Visualizing genomic regions using Clinker

Overview

Teaching: 10 min
Exercises: 50 min
Questions
  • How does the genetic region of your resistance gene look like

Objectives
  • Extracting and annotating a genetic region

  • Visualizing gene differences and similarities

Introduction

In the previous exercises we detected a resistance gene. If you want to investigate if the genetic region is similar, suggesting a common source or clonal spread, the region can be investigated using a tool called Clinker . Clinker is available via commandline but also via the https://cagecat.bioinformatics.nl/ webserver. We will discuss both options and perform one of the options below. We have also included a piece of shell code to extract and annotate the regions, however you can also cut and paste the sequence from the contigs using a text editor. For this course, we provide an example of several regions with a resistance gene, however the code can be used to generate your own at a later timepoint.

Webbased use of Clinker

Download the zip file with extracted regions here: regions.zip Unzip the extracted regions file to a folder called “regions”

Visit the website https://cagecat.bioinformatics.nl/ and press the “Start” button at the Clinker section. Click on Genome Files and the navigate to the unzipped “regions” folder. Select all gbk files by pressing Shift and selecting all files. Press Open and then press the Submit button. It takes about a minute to run the analysis. You can now adjust the output to suit your needs. Move the fragments around (click on the name and move up or down) or flip them (double click on the sequence). Use the “Scale factor”, “Show gene labels” and “Label type” options to improve the visualization.

Manual commandline use of Clinker

This manual Clinker exercise should only be done after you have completed the annotation exercise from day 3 and if you feel compfortable using the commandline. This is not an easy command. It requires a good understanding of the input files of Clinker and how to get annotations. The commandline extraction procedure is complex and possibly it is better to do this by hand.

Extracting regions

This exercise can be done on all the genomes from the study. If you want to run this, please perform the following commands. Please just copy and paste the code into a text document, adjust the settings if necessary, and then copy and paste into your shell.

#first we get all assembled genomes with all AMRFinderPlus outputs in them, same as you generated before, but now for all genomes
cp -r /mnt/netappits/users/courses2/shared/triumph/assembly_all ~/assembly_all

# We make the output folder
mkdir ~/regions/

# We run the script that finds CTX-M-14 from all the AMRFinderPlus outputs from all assembled genomes. You can change CTX-M-14 to another gene of interest (e.g. CTX-M-15). The gene itself, 5 kb left of the gene and 5 kb right of the gene are extracted and placed into a file in the folder ~/regions/

cd ~/assembly_all/
for sample in barcode01 barcode02 barcode03 barcode04 barcode05 barcode06 barcode07 barcode08 barcode09 barcode10 barcode11 barcode12 barcode13 barcode14 barcode15 barcode16 barcode17 barcode18 barcode19 barcode20 barcode21 barcode22 barcode23 barcode24; do
  grep "CTX-M-14" "$sample/amrfinderplus.txt" | cut -f 2,3,4 | while read contig start stop; do
    # Extract the sequence of the specified contig
    seq=$(awk -v contig="$contig" '
      BEGIN { found=0 }
      /^>/ {
        if(found) exit;
        found=($0 ~ ">"contig);
        next;
      }
      found { printf "%s", $0 }
    ' "$sample/assembly.fasta")

    # Define extraction range. In the example 5000 basepairs, but you can adjust this
    seq_length=${#seq}
    extract_start=$((start - 5000))
    extract_stop=$((stop + 5000))

    # Ensure the boundaries are valid
    if (( extract_start < 1 )); then extract_start=1; fi
    if (( extract_stop > seq_length )); then extract_stop=$seq_length; fi

    # Extract the subsequence
    extracted_seq=$(echo "$seq" | cut -c "$extract_start"-"$extract_stop")

    # Write to output file
    echo -e ">"$sample"\n$extracted_seq" > ~/regions/"$sample".fasta
  done
done

Annotating extracted regions

The above command was run on all the assembled genomes from this study. CTX-M-14 is found in barcode01 barcode06 barcode14 barcode15 barcode16 barcode17 barcode18 barcode19 barcode20 barcode24 according to the AMRFinderPlus analysis. We can annotate these using the following commands so that we can compare the genes in the regions using Clinker

$ cd ~/regions
$ for region in barcode01 barcode06 barcode14 barcode15 barcode16 barcode17 barcode18 barcode19 barcode20 barcode24; do
  prokka --outdir ~/regions/"$region" --prefix "$region" ~/regions/"$region".fasta --usegenus -genus Escherichia --cpus 1 --rawproduct --locustag "$region"
done

Running Clinker in the commandline

Finally, we can run Clinker on the annotated regions. Clinker needs to use gbk (Genbank) files

$ cd ~/regions/
$ clinker barcode*/*.gbk -p regions.html
$ 

Download the html file regions.html you have just created and open it into a webbrowser.

Challenge: Is the resistance gene location conserved?

Compare the regions, are the genes the same on all regions? What differences can you see?.

Solution

The output can be made to look like this after moving the fragments around (click on the name and move up or down) and flipping them (double click on the sequence)

image

Discuss with the class. The report.html can also be downloaded here: regions.html : use right mouse button, click save target as.

Key Points

  • Resistance genes can be found on the chromosome and on plasmids

  • Visualizing the region can help understand the history of the genetic locus containing the resistance gene