Sequence assembly

Overview

Teaching: 10 min
Exercises: 60 min
Questions
  • How can the information in the sequencing reads be reduced?

  • What are the different methods for assembly?

Objectives
  • Understand differences between assembly methods

  • Assemble the long reads

Sequence assembly means the alignment and merging of reads in order to reconstruct the original sequence. The assembly of a genome from short sequencing reads will take a while - from minutes up to several hours per genome.

Sequence assembly

The assembler we will run is Flye. Flye can be run on low quality nanopore data generated using the fast basecalling model, or on (super) high quality data from newer basecalling models with newer flowcells. It is recommended to always use the data from the best basecalling model (in this case “sup”) for assembly.

Assembly using commandline

Because assembly of each genome might take a while, we will a assemble two isolates per person. When the assembly has started for most people we can follow the assembly lecture (next page).

Preparation

$ cd ~
$ mkdir assembly

To run Flye we will use the Flye command with the –nano-hq option as we have high quality data, –out-dir for the output folder –threads 2 for the number of CPUs. The following is an example. Replace barcode02 and barcode03 with the names of your isolates

$ ls
$ cd ~/reads
$ ls
$ for sample in barcode01 barcode02  ; do
    flye --nano-hq $sample.fastq  --threads 2  --out-dir ~/assembly/$sample/ 
  done
$ cd ~/assembly
$ ls 
$ cd barcode02
$ ls

Now it’s time to have a break, as assembly will take a while. The loop ensures it does both your genomes. In theory you can do this for 100s of genomes. Of course you have to change the names to suit your selected samples.

The assemblies are found in the folders in the folder ~/assembly. The end result is called assembly.fasta. Inspect the log file as well (look at the end of the file) and the info file. In the next lecture we will explain about reads, readpairs, contigs, scaffolds, which are used in Illumina and Nanopore sequencing.

Nanopore assembly sometimes needs to be polished as well. See the paper by Ryan Wick Assembling the perfect bacterial genome using Oxford Nanopore and Illumina sequencing . His latest blog post suggest that polishing is no longer necessary when using MinKNOW from October 2023 onwards and using the “sup” basecalling mode. We will therefore skip it. If you want to assemble older nanopore reads, please take a look at Medaka .

Assembly using Epi2me

To run Flye inside Epi2me we will run Epi2me and use the bacterial genome assembly workflow. A complete manual is available on the setup page : link . Because assembly of each genome might take a while, we will a assemble two isolates per person. When the assembly has started for most people we can follow the assembly lecture (next page).

Challenge: How many contigs were generated by Flye?.

Find out how many contigs or scaffolds there are in the E. coli isolates. Enter your solution in the Google Sheets table

Hint:

$ grep -c

prints a count of matching lines for each input file.

Solution

You can read the log files for the necessary information, or you can get these from the assembly.fasta file:

$ cd ~/assembly
$ grep -c '>'  barcode*/assembly.fasta

At the moment, all samples are called scaffolds.fasta. This is not ideal. In the next episode we will rename the assembled scaffolds before processing them further.

Key Points

  • Assembly is a process which aligns and merges fragments from a longer DNA sequence in order to reconstruct the original sequence.

  • Assembly is a time consuming process. Make sure you plan it well