De novo assembly is a hierarchical data structure.
Overlapping reads form contiguous sequences (contigs), and contigs form scaffolds. Scaffolds are formed using insert size data from paired-end sequencing and elucidate the order and orientation of contigs.
There are two approaches for de novo genome assembly, which both make use of graphs. Reads are represented in such a way that following a path in the graph from left to right results in 'reading' the genomic sequence.
1. Overlap-Layout-Consensus Approach
An overlap graph is created in which nodes represent reads, and edges represent overlaps between them. A Hamiltonian cycle is found by walking through each node exactly once, and successive nodes (reads) are merged to form the consensus sequences.
This method is more suitable for fewer reads with longer overlaps, but is computationally very expensive due to the number of reads and the fact that finding a Hamiltonian cycle is an NP-complete problem. A more efficient approach...
2. De Bruijn Graphs
In de Bruijn graphs, nodes represent k-mers, which are factors of reads of length k. This results in a major reduction of redundancy in the graph, as reads are not considered whole, and each unique k-mer is represented by a node. Edges represent overlaps of length k-1 between k-mers.
The size of k is decided by taking read length and error rate into consideration, and has a great effect on the resulting consensus sequence.
An Eulerian cycle is found in the graph by visiting each edge once, and therefore, constructing the genome form overlapping k-mers. This is computationally less expensive than finding the Hamiltonian cycle, because the problem of finding the Eulerian cycle is in P.
This method is the most widely used method for de novo assembly of genomes.
Part one was posted on Day 65. Part three will be posted on Day 69 - stay tuned!
This post is based upon a lecture given by my supervisor Dr. Solon Pissis.