De novo assembly is a hierarchical data structure.

Â

Overlapping reads form contiguous sequences (contigs), and contigs form scaffolds.Â Scaffolds are formed using insert size data fromÂ paired-end sequencingÂ and elucidate the order and orientation of contigs.

Â

There are two approaches for de novo genome assembly, which both make use of graphs. Reads are represented in such a way that following a path in the graph from left to right results in 'reading' the genomic sequence.

Â

1. Overlap-Layout-Consensus Approach

Â

Â

An overlap graph is created in which nodes represent reads, and edges represent overlaps between them. A Hamiltonian cycle is found by walking through each node exactly once, and successive nodes (reads) are merged to form the consensus sequences.

This method is more suitable for fewer reads with longer overlaps, but is computationally very expensive due to the number of reads and the fact that finding a Hamiltonian cycle is an NP-complete problem. A more efficient approach...

Â

2. De Bruijn Graphs

Â

Â

In de Bruijn graphs, nodes representÂ k-mers, which are factors of reads of length k. This results in a major reduction of redundancy in the graph, as reads are not considered whole, and each unique k-mer is represented by a node. EdgesÂ represent overlaps of lengthÂ k-1 betweenÂ k-mers.

Â

The size ofÂ kÂ is decided by taking read length and error rate into consideration, and has a great effect on the resulting consensus sequence.

Â

An Eulerian cycle is found in the graph by visiting each edge once, and therefore, constructing the genome form overlappingÂ k-mers. This is computationally less expensive than finding the Hamiltonian cycle, because the problem of finding the Eulerian cycle is in P.

This method is the most widely used method for de novo assembly of genomes.

Â

Part one was posted on Day 65. Part three will be posted on Day 69Â - stay tuned!

This post is based upon a lecture given by my supervisor Dr. Solon Pissis.

[67]