This genome assembly approach utilises an existing complete reference genome sequence to assemble newly sequenced genome, by aligning reads with it.
This can be defines as a pattern matching problem, where the DNA sequencing data is the set of patterns which must be matched to the text (reference genome) with at most e errors (mismatches).
The naïve approach.
If gaps are not allowed in the alignment of patterns with the text, the easiest way to match the pattern to the text would be to slide the read along the text and report each time there is a match. This approach is computationally very expensive and inefficient.
If gaps are allowed, even more inefficient dynamic programming algorithms such as Needleman-Wunsch must be used.
We need more efficient approaches, and one way of achieving this is to index the text sequence for faster access.
Hashing-Based indexing with Seed and Extend
1. Choose a value of k significantly less than the length of the reads.
2. Store the positions of all k-mers of the text in an array of linked lists.
3. Select a k-mer from each read. This could be the first prefix, or an infix, of the read, which would usually contain the least number of sequencing errors.
4. Map each one to the genome using the hashing procedure. (The seed.)
5. For each match, try to extend the match to the rest of the read, using the hashing procedure or an algorithm similar to Needleman-Wunsch (if allowing for approximate matches with e errors).
Seeds may be very repetitive in the text, which makes this approach memory intensive.
An improvement would be to make use of the pigeonhole principle.
Divide the reads in to q non-overlapping substrings. If allowing for e mismatches, at least q - e of the set of q substrings would match the text. These substrings can act as seeds to be extended for matching reads, as long as they occur in the correct sequence (this would need to be checked). This very clever strategy is also known as partitioning into exact matching.
For example, say we set q = 4 and e = 2. We divide our string s of length | s | = 12 into q = 4 non-overlapping substrings of length 3. We then try to match each substring separately. If the errors we allow are distributed evenly in s, at least two substrings will still match our text exactly. If the errors occur very close together, in just one substring, more than q - e = 2 substrings will match exactly (maybe three; or in the case of an exact match of each substring - none).
Burrows-Wheeler Transform (BWT) for Indexing the Reference Genome.
Check out tomorrow's blog post to learn exactly what BWT is, and why so many of the best and current mapping algorithms are based upon its use.
Part one of this series on Genome Assembly was posted on Day 65 and part two was posted on Day 67 and part three was posted on Day 69.
This post is based upon a lecture given by my supervisor Dr. Solon Pissis.