#BioinformaticsTool: LTRharvest

11 Sep 2017

LTRharvest is a tool to predict Long Terminal Repeats (LTRs) in genomes.


The following is a list of the structural parameters that LTRharvest considers:

  • LTR length

  • distance between 5' + 3' LTRs

  • sequence similarity between 5' + 3' LTRs

  • Target site duplications

  • LTR motifs that are usually gapped palindromes

Note that the input can be an assembled genome or whole genome shotgun sequencing reads + that the algorithm disregards degenerate symbols such that they do not match with themselves or anything else. Example of these are W = {A,T}, R = {A,G}, B = {C, G, T} and N = {A, C, G, T}


LTRharvest uses the enhanced suffix array to find maximal exact repeats, which it uses as seeds to extend using dynamic programming, ultimately accounting for the user-specified sequence similarity and length criteria.


The resulting approximate repeat pairs are then checked for TSDs + palindromic motifs. TSDs are searched for in the extremities of the sequences, given a user-specified length range. That is, the leftmost occurrence of the repeat corresponds to the 5' end of the retrotransposon + the rightmost occurrence corresponds to the 3' end. It follows that the left/5' end of the 5' repeat + the right/3' end of the 3' repeat are checked for TSDs. This checking is done by building a suffix array from factors of the repeats of lengths specified by the user + finding exact matches. If this option is switched on, any repeat pairs that do not contain exact TSDs are omitted from further steps in the algorithm. The authors state that LTR motifs 'consist of two pairs of two nucleotides and an allowed number of mismatches'. These are checked for in place of TSDs if the corresponding step is not done, and checked at the internal boundaries of TSDs if they are searched for.


The algorithm then eliminates any repeat pairs if they do not satisfy the length and distance parameters. Ukkonen's approximate string matching algorithm is then used to calculate sequence similarity of each pair + those that do not have the user-defined minimum similarity are discarded.


What remains from this pipeline are all predicted LTR retrotransposons. It must be noted that, in general, de novo predictions can only account for full length transposable elements [TEs], and though parameters (such as sequence similarity) can be relaxed to produce more evolutionarily relevant output, the authors recommend BLAST-ing their predictions against relevant TE databases to elucidate LTR retrotransposons that do not fit the canonical model perfectly.


The authors recommend a post-processing step for analysis of genomes that are not well-studied, and list several suggestions to refine their output. When evaluating the results of their algorithm on real data (S. cerevisiae and D. melanogaster), this post-processing step was crucial in realising that many false positives of full-length TEs were in fact true positives of partial/degenerated TEs.


The authors present a comprehensive performance comparison of their tool with seven others, and show that LTRharvest achieves the fastest running time, and with fine-tuned parameters + post-processing, one of the highest accuracy rates.




This blog post was based on the following paper:

Ellinghaus, D., Kurtz, S. and Willhoeft, U., 2008. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC bioinformatics, 9(1), p.18.

Ukkonen, E., 1985. Algorithms for approximate string matching. Information and control, 64(1-3), pp.100-118.

Share on Facebook
Share on Twitter
Share on LinkedIn
Please reload

Please reload

Related Posts
PhDomics by Fatima