#BioinformaticsTools for predicting transposable elements

13 Sep 2017

Following my previous post, here is a list of other tools designed to predict transposable elements (TEs) in genomes, as well as some great general reading material.


1. Fiston-Lavier, A.S., Carrigan, M., Petrov, D.A. and González, J., 2010. T-lex: a program for fast and accurate assessment of transposable element presence using next-generation sequencing data. Nucleic acids research, 39(6), pp.e36-e36.


The ultimate goal of this tool is to analyse copy number variation of TEs in populations of genomes.


2. Ye, C., Ji, G. and Liang, C., 2016. detectMITE: A novel approach to detect miniature inverted repeat transposable elements in genomes. Scientific reports, 6.


This tool specifically focuses on finding MITEs in genomes.


3. A.F.A. Smit, R. Hubley & P. Green RepeatMasker at http://repeatmasker.org


This database search tool produces a masked sequence, in which each base in a repeat is replaced with an N, as output, along with a detailed annotation of the identified repeats. It searches the RepBase Update and Dfam databases, as well as its own database which, notably, only contains data for model organisms. The developers state that it runs linearly in the length of the input sequence. The need for a sequence in which repeats are masked is during genomic annotation, particularly in gene prediction, as repetitive sequences can greatly manipulate the output.


4. Bao, Z. and Eddy, S.R., 2002. Automated de novo identification of repeat sequence families in sequenced genomes. Genome research, 12(8), pp.1269-1276.


This tool uses an extended alignment + clustering algorithm that uses some probabilistic methods which the authors state allow improved identification of repeats and refinement of results, in comparison to standard de novo methods.


5. Price, A.L., Jones, N.C. and Pevzner, P.A., 2005. De novo identification of repeat families in large genomes. Bioinformatics, 21(suppl_1), pp.i351-i358.


When analysing the C. briggsae genome, this tool finds > 4% of the genome to consist of repetitive elements in comparison to results from RECON [4]. The authors regard this improvement to be due to its simple yet efficient and effective algorithm, which uses the seed-and-extend method.


6. Smit, AFA, Hubley, R. RepeatModeler Open-1.0. 2008-2015 <http://www.repeatmasker.org>.


This tool conveniently runs RECON [4] + RepeatScout [5], reporting their complementary results to the user.


7. Hubley, R., Finn, R.D., Clements, J., Eddy, S.R., Jones, T.A., Bao, W., Smit, A.F. and Wheeler, T.J., 2015. The Dfam database of repetitive DNA families. Nucleic acids research, 44(D1), pp.D81-D89.


This database contains families of repeats as multiple sequence alignments and profile hidden Markov models, which are produced from up to 2000 sequences in each family. Note that Repbase Update contains consensus sequences, which are used by Dfam to build alignments.  It is used by tools such as RepeatMasker [3] and the authors highlight that using Dfam with RepeatMasker, instead of RU, results in an increased coverage of annotation (+5% in humans, for example). 


8. Chu, C., Nielsen, R. and Wu, Y., 2016. REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads. PloS one, 11(3), p.e0150719.


This tools uses a method analogous to genome assembly, and consequently makes use of pre-existing open-source genome assembly algorithms to identify repeats from sequence reads (de novo, without the need for a reference genome or repeat database). Based on the assumption that k-mers with the same frequencies will occur in the same repeat, the k-mers are binned, and subsequently assembled in to 'repeat contigs'. These contigs are then merged into scaffolds which represent repeats. Although very simple, this method proves to be effective in identifying known repeats, as well as novel ones.


9. Koch, P., Platzer, M. and Downie, B.R., 2014. RepARK—de novo creation of repeat libraries from whole-genome NGS reads. Nucleic acids research, 42(9), pp.e80-e80.


Published two years prior to REPdenovo [8], this tool uses a similar pipeline but without the construction of scaffolds. Its results are therefore similar to, but not as accurate as, REPdenovo.


10. Morgulis, A., Gertz, E.M., Schäffer, A.A. and Agarwala, R., 2005. WindowMasker: window-based masker for sequenced genomes. Bioinformatics, 22(2), pp.134-141.


This tool counts k-mers after calculating k based on the formula (L / 4^k) < 5, where L is the sum of the lengths of all contigs and k is the smallest possible integer that satisfies the formula. It assigns scores to the k-mers based on their frequencies in the first pass and uses these scores to compute scores for windows of size k+4 in the second pass. Windows that pass a certain threshold are masked, producing output in a similar format to RepeatMasker [3], yet doing so two orders of magnitude faster and also more accurately.


11. Benson, G., 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research, 27(2), p.573.


This tool uses k-mer counting and probabilistic methods to predict tandem repeats.


12. Liu, X., Cheung, D.W.L., Ting, H.F., Lam, T.W. and Yiu, S.M., 2013, May. LCR_Finder: A de novo low copy repeat finder for human genome. In International Symposium on Bioinformatics Research and Applications (pp. 125-136). Springer, Berlin, Heidelberg.


This tool focuses on predicting complex low copy repeats >100kb in length.


13. McCarthy, E.M. and McDonald, J.F., 2003. LTR_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics, 19(3), pp.362-367.


This tool is presented as one that is complementary to established homology-based methods. It uses a simple algorithm to predict LTR-retrotransposons based on their LTRs and TSDs.


14a. Hoen, D.R., Hickey, G., Bourque, G., Casacuberta, J., Cordaux, R., Feschotte, C., Fiston-Lavier, A.S., Hua-Van, A., Hubley, R., Kapusta, A. and Lerat, E., 2015. A call for benchmarking transposable element annotation methods. Mobile DNA, 6(1), p.13.

14b. Platt, R.N., Blanco-Berdugo, L. and Ray, D.A., 2016. Accurate transposable element annotation is vital when analyzing new genome assemblies. Genome biology and evolution, 8(2), pp.403-410.


Good overviews of the research area in general.


15. Lerat, E., 2010. Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity, 104(6), p.520.


A very important review paper for biologists wanting to choose the best tool for their research + for bioinformaticians interested in improving upon the results of currently available tools.


16. Sotero-Caio, C.G., Platt, R.N., Suh, A. and Ray, D.A., 2017. Evolution and Diversity of Transposable Elements in Vertebrate Genomes. Genome biology and evolution, 9(1), pp.161-177.


This paper highlights the need for careful and precise design of bioinformatic tools for TE prediction; due to the lack of knowledge of the subject, some results can lead to incorrect conclusions.


17. Nelson, M.G., Linheiro, R.S. and Bergman, C.M., 2017. McClintock: An integrated pipeline for detecting transposable element insertions in whole genome shotgun sequencing data. bioRxiv, p.095372.


Published earlier this year, this paper describes a tool which combines the six different complementary tools to produce an easy-tp-use pipeline with comprehensive + relatively accurate results.


Lastly, remember to check out https://omictools.com for a non-exhaustive but large list of tools for TE prediction.




Share on Facebook
Share on Twitter
Share on LinkedIn
Please reload

Please reload

Related Posts
PhDomics by Fatima