This tool predicts the loci and consensus sequences of MITEs.
It claims to have a lower false positive rate than TRANSPO, FINDMITE and MUST. TRANSPO requires the user to input known MITE sequences and is therefore unable to predict novel MITEs. FINDMITE and MUST look for structural features of MITEs in order to find them, but as there are only two such features, their false-positive rates are very high. Furthermore, neither of them can handle whole genome sequences as input. MITE-Hunter found 97.6% of previously reported MITEs that occur in the rice genome + found 16 new ones.
The following is an overview of the underlying pipeline:
1. Identifying candidates
The input genome is cut into overlapping fragments of total length 2.5kb, where the length of the overlap is 500bp. Candidate MITEs are found by searching for Terminal Inverted Repeats (TIRs) + Target Site Duplications (TSDs). The default length for TIRs is 10bp with one mismatch allowed between the pairs. The length of TSDs can range between 2 - 10bp.
2. Filtering false-positives
An all-by-all BLAST comparison is done for candidates after dividing them into groups based on their lengths (default difference of 100bp). Candidates with similarities in their infixes but not within their TIRs were trimmed of the latter + retained for the next step.
3. Choosing representative candidates
An all-by-all BLAST comparison is done on all remaining candidates + they are clustered such that each cluster comprises of a representative sequence that matches all other sequences in the group with >80% identity and >90% length. This process is iterated until all sequences have been clustered. The resultant representative candidates are retained for the next step.
4. Filtering false-positives (again)
Each candidate sequence is BLAST'd against the genome + homologous sequences are retrieved along with their flanking sequences (60bp by default). If a candidate does not have more than a number of homologs in the sequence (< 4 by default) they are filtered out. If a candidate has more than 35 homologs, only the 35 best-scoring ones are retained.
5. Aligning the sets of sequences
A multiple sequence alignment is done of each candidate sequence with its corresponding homologous sequences. Three average similarity scores are computed for the left (L) + right (R) flanking sequences + the infix (internal homologous region - H). Any sequences that are outliers and that influence a significant change in the averages are removed + the scores are re-calculated. Furthermore, any candidate's homologs that have high similarity in the left or right flanking sequences result in the candidate being filtered out, if >50% of the left/right sequences have >60% similarity. Therefore, candidates that have low L + R scores but high H scores are retained.
6. Predicting TSDs
The most common sub-sequence, ranging between 2-10bp, flanking the sequences in a candidate's set of homologs is predicted as the TSD for that set. This ultimately allows for classification of the transposon.
7. Creating consensus sequences
A consensus sequence is created from the multiple sequence alignment of each candidate set.
8. Grouping the candidates + choosing representatives
The same iterative clustering process as in step 3 is done to elucidate candidates that represent groups of consensus sequences.
When evaluated with rice chromosome 12 as input + compared to MUST and FINDMITE, the results were as follows:
FINDMITE: < 1 min
It is clear from this comparison that not only is MITE-Hunter the best tool comparatively, but in general it has a high rate of accuracy and a reasonable running time.
Note that, detectMITE, published years later, has improved results:
Ye, C., Ji, G. and Liang, C., 2016. detectMITE: A novel approach to detect miniature inverted repeat transposable elements in genomes. Scientific reports, 6.
This blog post was a summary of the following paper:
Han, Y. and Wessler, S.R., 2010. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic acids research, 38(22), pp.e199-e199.