#LitReview: Protein sequence similarity searches using patterns as seeds

15 Jun 2017

This blog post is a summary of the following paper, following its structure.


Zhang, Z., Miller, W., Schäffer, A.A., Madden, T.L., Lipman, D.J., Koonin, E.V. and Altschul, S.F., 1998. Protein sequence similarity searches using patterns as seeds. Nucleic acids research, 26(17), pp.3986-3990.




Short, conserved sub-sequences (patterns) occurring in protein sequences are studied to elucidate their function. If a particular pattern occurs in a family of proteins with related functions, it can be inferred that the pattern has some involvement in the function. It is therefore useful in identifying additional sequences belonging to a protein family.


Single or even multiple occurrences of a weakly defined pattern could falsify subsequent inferences, and it is, therefore, crucial to take into account a model which defines the underlying base composition of the sequence. A weakly defined pattern is that which contains many degenerate letters. For example, {P,L,F,V}{Y,M,D,N}{L,W,G,A} is a pattern of length 3, at which each position could be one of four given amino acids. Considering the fact that it is merely three letters long, and so many matches could satisfy this pattern definition, you can see how this would be considered a very weakly defined pattern!


The Pattern-Hit Initiated BLAST (PHI-BLAST) algorithm takes and produces the following.

Input: a well-defined pattern, optionally including gaps and wild cards (see here); a set of sequences containing at least one occurrence of the pattern; a query sequence

Output: For each instance of the pattern in the query sequence, if the pattern occurs in a similar region in one of the sequences in the database as well, a local alignment is produced, and complemented with statistical analyses. All such alignments are presented in a sorted fashion according to the statistical significance of the pattern in the query sequence.


Evaluating these local alignments allows for the elucidation of specific amino acid residues that are significant in a protein for a particular species or in a particular cell state, or more generally, that the occurrence of the entire pattern is for a specific function in the query sequence, and not by chance.


The PHI-BLAST Algorithm


First, the input pattern is checked against background amino acid frequencies to compute its expected frequency. The pattern is only allowed if its frequency is less than once per 5000 residues.


Other efficient pattern matching algorithms have been adapted such that their use is in finding matches of the simplest parts of a complex pattern, before filling in the gaps and locating the entire pattern. For example, the bold part of the following pattern would be searched (first seed) before the underlined part (second seed), and then the remainder: {P,L,F,V}AWWVL{Y,M,D,N}YLF{L,W,G,A}. 


PHI-BLAST finds the optimal local alignment, and uses dynamic programming to perform gapped extensions to refine the extremities.


Statistical Analysis


Scores are computed for each occurrence of the input pattern and for the gapped extensions flanking it. The results are sorted with respect to the sum of the scores of the flanking regions; this is because the scores for the input pattern are implicitly assumed to be uniform across all reported results.


The total number of occurrences of the pattern is reported, but no further analysis of this particular feature is done, as it is already known that the input pattern should be biologically important.


The score is further developed to produce an E value based on factors including its own random distribution. That is, the distribution of scores resulting from applying the algorithm to random synthetic sequences.


Implementation and Examples


Functionality has been added to complement PHI-BLAST such that a query sequence is used to search the PROSITE database (of protein motifs), and then the PHI-BLAST database search is performed on any resultant patterns. Note that regions of low complexity/biased amino acid composition are omitted from the search. Subsequently, a position-specific scoring matrix is built using PSI-BLAST, ultimately allowing for distant relationships between the query sequence and the sequence database to be discovered and detailed. 


Performance Evaluation


Both accuracy and running time were tested, with successful outcomes. It is worth noting, however, that a weak pattern results in a longer running time, as more initial matches are found (and so more extensions need to be done).




It is highlighted that proteins belonging to the same family may not all contain highly conserved motifs, and that PHI-BLAST can help to discover novel and distant homologous sequences. The authors conclude by mentioning intention to create a similar tool for DNA sequences (for DNA queries only, and also for translated DNA sequences (for protein queries); and they propose to change the gap penalty costs currently in use.





Share on Facebook
Share on Twitter
Share on LinkedIn
Please reload

Please reload

Related Posts
PhDomics by Fatima