Iliopoulos, C.S., Mohamed, M., Pissis, S.P. and Vayani, F., 2018, October. Maximal Motif Discovery in a Sliding Window. In International Symposium on String Processing and Information Retrieval (pp. 191-205). Springer, Cham.

β

Motifs are relatively short sequences that are biologically significant, and their discovery in molecular sequences is a well-researched subject. A don’t care is a special letter that matches every letter in the alphabet. Formally, a motif is a sequence of letters of the alphabet and don’t care letters. A motif πΜ_{π,π} that occurs at least k times in a sequence is maximal if it cannot be extended (to the left or right) nor can it be specialised (that is, its π′ ≤ π′ ≤ d don’t cares cannot be replaced with letters from the alphabet) without reducing its number of occurrences. Here we present a new dynamic data structure, and the first on-line algorithm, to discover all maximal motifs in a sliding window of length β on a sequence x of length n. Read the full paper here.

β

β

Grossi, R., Iliopoulos, C.S., Liu, C., Pisanti, N., Pissis, S.P., Retha, A., Rosone, G., Vayani, F. and Versari, L., 2017. On-line pattern matching on similar texts. In LIPIcs-Leibniz International Proceedings in Informatics (Vol. 78). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.

β

Pattern matching on a set of similar texts has received much attention, especially recently, mainly due to its application in cataloguing human genetic variation. In particular, many different algorithms have been proposed for the off-line version of this problem; that is, constructing a compressed index for a set of similar texts in order to answer pattern matching queries efficiently. However, the on-line, more fundamental, version of this problem is a rather undeveloped topic. Solutions to the on-line version can be beneficial for a number of reasons; for instance, efficient on-line solutions can be used in combination with partial indexes as practical trade-offs. We make here an attempt to close this gap via proposing two efficient algorithms for this problem. Notably, one of the algorithms requires time linear in the size of the texts' representation, for short patterns. Furthermore, experimental results confirm our theoretical findings in practical terms. Read the full paper here.

β

β

β

The information that can be inferred or predicted from knowing the genomic sequence of an organism is astonishing. String algorithms are critical to this process. This paper provides an overview of two particular problems that arise during computational molecular biology research, and recent algorithmic developments in solving them. Read the full paper here.

β

β

β

Sequence comparison is a fundamental step in many important tasks in bioinformatics; from phylogenetic reconstruction to the reconstruction of genomes. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are

generally computed through sequence alignment techniques. As circular molecular structure is a common phenomenon in nature, a caveat of the adaptation of alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. In this paper, we introduce a new distance measure based on q-grams, and show how it can be applied effectively and computed efficiently for circular sequence comparison. Experimental results, using real DNA, RNA, and protein sequences as well as synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art. Read the full paper here.β

β

β

β

DNA sequencing is the process of determining the exact order of the nucleotide bases of an individual's genome in order to catalogue sequence variation and understand its biological implications. Whole-genome sequencing techniques produce masses of data in the form of short sequences known as reads. Assembling these reads into a whole genome constitutes a major algorithmic challenge. Most assembly algorithms utilise de Bruijn graphs constructed from reads for this purpose. A critical step of these algorithms is to detect typical motif structures in the graph caused by sequencing errors and genome repeats, and filter them out; one such complex subgraph class is a so-called superbubble. In this paper, we propose a linear time algorithm to detect all superbubbles in a directed acyclic graph with n vertices and m (directed) edges, improving the best-known quasilinear time algorithm by Sung et al. Read the full paper here.

β

β

β

A degenerate symbol over an alphabet Σ is a non-empty subset of Σ, and a sequence of such symbols is a degenerate string. A degenerate string is said to be conservative if its number of non-solid symbols is upper-bounded by a fixed positive constant k. We consider here the matching problem of conservative degenerate strings and present the first linear-time algorithm that can find, for given degenerate strings P and T of total length n containing k non-solid symbols in total, the occurrences of P in T in O(nk) time. Read the full paper here.

β

β

β

Sequence comparison is a fundamental step in many important tasks in bioinformatics. Traditional algorithms for measuring approximation in sequence comparison are based on the notions of distance or similarity, and are generally computed through sequence alignment techniques. As circular genome structure is a common phenomenon in nature, a caveat of specialised alignment techniques for circular sequence comparison is that they are computationally expensive, requiring from super-quadratic to cubic time in the length of the sequences. In this paper, we introduce a new distance measure based on q-grams, and show how it can be computed efficiently for circular sequence comparison. Experimental results, using real and synthetic data, demonstrate orders-of-magnitude superiority of our approach in terms of efficiency, while maintaining an accuracy very competitive to the state of the art. Read the full paper here.

β

β

β

Multiple sequence alignment is a core computational task in bioinformatics and has been extensively studied over the past decades. This computation requires an implicit assumption on the input data: the left- and right-most position for each sequence is relevant. However, this is not the case for circular structures; for instance, MtDNA. Efforts have been made to address this issue but it is far from being solved. We have very recently introduced a fast algorithm for approximate circular string matching (Barton et al., Algo Mol Biol, 2014). Here, we first show how to extend this algorithm for approximate circular dictionary matching; and, then, apply this solution with agglomerative hierarchical clustering to find a sufficiently good rotation for each sequence. Furthermore, we propose an alternative method that is suitable for more divergent sequences. We implemented these methods in BEAR, a programme for improving multiple circular sequence alignment. Experimental results, using real and synthetic data, show the high accuracy and efficiency of these new methods in terms of the inferred likelihood-based phylogenies. Read the full paper here.

See my DBLP page here.

Please note that alphabetical order of authorship is the norm in my field.