Suffix Tree: The Basics

10 Jan 2017

 

This post is a summary of Chapter 5.2 of Prof. Dan Gusfield's book on Algorithms on Strings, Trees and Sequences.

  • A suffix tree T(S) for a string S of length|S| = n is a tree with n leaves, labelled 0 to n - 1.

  • It is a rooted tree and its edges are directed.

  • Each internal node (excludes root and leaf nodes) has at least two children.

  • Each edge in T(S) is labelled with a non-empty substring of S.

  • Given an internal node (or vertex) v, no two edges originating at v can share the same first character in their labels.

  • Concatenation of the edge labels from the root to any leaf node i produces the suffix of S beginning at position i in S, that is, S[i ... n - 1].

  • Therefore, the label of a path, from the root to any node, is the concatenation of the labels of the edges in that path, in order.

  • The path-label of a node v is the label of the path from the root to node v.

  • The string-depth of any node v in T is the number of characters in the path-label of v.

Below is a nice and simple visualisation of a suffix tree for the word banana. Ignore the dotted lines for now!

  • Notice that all path-labels for leaf nodes end in $. This is because the suffix tree is actually built on the string banana$. If one suffix of a string S is a prefix of another suffix then the path of the former would not end at a leaf node. For example, suffix 3 of banana is ana, which is the prefix of suffix 1, anana. If we add the $ to banana, suffix 3 becomes ana$, and due to the $ we know this substring cannot occur anywhere else, eliminating the problem. The termination character used is a dollar sign, $, because it does not occur in the alphabet upon which our string is based (we assume here that our alphabet is the English alphabet). Sometimes, this new string, built by the concatenation of S and $, is referred to as S$.

  • A path ending in the middle of an edge (u,v), splits its path-label at some point. For example, the substring an of banana$ begins at the root, down the leftmost edge ending at node α to spell a, then halfway down the edge labelled na. This path label can be defined as (α, 1), as it reaches node α and then extends by one character.

 

[9]

Share on Facebook
Share on Twitter
Share on LinkedIn
Please reload

Please reload

Related Posts
PhDomics by Fatima