suf_tree

download suf_tree

of 6

Transcript of suf_tree

  • 8/8/2019 suf_tree

    1/6

    Suffix tree 1

    Suffix tree

    Suffix tree for the string BANANA. Each substring is terminated

    with special character $. The six paths from the root to a leaf (shown

    as boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$,

    ANANA$ and BANANA$. The numbers in the boxes give the start

    position of the corresponding suffix. Suffix links drawn dashed.

    In computer science, a suffix tree (also called PAT

    tree or, in an earlier form, position tree) is a data

    structure that presents the suffixes of a given string in a

    way that allows for a particularly fast implementation

    of many important string operations.

    The suffix tree for a string is a tree whose edges are

    labeled with strings, such that each suffix of

    corresponds to exactly one path from the tree's root to a

    leaf. It is thus a radix tree (more specifically, a Patricia

    trie) for the suffixes of .

    Constructing such a tree for the string takes time

    and space linear in the length of . Once constructed,

    several operations can be performed quickly, for

    instance locating a substring in , locating a substring

    if a certain number of mistakes are allowed, locating

    matches for a regular expression pattern etc. Suffix

    trees also provided one of the first linear-time solutions

    for the longest common substring problem. These

    speedups come at a cost: storing a string's suffix tree

    typically requires significantly more space than storing

    the string itself.

    History

    The concept was first introduced as a position tree by Weiner in 1973[1]

    in a paper which Donald Knuth

    subsequently characterized as "Algorithm of the Year 1973". The construction was greatly simplified by McCreight

    in 1976[2]

    , and also by Ukkonen in 1995[3]

    [4]

    . Ukkonen provided the first linear-time online-construction of suffix

    trees, now known as Ukkonen's algorithm.

    Definition

    The suffix tree for the string of length is defined as a tree such that ([5]

    page 90):

    the paths from the root to the leaves have a one-to-one relationship with the suffixes of ,

    edges spell non-empty strings,

    and all internal nodes (except perhaps the root) have at least two children.

    Since such a tree does not exist for all strings, is padded with a terminal symbol not seen in the string (usually

    denoted $). This ensures that no suffix is a prefix of another, and that there will be leaf nodes, one for each of the

    suffixes of . Since all internal non-root nodes are branching, there can be at most n 1 such nodes, and

    n + (n 1) + 1 = 2n nodes in total (n leaves, n 1 internal nodes, 1 root).

    Suffix links are a key feature for linear-time construction of the tree. In a complete suffix tree, all internal non-root

    nodes have a suffix link to another internal node. If the path from the root to a node spells the string , where

    is a single character and is a string (possibly empty), it has a suffix link to the internal node representing . Seefor example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used

    in some algorithms running on the tree.

    http://en.wikipedia.org/w/index.php?title=Ukkonen%27s_algorithmhttp://en.wikipedia.org/w/index.php?title=Donald_Knuthhttp://en.wikipedia.org/w/index.php?title=Longest_common_substring_problemhttp://en.wikipedia.org/w/index.php?title=Regular_expressionhttp://en.wikipedia.org/w/index.php?title=Substringhttp://en.wikipedia.org/w/index.php?title=Patricia_triehttp://en.wikipedia.org/w/index.php?title=Patricia_triehttp://en.wikipedia.org/w/index.php?title=Radix_treehttp://en.wikipedia.org/w/index.php?title=Tree_%28data_structure%29http://en.wikipedia.org/w/index.php?title=String_%28computer_science%29http://en.wikipedia.org/w/index.php?title=Suffix_%28computer_science%29http://en.wikipedia.org/w/index.php?title=Computer_sciencehttp://en.wikipedia.org/w/index.php?title=File:Suffix_tree_BANANA.svg
  • 8/8/2019 suf_tree

    2/6

    Suffix tree 2

    Functionality

    A suffix tree for a string of length can be built in time, if the letters come from an alphabet of integers

    in a polynomial range (in particular, this is true for constant-sized alphabets).[6]

    For larger alphabets, the running

    time is dominated by first sorting the letters to bring them into a range of size ; in general, this takes

    time. The costs below are given under the assumption that the alphabet is constant.

    Assume that a suffix tree has been built for the string of length , or that a generalised suffix tree has been built

    for the set of strings of total length . You can:

    Search for strings:

    Check if a string of length is a substring in time ([5]

    page 92).

    Find the first occurrence of the patterns of total length as substrings in time.

    Find all occurrences of the patterns of total length as substrings in time ([5]

    page 123). Search for a regular expressionP in time expected sublinear in (

    [7]).

    Find for each suffix of a pattern , the length of the longest match between a prefix of and a

    substring in in time ([5]

    page 132). This is termed the matching statistics for . Find properties of the strings:

    Find the longest common substrings of the string and in time ([5]

    page 125).

    Find all maximal pairs, maximal repeats or supermaximal repeats in time ([5]

    page 144).

    Find the LempelZiv decomposition in time ([5]

    page 166).

    Find the longest repeated substrings in time.

    Find the most frequently occurring substrings of a minimum length in time.

    Find the shortest strings from that do not occur in , in time, if there are such strings.

    Find the shortest substrings occurring only once in time.

    Find, for each , the shortest substrings of not occurring elsewhere in in time.

    The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in time ([5]

    chapter 8). You can then also:

    Find the longest common prefix between the suffixes and in ([5]

    page 196).

    Search for a patternP of length m with at most kmismatches in time, wherez is the number of hits

    ([5]

    page 200).

    Find all maximal palindromes in ([5]

    page 198), or time if gaps of length are allowed, or

    if mismatches are allowed ([5]

    page 201).

    Find all tandem repeats in , and k-mismatch tandem repeats in ([5]

    page 204).

    Find the longest substrings common to at least strings in for in time ([5]

    page

    205).

    http://en.wikipedia.org/w/index.php?title=Longest_common_substring_problemhttp://en.wikipedia.org/w/index.php?title=Tandem_repeatshttp://en.wikipedia.org/w/index.php?title=Palindromehttp://en.wikipedia.org/w/index.php?title=Lowest_common_ancestorhttp://en.wikipedia.org/w/index.php?title=Longest_repeated_substring_problemhttp://en.wikipedia.org/w/index.php?title=Lempel%E2%80%93Zivhttp://en.wikipedia.org/w/index.php?title=Maximal_pairhttp://en.wikipedia.org/w/index.php?title=Longest_common_substring_problemhttp://en.wikipedia.org/w/index.php?title=Sublinearhttp://en.wikipedia.org/w/index.php?title=Regular_expressionhttp://en.wikipedia.org/w/index.php?title=Generalised_suffix_treehttp://en.wikipedia.org/w/index.php?title=Sorting_algorithm
  • 8/8/2019 suf_tree

    3/6

    Suffix tree 3

    Applications

    Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search,

    computational biology and other application areas.[8]

    Primary applications include:[8]

    String search, in O(m) complexity, where m is the length of the sub-string (but with initial O(n) time required to

    build the suffix tree for the string)

    Finding the longest repeated substring

    Finding the longest common substring

    Finding the longest palindrome in a string

    Suffix trees are often used in bioinformatics applications, searching for patterns in DNA or protein sequences (which

    can be viewed as long strings of characters). The ability to search efficiently with mismatches might be considered

    their greatest strength. Suffix trees are also used in data compression; they can be used to find repeated data, and can

    be used for the sorting stage of the BurrowsWheeler transform. Variants of the LZW compression schemes use

    suffix trees (LZSS). A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some

    search engines (first introduced in[9]

    ).

    Implementation

    If each node and edge can be represented in space, the entire tree can be represented in space. The

    total length of all the strings on all of the edges in the tree is , but each edge can be stored as the position and

    length of a substring of S, giving a total space usage of computer words. The worst-case space usage of a

    suffix tree is seen with a fibonacci word, giving the full nodes.An important choice when making a suffix tree implementation is the parent-child relationships between nodes. The

    most common is using linked lists called sibling lists. Each node has pointer to its first child, and to the next node in

    the child list it is a part of. Hash maps, sorted/unsorted arrays (with array doubling), and balanced search trees may

    also be used, giving different running time properties. We are interested in:

    The cost of finding the child on a given character.

    The cost of inserting a child.

    The cost of enlisting all children of a node (divided by the number of children in the table below).

    Let be the size of the alphabet. Then you have the following costs:

    Lookup Insertion Traversal

    Sibling lists / unsorted arrays

    Hash maps

    Balanced search tree

    Sorted arrays

    Hash maps + sibling lists

    Note that the insertion cost is amortised, and that the costs for hashing are given perfect hashing.

    The large amount of information in each edge and node makes the suffix tree very expensive, consuming about ten

    to twenty times the memory size of the source text in good implementations. The suffix array reduces this

    requirement to a factor of four, and researchers have continued to find smaller indexing structures.

    http://en.wikipedia.org/w/index.php?title=Suffix_arrayhttp://en.wikipedia.org/w/index.php?title=Self-balancing_binary_search_treehttp://en.wikipedia.org/w/index.php?title=Dynamic_arrayhttp://en.wikipedia.org/w/index.php?title=Array_data_structurehttp://en.wikipedia.org/w/index.php?title=Hash_maphttp://en.wikipedia.org/w/index.php?title=Linked_listhttp://en.wikipedia.org/w/index.php?title=Fibonacci_wordhttp://en.wikipedia.org/w/index.php?title=Data_clusteringhttp://en.wikipedia.org/w/index.php?title=Suffix_tree_clusteringhttp://en.wikipedia.org/w/index.php?title=LZSShttp://en.wikipedia.org/w/index.php?title=LZWhttp://en.wikipedia.org/w/index.php?title=Burrows%E2%80%93Wheeler_transformhttp://en.wikipedia.org/w/index.php?title=Data_compressionhttp://en.wikipedia.org/w/index.php?title=Proteinhttp://en.wikipedia.org/w/index.php?title=DNAhttp://en.wikipedia.org/w/index.php?title=Bioinformaticshttp://en.wikipedia.org/w/index.php?title=Palindromehttp://en.wikipedia.org/w/index.php?title=String_search%23Index_methods
  • 8/8/2019 suf_tree

    4/6

    Suffix tree 4

    External construction

    Suffix trees quickly outgrow the main memory on standard machines for sequence collections in the order of

    gigabytes. As such, their construction calls for external memory approaches.

    There are theoretical results for constructing suffix trees in external memory. The algorithm by Farach et al.[10]

    is

    theoretically optimal, with an I/Os complexity equal to that of sorting. However, as discussed for example in[11]

    , the

    overall intricacy of this algorithm has prevented, so far, its practical implementation.

    On the other hand, there have been practical works for constructing disk-based suffix trees which scale to (few)

    GB/hours. The state of the art methods are TDD[12]

    , TRELLIS[13]

    , DiGeST[14]

    , and B^2ST[15]

    .

    TDD and TRELLIS scale up to the entire human genomeapproximately 3GBresulting in a disk-based suffix

    tree of a size in the tens of gigabytes[12]

    ,[13]

    . However, these methods cannot handle efficiently collections of

    sequences exceeding 3GB[14]

    . DiGeST performs significantly better and is able to handle collections of sequences

    in the order of 6GB in about 6 hours[14]

    . The source code and documentation for the latter is available from[16]

    .

    All these methods can efficiently build suffix trees for the case when the tree does not fit in main memory, but the

    input does. The most recent method, B2ST[15]

    , scales to handle inputs that do not fit in main memory.

    See also

    Suffix array

    References

    [1] P. Weiner (1973). "Linear pattern matching algorithm". 14th Annual IEEE Symposium on Switching and Automata Theory. pp. 111.

    [2] Edward M. McCreight (1976). "A Space-Economical Suffix Tree Construction Algorithm" (http://doi.acm.org/10.1145/321941.321946).

    Journal of the ACM23 (2): 262272. doi:10.1145/321941.321946. .

    [3] E. Ukkonen (1995). "On-line construction of suffix trees" (http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf).Algorithmica14

    (3): 249260. doi:10.1007/BF01206331. .

    [4] R. Giegerich and S. Kurtz (1997). "From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction"(http://www. zbh. uni-hamburg.de/staff/kurtz/papers/GieKur1997.pdf).Algorithmica19 (3): 331353. doi:10.1007/PL00009177. .

    [5] Gusfield, Dan (1999) [1997].Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology . USA: Cambridge

    University Press. ISBN 0-521-58519-8.

    [6] Martin Farach (1997). "Optimal suffix tree construction with large alphabets" (ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/

    TechReports/1996/96-48.ps.gz). . pp. 137143. .

    [7] Ricardo A. Baeza-Yates and Gaston H. Gonnet (1996). "Fast text searching for regular expressions or automaton searching on tries".Journal

    of the ACM(ACM Press) 43 (6): 915936. doi:10.1145/235809.235810.

    [8] Allison, L.. "Suffix Trees" (http://www.allisons.org/ll/AlgDS/Tree/Suffix/). . Retrieved 2008-10-14.

    [9] Oren Zamir and Oren Etzioni (1998). "Web document clustering: a feasibility demonstration". SIGIR '98: Proceedings of the 21st annual

    international ACM SIGIR conference on Research and development in information retrieval. ACM. pp. 4654.

    [10] Martin Farach-Colton, Paolo Ferragina, S. Muthukrishnan (2000). "On the sorting-complexity of suffix tree construction.".J. Acm 47(6)47

    (6): 9871011. doi:10.1145/355541.355547.

    [11] Smyth, William (2003). Computing Patterns in Strings. Addison-Wesley.

    [12] Sandeep Tata, Richard A. Hankins, and Jignesh M. Patel (2003). "Practical Suffix Tree Construction". VLDB '03: Proceedings of the 30th

    International Conference on Very Large Data Bases. Morgan Kaufmann. pp. 3647.

    [13] Benjarath Phoophakdee and Mohammed J. Zaki (2007). "Genome-scale disk-based suffix tree indexing". SIGMOD '07: Proceedings of the

    ACM SIGMOD International Conference on Management of Data. ACM. pp. 833844.

    [14] Marina Barsky, Ulrike Stege, Alex Thomo, and Chris Upton (2008). "A new method for indexing genomes using on-disk suffix trees".

    CIKM '08: Proceedings of the 17th ACM Conference on Information and Knowledge Management. ACM. pp. 649658.

    [15] Marina Barsky, Ulrike Stege, Alex Thomo, and Chris Upton (2009). "Suffix trees for very large genomic sequences". CIKM '09:

    Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM.

    [16] "The disk-based suffix tree for pattern search in sequenced genomes" (http://webhome.cs.uvic.ca/~mgbarsky/DIGEST_SEARCH). .

    Retrieved 2009-10-15.

    http://webhome.cs.uvic.ca/~mgbarsky/DIGEST_SEARCHhttp://en.wikipedia.org/w/index.php?title=Addison-Wesleyhttp://www.allisons.org/ll/AlgDS/Tree/Suffix/ftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/1996/96-48.ps.gzftp://dimacs.rutgers.edu/pub/dimacs/TechnicalReports/TechReports/1996/96-48.ps.gzhttp://www.zbh.uni-hamburg.de/staff/kurtz/papers/GieKur1997.pdfhttp://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdfhttp://doi.acm.org/10.1145/321941.321946http://en.wikipedia.org/w/index.php?title=Suffix_array
  • 8/8/2019 suf_tree

    5/6

    Suffix tree 5

    External links

    Suffix Trees (http://www.cise.ufl.edu/~sahni/dsaaj/enrich/c16/suffix. htm) by Dr. Sartaj Sahni (CISE

    Department Chair at University of Florida)

    Suffix Trees (http://www.allisons.org/ll/AlgDS/Tree/Suffix/) by Lloyd Allison

    NIST's Dictionary of Algorithms and Data Structures: Suffix Tree (http://www.nist.gov/dads/HTML/

    suffixtree.html) suffix_tree (http://mila.cs.technion.ac.il/~yona/suffix_tree/) ANSI C implementation of a Suffix Tree

    libstree (http://www.cl.cam.ac.uk/~cpk25/libstree/), a generic suffix tree library written in C

    Tree::Suffix (http://search.cpan.org/dist/Tree-Suffix/), a Perl binding to libstree

    Strmat (http://www.cs.ucdavis.edu/~gusfield/strmat. html) a faster generic suffix tree library written in C

    (uses arrays instead of linked lists)

    SuffixTree (http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/) a Python binding to Strmat

    Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice (http://www.

    balkenhol.net/papers/t1043. pdf.gz), application of suffix trees in the BWT

    Theory and Practice of Succinct Data Structures (http://www.cs.helsinki.fi/group/suds/), C++

    implementation of a compressed suffix tree]

    Practical Algorithm Template Library (http://code.google.com/p/patl/), a C++ library with suffix tree

    implementation on PATRICIA trie, by Roman S. Klyujkov

    http://code.google.com/p/patl/http://www.cs.helsinki.fi/group/suds/http://www.balkenhol.net/papers/t1043.pdf.gzhttp://www.balkenhol.net/papers/t1043.pdf.gzhttp://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/http://www.cs.ucdavis.edu/~gusfield/strmat.htmlhttp://search.cpan.org/dist/Tree-Suffix/http://www.cl.cam.ac.uk/~cpk25/libstree/http://mila.cs.technion.ac.il/~yona/suffix_tree/http://www.nist.gov/dads/HTML/suffixtree.htmlhttp://www.nist.gov/dads/HTML/suffixtree.htmlhttp://www.allisons.org/ll/AlgDS/Tree/Suffix/http://www.cise.ufl.edu/~sahni/dsaaj/enrich/c16/suffix.htm
  • 8/8/2019 suf_tree

    6/6

    Article Sources and Contributors 6

    Article Sources and ContributorsSuffix tree Source: http://en.wikipedia.org/w/index.php?oldid=396896972 Contributors: 12hugo34, Alfio, Andre.holzner, Andreas Kaufmann, AxelBoldt, Bbi5291, B cat, Beetstra, Blahma,

    Charles Matthews, Christian Kreibich, CobaltBlue, Cybercobra, David Eppstein, Dcoetzee, Delirium, Deselaers, Dionyziz, DmitriyV, Ffaarr, Garyzx, Giftlite, Heavyrain2408, Illya.havsiyevych,

    Jamelan, Jemfinch, Jhclark, Jogloran, Johnbibby, Kbh3rd, Leafman, Luismsgomes, MaxEnt, Mechonbarsa, Michael Hardy, NVar, Nealjc, Nils Grimsmo, Nux, Oleg Alexandrov, P0nc, R. S.

    Shaw, Requestion, RomanPszonka, Ronni1987, Ru.spider, Safek, Sho Uemura, Shoujun, Sky Attacker, Squash, Sundar, TheMandarin, TheTaxman, TripleF, Vecter, Wsloand, Xevior,

    Xodarap00, Xutaodeng, 60 anonymous edits

    Image Sources, Licenses and ContributorsImage:Suffix tree BANANA.svg Source: http://en.wikipedia.org/w/index.php?title=File:Suffix_tree_BANANA.svg License: Public Domain Contributors: User:Nux

    License

    Creative Commons Attribution-Share Alike 3.0 Unportedhttp://creativecommons.org/licenses/by-sa/3.0/

    http://creativecommons.org/licenses/by-sa/3.0/