Tries and Suffix Tries

download Tries and Suffix Tries

of 26

Transcript of Tries and Suffix Tries

  • 8/10/2019 Tries and Suffix Tries

    1/26

  • 8/10/2019 Tries and Suffix Tries

    2/26

    Tries

    A trie (pronounced try) is a tree representing a collection of strings withone node per common prefix

    Each key is spelled out along some path starting at the root

    Each edge is labeled with a character c!

    A node has at most one outgoing edge labeled c, for c!

    Smallest tree such that:

    Natural way to represent either a setor a mapwhere keys are strings

    http://xlinux.nist.gov/dads/HTML/prefix.htmlhttp://xlinux.nist.gov/dads/HTML/prefix.htmlhttp://xlinux.nist.gov/dads/HTML/node.htmlhttp://xlinux.nist.gov/dads/HTML/node.html
  • 8/10/2019 Tries and Suffix Tries

    3/26

  • 8/10/2019 Tries and Suffix Tries

    4/26

    Tries: example

    Checking for presence of a key P,where n= | P |, is ??? time

    If total length of all keys is N, trie

    has ??? nodes

    O(n)

    O

    (N

    )

    What about | | ?

    Depends how we represent outgoing

    edges. If we dont assume |

    | is asmall constant, it shows up in one orboth bounds.

    i

    n

    s

    t

    a

    n

    t

    t

    e

    r

    n

    a

    l

    e

    t1

    2 3

  • 8/10/2019 Tries and Suffix Tries

    5/26

    Tries: another example

    We can index T with a trie. The trie mapssubstrings to offsets where they occur

    ac 4

    ag 8

    at 14

    cc 12

    cc 2

    ct 6

    gt 18

    gt 0

    ta 10tt 16

    ac

    g

    t

    cg

    t

    c

    t

    t

    a

    t

    4

    8

    14

    12, 2

    6

    18, 0

    10

    16

    root:

  • 8/10/2019 Tries and Suffix Tries

    6/26

    Tries: implementation

    !"#$$&'()*#+,-./)!012

    333 &'() (4+")4)50#0(-5 -6 # 4#+7 8$$-!(#0(59 :);$ ,$0'(59$ -' -0) +#(' 333

    !>' E$)"67'--0 6-'! (5:2 H 6-' )#!< !'L!M EFG H (6 5-0 0'L!M

    !>'LN@#">)NM E@ H #0 0'2

    ')0>'5P-5)H :); ?#$5N0 (5 0'L!M

    H 9)0 @#">)D -' P-5) (6 0'5!>'79)0,N@#">)N1

    Python example:

    http://nbviewer.ipython.org/6603619

    http://nbviewer.ipython.org/6603619http://nbviewer.ipython.org/6603619http://nbviewer.ipython.org/6603619http://nbviewer.ipython.org/6603619
  • 8/10/2019 Tries and Suffix Tries

    7/26

    Tries: alternatives

    Tries arent the only tree structure that can encode sets or maps with string

    keys. E.g. binary or ternary search trees.

    i

    Example from: Bentley, Jon L., and Robert Sedgewick. "Fast algorithms for sorting and searching

    strings." Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. Society for

    Industrial and Applied Mathematics, 1997

    b s o

    a e h n t n t

    Ternary search tree for as, at, be, by, he, in, is, it, of, on, or, to

    s y e f r o

    t

  • 8/10/2019 Tries and Suffix Tries

    8/26

    Indexing with suffixes

    Until now, our indexes have been based on extracting substrings from T

    A very different approach is to extract suffixesfrom T. This will lead us tosome interesting and practical index data structures:

    6

    5

    3

    1

    0

    4

    2

    $

    A$

    ANA$

    ANANA$

    BANANA$

    NA$

    NANA$

    $B A N A N A

    A$ B A N A N

    A N A $ B A N

    A N A N A $B

    BA N A N A$

    N A $ B A N A

    N A N A $ BA

    Suffix Tree Suffix Array FM IndexSuffix Trie

  • 8/10/2019 Tries and Suffix Tries

    9/26

    Suffix trie

    Build a triecontaining all suffixesof a text T

    G T T A T A G C T G A T C G C G G C G T A G C G G $G T T A T A G C T G A T C G C G G C G T A G C G G $ T T A T A G C T G A T C G C G G C G T A G C G G $ T A T A G C T G A T C G C G G C G T A G C G G $ A T A G C T G A T C G C G G C G T A G C G G $ T A G C T G A T C G C G G C G T A G C G G $ A G C T G A T C G C G G C G T A G C G G $ G C T G A T C G C G G C G T A G C G G $ C T G A T C G C G G C G T A G C G G $ T G A T C G C G G C G T A G C G G $ G A T C G C G G C G T A G C G G $ A T C G C G G C G T A G C G G $ T C G C G G C G T A G C G G $ C G C G G C G T A G C G G $ G C G G C G T A G C G G $ C G G C G T A G C G G $ G G C G T A G C G G $ G C G T A G C G G $ C G T A G C G G $ G T A G C G G $ T A G C G G $ A G C G G $ G C G G $ C G G $ G G $ G $ $

    T:

    m(m+1)/2

    chars

  • 8/10/2019 Tries and Suffix Tries

    10/26

  • 8/10/2019 Tries and Suffix Tries

    11/26

  • 8/10/2019 Tries and Suffix Tries

    12/26

    Suffix trie

    Each path from root to leaf represents asuffix; each suffix is represented by somepath from root to leaf

    ! " #

    ! " #

    "

    !

    #

    !

    ! #

    "

    !

    #

    !

    ! #

    "

    !

    #

    Shortest

    (non-empty)suffix

    Longest suffix

    T: abaaba abaaba$T$:

    Would this still be the case if we hadntadded $?

  • 8/10/2019 Tries and Suffix Tries

    13/26

    Suffix trie

    T: abaaba

    Would this still be the case if we hadntadded $? No

    ! "

    ! "

    "

    !

    !

    !

    "

    !

    !

    !

    "

    !

    Each path from root to leaf represents asuffix; each suffix is represented by somepath from root to leaf

  • 8/10/2019 Tries and Suffix Tries

    14/26

    Suffix trie

    We can think of nodes as having labels,where the label spells out characters on thepath from the root to the node

    ! " #

    ! " #

    "

    !

    #

    !

    ! #

    "

    !

    #

    !

    ! #

    "

    !

    #

    baa

  • 8/10/2019 Tries and Suffix Tries

    15/26

    Suffix trie

    How do we check whether a string Sis asubstring of T?

    ! " #

    ! " #

    "

    !

    #

    !

    ! #

    "

    !

    #

    !

    ! #

    "

    !

    #

    Note: Each of Tssubstrings is spelled outalong a path from the root. I.e., everysubstringis aprefix of some suffixof T.

    Start at the root and follow the edgeslabeled with the characters of S

    If we fall off the trie -- i.e. there is no

    outgoing edge for next character of S, thenSis not a substring of T

    If we exhaust Swithout falling off, Sis asubstring of T

    S = baaYes, its a substring

  • 8/10/2019 Tries and Suffix Tries

    16/26

    Suffix trie

    How do we check whether a string Sis asubstring of T?

    ! " #

    ! " #

    "

    !

    #

    !

    ! #

    "

    !

    #

    !

    ! #

    "

    !

    #

    Note: Each of Tssubstrings is spelled outalong a path from the root. I.e., everysubstringis aprefix of some suffixof T.

    Start at the root and follow the edgeslabeled with the characters of S

    If we fall off the trie -- i.e. there is no

    outgoing edge for next character of S, thenSis not a substring of T

    If we exhaust Swithout falling off, Sis asubstring of T

  • 8/10/2019 Tries and Suffix Tries

    17/26

  • 8/10/2019 Tries and Suffix Tries

    18/26

    Suffix trie

    How do we check whether a string Sis asuffixof T?

    ! " #

    ! " #

    "

    !

    #

    !

    ! #

    "

    !

    #

    !

    ! #

    "

    !

    #

    Same procedure as for substring, butadditionally check whether the final node inthe walk has an outgoing edge labeled $

    S = baaNota suffix

  • 8/10/2019 Tries and Suffix Tries

    19/26

    Suffix trie

    How do we check whether a string Sis asuffixof T?

    ! " #

    ! " #

    "

    !

    #

    !

    ! #

    "

    !

    #

    !

    ! #

    "

    !

    #

    Same procedure as for substring, butadditionally check whether the final node inthe walk has an outgoing edge labeled $

  • 8/10/2019 Tries and Suffix Tries

    20/26

    Suffix trie

    How do we count the number of timesa string Soccurs as a substring of T?

    ! " #

    ! " #

    "

    !

    #

    !

    ! #

    "

    !

    #

    !

    ! #

    "

    !

    #

    Follow path corresponding to S.Either we fall off, in which case

    answer is 0, or we end up at node nand the answer = # of leaf nodes inthe subtree rooted at n.

    Leaves can be counted with depth-firsttraversal.

    n

  • 8/10/2019 Tries and Suffix Tries

    21/26

    Suffix trie

    How do we find the longest repeatedsubstringof T?

    ! " #

    ! " #

    "

    !

    #

    !

    ! #

    "

    !

    #

    !

    ! #

    "

    !

    #

    Find the deepest node with morethan one child

  • 8/10/2019 Tries and Suffix Tries

    22/26

    Suffix trie: implementation!"#$$Q>66(R&'(),-./)!012

    B)6CC(5(0CC,$)"6D 012

    333 *#:) $>66(R 0'() 6'-4 0 333 0 SENTNH $+)!(#" 0)'4(5#0-' $;4.-" $)"67'--0 EFG 6-'( (5R'#59),")5,0112 H 6-' )#!< $>66(R !>' E$)"67'--0 6-'! (50L(2M2 H 6-' )#!< !66(R (6! 5-0(5!>'2 !>'L!M EFG H #BB ->09-(59 )B9) (6 5)!)$$#'; !>' E!>'L!M

    B)66-""-?U#0'2 ')0>'5P-5) !>' E!>'L!M

    ')0>'5!>'

    B)6.$0'(59,$)"6D $12 333 V)0>'5 0'>) (66 $ #++)#'$ #$ # $>.$0'(59 -6 0 333 ')0>'5$)"676-""-?U#0

  • 8/10/2019 Tries and Suffix Tries

    23/26

    Suffix trie

    How many nodes does the suffix trie have?

    Is there a class of string where the numberof suffix trie nodes grows linearly with m?

    Yes: e.g. a string of mas in a row (am)

    ! "

    ! "

    ! "

    ! "

    "

    T= aaaa

    1 Root

    mnodes withincoming aedge

    m + 1 nodes withincoming $edge

    2m+ 2 nodes

  • 8/10/2019 Tries and Suffix Tries

    24/26

    Suffix trie

    Is there a class of string where the numberof suffix trie nodes grows with m2?

    Yes: anbn

    1 root

    nnodes along b chain, right nnodes along a chain, middle

    nchains of nb nodes hanging offeacha chain node

    2n + 1 $leaves (not shown)

    n2+ 4n+ 2 nodes, where m= 2n

    Figure & exampleby Carl Kingsford

  • 8/10/2019 Tries and Suffix Tries

    25/26

    Suffix trie: upper bound on size

    Suffix trie

    Root

    Deepest leaf

    Max # nodes from top to bottom= length of longest suffix + 1= m+ 1

    Max # nodes from left to right= max # distinct substrings of any length

    m

    O(m2) is worst case

    Could worst-case # nodes be worse than O(m2)?

  • 8/10/2019 Tries and Suffix Tries

    26/26

    Suffix trie: actual growth

    0 100 200 300 400 500

    0

    50

    000

    100000

    150000

    200000

    250000

    Length prefix over which suffix trie was built

    #s

    uffix

    trien

    odes

    m^2actual

    mBuilt suffix tries for the first500 prefixes of the lambdaphage virus genome

    Black curve shows how #nodes increases with prefixlength