Genome-scale Disk-based Suffix Tree Indexing

GENOME-SCALE DISK-BASED SUFFIX TREE INDEXINGPhoophakdee and Zaki

OUTLINE Suffix Tree introduction Application in Bioinformatics Trellis Trellis performance Conclusion

EXAMPLE SUFFIX TREE Sequence

ACGACG$ What are Suffix

Links

SUFFIX TREE RUNTIME Time complexity

Construction of suffix tree: O(n) time and space where n is the size of the text

being searched Substring Search:

O(m) time where m is size of substring/search pattern Knuth-Morris-Pratt and Boyer-Moore algorithm

comparison

APPLICATION IN BIOINFORMATICS Database search Exact matching Approximate matching* Longest common substring Genome alignment* Structural motifs* Tandem repeats* Sequence comparison

PROBLEMS WITH GENOME-SCALE SUFFIX TREES Efficient O(n) suffix tree generating

algorithms Tree must fit entirely in main memory e.g. Ukkonen’s algorithm

Genomes are very large Human genome is 3 Gbp (0.75 GB) Data structure no longer able to fit in memory

WHAT TRELLIS SOLVES Prevents data skew in prefix partitioning

Bad data skew with prefix partitioning leads to prefix partitions that may not fit into memory.

From non-uniform distribution of alphabit/DNA Efficient disk-base implementation

Function under low memory constraints Efficient disk IO usage

Able to recover suffix links

TRELLIS STEPS Prefix Creation Phase Partitioning Phase Merging Phase Suffix Link Recovery Phase (Optional)

TRELLIS OVERVIEW

MERGING PHASE

THRESHOLD (t) Determines partition of sequence

Suffix subtree fits into memory during partitioning phase.

Determines cutoff for prefix set inclusion Recombined prefixed suffix subtree will fit

entirely into memory during merging phase. Allows input string and two sets of internal

nodes to fit entirely into memory during suffix link recovery phase

TRELLIS OVERVIEW

PERFORMANCE O(n2) time and O(n) space (where n is

sequence length) Comparison to TDD

Currently only other algorithm that scales up to genome level

Same time complexity Does not calculate suffix links

SUFFIX TREE CONSTRUCTION

QUERY TIMES

CONCLUSION Efficient disk-based suffix tree generation

that works well with limited memory Suffix links are recoverable Future work

Extend to larger alphabets Buffer input sequence Parallelize partitioning and merging

Genome-scale Disk-based Suffix Tree Indexing

Documents

Transcript of Genome-scale Disk-based Suffix Tree Indexing