Genome-scale Disk-based Suffix Tree Indexing

17
GENOME-SCALE DISK-BASED SUFFIX TREE INDEXING Phoophakdee and Zaki

description

Genome-scale Disk-based Suffix Tree Indexing. Phoophakdee and Zaki. Outline. Suffix Tree introduction Application in Bioinformatics Trellis Trellis performance Conclusion. Example Suffix Tree. Sequence ACGACG$ What are Suffix Links. Suffix tree runtime. Time complexity - PowerPoint PPT Presentation

Transcript of Genome-scale Disk-based Suffix Tree Indexing

Page 1: Genome-scale Disk-based Suffix Tree Indexing

GENOME-SCALE DISK-BASED SUFFIX TREE INDEXINGPhoophakdee and Zaki

Page 2: Genome-scale Disk-based Suffix Tree Indexing

OUTLINE Suffix Tree introduction Application in Bioinformatics Trellis Trellis performance Conclusion

Page 3: Genome-scale Disk-based Suffix Tree Indexing

EXAMPLE SUFFIX TREE Sequence

ACGACG$ What are Suffix

Links

Page 4: Genome-scale Disk-based Suffix Tree Indexing

SUFFIX TREE RUNTIME Time complexity

Construction of suffix tree: O(n) time and space where n is the size of the text

being searched Substring Search:

O(m) time where m is size of substring/search pattern Knuth-Morris-Pratt and Boyer-Moore algorithm

comparison

Page 5: Genome-scale Disk-based Suffix Tree Indexing

APPLICATION IN BIOINFORMATICS Database search Exact matching Approximate matching* Longest common substring Genome alignment* Structural motifs* Tandem repeats* Sequence comparison

Page 6: Genome-scale Disk-based Suffix Tree Indexing

PROBLEMS WITH GENOME-SCALE SUFFIX TREES Efficient O(n) suffix tree generating

algorithms Tree must fit entirely in main memory e.g. Ukkonen’s algorithm

Genomes are very large Human genome is 3 Gbp (0.75 GB) Data structure no longer able to fit in memory

Page 7: Genome-scale Disk-based Suffix Tree Indexing

WHAT TRELLIS SOLVES Prevents data skew in prefix partitioning

Bad data skew with prefix partitioning leads to prefix partitions that may not fit into memory.

From non-uniform distribution of alphabit/DNA Efficient disk-base implementation

Function under low memory constraints Efficient disk IO usage

Able to recover suffix links

Page 8: Genome-scale Disk-based Suffix Tree Indexing

TRELLIS STEPS Prefix Creation Phase Partitioning Phase Merging Phase Suffix Link Recovery Phase (Optional)

Page 9: Genome-scale Disk-based Suffix Tree Indexing

TRELLIS OVERVIEW

Page 10: Genome-scale Disk-based Suffix Tree Indexing

MERGING PHASE

Page 11: Genome-scale Disk-based Suffix Tree Indexing

THRESHOLD (t) Determines partition of sequence

Suffix subtree fits into memory during partitioning phase.

Determines cutoff for prefix set inclusion Recombined prefixed suffix subtree will fit

entirely into memory during merging phase. Allows input string and two sets of internal

nodes to fit entirely into memory during suffix link recovery phase

Page 12: Genome-scale Disk-based Suffix Tree Indexing

TRELLIS OVERVIEW

Page 13: Genome-scale Disk-based Suffix Tree Indexing

PERFORMANCE O(n2) time and O(n) space (where n is

sequence length) Comparison to TDD

Currently only other algorithm that scales up to genome level

Same time complexity Does not calculate suffix links

Page 14: Genome-scale Disk-based Suffix Tree Indexing

SUFFIX TREE CONSTRUCTION

Page 15: Genome-scale Disk-based Suffix Tree Indexing

QUERY TIMES

Page 16: Genome-scale Disk-based Suffix Tree Indexing

QUERY TIMES

Page 17: Genome-scale Disk-based Suffix Tree Indexing

CONCLUSION Efficient disk-based suffix tree generation

that works well with limited memory Suffix links are recoverable Future work

Extend to larger alphabets Buffer input sequence Parallelize partitioning and merging