Genome-scale Disk-based Suffix Tree Indexing
description
Transcript of Genome-scale Disk-based Suffix Tree Indexing
![Page 1: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/1.jpg)
GENOME-SCALE DISK-BASED SUFFIX TREE INDEXINGPhoophakdee and Zaki
![Page 2: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/2.jpg)
OUTLINE Suffix Tree introduction Application in Bioinformatics Trellis Trellis performance Conclusion
![Page 3: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/3.jpg)
EXAMPLE SUFFIX TREE Sequence
ACGACG$ What are Suffix
Links
![Page 4: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/4.jpg)
SUFFIX TREE RUNTIME Time complexity
Construction of suffix tree: O(n) time and space where n is the size of the text
being searched Substring Search:
O(m) time where m is size of substring/search pattern Knuth-Morris-Pratt and Boyer-Moore algorithm
comparison
![Page 5: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/5.jpg)
APPLICATION IN BIOINFORMATICS Database search Exact matching Approximate matching* Longest common substring Genome alignment* Structural motifs* Tandem repeats* Sequence comparison
![Page 6: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/6.jpg)
PROBLEMS WITH GENOME-SCALE SUFFIX TREES Efficient O(n) suffix tree generating
algorithms Tree must fit entirely in main memory e.g. Ukkonen’s algorithm
Genomes are very large Human genome is 3 Gbp (0.75 GB) Data structure no longer able to fit in memory
![Page 7: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/7.jpg)
WHAT TRELLIS SOLVES Prevents data skew in prefix partitioning
Bad data skew with prefix partitioning leads to prefix partitions that may not fit into memory.
From non-uniform distribution of alphabit/DNA Efficient disk-base implementation
Function under low memory constraints Efficient disk IO usage
Able to recover suffix links
![Page 8: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/8.jpg)
TRELLIS STEPS Prefix Creation Phase Partitioning Phase Merging Phase Suffix Link Recovery Phase (Optional)
![Page 9: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/9.jpg)
TRELLIS OVERVIEW
![Page 10: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/10.jpg)
MERGING PHASE
![Page 11: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/11.jpg)
THRESHOLD (t) Determines partition of sequence
Suffix subtree fits into memory during partitioning phase.
Determines cutoff for prefix set inclusion Recombined prefixed suffix subtree will fit
entirely into memory during merging phase. Allows input string and two sets of internal
nodes to fit entirely into memory during suffix link recovery phase
![Page 12: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/12.jpg)
TRELLIS OVERVIEW
![Page 13: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/13.jpg)
PERFORMANCE O(n2) time and O(n) space (where n is
sequence length) Comparison to TDD
Currently only other algorithm that scales up to genome level
Same time complexity Does not calculate suffix links
![Page 14: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/14.jpg)
SUFFIX TREE CONSTRUCTION
![Page 15: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/15.jpg)
QUERY TIMES
![Page 16: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/16.jpg)
QUERY TIMES
![Page 17: Genome-scale Disk-based Suffix Tree Indexing](https://reader035.fdocuments.us/reader035/viewer/2022062814/56816786550346895ddca011/html5/thumbnails/17.jpg)
CONCLUSION Efficient disk-based suffix tree generation
that works well with limited memory Suffix links are recoverable Future work
Extend to larger alphabets Buffer input sequence Parallelize partitioning and merging