Compressed Full-Text Indexes for Highly Repetitive …jltsiren.kapsi.fi/files/lectio.pdf•...
Transcript of Compressed Full-Text Indexes for Highly Repetitive …jltsiren.kapsi.fi/files/lectio.pdf•...
Compressed Full-Text Indexes for Highly
Repetitive Collections
Lectio praecursoriaJouni Sirén 29.6.2012
ALGORITHM
Sadakane: New text indexing functionalities of the compressed suffix
Burrows, Wheeler: A block sorting lossless data compression algorithm
Ferragina, Manzini: Indexing compressed text
Grossi, Vitter: Compressed suffix arrays and suffix trees with
Navarro, Mäkinen: Compressed full-text indexes
Raman, Raman, Rao: Succinct indexable dictionaries with applications
Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
Sadakane: Compressed suffix trees with full functionality
Manber, Myers: Suffix arrays: A new method for on-line string searches
Are there papers withSadakane as the first author?
Sadakane: New text indexing functionalities of the compressed suffix
Burrows, Wheeler: A block sorting lossless data compression algorithm
Ferragina, Manzini: Indexing compressed text
Grossi, Vitter: Compressed suffix arrays and suffix trees with
Navarro, Mäkinen: Compressed full-text indexes
Raman, Raman, Rao: Succinct indexable dictionaries with applications
Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
Sadakane: Compressed suffix trees with full functionality
Manber, Myers: Suffix arrays: A new method for on-line string searches
How many papers haveSadakane as the first author?
Sadakane: New text indexing functionalities of the compressed suffix
Burrows, Wheeler: A block sorting lossless data compression algorithm
Ferragina, Manzini: Indexing compressed text
Grossi, Vitter: Compressed suffix arrays and suffix trees with
Navarro, Mäkinen: Compressed full-text indexes
Raman, Raman, Rao: Succinct indexable dictionaries with applications
Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
Sadakane: Compressed suffix trees with full functionality
Manber, Myers: Suffix arrays: A new method for on-line string searches
What are the papers withSadakane as the first author?
Sadakane: New text indexing functionalities of the compressed suffix
Burrows, Wheeler: A block sorting lossless data compression algorithm
Ferragina, Manzini: Indexing compressed text
Grossi, Vitter: Compressed suffix arrays and suffix trees with
Navarro, Mäkinen: Compressed full-text indexes
Raman, Raman, Rao: Succinct indexable dictionaries with applications
Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
Sadakane: Compressed suffix trees with full functionality
Manber, Myers: Suffix arrays: A new method for on-line string searches
Sadakane: New text indexing functionalities of the compressed suffix
Burrows, Wheeler: A block sorting lossless data compression algorithm
Ferragina, Manzini: Indexing compressed text
Grossi, Vitter: Compressed suffix arrays and suffix trees with
Navarro, Mäkinen: Compressed full-text indexes
Raman, Raman, Rao: Succinct indexable dictionaries with applications
Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
Sadakane: Compressed suffix trees with full functionality
Manber, Myers: Suffix arrays: A new method for on-line string searches
Sadakane: New text indexing functionalities of the compressed suffix
Burrows, Wheeler: A block sorting lossless data compression algorithm
Ferragina, Manzini: Indexing compressed text
Grossi, Vitter: Compressed suffix arrays and suffix trees with
Navarro, Mäkinen: Compressed full-text indexes
Raman, Raman, Rao: Succinct indexable dictionaries with applications
Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
Sadakane: Compressed suffix trees with full functionality
Manber, Myers: Suffix arrays: A new method for on-line string searches
Sadakane: New text indexing functionalities of the compressed suffix
Burrows, Wheeler: A block sorting lossless data compression algorithm
Ferragina, Manzini: Indexing compressed text
Grossi, Vitter: Compressed suffix arrays and suffix trees with
Navarro, Mäkinen: Compressed full-text indexes
Raman, Raman, Rao: Succinct indexable dictionaries with applications
Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
Sadakane: Compressed suffix trees with full functionality
Manber, Myers: Suffix arrays: A new method for on-line string searches
Sadakane: New text indexing functionalities of the compressed suffix
Burrows, Wheeler: A block sorting lossless data compression algorithm
Ferragina, Manzini: Indexing compressed text
Grossi, Vitter: Compressed suffix arrays and suffix trees with
Navarro, Mäkinen: Compressed full-text indexes
Raman, Raman, Rao: Succinct indexable dictionaries with applications
Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
Sadakane: Compressed suffix trees with full functionality
Manber, Myers: Suffix arrays: A new method for on-line string searches
Sadakane: New text indexing functionalities of the compressed suffix
Burrows, Wheeler: A block sorting lossless data compression algorithm
Ferragina, Manzini: Indexing compressed text
Grossi, Vitter: Compressed suffix arrays and suffix trees with
Navarro, Mäkinen: Compressed full-text indexes
Raman, Raman, Rao: Succinct indexable dictionaries with applications
Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
Sadakane: Compressed suffix trees with full functionality
Manber, Myers: Suffix arrays: A new method for on-line string searches
Sadakane: New text indexing functionalities of the compressed suffix
Burrows, Wheeler: A block sorting lossless data compression algorithm
Ferragina, Manzini: Indexing compressed text
Grossi, Vitter: Compressed suffix arrays and suffix trees with
Navarro, Mäkinen: Compressed full-text indexes
Raman, Raman, Rao: Succinct indexable dictionaries with applications
Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
Sadakane: Compressed suffix trees with full functionality
Manber, Myers: Suffix arrays: A new method for on-line string searches
DATA STRUCTURE
• What if we have to preserve the original order of the records?
• We may want even faster queries.
• Perhaps there are too many records to fit into memory.
• Then we probably need another data structure.
INDEX
Sadakane: New text indexing functionalities of the compressed suffix
Burrows, Wheeler: A block sorting lossless data compression algorithm
Ferragina, Manzini: Indexing compressed text
Grossi, Vitter: Compressed suffix arrays and suffix trees with
Navarro, Mäkinen: Compressed full-text indexes
Raman, Raman, Rao: Succinct indexable dictionaries with applications
Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
Sadakane: Compressed suffix trees with full functionality
Manber, Myers: Suffix arrays: A new method for on-line string searches
Navarro
Raman
Grossi
Burrows
Ferragina
Manber
Sadakane
FULL-TEXT INDEX
$A C G T A C T G $A C T G $C G T A C T G $C T G $G $G A C G T A C T G $G T A C T G $T A C T G $T G $
GACGTACTG$
Suffix Array
10263791458
GACGTACTG$
Suffix Array
• While a character takes 1 byte, each pointer requires 4 or 8 bytes.
• Suffix array usually requires 5 or 9 times more space than the text.
• We need something smaller to handle large texts.
COMPRESSED INDEX
• Ferragina, Manzini 2000, 2005: FM-index
• Grossi, Vitter 2000, 2005: Compressed Suffix Array
• Use Burrows-Wheeler transform to simulate the suffix array.
• Compresses to 40% to 80% of text size.
• Yet some data should compress better.
HIGHLY REPETITIVE DATA
Individual GenomesG A C G T A - C T G C A G A T G - T A A T G CG A C G T A - C T G C A G A T G C T A A T C CG A C G T A - - - G C A G A T G C T A A T G CG A C G T A - C T G C A G - T G C T A A T G CG A C G T A - - - G C A G A T G C T A A T C CG A C G T A - C T G C T G A T G C T A A T G CG A C G T A C C T G C A G A T G C T A A T G CG A C G T A C C T G C A G - T G C T A A T G CG A C G T A - C T G C T G A T G C T A A T G CG A C G T A - C T G C A G A T G C T A A T C C
Version Historydhcp-eduroam-hy-138-42:thesis jltsiren$ svn diff -r 662 thesis.texIndex: thesis.tex===================================================================--- thesis.tex (revision 662)+++ thesis.tex (working copy)@@ -23,7 +23,7 @@ \isbnpdf{978-952-10-8052-4} \issn{1238-8645} \printhouse{Unigrafia}-\pubpages{108 + 72} % FIXME+\pubpages{97 + 63} \supervisorlist{Veli Mäkinen, University of Helsinki, Finland} \preexaminera{Kunihiko Sadakane, National Institute of Informatics, Japan} \preexaminerb{Jorma Tarhio, Aalto University, Finland}
Suffix array construction378 gigabytes
Finnish language Wikipedia with full version history42 gigabytes
Run-length compressed suffix array4.4 gigabytes
Do we have 378 gigabytes of memory?
INDEX CONSTRUCTION
Data Construction RLCSA
Suffix Array Direct Construction
Data Construction RLCSA
Suffix Array Direct Construction
INDEXING AUTOMATA
# G G GA AC C CT
T
T $
# G G G
A
A
A
A
A
A
C
C C C
T
T T $# GA GT
ACTA CTA
ACG CG
AT TGT
TA
AG
ACC
ACTG
CC CTG TG$ G$ $
HELSINGIN YLIOPISTOHELSINGFORS UNIVERSITET
UNIVERSITY OF HELSINKIMATEMAATTIS-LUONNONTIETEELLINEN TIEDEKUNTA
MATEMATISK-NATURVETENSKAPLIGA FAKULTETENFACULTY OF SCIENCE
Indexing Finite Language Representationof Population Genotypes Jouni Sirén, Niko Välimäki, Veli Mäkinen
ABSTRACTCompressed full-text indexes [6] based on the Bur-rows-Wheeler transform (BWT) are widely used inbioinformatics. Their most succesful application sofar has been mapping short reads to a referencesequence (e.g. Bowtie [3], BWA [4], SOAP2 [5]).These indexes use the BWT to simulate the suffixtree or the suffix array (SA), while using much lessspace than either of them. A simple generalizationallows indexing a set of sequences.
We propose a biologically motivated generalizationof the BWT to finite languages. Given a multiplealignment of sequences (e.g. individual genomes),we build a compressed index capable of simulatingthe suffix array over plausible recombinations of thesequences. Alternatively, we start from a referencesequence and a set of mutations, and build the in-dex over sequences containing any subset of themutations.
Our approach is based on finite automata. We startwith an automaton recognizing the input language.This automaton is transformed into an equivalentautomaton, where each state corresponds to a lexi-cographic range of suffixes of the language. A gen-eralization of the XBW transform for labeled trees[2] is used to index the transformed automaton.
FULL-TEXT INDEXES FOR PATTERN MATCHING AND SEQUENCE ANALYSIS
A
Suffix Tree SA Sorted Suffixes BWT
10
2
6
3
7
9
1
4
5
8
$
$GTCATGCAG $
10
2
6
3
7
9
1
4
5
8
$GTCATGCA
$GTCATGC
$GTCATG
$GTCAT
$GTCA
$GTC
$GT
$G
GTCATGCA
A
C
C
G
G
G
T
T
G
G
G
G
G
G
G
G
G
A
A
A
A
A
A
A
C
C
C
C
C
C
G
G
G
G
G
T
T
T
T
A
A
A
C
C
T
AC
C
$
G
T
GTACTG$
TG$
GTACTG$
TG$
$
ACGTACTG$
TACTG$
ACTG$
G$
$GTCATGCAGGC
A MATCH IN MULTIPLE ALIGNMENT
GTCATGCAG –
GATGCAG –
GTCATGAG –
GTCATCAG
– –
T
– CT TG GA
INITIAL AUTOMATON AND SORTED AUTOMATON
# G G GA AC C CT
T
T $
# G G G
A
A
A
A
A
A
C
C C C
T
T T $# GA GT
ACTA CTA
ACG CG
AT TGT
TA
AG
ACC
ACTG
CC CTG TG$ G$ $
GENERALIZED COMPRESSED SUFFIX ARRAY
$ ACC ACG ACTA ACTG AG AT CC CG CTA CTG G$ GA GT TA TG$ TGT #
BWT G T G G T T G A A A AC AT # CT CG C A $Edges 1 1 1 1 1 1 1 1 1 1 1 1 100 1 100 1 1 1
Basic operations are about 2 times slower than in regular BWT-based indexes. For reasonable mutationfrequencies f , the expected size of the sorted automaton is n(1 + f )O(log n), where n is the length of thereference sequence. For 1/f = W(log n), this becomes O(n). In our experiments, an index built for thehuman reference genome and the genetic variation found in the Finnish population sample of the 1000Genomes Project took approximately 2.8 gigabytes.
FUTURE DIRECTIONS• With our current algorithm, the construction of
a genome-scale index requires 12 hours and192 gigabytes of memory. We are currently in-vestigating other algorithms, such as externalmemory construction and distributed construc-tion in the MapReduce framework [1].
• In principle, our index can be used in any algo-rithm using a regular BWT-based index. Whatcan be done efficiently in practice?
• We are currently investigating several ways touse the generalized index in read alignment.Are there other applications, where our indexcould be superior to the existing approaches?
REFERENCES[1] J. Dean, S. Ghemawat: Simplified Data Pro-
cessing on Large Clusters. OSDI 2004.
[2] P. Ferragina et al.: Compressing and indexinglabeled trees, with applications. Journal of theACM, 2009.
[3] B. Langmead et al.: Ultrafast and memory-effi-cient alignment of short DNA sequences to thehuman genome. Genome Biology, 2009.
[4] H. Li, R. Durbin: Fast and accurate short readalignment with Burrows-Wheeler Transform.Bioinformatics, 2009.
[5] R. Li et al.: SOAP2: an improved ultrafast toolfor short read alignment. Bioinformatics, 2009.
[6] G. Navarro, V. Mäkinen: Compressed full-textindexes. ACM Computing Surveys, 2007.
Compressed Full-Text Indexes for Highly
Repetitive Collections