1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring...
-
Upload
julian-allen -
Category
Documents
-
view
214 -
download
1
Transcript of 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring...
1
Tien Huynh1, Michalis Vlachos2, Isidore Rigoutsos3
EDBT 2010March 22-26 2010
Anchoring Millions of Distinct Reads on the Human Genome
Within Seconds
1IBM TJ Watson Research, NY, USA2IBM Zurich Research Laboratory, CH3 Thomas Jefferson University, USA
2
Introduction - Past• Bioinformatics Sequencing Evolution
Human Genome3.3B nucleotides825MB raw data20MB compressed25,000 genes
3
Introduction – Past, Present, Future
It took 13 years for teams of scientists around the globe to first read the human genome – completing the project in 2001.
In 2007, it took 2 months to sequence the genome of DNA-co-discoverer James Watson.
By 2013 it is likely that your personal genome could be read in the time it takes to boil an egg.
“Science 2009”
4
Question from industry: Within 24 hours, can we find FAST, ALL the positions of (newly sequenced) DNA fragments on a reference genome
Reference genome
A C G T T A C G
20-75 nts
C G T T A C G
AC G ? AG T A
…
millions
New generation Sequencer
What we want to do
5
Applications• Cancer Research: help isolate cancer-initiating mutations• Better design of siRNA’s (short interfering RNA’s)
– RNA sequences of ~21 nucleotides -> RNA interference– Therapeutic gene silencing– Cancer therapy by tumor-related gene targeting
• Junk DNA analysis• Re-sequencing• etc
6
How can we achieve that?• Solve as hash join between the reference
and the query database– Reference Genome/Database
(eg human genome)• Static, one time
– Query Database (produced fragments)• Dynamic
recreated on every experiment– How fast can we achieve this
on a commodity desktop?
• We call the technique QPick (=Quick Pick)
7
ACGG…TTT
CGG…TTTA
GG…TTTAG
k-mer extraction
genomic sequence (the database)
acggtttaTTTAGGGGGCCAAAAAAATTT
cggtttatTTAGGGGGCCAAAAAAATTTA
head tail
key extraction &bit packetization
• • •
hashtablekey
bit packetization
cggtttat
aaaaaaaa
acaaaaaa
tttttttt
target database
011..011 010..111 011..111
000..011 001..001
101..000
……
ctttttat
ggtataaa
hashtable join
short-read generator
TTT?AGGGGGATAGAGAATTAAAA?TTAG?AATT?CC
output(with wildcards)
cggtttat
aaaaaaaa
agggaaaa
tttttttt
011..011 010..111
001..011 011..001
101..000…
…
ggtttgat
ggtttttt
head & tail separation
query set expansion(wildcards + reverse strand)
bit packetization / hashtable construction
query database
8
QPick - Advantages
• Disk Based. Small Memory Requirements– Other techniques run out of memory on large
datasets…
– 3GB RAM is sufficient to
a)Indexb)Search Human Genomec) Query Millions of Short Fragments
9
• Highly Parallelizable– Single CPU/Core are sufficient.
QPick - Advantages
10
A
• Flexibility – Rigid and wildcard matches
QPick - Advantages
AC G ? AG T A ?
wildcard wildcard
AC G AG T AC
query
AC G AG T AC AC G AG T AC
A C G T T A C G A C G
A C G T T
A C G T T A C G
T T
AG
G
reference genome……..
11
• Completeness of results– Various competitors failed to return all
matches
QPick - Advantages
A C G T T A C G
C G T T A C G
AC G ? AG T A
…
QPick does not miss any matches
12
Technical Highlights• Data Compression
– Exploit small DNA alphabet
• Fast bitwise operations– Take advantage of 64bit
word comparisons
• Simple implementation (hash based)
• Data Pruning
A C G T
…0011100011…
64bits
13
Data Representationreference genome
AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC
AC G AG T AC T T C
GAG AG T AC T T C
reference genome
window Lmax
14
Data Representation - Headreference genome
A G TC A G TC A G TC A G TC A G TC A G TC A C
A
G
T
C
00
01
10
11
Codebook
Binary: 0001101100011011000110110001Decimal: 28,422,577
head
window
position28,422,577 tail
00 01 10 11 00 01 10 11 00 01 10 11 00 01
tail
15
Advantages– Head = key for hashtable– Reduction in dataset size
• We don’t have to explicitly store the head
Data Representationreference genome
16
Data Representation - Tailreference genome
A
G
T
C
0001
0010
0100
1000
Codebook
. 0000
G T A G TC A G TC A Chead
tail
. . . .
padded toproper lengthtail: 16 nts 16x4bits = one 64bit word
0100 1000 etc … 0000 1
position of thewindow(how many shiftswe have made since the beginning ofthe string…)pattern p
query q[p & q = q]
17
Example
ATAGACTAAAAAAAAAAAAAAATT
reference genome
24 symbols
……
padding
30 symbols
atagactaaaaaaa AAAAAAAATT...... 1
tagactaaaaaaaa AAAAAAATT ....... 2
aaaaaaaaaaaaaa
aaaaaaaaaaaaaa
aaaaaaaaaaaaat
A T T . . . . . . . . . . ...
T T . . . . . . . . . . . . ..
T . . . . . . . . . . . . . . .
8
9
10
…
18
Up to now we have seen how to encode the reference genome
Now we show how to encode the query sequences
19
Encoding the query DBNow, instead of one long sequence, many shorter ones
A G TC A G TC .
A GC .
A G TC A G TC . .
A G TC G
millions
20
Encoding the query DBNow, instead of one long sequence, many shorter ones
A G TC A G TC .
A GC .
A G TC A G TC . .
A G TC G
millions
We do similar extractions in heads and tails
tailhead
16 symbols = 64bits14 symbols: key
21
Encoding the query DBDifferences between target db and query db:1. Query sequences may contain wildcards
(in the db only used for padding)
A G TC A G TC . .
22
Encoding the query DBDifferences between target db and query db:1. Query sequences may contain wildcards
(in the db only used for padding)
2. Query sequences may have variable length (compared to the fixed size n-grams extracted from the target db)
A G TC A G TC . .
A GC .
A G TC A G TC . .
A G TC G
23
Encoding the query DB - wildcards
Treated differently depending where they appear
1. Head. Expanded to the possible symbols
A G TC .
A G TC
A G TC
A G TC
A G TC
A
G
T
C
24
Encoding the query DB - wildcards
Treated differently depending where they appear1. Head. Expanded to the possible symbols
2. Tail. Encoded as binary wildcard 0000
A G TC .
A G TC
A G TC
A G TC
A G TC
A
G
T
C
25
Handling Forward & Reverse DNA strands
AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC
reference genome
backward strand
forward strand
If we explicitly encode the reverse strand we would be indexing twice as much data
26
Handling Forward & Reverse DNA strands
AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC
reference genome
backward strand
forward strand
G G C C T T T
Form complementary sequences
A A A G G C C
27
Handling Forward & Reverse DNA strands
AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC
reference genome
backward strand
forward strand
G G C C T T T
Form complementary sequences
A A A G G C C
28
Handling Forward & Reverse DNA strands
AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC
reference genome
backward strand
forward strand
G G C C T T T
Form complementary sequences
A A A G G C C
C C T
GGA
29
Overall search process….
30
Overall search process….
31
Overall search process….
32
Was our hash key fuction a correct one?…a bad key function would have led to many collisions…
YES! 98.5% of buckets contain less than 10 entries
33
Homo sapiens
Experiments• Comparison with other Short Sequence
Anchoring Tools• QPick Performance
– Number of Wildcards– Index Creation Time– Query Time
• Datasets
ftp://ftp.ensembl.org/pub/release-49/fasta/mus_musculus/dna/
ftp://ftp.ensembl.org/pub/release-42/homo_sapiens_42_36d/data/fasta/dna/
Mus Musculus
34
Up to 60x faster …
Tool Hits Misses Time (sec) ImprovementQPick 10,611,
8580 79 -
BWA 10,611,858
0 149 1.8x
fetchGWI
10,611,856
2 186 2.3x
Bowtie 10,611,858
0 331 4.1x
SOAP 10,573,528
38,330 4728 59.84x
Eland 10,559,799
5,014,059
555 7.0x
COMPARISONShort Sequence Anchoring Tools
36
Comparison with Hash-Based techniques
QPick – 16x faster than FetchGWI
37
Varying the wildcards• 5-6 seconds retrieval time for 0-1 wildcards• <60 sec for up to 4 wildcards
38
• Time to index– Reference Genome– Query sequences
• Time to Search– 1M- 10M queries (short DNA fragments)
Full Human-Genome Search
39
Full Genome Search – Index Time
~ 7 hours (on single core CPU)
< 50 sec for query hashtables
40
Full Genome Search – Search Time
• ~130sec to find 1M matches (single core)
41
Conclusions• QPick: Fast and complete search for short
sequence fragments• Takes Advantage of:
– Small DNA alphabet– Bit packetization– Hash Joins
• Up to 60x faster the competitive techniques• Applications for ….• Future…
A C G T
…0011100011…