SNAP: Fast, accurate sequence alignment enabling biological applications
-
Upload
demetria-mccray -
Category
Documents
-
view
26 -
download
1
description
Transcript of SNAP: Fast, accurate sequence alignment enabling biological applications
SNAP: Fast, accurate sequence alignment enabling biological
applicationsRavi Pandya, Microsoft Research
ASHG 10/19/2014
SNAP
SNAP is fast *Align 50x genome in 1.2 hours(BWA-MEM = 11.75 hours)Sort + index + markdup BAM in 2 hours(samtools+sambamba = 4.25 hours)
SNAP is as accurate as BWA-MEM, Bowtie2, etc.ROC on simulated data% aligned on real dataVariant calls on real data
* NA12878:ERR194147, Azure D14 (16 cores, 112GB RAM, 800GB SSD)
Sequence alignment
The problem:Given a read R and a reference genome GFind the position in p in G that minimizesEditDistance(R, G[p .. p + |R|])
SNAP solves this quickly and accurately because of:Efficient system architectureReducing the number of comparisonsReducing the cost of comparisons
System architecturefull
align sort
async read async write
emptytemp file
mergesort
markduplicates
index
compress
The sequence alignment problemThe easy part:
97% of 20-mersin the human genomeoccur only oncebut at only 75% of locations
The hard part:
The other 3% of 20-mersand 25% of locations
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Single Equally Weighted
Paired Equally Weighted
Single Time Weighted
Paired Time Weighted
10% of reads
95% of time
CDF of per-read/pair alignment time, NA18705 169M pairs(using deeper search parameters than current defaults)
Bill Bolosky, MSR
Hash table lookup
Build a multi-valued map (~30GB for hg19)from all seeds S in G all locations of S in G
330 reads/s
14k reads/s
For all seeds in read, all locations of seed in genome,Score implied alignment of read, keep the best
Ignore frequent seeds (>300 occurrences)Only use a few seeds/read
42x
Bill Bolosky, MSR
Fast scoring
113k reads/s
154k reads/s(470x overall)
Sort candidates by # of seed hits
Skip locations with #seed misses > limit
1.4x
92k reads/s O(n2) Ukkonen O(nd), n=len, d=min(limit, actual)Use limit = best score so far + 2 (for MAPQ)
1.2x
6.6x
Bill Bolosky, MSR
Paired-end alignment
Find & score candidate location pairsC(R1:R2) = C(R1) ∩ C(R2) {± insert size}Enumerate in O(h log n) h = |C(R1) ∩ C(R2)| n = |C(R1)| + |C(R2)|Increases accuracy by allowingmuch higher limit on seed occurrences(e.g. 4k vs 300)
Bill Bolosky, MSR
Results: simulated data
Mason-generated paired-end 100bp reads
Results: real data
NA18507 (Illumina HiSeq 50x)
* AWS cr1.8xlarge (32 cores, 244GB RAM, 2x120GB SSD)
Results: GATK variant calls
Broad GATK pipeline, curated NA12878 variant calls
Results: NIST Genome-in-a-BottleAppistry GATK pipeline, GIAB highly confident callsLonger seeds are much faster, similar precision/recall
11.75
ERR194147*.fastq.gz, Azure D14 (16 cores, 112GB RAM, 800GB SSD)
Results: NIST Genome-in-a-BottleLower confidence calls (qual>20, 2 platforms)
Highly confident indel snp Aligner Recall Precision Recall Precisionbwa-mem 97.24% 97.15% 99.57% 99.65%snap-20 97.04% 97.48% 99.51% 99.57%snap-24 97.04% 97.46% 99.52% 99.57%snap-28 97.04% 97.45% 99.53% 99.57%snap-32 97.00% 97.41% 99.51% 99.57%
Lower confidence indel snp Aligner Recall Precision Recall Precisionbwa-mem 96.38% 96.30% 99.00% 99.32%snap-20 96.17% 96.68% 98.94% 99.25%snap-24 96.17% 96.67% 98.95% 99.23%snap-28 96.16% 96.62% 98.96% 99.21%snap-32 96.11% 96.55% 98.94% 99.17%
Pathogen ID: SURPI (Charles Chiu, UCSF)
“This analysis of DNA sequences required just 96 minutes. A similar analysis conducted with the use of previous generations of computational software on the same hardware platform would have taken 24 hours or more to complete, Chiu said.”
SURPI
SNAP enables SURPI with:Fast filtering mode64-bit index for >40GB ntDBSecondary mapping output
Charles Chiu, UCSF
Acknowledgements
Microsoft ResearchBill BoloskyRavi PandyaUC San FranciscoTaylor SittlerBroad InstituteChristopher Hartl
UC Berkeley AMPLabMatei ZahariaKristal CurtisArmando FoxScott ShenkerIon StoicaDavid Patterson
Binaries, source, documentation (Apache 2.0 licensed)http://snap.cs.berkeley.edu