genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf ·...
Transcript of genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf ·...
![Page 1: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/1.jpg)
GenotypingCMSC702 Spring 2014
![Page 2: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/2.jpg)
What makes them different?
Much human varia,on is due to difference in ~ 6 million base pairs (0.1 % of genome) referred to as SNPs
![Page 3: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/3.jpg)
Genomic DNA:SNP
TACATAGCCATCGGTANGTACTCAATGATGATAA
G
Single Nucleo,de Polymorphism (SNP)
Three genotypes
![Page 4: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/4.jpg)
TACATAGCCATCGGTAAGTACTCAATGATGATA
AA
ATGTATCGGTAGCCATTCATGAGTTACTACTAT
TACATAGCCATCGGTAAGTACTCAATGATGATAATGTATCGGTAGCCATTCATGAGTTACTACTAT
Mother
Father
![Page 5: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/5.jpg)
TACATAGCCATCGGTAAGTACTCAATGATGATA
AG
ATGTATCGGTAGCCATTCATGAGTTACTACTAT
TACATAGCCATCGGTAGGTACTCAATGATGATAATGTATCGGTAGCCATCCATGAGTTACTACTAT
Mother
Father
![Page 6: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/6.jpg)
TACATAGCCATCGGTAGGTACTCAATGATGATA
GG
ATGTATCGGTAGCCATCCATGAGTTACTACTAT
TACATAGCCATCGGTAGGTACTCAATGATGATAATGTATCGGTAGCCATCCATGAGTTACTACTAT
Mother
Father
![Page 7: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/7.jpg)
[Check, Nature 437]
![Page 8: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/8.jpg)
![Page 9: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/9.jpg)
![Page 10: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/10.jpg)
Personal Genomics
![Page 11: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/11.jpg)
Next-gen SequencingPlatforms
• Millions of short DNA fragments (~100 bp) sequenced in parallel
13
Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010
Source: Whiteford et al. Swift: primary data analysis for the Illumina Solexa sequencingplatform. Bioinformatics. 2009
Source: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010
namesequencequality scores
x 100s of millions
14
Sequencing throughput
HiSeq 200025 billion bp per day
(2010)
GA IIx5 billion bp per day
(2009)
GA II1.6 billion bp per day
(2008)
Images: www.illumina.com/systems
Numbers: www.politigenomics.com/next-generation-sequencing-informatics
Dates: Illumina press releases
15
Sequencing throughput
HiSeq 250060 billion bp per day
(2012)
GA IIx5 billion bp per day
(2009)
GA II1.6 billion bp per day
(2008)
Images: www.illumina.com/systems
Numbers: www.politigenomics.com/next-generation-sequencing-informatics
Dates: Illumina press releases
16
![Page 12: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/12.jpg)
Sec-gen Sequencing for SNPs
TAACGATTC
ATTGCTAAG ......
......
TAACGTTTC
ATTGCAAAG ......
......
![Page 13: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/13.jpg)
Sec-gen Sequencing for SNPs
![Page 14: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/14.jpg)
Sec-gen Sequencing for SNPs
![Page 15: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/15.jpg)
Sec-gen Sequencing for SNPs
![Page 16: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/16.jpg)
Sec-gen Sequencing for SNPs
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTANCTGTCT ||||||||| |||||| GTCGCAGTATCTGTCT !GGATCTGCGATATACC |||||| ||||||||| GGATCT-CGATATACC !AATCTGATCTTATTTT |||||||||||||||| AATCTGATCTTATTTT !ATATATATATATATAT |||||||||||||||| ATATATATATATATAT !TCTCTCCCANNAGAGC ||||||||| ||||| TCTCTCCCAGGAGAGC
Align Aggregate
Reference
Call: HET A, G p-value: 0.0023
GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG
Statistics
“Coverage”
“Pileup” or “Coverage plot”
“Depth of coverage” = 14
(slide courtesy of Ben Langmead)
We want !
!probability of genotype
given aligned bases
P (Ti|D)
![Page 17: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/17.jpg)
SNP calling
• We will look at SOAP and samtools today: [Ruiqiang Li et al., Genome Research 2009; Heng Li, Bioinformatics 2011].
• Both uses a “bayesian” formulation
• This is also how “first” generation SNP-calling was done (BayesSNP).
• Many other use a similar formulation (MAQ, Atlas-SNP, FreeBayes).
• Main difference is in their probabilistic framework of genotype.
!!Short Oligonucleotide Analysis Package S e q u e n c e A l i g n m e n t / M a p t o o l s
![Page 18: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/18.jpg)
SOAPsnp
P (Ti
|D) =P (D|T
i
)P (Ti
)Px
P (D|Tx
)P (Tx
)
Probability of data given genotype
Prior probability of genotype
![Page 19: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/19.jpg)
Prior Probabilities
Assuming:
1)SNP rate is 10-3
2)Error rate in reference is 10-5
![Page 20: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/20.jpg)
Data probability
P (D|Ti) =nY
k=1
P (dk|Ti)
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG
P (dk|Ti) =P (dk|Hm) + P (dk|Hn)
2
Ti = HmHn
![Page 21: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/21.jpg)
Data probability
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG
Data for each base (allele) is 1. observed base (allele) 2. sequencing cycle 3. quality score (error probability) 4. occurrence
P (dk|Ti) =P (dk|Hm) + P (dk|Hn)
2
![Page 22: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/22.jpg)
Data probability
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG
Data for each base (allele) is 1. observed base (allele) 2. sequencing cycle 3. quality score (error probability) 4. occurrence
P (dk|Hm) = P (ok, ck, qk|Hm)= P (ok, ck|Hm, qk)P (qk|Hm)
![Page 23: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/23.jpg)
Data probability
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT
GTCGCAGTATCTGTCT GTCGCAGTATCTGTNN TGTCGCAGTATCTGTC TATGTCGCAGTATCTG TATATCGCAGTATCTT TATATCGCAGTATCTG NATATCGCAGTATNTG CCCTATATCGCAGTAT ACACCCTATGTCGCA ACACCCTATCTCGCA ACACCCTATGTCGCA GA-CACCCTATGTCGC CCGGA-CACCCTATAT CCGGA-CACCCTATAT GCCGGA-CACCCTATG
No model here: use a lookup table!
P (ok, ck|Hm, qk)
![Page 24: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/24.jpg)
Quality score recalibration
![Page 25: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/25.jpg)
Substitution Errors
![Page 26: genotyping - University Of Marylandusers.umiacs.umd.edu/.../lect15_genotyping/genotyping.pdf · 2014. 4. 10. · Sequencing technologies - the next generation. Nat Rev Genet. 2010](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc27b2f8c07ea09fe62bc20/html5/thumbnails/26.jpg)
SOAPsnp
• Quality score recalibration and biased substitution rates are incorporated
• Uses a “bayesian” formulation
• Simple model, easily implemented
• Independence across genomic loci
• Easily parallelized (see Crossbow)