Informatics challenges and computer tools for sequencing 1000s of human genomes
description
Transcript of Informatics challenges and computer tools for sequencing 1000s of human genomes
![Page 1: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/1.jpg)
Informatics challenges and computer tools for sequencing 1000s of human genomes
Gabor T. MarthBoston College Biology Department
Cold Spring Harbor LaboratoryPersonal Genomes meetingOctober 9-12, 2008
![Page 2: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/2.jpg)
Large-scale individual human resequencing
![Page 3: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/3.jpg)
Next-gen sequencers offer vast throughput…
read length
base
s p
er
mach
ine r
un
10 bp 1,000 bp100 bp
1 Gb
100 Mb
10 Mb
10 Gb
Illumina, AB/SOLiD short-read sequencers
ABI capillary sequencer
454 pyrosequencer(100-400 Mb in 200-450 bp reads)
(5-15Gb in 25-70 bp reads)
1 Mb
![Page 4: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/4.jpg)
The resequencing informatics pipeline
(iii) read assembly
REF
(ii) read mapping
IND
(i) base calling
IND(iv) SNP and short INDEL calling
(vi) data validation, hypothesis generation
(v) SV calling
![Page 5: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/5.jpg)
The variation discovery “toolbox”
• base callers
• read mappers
• SNP callers
• SV callers
• assembly viewers
GigaBayesGigaBayes
![Page 6: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/6.jpg)
1. Base calling
base sequence
base quality (Q-value) sequence
• early manufacturer-supplied base callers were imperfect• third party software made substantial improvements• machine manufacturers are now focusing more on base calling
![Page 7: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/7.jpg)
… and they give you the picture on the box
2. Read mapping
Read mapping is like doing a jigsaw puzzle…
…you get the pieces…
Larger, more unique pieces are easier to place than others…
![Page 8: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/8.jpg)
Next-gen reads are generally short
read length [bp]0 100 200 300
~200-450 (variable)
25-70 (fixed)
25-50 (fixed)
20-60 (variable)
400
![Page 9: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/9.jpg)
Base error rates are low
Illumina
454
![Page 10: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/10.jpg)
Strategies to deal with non-unique mapping
![Page 11: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/11.jpg)
Mapping probabilities (qualities)
0.8 0.19 0.01
read
![Page 12: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/12.jpg)
Error types are very different
Illumina
454
![Page 13: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/13.jpg)
Gapped alignments
![Page 14: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/14.jpg)
MOSAIK
• fast• accurate• gapped• versatile (short + long reads)
![Page 15: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/15.jpg)
3. SNP and short-INDEL calling
• deep alignments of 100s / 1000s of individuals • trio sequences
![Page 16: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/16.jpg)
Allele discovery is a multi-step sampling process
Population Samples Reads
![Page 17: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/17.jpg)
Capturing the allele in the sample
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1E-0
4
2E-0
4
5E-0
40.
001
0.00
20.
005
0.01
0.02
0.05 0.
10.
20.
5
Population AF
Pro
b(a
llele
cap
ture
d in
sam
ple
)
n=100
n=200
n=400
n=800
n=1600
![Page 18: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/18.jpg)
Allele calling in the reads
1 2
1 21
1
1 2
Pr | Pr | Pr , , ,
Pr | Pr | Pr , , ,
Pr , , , |i
kT
ii n
l kT
nk ki i i n
i
nk k l l l li i
iG
n
B T T G G G G
B T T G G G G
G G G B
base quality
allele call in read
number of individuals
GigaBayesGigaBayes
![Page 19: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/19.jpg)
How many reads needed to call an allele?aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
Q30 Q40 Q50 Q60
1 0.01 0.01 0.1 0.5
2 0.82 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
![Page 20: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/20.jpg)
The need for accurate data…
![Page 21: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/21.jpg)
… and realistic base quality values
![Page 22: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/22.jpg)
Recalibrated base quality values (Illumina)
![Page 23: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/23.jpg)
More samples or deeper coverage / sample?
Shallower read coverage from more individuals …
…or deeper coverage from fewer samples?
simulation analysis by Aaron
Quinlan
![Page 24: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/24.jpg)
Analysis indicates a balance
![Page 25: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/25.jpg)
SNP calling in trios
2
2
2 22 2
2
2
2
2 2
2
11 12 22
1 111: 1 1
2 2 11: 111: 11 1
11 12 : 2 1 12 : 2 1 1 12 : 12 2
22 : 22 : 11 122 : 1
2 2
1 1 111: 1 1 11:
2 2 4Pr | , 1 1
12 12 : 2 1 12 2
1 122 : 1
2 2
M M M
F
C M F
F
G G G
G
G G GG
2 2 2
2 22 2
2 22
2
2 22 2
1 1 1 11 1 11: 1
2 4 2 21 1 1 1 1
12 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2
1 1 1 1 122 : 1 1 22 : 1 1
4 2 4 2 2
1 111: 1
2 211: 11 1
22 12 : 1 12 : 12
22 : 1FG
2
2
2
11:
2 1 12 : 2 12
22 : 11 122 : 1 1
2 2
• the child inherits one chromosome from each parent• there is a small probability for a mutation in the child
![Page 26: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/26.jpg)
SNP calling in trios
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
mother father
childP=0.79
P=0.86
![Page 27: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/27.jpg)
4. Structural variation discovery
Deletion
DNA reference
LM ~ LF+Ldel & depth: low
pattern
LMLF
Ldel
Tandemduplication
LM ~ LF-Ldup & depth: highLdup
Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv
Translocation
LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2
LT2 LT1
LM LM
LM
InsertionLins
un-paired read clusters & depth normal
Chromosomaltranslocation
LT
LM ~LF+LT & depth: normal& cross-paired read clusters
Read pair mapping pattern (breakpoint detection)
![Page 28: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/28.jpg)
Copy number estimation
Depth of read coverage
![Page 29: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/29.jpg)
Deletion: Aberrant positive mapping distance
![Page 30: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/30.jpg)
Tandem duplication: negative mapping distance
![Page 31: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/31.jpg)
Het deletion “revealed” by normalization
Chip StewartSaturday poster session
![Page 32: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/32.jpg)
5. Data visualization
• software development• data validation• hypothesis generation
![Page 33: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/33.jpg)
Summary
• Next-generation sequencing is a boon for large-scale individual human resequencing
• Basic data mining tools are getting applied and tested in the 1000 Genomes Project
• There is still a lot of fine-tuning to do
• A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes
![Page 34: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/34.jpg)
Credits
Derek BarnettEric Tsung
Aaron QuinlanDamien Croteau-Chonka
Weichun Huang
Michael Stromberg
Chip Stewart
Michele Busby
Several postdoc positions are available… … mail [email protected]
![Page 35: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/35.jpg)
Software tools for next-gen data
http://bioinformatics.bc.edu/marthlab/Beta_Release
![Page 37: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/37.jpg)
Individual genotype directly from sequence
AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA
AACGTTAGCATAAACGTTAGCATA
individual 1
individual 3
individual 2
A/C
C/C
A/A
![Page 38: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/38.jpg)
Genotyping from primary sequence data
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 500000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SNP Position
Fra
ctio
n of
con
fiden
t gen
otyp
es
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
200
400
600
800
1000
1200
1400
1600
100 @ 16x: 0.975 +/- 0.121
200 @ 8x: 0.968 +/- 0.129
400 @ 4x: 0.924 +/- 0.151
800 @ 2x: 0.769 +/- 0.154
![Page 39: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/39.jpg)
Most reads contain no or few errors
![Page 40: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/40.jpg)
Paired-end reads help unique read placement
• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency
Korbel et al. Science 2007
• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity
PE
MP
![Page 41: Informatics challenges and computer tools for sequencing 1000s of human genomes](https://reader035.fdocuments.us/reader035/viewer/2022062305/56815232550346895dc07986/html5/thumbnails/41.jpg)
How many reads needed to call an allele?aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac
aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac
P=0.82 P=0.08