Tools For HTS Analysis
description
Transcript of Tools For HTS Analysis
TOOLS FOR HTS ANALYSIS
Michael Brudno and Marc FiumeDepartment of Computer ScienceUniversity of Toronto
Outline•Lab focus•Our tools
•SHRiMP: read mapper•VARiD: SNP and indel finder•Savant: genome browser
• Discussion
Our Tools
READ MAPPING (SHRiMP)
READ MAPPING (SHRiMP)
SNP DETECTION(VARiD)
SNP DETECTION(VARiD)
INDEL DETECTION(MODiL)
INDEL DETECTION(MODiL)
CNV DETECTION(CNVer)
CNV DETECTION(CNVer)
ASSEMBLY(UNNAMED)ASSEMBLY
(UNNAMED)VISUALIZATION(SAVANT)
VISUALIZATION(SAVANT)
SHRIMP – SHORT READ MAPPING PACKAGE
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Key SHRiMP Features•High Sensitivity•Support for common formats (SAM, FASTQ, etc)•Flexible seeding framework•Multi-threading•Full support for SOLiD and Illumina (and 454) reads
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Sensitivity/Specificity Comparison
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Runtime comparison
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Unpaired 50bp Reads Paired 75bp Reads
Mapping 6 million reads to C. Savignyi (180 Mb)
VARID – SNP AND INDEL DETECTION
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
motivation | methods | results | summary
Variation detection from NGS reads
Reference: TCAGCATCGGCATCGACTGCACAGGACCAGTCGATCGAC
Donor: ??????????????????????????????????????? GCATCGACTGCA CGGGATCGACTGAligned reads: ATCCATTGCA GATCCACTGCAC
• Determine differences (variation) between reference and donorusing NGS reads of the donor
MotivationColor-space and Letter-space platforms
bring them together
MotivationColor-space and Letter-space platforms
bring them together
MethodsMethods
SummarySummary
ResultsResults
16
motivation | methods | results | summary
Sequencing Platforms
• letter-space Sanger, 454, Illumina, etc
> NC_005109.2 | BRCA1 SX3TCAGCATCGGCATCGACTGCACAGG
• color-space AB SOLiD less software tools available
> NC_005109.2 | BRCA1 AF3T212313230313232121311120
• many differences -> useful to combine this information• sequencing biases• inherent errors • advantages
17
A G
C T
2
1
2
1 3
0 0
00
A
A
C
0
G T
1 32
C 1
G
0
2
3 2
3 10
T 3 2 01
Color Space
motivation | methods | results | summary
Translation Matrix Translation Automata
18
Translating
> T212313230313232121311120> T
Sequencing Error vs SNP
Sequencing Error> T212313230313232121311120> T212313230310232121311120> TCAGCATCGGCAAGCTGACGTGTCC
SNP> TCAGCATCGGCATCGACTGCACAGG> TCAGCATCGGCAGCGACTGCACAGG> T212313230312332121311120
A G
C T
CAGCATCGGCATCGACTGCACAGG
Color Space
motivation | methods | results | summary
19
Color Space
motivation | methods | results | summary
20
• clear distinction between a sequencing error and a SNP• can this help us in SNP detection? sounds like it!
single color change error, 2 colors changed (likely) SNP.
Easy snp call Well covered bases Difficult Casereference incolor-space
reads
position
Detection• Heterozygous SNPs• Homozygous SNPs• Tri-allelic SNPs• small indels• account for various errors, quality values & misalignments
Motivation • variation caller to handle both letter-space & color-space reads
Motivation
motivation | methods | results | summary
21
VARiD• system to make inferences on the donor bases
• variation detection
Methods
Simple HMM Modelstates, emissions, transitions, FB
Extended HMM Modelgaps, diploids, exceptions
Methods
Simple HMM Modelstates, emissions, transitions, FB
Extended HMM Modelgaps, diploids, exceptions
MotivationMotivation
SummarySummary
ResultsResults
22
Statistical model for a system - states
Assume that system is a Markov process with state unobserved. Markov Process: next state depends only on current state
We can observe the state’s emission (output)each state has a probability distribution over outputs
Hidden Markov Model (HMM)
motivation | methods | results | summary
23
S1 S2 S3
e1
e2
e1
e2
e1
e2
Hidden Markov Model (HMM)
motivation | methods | results | summary
24
Apply HMM to variation detection: • we don’t know the state (donor), but • we can observe some output determined by the state (reads)
Hidden Markov Model (HMM)
motivation | methods | results | summary
25
. . . . . B6 B7 B8 B9 . . . . .
B6 B7 B7 B8 B8 B9
AA
AC
color 0
color 1
AA
AC
color 0
color 1
AA
AC
color 0
color 1
unknowndonor
Why pairs of letters? Handle colors.• AA and TT gives the same colors. Can’t just model colors
The donor could be:• letters: AA color 0• letters: AC color 1 :• letters: TT color 016 combinations
A G
C T
2
1
2
1 3
0 0
00
A
A
C
0
G T
1 32
C 1
G
0
2
3 2
3 10
T 3 2 01
States
motivation | methods | results | summary
26
B6 B7
States and Transitions
motivation | methods | results | summary
27
. . . B6 B7 B8 . . .
B6 B7 B7 B8
AA
CA
AT
TT
:
:
GA
CT
::
AA
TT
.
.
.
.
.
.
.
.
.
Poss
ible
Sta
tes
Transitions• only certain transitions allowed
• when allowed, p(Xt|Xt-1) = freq(Xt)
• each state depends only on the previous states (Markov Process)
States• 16 possible states• only look at second letter
T01020100311223 T1030101311223 T20100311223
ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC
Unknown genome
Color reads
Letter reads
Emissions
motivation | methods | results | summary
28
..... B6B7B8B9 ..... B7 B8
color 0
color 1
AA
AC
color 0
AA
AA
color 0
color 1
color 2
color 3
letters A
letters C
1 – 3ε
ε
ε
ε
emissionprobabilityp(em|AA)
letters T ξ
1- 3ξ
letters G ξ
ξ
Emission Probabilities
motivation | methods | results | summary
29
Same color emission distribution
TT
Different letter emission distribution
TT
])31[(])31[( 2112 Ep
E.g. For state CC:
Combining emission probabilities• probability that this state emitted these reads.
motivation | methods | results | summary
Emission Probabilities
30
T01020100311223 T1030101311223 T20100311223
ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC
..... B6B7B8B9 .....
Summary
• unknown state • donor pair at location
•transitions • transition probabilities
• emissions • reads at location• emission probabilities
motivation | methods | results | summary
Simple HMM
31
B6 B7
AA
AC
color 0
color 1
• Have set-up a form of an HMM• run Forward-Backward algorithm • get probability distribution over states at some position
AA
CA
AT
TT
:
:
GA
CT
::
likely state
motivation | methods | results | summary
Forward-Backward Algorithm
32
• Variation Detection:compare most likely state with reference:
ref: GCTATCCAdon: ...AT...
Methods
Simple HMM Modelstates, emissions, transitions, FB
Extended HMM Modelgaps, diploids, exceptions
Methods
Simple HMM Modelstates, emissions, transitions, FB
Extended HMM Modelgaps, diploids, exceptions
MotivationMotivation
SummarySummary
ResultsResults
33
Simple HMM • only detects homozygous SNPs
Extended HMM:• short indels• heterozygous SNPs• complex error profiles & quality values
motivation | methods | results | summary
Extended HMM
34
Expand states• Have states that include gaps
• emit: gap or color
A---
-G
AGTG
T-T-
• Have larger states, for diploids
• Transitions built in similar fashion as before• Same algorithm, but in all we have 1600 states with very sparse transitions
Expansion: Gaps and het. SNPs
motivation | methods | results | summary
35
• Emission probabilities o Support quality values o Use variable error rates for emissions
• Translate through the first lettero first color is incorrecto letter-space signal
Donor: ACAGCATCGGCATCGACTGC 1123132303123321213read: >T2123132303123321213 > C123132303123321213
Expansion
motivation | methods | results | summary
36
• Post-process putative SNPso uncorrelated adjacent errors may support het SNPso check putative SNPs
motivation | methods | results | summary
blue: varid steps
Summary
ResultsResults
MotivationMotivation
MethodsMethods
SummarySummary
38
Results
motivation | methods | results | summary
• Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773)
Color-space dataset:• Compare random subsets:
• Corona (with AB mapper) • VARiD (with SHRiMP) • VARiD (with AB mapper)
Conclusions:• Using F-measure, the three pipelines perform very similarly. • High-coverage results is as good as can be achieved
39
Results
motivation | methods | results | summary
• Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773)
Letter-space dataset:• Compare random subsets :
• GigaBayes (with Mosaik) • VARiD (with SHRiMP) • VARiD (with Mosaik)
Conclusion:• Using F-measure the three pipelines perform very similarly.• High-coverage results is as good as can be achieved
40
Results
VARiD: Combining Letter-space and Color-space Datato achieve increased accuracy in at-cost comparison
motivation | methods | results | summary
41
SummarySummary
MotivationMotivation
MethodsMethods
ResultsResults
42
Summary of VARiD• HMM modeling underlying donor• Treats color-space and letter-space together in the same framework• no translation – take advantage of each technology’s properties• accurately calls short SNPs, indels in both color- and letter-space
• improved results with hybrid data.
Summary
motivation | methods | results | summary
• Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available)
• Contact: [email protected]
• Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available)
• Contact: [email protected]
43
SAVANT GENOME BROWSER
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Challenge in Genomic Data Analysis• genomic data is generated in high volumes• interpretation and analysis challenge• typical pipeline employs many separate tools for computation and visualization
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Tools for HTS data analysisTool Cost Computation Visualization
Read Alignment e.g. Bowtie, BWA
Free Y N
File Format Conversion e.g. Galaxy, SAMTools
Free Y N
Other Comand-line Toolse.g. Genetic Variation Discovery, Comparitive Genomics, etc.
Free Y N
UCSC Genome Browser Free N Y
Integrative Genomics Viewer Free N Y
GBrowse Free N Y
CLC Genomics Workbench $$$ Y Y
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
• substantial disconnect between the processes of computational analysis and visualization
Tools for HTS data analysisTool Cost Computation Visualization
Read Alignment e.g. Bowtie, BWA
Free Y N
File Format Conversion e.g. Galaxy, SAMTools
Free Y N
Other Comand-line Toolse.g. Genetic Variation Discovery, Comparitive Genomics, etc.
Free Y N
UCSC Genome Browser Free N Y
Integrative Genomics Viewer Free N Y
GBrowse Free N Y
CLC Genomics Workbench $$$ Y Y
Savant Genome Browser Free Y Y
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
• substantial disconnect between the processes of computational analysis and visualization
Savant Genome Browser• platform for integrated visual analysis of genomic data
• feature-rich genome browser•computationally extensible via plugin framework
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
(Very) Short List of Features
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
FEATURE DEMONSTRATION
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
INTERFACEHTS READ ALIGNMENTSEXAMPLE PLUGINS: SNP FINDER
Power of visual analytics• task: find the correct parameter for command-line tool
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Plugin Framework• unlocks the potential for performing visual analytics
• mutually beneficial for both users and tool developers
for users: perform complex data analyses on-the-fly within a visual environment
for programmers: platform for simple development and deployment of various programs
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
CONCLUSIONS
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Conclusions• Savant is a platform for integrated visualization and analysis of genomic data
•stand-alone genome browser•novel features: e.g. table view, visualization modes, data selection, etc.
•computationally extensible through plugin framework
• makes interpretation and analysis of genomic data easier and more efficient
Savant Genome Browser - http://compbio.cs.toronto.edu/savant/
Acknowledgements
Recep Andrew Vlad MikeBrudno
Yue Marc
Vanessa OrionJoe Nilgun
Paul
Vera
Misko Yoni
Questions?
SHRiMPhttp://compbio.cs.toronto.edu/shrimp
VARiDhttp://compbio.cs.toronto.edu/varid
Savant Genome Browserhttp://compbio.cs.toronto.edu/savant