An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun...
-
Upload
ella-flynn -
Category
Documents
-
view
215 -
download
0
Transcript of An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun...
![Page 1: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/1.jpg)
naiveBayesCallAn efficient model-based base-calling algorithm
for high-throughput sequencingWei-Chun Kao and Yun S. Song, UC Berkeley
Presented by: Frank Chun Yat Li
![Page 2: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/2.jpg)
Introduction
![Page 3: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/3.jpg)
Extracting the DNA SequenceImage Analysis
Translate image data into fluorescence intensity data
Base-callingFrom the intensities infer the DNA sequence
![Page 4: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/4.jpg)
Base-calling AlgorithmsBustard
Ships with Illumina’s Genome AnalyzerVery efficient and based on matrix inversion But error rate very high in the later cycles
Alta-CyclicMore accurateBut needs large amounts of labelled training data
BayesCallMost accurateLittle training data But computation time is very high, not practical
![Page 5: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/5.jpg)
navieBayesCallBuilds upon BayesCall
An order of magnitude faster Comparable error rate
![Page 6: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/6.jpg)
Review of BayesCall’s ModelL total cyclesei elementary 4X1 column matrix, eg:
Goal is to produce the sequence of the kth cluster Sk
• (Sk) = (S1,k, … SL,k) with St,k Є { eA, eC, eG, eT }
• Sk is initialized to have uniform distribution of ei
• If the distribution of the Genome is known Sk can initialize to that to improve accuracy
൦
0100൪ => ൦𝐴𝐶𝐺𝑇൪
![Page 7: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/7.jpg)
Terminology Active template density At,k
Model of the density of the kth cluster that is still active at cycle t
It,k is a 4 X 1 matrix of the 4 intensities
![Page 8: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/8.jpg)
Terminology Phasing of each DNA strand is modelled by a L
X L matrix P, where
p = probability of phasing (no new base is synthesized)
q = probability of prephasing (2 new bases are synthesized instead of 1)
Qj,t is the probability of the template terminating at location j after t cyclesQj,t = [Pt]0,j
![Page 9: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/9.jpg)
Punch LineFrom At,k and Qj,t, we can estimate the
concentration of active DNA strands in a cluster at cycle tNamed See formula (2) in the paper
With , model other residual effects that propagate from one cycle to the nextNamed See formula (3) in the paper
Punch lineObserved fluorescence intensities is a 4D normal
distribution
![Page 10: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/10.jpg)
naiveBayesCallTry to apply the Viterbi algorithm to this
domainProblem 1: high order Markov model because
of all the prephasing and phasing effects, so very computationally expensive
Problem 2: the densities Ak = (A1,k, … , AL,k) is a series of continuous random variables and the Viterbi algorithm does not handle continuous variables
![Page 11: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/11.jpg)
Problem 1 SolutionIf we provide a good initialization of Sk, then
the algorithm can home in on the solution a lot fasterThe authors came up with a hybrid algorithm
using modeling in BayesCall and approach in Bustard to come up with the initializations quickly
See section 3.1 in the paper At,k can be estimated using Maximum a
Posteriori (MAP) estimation See forumlaes (10), (11), (12)
PROBLEM 2 SOLUTION
![Page 12: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/12.jpg)
naiveBayesCall Algorithm
![Page 13: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/13.jpg)
ResultsData came from sequencing PhiX174 virus.
All 4 algorithms were ran against the data (Bustard, Alta-Cyclic, BayesCall, naiveBayesCall)
naiveBayesCall is more accurate from 21+ bp
![Page 14: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/14.jpg)
ImprovementsnaiveBayesCall accuracy is higher than Alta-
Cyclic from all cyclesError rate is also lower than Bustard’s at
later cycles
![Page 15: An efficient model-based base-calling algorithm for high-throughput sequencing Wei-Chun Kao and Yun S. Song, UC Berkeley Presented by: Frank Chun Yat Li.](https://reader035.fdocuments.us/reader035/viewer/2022062422/56649efc5503460f94c0fe20/html5/thumbnails/15.jpg)
What is the Point?Each run of the sequencer costs $$With lower error rate, we can run the
sequencer longer and still obtain accurate data
less runs!