Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High...
-
Upload
naomi-robinson -
Category
Documents
-
view
215 -
download
2
Transcript of Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High...
![Page 1: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/1.jpg)
Jonathan Laserson
advised by
Daphne Koller
October 20th, 2011
Bayesian Assembly of Reads from High Throughput Sequencing
THESIS DEFENSE
![Page 2: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/2.jpg)
High-Throughput Sequencing Millions of noisy short reads from a sample A snapshot of the genomic material in the
sample
Roche 454 GS-FLX Titanium Avg read length 380b reads in a run 500k Base error rate 0.74%
(84% indels) Cost per base 0.05$/kb
[You et al., 2011]
genomes
reads
CGGACT
CGGACTATCCACTATCCACCGACCCTGTGGGTCCCCGTCAGAACCCGCCCTA...
Output
![Page 3: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/3.jpg)
Population Sequencing
x
x
x
x
x x
x
x
x
x
x
xxx
x
x
xx
noise
mutations Depth
Phylogenetic trees Immune-seq
Breadth Metagenomics de novo assembly
reads
sequences
![Page 4: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/4.jpg)
CTGGGAAAAG AGCCCAGCCCC CGTAGATCA CTGGAAAAG GCTGACGTAG AAAGCCCAG ACTGACCTG
CCCTCATTTC ACTGCCCTGG GACGTAGATCA GCCCTGGTGA CGGCTGACGT AGATCATAGA
AGCCCAGCCC CATAGAAACGAT
reads
an assembly
contigs
Input
Output
CGGCTGACGTAGATCATAGAAACGATCGGCTGACGT GCTGACGTAG GACGTAGATCA CGTAGATCA AGATCATAGA CATAGAAACGAT
GENOVO
ACTGCCCTGGGAAAAGCCCAGCCCCTCATTTCACTGCCCTGGACTGCCCTG GCCCTGGGA CTGGGAAAAG CTGGGAAAAG AAAGCCCAG AGCCCAGCCC AGCCCAGCCCC CCCTCATTTCreads
![Page 5: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/5.jpg)
Previous Work Velvet, Euler:
Assume most reads have no error Assume a single genome Trash low-abundance reads Fix reads with one error a-priori
Newbler: No code / publication
![Page 6: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/6.jpg)
Our Approach: Probabilistic Model
Fully Bayesian model for high-throughput sequencing Jointly performing 3 tasks:
de-noising the reads de novo sequence assembly multiple alignment
We do not throw away reads Can handle unbalanced populations of genomes
![Page 7: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/7.jpg)
Our Approach: Probabilistic Model
bso
oi
o=-∞..+∞
π
si
i=1..N
α
yi
ρs
s=1..∞
β
s ~ CRP(α, N)
bso ~ U[A,C,G,T]
ρs ~ Beta(1,1+β)
oi ~ G(ρs_i)
yi ~ read_model(b,si,oi)
log P(y,b,s,o|ρ) ≈ #errors · log(ε) - #bases · log(4) + #seq · log(α) min read errors min bases
we want to find argmax P(b,s,o | y)
b,s,o
![Page 8: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/8.jpg)
Inference
MCMC Making moves in a probabilistic space Stochastic process
escape local optimum multiple hypotheses
![Page 9: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/9.jpg)
18
Move 1: Read Mapping
A READ samples a sequence and an offset (inside the seq)si oi
10
candidate locations
Likelihood Term Stickiness Term Geometric TermP(si=s,oi=o | y,b,ρ,s-i,o-i) α
![Page 10: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/10.jpg)
Comparison:How many bases did you BLAST into?
horizontal line:total basescovered byall methodstogether
vertical line:threshold for a perfect match
![Page 11: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/11.jpg)
horizontal line:total basescovered byall methodstogether
vertical line:threshold for a perfect match
Comparison:How many bases did you BLAST into?
![Page 12: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/12.jpg)
A Score for an Assembly
=
α controls minimal overlapbetween adjacent readsmin read errors min bases
#errors · log(ε) - #bases · log(4) + #seq · log(α)
![Page 13: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/13.jpg)
Contributions Fully Bayesian model for high-
throughput sequencing An algorithm jointly performing 3 tasks:
de-noising the reads de novo sequence Assembly multiple alignment
A score for de novo assembly Code freely available
Laserson J, Jojic V, Koller D, Genovo: de novo Assembly for Metagenomes, Journal of Computational Biology, 2011. RECOMB 2010.
![Page 14: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/14.jpg)
Population Sequencing
x
x
x
x
x x
x
x
x
x
x
xxx
x
x
xx
noise
mutations Depth
Phylogenetic trees Immune-seq
Breadth Metagenomics de novo assembly
reads
sequences
![Page 15: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/15.jpg)
Immune System 1010 B-Cells in an
individual
![Page 16: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/16.jpg)
[Illustration: Wikipedia]
Immune System 1010 B-Cells in an
individual Phase 1: generate basic
seq choose V,D,J from catalog
V D J
57 x 27 x 6
~297 bases ~24 bases ~52 bases
![Page 17: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/17.jpg)
Immune System 1010 B-Cells in an
individual Phase 1: generate basic
seq choose V,D,J stitch V-D and D-J by:
eat up a few letters from ends insert a few letters
V D JN1 N2
~360 bases [Illustration: Wikipedia]
![Page 18: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/18.jpg)
Immune System 1010 B-Cells in an
individual Phase 1: generate basic
seq Phase 2: hyper-mutate
If detecting an invader, clone yourself with lots-of-mutations
Even better detection? keep cloning.
Remember the successful sequences for future reference, and keep producing them.
This set of sequences is one clone.V D JN1 N2
[Illustration: Janeway, 2001]
![Page 19: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/19.jpg)
Input
Output
igForest
reads
clones
x
x
xx
x
x
x x
x
xxx
x
x
V1 JD 61
xx
V7 JD
x
11
x
xV4 JD 62
germline
mutationssequencing noise
![Page 20: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/20.jpg)
ReperTree
Phase 1: Divide reads into clones 1M reads! In a healthy individual almost every
readis from a different clone.
Phase 2: Construct a mutation tree for each clone. How to tell a mutation from a
sequencing error?
De Novo Annotation for B-Cells
![Page 21: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/21.jpg)
Overview 10 (healthy) Organ donors 4 samples from each individual
Blood, spleen, two lymph nodes 380k reads total
8,000 clones 165k other B-cells variants (not clones)
JD
JD
JD
JD
JD
JD
JD
JD
JD
JD
JD
JD
JD
JD
JD
JD
JD
JD
igForest
![Page 22: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/22.jpg)
A Repertoire
Vs (57)
Js (6)
![Page 23: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/23.jpg)
Variants (173k)
blood spleen lymph node 1 lymph node 2
Js (6)
Vs (57)
pati
ents
pati
ents
![Page 24: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/24.jpg)
1.8
2
2.8
1.9
2.1
2.2
2.9
2.9
3
V,J VJ VS JS VD JD VJS VJD
XX
X XX X
X X X X
X X X X XX
X
X X
Repertoire Factorization
X(v,j,donor, source) = a(v,j,source) x b(v,j,donor)
*1e-3
RMSE = 1.8e-3
![Page 25: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/25.jpg)
Properties Of VariantsJD
JD
JD
JD
JD
JD
JD
JD
JD
JD
length of variable region
length of N1
length of N2
distance to germline
fract
ion o
f vari
ants
![Page 26: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/26.jpg)
Clones (8,000)
JD
JD
JD
bloodspleenMesenteric Mediastinal
shared2 sources
pureclones
shared3 sources
pati
ents
Evidence of clone migration
How do the trees look like?
![Page 27: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/27.jpg)
Current Phylogenetic Trees
X*100 sequences Weeks to run No sequencing/PCR errors Single solution
Binary tree Data at the leaves Continuous mutation
model
X*10,000 √ Hours √ Noise model √ Multiple/
statistics √
√ √ Discrete √
igForesttraditional phylo trees
![Page 28: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/28.jpg)
Previous Work on B Cells
ihmmune-align [Gaëta et al. 2007], SoDA2 [Munshaw and Kepler 2010]
examine each read independently, no tree
igTree [Barak et al. 2008] heuristic search, no high-throughput support
Our Approach: Joint probabilistic model for hypermutation
and sequencing noise
igForest
![Page 29: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/29.jpg)
Generate Tree Start with one live cell at t=0 Give birth at rate λ Die at rate δ Stop when t= “snapshot time”.
time t=snapshot
root
t=0
(Birth/death process)
![Page 30: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/30.jpg)
Generate Sequences Start with the inferred germline at the root Generate each child sequence from the sequence of
its parent Using a Mutation Model
Continue so, recursively.time
root
t=0
x
x
x
x
x
x
xx
x
x
x
x
xx
x
t=snapshot
![Page 31: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/31.jpg)
Generate Reads
reads
live cells
Total N reads. Assign reads to live cells (uniformly). Copy read sequence from cell, based on read model.
t=snapshott=0
![Page 32: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/32.jpg)
Inference MCMC Moves to manipulate tree and
read assignments Generating samples of trees along the way
![Page 33: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/33.jpg)
t=snapshott=0
1
2
1
01
2
1
Output Tree
mutation distance
![Page 34: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/34.jpg)
Output Tree
t=snapshott=0
1
2
1
01
2
1
6 8
211
2 6
1 2
![Page 35: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/35.jpg)
Sample Trees
reads
mutations compared to germline
bloodspleenMesenteric
Mediastinal
![Page 36: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/36.jpg)
Sample TreesbloodspleenMesenteric
Mediastinal
![Page 37: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/37.jpg)
![Page 38: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/38.jpg)
Properties Of Treesmax mutational height
median mutational height
median mutational distance (edge length)
max mutational distance (edge length)
fract
ion o
f cl
ones
![Page 39: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/39.jpg)
FR2 CDR2 FR3 CDR3
clonesJD
JD
Clones
root down the tree
Mutation Analysis – V region
clonesJD
JD
variants
JD
JD
JD
JD
variants
Clones
FR2 CDR2 FR3 CDR3
fract
ion o
f m
uta
tion
s
mutations binned by their aligned codon position
FR2 CDR2 FR3 CDR3
fract
ion o
f m
uta
tion
s
germline to root
![Page 40: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/40.jpg)
FR2 CDR2 FR3 CDR3
clonesJD
JD
Clones
root down the tree
Mutation Analysis – V region
clonesJD
JD
variants
JD
JD
JD
JD
variants
Clones
FR2 CDR2 FR3 CDR3
fract
ion o
f m
uta
tion
s
mutations binned by their aligned codon position
FR2 CDR2 FR3 CDR3
fract
ion o
f m
uta
tion
s
germline to root
![Page 41: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/41.jpg)
Fully Bayesian phylogenetic analysis O(10k) reads Handles sequencing noise
Big Picture view of the B cell repertoire Intuitive clone description
Insight on the Immune System Evidence of clone migration Differences between tissues Deep hypermutation analysis
Code (will be) freely available
Contributions
![Page 42: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/42.jpg)
x
x
x
x
x x
x
x
x
x
x
xxx
x
x
xx
noise
mutations Depth
Breadth
reads
sequences
Thank you!
![Page 43: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/43.jpg)
Daphne Koller
Acknowledgements
![Page 44: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/44.jpg)
Vladimir Jojic
Scott BoydAndrew Fire
Katherine JacksonElyse Hope
Acknowledgements
Funding National Science Foundation Stanford Graduate Fellowship
Chair: Richard Olshen
![Page 45: Jonathan Laserson advised by Daphne Koller October 20 th, 2011 Bayesian Assembly of Reads from High Throughput Sequencing THESIS DEFENSE.](https://reader035.fdocuments.us/reader035/viewer/2022062802/56649eb15503460f94bb7e40/html5/thumbnails/45.jpg)
DAGS team Ben Packer Karen Sachs Geremy Heitz Serafim Batzoglou and his
group Orly Fuhrman Leor Vartikovski My Parents