Use of Bayesian Statistical Techniques in High-throughput...
Transcript of Use of Bayesian Statistical Techniques in High-throughput...
![Page 1: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/1.jpg)
Use of Bayesian Statistical Techniques in High-throughput Sequencing
Christopher E. Mason Assistant Professor
The Institute for Computational Biomedicine, Department of Physiology and Biophysics at
Weill Cornell Medical College, and the Tri-I Program (Cornell, MSKCC, Rockefeller) on Comp. Bio. & Medicine
February 21st , 2012
![Page 2: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/2.jpg)
Kahvejian, 2008
Since DNA defines the biochemical recipe for the genesis of organisms, sequencing allows us to create molecular portraits
of development and disease at single-base resolution.
![Page 3: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/3.jpg)
Table 1: Number of Genes vs. Genome Sizeorganism genes Base pairs
Plant ~50,000 100,000,000,000 Human, mouse or rat 25,000 3,000,000,000
Honey bee 15,000 300,000,000 Fruit Fly 13,767 130,000,000 Worm 19,000 97,000,000 Fungus 6,000 13,000,000
Bacterium 3,000 5,000,000 Mycoplasma genitalium 500 580,000
DNA virus 10–900 50,000 RNA virus 1–25 10,000
Viroid 0–1 500
![Page 4: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/4.jpg)
Understanding the genome’s mutation, selection, and/or drift
C A G T C T
H A K V D S
κ ν γ p β α
GI
Brain
Liver
Skin
SG
Gon
H. sapiens
Disease
How can we understand these interactions?
Evo-Devo
Integrated, spatiotemporal molecular profiling.
![Page 5: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/5.jpg)
What erudition do we have now on the functional elements?
1.2 million sequence reads in brain (1990-‐2009)
Ø Currently limited amount of EST info at NCBI Ø EST data is expensive, Ome-‐consuming (cloning), and exhibits 3’ bias. Ø Much EST and cDNA data is for whole brains, and few libraries exist with region-‐specific data.
New School: One run of a NGS machine = billions of sequence reads in days
Old School
![Page 6: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/6.jpg)
DescripOon/Discussion of the Various Technologies
• The goal of the Archon X prize in Genomics is to enable a $1,000 genome,
• Currently at $3,000-‐$50,000 • Certain pla^orms are be_er suited for certain tasks: – CounOng applicaOons (ChIP-‐Seq, RNA-‐Seq) need more reads
– De novo assembly work needs longer reads – Whole genome re-‐sequencing requires lower errors rate and high processivity
![Page 7: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/7.jpg)
But, there are many options:
Michael Metzker, 2010
![Page 8: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/8.jpg)
DNA (0.1-‐1.0 ug)
Single molecule array Sample
prepara9on Cluster growth 5’
5’ 3’
G
T
C
A
G
T
C
A
G
T
C
A
C
A
G
T C
A
T
C
A
C
C
T A G
C G
T A
G T
1 2 3 7 8 9 4 5 6
Image acquisi9on Base calling
T G C T A C G A T …
Sequencing
Illumina SBS Technology Reversible Terminator Chemistry Founda7on
h_p://www.illumina.com/technology/sequencing_technology.ilmn
![Page 9: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/9.jpg)
Quality Scores vary
Bansal et al, 2010
![Page 10: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/10.jpg)
Pyrosequencing
Metzker, 2010
![Page 11: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/11.jpg)
Single Molecule Real-‐Time
Metzker, 2010
![Page 12: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/12.jpg)
Direct DetecOon of MethylaOon
70.5 71.0 71.5 72.0 72.5 73.0 73.5 74.0 74.50
100
200
300
400
Fluo
resc
ence
inte
nsity
(a.u
.)
Time (s)
104.5 105.0 105.5 106.0 106.5 107.0 107.5 108.0 108.50
100
200
300
400
Fluo
resc
ence
inte
nsity
(a.u
.)
Time (s)
C
T G A T C G T A C
mA A G T C T A A
G C C A A A A
Approach: KineOc detecOon of methylated bases during SMRT DNA sequencing
Example: N6-‐methyladenosine (mA)
![Page 13: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/13.jpg)
detect other base modificaOons
0 10 20 30 40 50 60 70 80 900.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
Pulse width ra
9o
Pulse width
DNA Template Posi9on
0 10 20 30 40 50 60 70 80 90
0.5
1.0
1.5
2.0
2.5
3.0
IPD Ra
9o
Interpulse duraOon
5-‐methylcytosine (mC)
= Methylated posiOon
0 10 20 30 40 50 60 70 80 900.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
DNA Template Posi9on
Pulse width ra
9o
Pulse width
0 10 20 30 40 50 60 70 80 90
0.5
1.0
1.5
2.0
2.5
3.0
IPD Ra
9o
Interpulse duraOon
5-‐hydroxymethylcytosine (hmC)
OH
![Page 14: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/14.jpg)
Watch TranslaOon in Real-‐Time
Uemura et al, Nature, 2010
![Page 15: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/15.jpg)
Full-‐length cDNA sequencing
![Page 16: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/16.jpg)
“Post-‐Light,” Semi-‐Conductor Sequencing
Essentially, 11 million very small pH meters Purushothaman et al, 2005
IonTorrent, Inc.
![Page 17: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/17.jpg)
MinION GridION
Exonuclease-‐Seq Strand-‐Seq
DNA Sequencing
![Page 18: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/18.jpg)
Other (Maybe Killer) Apps
Analyte Protein Aptamer
Direct RNA Sequencing Small molecule
![Page 19: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/19.jpg)
Each Pla^orm has various sources of noise, and thus Error
• De-‐Phasing – Lagging strand dephasing from incomplete extension
– Leading strand dephasing from over-‐extension • Dark NucleoOdes • Polymerase errors (10-‐5 to 10-‐7) • Pla^orm-‐specific errors
– Illumina more likely to have error aker ‘G’ – PCR-‐based methods miss GC-‐ and AT-‐rich regions
![Page 20: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/20.jpg)
Alignment to the genome
![Page 21: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/21.jpg)
![Page 22: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/22.jpg)
Analyzing High-Resolution Data
• Bayesian Methods • Hidden Markov Models • Permutation Testing • Circular Binary Segmentation • Seed-seeking • Least Squares Regression • Democratic Voting
![Page 23: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/23.jpg)
Prior • The prior function Pr (H) gives the probability of different possible
values of the quantity of interest before the data are considered – that is, it represents the state of knowledge prior to the data.
• Prior may be broad or flat if we have few data (non-informative prior), or peaked if we have more information (informative prior).
Pro
babi
lity
Unknown quantity of interest
(Population size, length at sexual maturity, haplotype frequency, model parameter)
![Page 24: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/24.jpg)
Likelihood • The likelihood function Pr (data | H) gives the probability of
obtaining the data, given different possible values of the unknown quantity of interest (the “hypothesis” H).
• The likelihood is calculated using a statistical model, which represents the process that produced the data. The likelihood function connects the parameters of the model to the data.
Like
lihoo
d
Unknown quantity of interest
(Population size, length at sexual maturity, haplotype frequency, model parameter)
![Page 25: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/25.jpg)
Posterior • The posterior function Pr (H | data) gives the probability of different
possible values of the quantity of interest after the data are considered – that is, it represents the state of knowledge posterior to the data.
• The posterior is a combination of the prior (what we knew before) and the likelihood (what the data told us).
• The difference between the prior and the posterior indicates how much we learned from the data.
prior
posterior
Pro
babi
lity
Unknown quantity of interest
(Population size, length at sexual maturity, haplotype frequency, model parameter)
![Page 26: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/26.jpg)
Paradigm for Bayesian inference
posterior likelihood x prior
new state of knowledge
Thus, in Bayesian reasoning, new data update the current state of knowledge through Bayes’ Theorem. The result is a new state of knowledge represented by the posterior.
information from new data
current state of knowledge x
![Page 27: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/27.jpg)
Bayes’ Theorem
(H) The Hypothesis (the unknown) and x = data
p (H | data) = ——————————— p (data | H) p (H)
p (data)
posterior = likelihood x prior
![Page 28: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/28.jpg)
2 fundamental Bayesian concepts:
1. Things that are unknown are represented by probability distributions.
2. Things that are known (data) are used to improve the knowledge of unknowns through Bayes’ Theorem.
![Page 29: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/29.jpg)
Bayesian view of cancer
• 1% of women at 40 yrs have breast cancer (Prior)
• Mammography diagnoses 8/10 correctly (true positive rate, false negative rate)
• 10% of mammographies are false positives
• If you get a positive result, what are your odds of having breast cancer?
![Page 30: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/30.jpg)
healthy c
10,000 patients
100 patients
9,900 patients
c
8/10 true positives
80 patients
healthy 990
patients
10% false negatives
80 1070
=7.5%
![Page 31: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/31.jpg)
Bayes Theorem depends on the prior probability (pr(A)):
.8*.01 .8*.01 + .1*.99 =7.5%
A = have cancer B = positive result
![Page 32: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/32.jpg)
healthy c
1,000,000 patients
1 patient
999,999 patients
c
8/10 true positives
~.8 patients
healthy 9,999
patients
10% false negatives
0.8 10,000
=0.0008%
![Page 33: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/33.jpg)
Q. What is the Bayesian Conspiracy?
A. The Bayesian Conspiracy is a mulOnaOonal, interdisciplinary, and shadowy group of scienOsts that controls publicaOon, grants, tenure, and the illicit traffic in grad students. The best way to be accepted into the Bayesian Conspiracy is to join the Campus Crusade for Bayes in high school or college, and gradually work your way up to the inner circles. It is rumored that at the upper levels of the Bayesian Conspiracy exist nine silent figures known only as the Bayes Council.
h_p://yudkowsky.net/raOonal/bayes
![Page 34: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/34.jpg)
ApplicaOons of Bayes to DNA-‐Seq
![Page 35: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/35.jpg)
![Page 36: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/36.jpg)
Each Pla^orm is slightly different, and so intrinic errors are different
SeQC Consortium
![Page 37: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/37.jpg)
![Page 38: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/38.jpg)
SNIP-‐Seq SNP calling
For each potenOal variant site in the sequenced region: Set the base quality value for each base call to the Illumina
quality value For k = 1, 2,…
a. Sample a genotype for each individual from the posterior distribuOon using a heterozygote prior of 0.001.
b. Recalibrate the quality score for each base call using genotypes for all individuals.
If the genotype of any individual is different from the reference, idenOfy posiOon as a SNP.
Sample a genotype for each individual from the posterior distribuOon computed using a heterozygote prior of 0.2.
Bansal et al, 2010
![Page 39: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/39.jpg)
PopulaOon-‐based models can overcome systemaOc errors, but…
![Page 40: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/40.jpg)
Population Stratification is from the migration patterns of haplotypes
throughout human history
![Page 41: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/41.jpg)
Genes mirror geography within Europe Novembre et al, 2008
![Page 42: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/42.jpg)
Genes mirror geography within Europe Novembre et al, 2008
![Page 43: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/43.jpg)
ApplicaOons of Bayes to RNA-‐Seq
![Page 44: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/44.jpg)
Microarrays vs. RNA-‐Seq
Marioni and Mason et. al, Genome Research, 2008
RNA-‐Seq: An assessment of technical reproducibility and comparison with gene expression arrays
![Page 45: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/45.jpg)
• Early days—fold change cutoffs (e.g., 2x difference or better) • not a very satisfying approach: -doesn’t take into account variance -misses any small changes
Data Analysis: What genes are differentially expressed?
Here, “A” has a fold change >2.5, but varies greatly between replicate experiments. “B” has a fold change of only 1.75, but changes reliably each time the experiment is performed.
![Page 46: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/46.jpg)
Experimental Design: Liver vs. Kidney
![Page 47: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/47.jpg)
Metric for RNA-Seq Expression
RPKM: Reads per Kilobase per Million Reads
Normalizes for (1) gene size and (2) sequencing depth (~0.1-‐1 transcript/cell)
Y = (exons, introns, intergenic reads)
Mortazavi, Williams, et al. Nature Methods, 2008
FPKM=fragments-PKM is for paired-end data
![Page 48: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/48.jpg)
Comparing GA and Affy arrays Log2 fo
ld change (kidne
y/liver) -‐ Affy
metrix
Log2 fold change (kidney/liver) -‐ Solexa Spearman correlaOon = 0.72
![Page 49: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/49.jpg)
13,072 DifferenOally Expressed (DE) Genes
(37%) DE in Solexa 4,959
(12%) DE in Affy 1,579
(50%) Both 6,534
![Page 50: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/50.jpg)
Bias is introduced if these ratios are not kept:
Y All Unique Reads
Exons Intron Inter-‐ genic
Exons Intron Inter-‐ genic
Good Run
Bad Run
Good Run
Bad Run RPKM = 72.7
RPKM = 128.4
Mortazavi, Williams, et al. Nature Methods, 2008
![Page 51: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/51.jpg)
NormalizaOon is needed
Define Ygk as the observed count for gene g in library k summarized from the raw reads, μgk as the true and unknown expression level (number of transcripts), Lg as the length of gene g and Nk as total number of reads for library k. We can model the expected value of Ygk as: Sk represents the total RNA output of a sample. The problem underlying the analysis of RNA-‐seq data is that while Nk is known, Sk is unknown and can vary drasOcally from sample to sample, depending on the RNA composiOon.
![Page 52: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/52.jpg)
Robinson and Oshlack Genome Biology 2010 11:R25
NormalizaOon is needed
![Page 53: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/53.jpg)
NormalizaOon is needed
Robinson and Oshlack Genome Biology 2010 11:R25 Trimmed Mean M-‐values (TMM)
![Page 54: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/54.jpg)
![Page 55: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/55.jpg)
Katz et al, 2010
![Page 56: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/56.jpg)
Coverage Requirements: How many lanes/plates/wells?
Depends on: 1. Read Length
2. Size of Transcriptome 3. Complexity of Transcriptome
4. Complexity of Tissue 5. Biological Variance
6. Errors (random and systemaOc)
![Page 57: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/57.jpg)
Plateau of InformaOon Starts @ ~500Mb
Marioni,Mason et al, Genome Research, 2008
![Page 58: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/58.jpg)
Coverage Requirements
Wang, Gerstein, and Snyder, 2008
![Page 59: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/59.jpg)
No current visible end of gene discovery
![Page 60: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/60.jpg)
How many replicates do I need? CalculaOon of the number of replicates depends on:
1. An esOmate of σ2 obtained from previous experiments. 2. The size of the difference (δ) to be detected.
3. The assurance with which it is desired to detect the difference (i.e., Power of the test = 1-‐β).
4. The level of significance to be used in the actual experiment (i.e., Type I error). 5. The test required, whether a one-‐tail or two-‐tail test.
To determine the number of replicates use the following formula :
where: Zα/2 is associated with the Type I error (two-‐tailed) Zβ is associated with the Type II error
δ is the true difference to be detected, and σ is the known variance obtained from previous experiments
![Page 61: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/61.jpg)
Bayes in Chip-‐seq too!
![Page 62: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/62.jpg)
What’s the problem with Bayesian statistics? (according to non-Bayesians)
1. Priors can introduce subjective judgment into data analysis.
2. Priors affect the result. Different people can get different answers from the same data.
3. It’s too hard. There are no simple point-and-click programs.
![Page 63: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/63.jpg)
This requires a lot of space
1.2 million sequence reads in brain (1990-‐2009)
Old School
Stein L, 2010
![Page 64: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/64.jpg)
drugs
vitamin
hormone
N other species
mRNAs degrade
TFBS
Gene regulaOon
Robust Sample Prep
& Data AcquisiOon
Family Medical History
1. Small Variants (SNVs, indels) 2. DE by exon, gene, intron, & jxn 3. PolyA sites and miRNA
changes 4. TARs and ORF potenOal 5. Gene fusion events 6. Allele-‐specific expression
1. Variants (SNVs, indels) 2. Structural variants (CNVs, TxC) 3. Ancestry predicOon from
AIMs
RNA edits, eQTLs
StaOsOcal Models Built from Per-‐Base, High-‐Res Data
1. mC or hmC modificaOons 2. Histone marks (HXXX) 3. Regulatory informaOon
1. Mass spectrophotometry data 1. Protein-‐Protein InteracOons 2. Polysome-‐Associated mRNAs
Nucleosome effects
1. Metabolic changes
(G)enome Human
AnnotaOons, TDBGV miRbase, TCGA,
1KG,dbSNP
Bacteria /Fungi
G
Virus
G
And that is only the Genome!
![Page 65: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/65.jpg)
Meta-genomic phenotypes can persist for years, and “passenger genomes” can be a phenotype, as
well as their distributions.
Jakobsson et al, 2010
Normal throat
Throat + anObioOcs
![Page 66: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/66.jpg)
Risk factors for diseases usually involve many genes and pathways
. Li et al, PLOS GeneOcs, 2007.
![Page 67: Use of Bayesian Statistical Techniques in High-throughput …physiology.med.cornell.edu/people/banfelder/qbio/... · 2015-01-25 · Use of Bayesian Statistical Techniques in High-throughput](https://reader033.fdocuments.us/reader033/viewer/2022042922/5f6cde7373abac6a3e1a0587/html5/thumbnails/67.jpg)
There are other factors than these!
Systems biology requires spaOotemporal monitoring of the genome, epigenome, transcriptome, proteome, metabolome, and the environment, to see the interactome
SHOW NO DIFFERENCE