Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University...
-
Upload
carmel-blair -
Category
Documents
-
view
219 -
download
0
Transcript of Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University...
Parsing A Bacterial Genome
Mark CravenDepartment of Biostatistics & Medical Informatics
University of Wisconsin
U.S.A.
www.biostat.wisc.edu/~craven
The Task
Given: a bacterial genome
Do: use computational methods to predict a “parts list” of regulatory elements
Outline
1. background on bacterial gene regulation
2. background on probabilistic language models
3. predicting transcription units using probabilistic language models
4. augmenting training with “weakly” labeled examples
5. refining the structure of a stochastic context free grammar
The Central Dogma of Molecular Biology
Transcription in Bacteria
Operons in Bacteria
• operon: sequence of one or more genes transcribed as a unit under some conditions
• promoter: “signal” in DNA indicating where to start transcription
• terminator: “signal” indicating where to stop transcription
promotergene
terminatorgene
gene
mRNA
The Task Revisited
Given:– DNA sequence of E. coli genome– coordinates of known/predicted
genes– known instances of operons,
promoters, terminatorsDo:
– learn models from known instances
– predict complete catalog of operons, promoters, terminators for the genome
Our Approach: Probabilistic Language Models
1. write down a “grammar” for elements of interest (operons, promoters, terminators, etc.) and relations among them
2. learn probability parameters from known instances of these elements
3. predict new elements by “parsing” uncharacterized DNA sequence
Transformational Grammars• a transformational grammar characterizes a set of legal strings• the grammar consists of
– a set of abstract nonterminal symbols
– a set of terminal symbols (those that actually appear in strings)
– a set of productions
4321 , , , , CCCCS
21 t CC 32 CaC
42 g CC g3 C
a3 C
a4 C
tgca , , ,
A Grammar for Stop Codons
• this grammar can generate the 3 stop codons: taa, tag, tga• with a grammar we can ask questions like
– what strings are derivable from the grammar?– can a particular string be derived from the grammar?
1CS 21 CtC 32 CaC
42 CgC g3 C
a3 C
a4 C
The Parse Tree for tag
1CS 21 CtC 32 a CC
42 CgC g3 C
a3 C
a4 C
S
1C
2C
3C
t
a
g
A Probabilistic Version of the Grammar
• each production has an associated probability• the probabilities for productions with the same left-hand side sum to 1• this grammar has a corresponding Markov chain model
1CS 21 CtC 32 CaC
42 CgC g3 C
a3 C
a4 C
1.0 1.0 0.7
0.3
1.0
0.2
0.8
A Probabilistic Context Free Grammar for Terminators
START
PREFIX
STEM_BOT1
STEM_BOT2
STEM_MID
STEM_TOP2
STEM_TOP1
LOOP
LOOP_MID
SUFFIX
B
tl STEM_BOT2 tr
tl* STEM_MID tr
* | tl* STEM_TOP2 tr
*
tl* STEM_MID tr
* | tl* STEM_TOP2 tr
*
tl LOOP tr
B B LOOP_MID B B
tl* STEM_TOP1 tr
*
B LOOP_MID |
B B B B B B B B B
a | c | g | u
B B B B B B B B B
PREFIX STEM_BOT1 SUFFIX
t = {a,c,g,u}, t* = {a,c,g,u, }
cgaccgc
c-u-c-a-a-a-g-g- gcuggcg
ua
u c
c
-u-u-u-u-u-u-u-u
prefixstemloopsuffix
Inference with Probabilistic Grammars
• for a given string there may be many parses, but some are more probable than others
• we can do prediction by finding relatively high probability parses
• there are dynamic programming algorithms for finding the most probable parse efficiently
Learning with Probabilistic Grammars
• in this work, we write down the productions by hand, but learn the probability parameters
• to learn the probability parameters, we align sequences of a given classs (e.g. terminators) with the relevant part of the grammar
• when there is hidden state (i.e. the correct parse is not known), we use Expectation Maximization (EM) algorithms
Outline
1. background on bacterial gene regulation
2. background on probabilistic language models
3. predicting transcription units using probabilistic language models [Bockhorst et al., ISMB/Bioinformatics ‘03]
4. augmenting training with “weakly” labeled examples
5. refining the structure of a stochastic context free grammar
untranscribed region
transcribed region
ORF
SCFG
position specific Markov model
semi-Markov model
-35 -10 TSS
ORF
lastORF
RITprefix
RDTprefix
stemloop
stemloop
start
RITsuffix
RDTsuffix
endendspacer
startspacer
spacerpromintern
postprom
intraORF
preterm
UTR
A Model for Transcription Units
The Components of the Model
• stochastic context free grammars (SCFGs) represent variable-length sequences with long-range dependencies
• semi-Markov models represent variable-length sequences
),|Pr()|Pr(
)|Pr()|Pr(
011
0
:
l
j
illi
Lji
xxx
Lx
),|Pr()|Pr()|Pr( 111
1:
ill
j
illiji xxxx
• position-specific Markov models represent fixed-length sequence motifs
Gene Expression Data
• in addition to DNA sequence data, we also use expression data to make our parses
• microarrays enable the simultaneous measurement of the transcription levels of thousands of genes
genes/sequence positions
experimentalconditions
Incorporating Expression Data
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
• our models parse two sequences simultaneously
– the DNA sequence of the genome
– a sequence of expression measurements associated with particular sequence positions
• the expression data is useful because it provides information about which subsequences look like they are transcribed together
Predictive Accuracy for Operons
0
20
40
60
80
100
sequence only expression only sequence+expression
sensitivity
specificity
precision
FNTP
TP
ysensitivit
FPTN
TN
yspecificit
FPTP
TP
precision
Predictive Accuracy for Promoters
50
60
70
80
90
100
sequence only expression only sequence+expression
sensitivity
specificity
precision
FNTP
TP
ysensitivit
FPTN
TN
yspecificit
FPTP
TP
precision
Predictive Accuracy for Terminators
50
60
70
80
90
100
sequence only expression only sequence+expression
sensitivity
specificity
precision
FNTP
TP
ysensitivit
FPTN
TN
yspecificit
FPTP
TP
precision
Accuracy of Promoter & Terminator Localization
Terminator Predictive Accuracy
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
True
Pos
itive
Rat
e
False Positive Rate
SCFGSCFG, no training
Complementarity MatrixInterpolated Markov Model
FNTP
TP
TNFP
FP
Outline
1. background on bacterial gene regulation
2. background on probabilistic language models
3. predicting transcription units using probabilistic language models
4. augmenting training data with “weakly” labeled examples [Bockhorst & Craven, ICML ’02]
5. refining the structure of a stochastic context free grammar
Key Idea: Weakly Labeled Examples
• regulatory elements are inter-related
– promoters precede operons
– terminators follow operons
– etc.
• relationships such as these can be exploited to augment training sets with “weakly labeled” examples
Inferring “Weakly” Labeled Examples
g1 g2 g3 g4
g5
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTTGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA
• if we know that an operon ends at g4, then there must be a terminator shortly downstream
• if we know that an operon begins at g2, then there must be a promoter shortly upstream
• we can exploit relations such as this to augment our training sets
Strongly vs. Weakly Labeled Terminator Examples
gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg
gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg
end of stem-loop
strongly labeled terminator:
weakly labeled terminator:
extent of terminator
sub-class: rho-independent
Training the Terminator Models:Strongly Labeled Examples
rho-dependent terminator model
negativemodel
rho-independent terminator model
negative examplesrho-independent
examplesrho-dependent
examples
Training the Terminator Models:Weakly Labeled Examples
rho-dependent terminator model
negativemodel
rho-independent terminator model
negative examples
weakly labeled examples
combined terminator model
Do Weakly Labeled Terminator Examples Help?
• task: classification of terminators (both sub-classes) in E. coli K-12
• train SCFG terminator model using:
– S strongly labeled examples and
– W weakly labeled examples
• evaluate using area under ROC curves
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100 120 140
Are
a u
nd
er
RO
C c
urv
e
Number of strong positive examples
250 weak examples
25 weak examples
0 weak examples
Learning Curves using Weakly Labeled Terminators
Are Weakly Labeled Examples Better than Unlabeled Examples?
• train SCFG terminator model using:
– S strongly labeled examples and
– U unlabeled examples
• vary S and U to obtain learning curves
Training the Terminator Models:Unlabeled Examples
rho-dependent terminator model
negativemodel
rho-independent terminator model
unlabeled examples
combined model
0 40 80 120
250 unlabeled examples25 unlabeled examples
0 unlabeled examples
Are
a u
nd
er
RO
C c
urv
e
Number of strong positive examples
0.6
0.8
1
0 40 80 120
250 weak examples25 weak examples
0 weak examples
Weakly Labeled Unlabeled
Learning Curves: Weak vs. Unlabeled
Are Weakly Labeled Terminators from Predicted Operons Useful?
• train operon model with S labeled operons
• predict operons
• generate W weakly labeled terminators from W most confident predictions
• vary S and W
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100 120 140 160
Are
a u
nd
er
RO
C c
urv
e
Number of strong positive examples
200 weak examples
100 weak examples
25 weak examples
0 weak examples
Learning Curves using Weakly Labeled Terminators
Outline
1. background on bacterial gene regulation
2. background on probabilistic language models
3. predicting transcription units using probabilistic language models
4. augmenting training with “weakly” labeled examples
5. refining the structure of a stochastic context free grammar [Bockhorst & Craven, IJCAI ’01]
Learning SCFGs• given the productions of a grammar, can learn the
probabilities using the Inside-Outside algorithm
• we have developed an algorithm that can add new nonterminals & productions to a grammar during learning
• basic idea:
– identify nonterminals that seem to be “overloaded”
– split these nonterminals into two; allow each to specialize
Refining the Grammar in a SCFG
1w
2wA U
1w
2wC G
• there are various “contexts” in which each grammar nonterminal may be used
• consider two contexts for the nonterminal 2w
2w2w
AU
| CG
|G C
| UA
3
3
3
32
w
w
w
ww 0.4
0.4
0.1
0.1
• if the probabilities for look very different, depending on its context, we add a new nonterminal and specialize
AU
| CG
|G C
| UA
3
3
3
32
w
w
w
ww 0.1
0.1
0.4
0.4
Refining the Grammar in a SCFG
• we can compare two probability distributions P and Q using Kullback-Leibler divergence
AU
| CG
|G C
| UA
3
3
3
32
w
w
w
ww 0.4
0.4
0.1
0.1
AU
| CG
|G C
| UA
3
3
3
32
w
w
w
ww 0.1
0.1
0.4
0.4
)(
)()()||(
i
i
ii xQ
xPxPQPH
P Q
Learning Terminator SCFGs• extracted grammar from the literature
(~ 120 productions)• data set consists of 142 known E. coli terminators, 125
sequences that do not contain terminators• learn parameters using Inside-Outside algorithm (an EM
algorithm)• consider adding nonterminals guided by three heuristics
– KL divergence– chi-squared– random
SCFG Accuracy After Adding 25 New Nonterminals
20%
40%
60%
80%
100%
0% 20% 40% 60% 80% 100%
True
Pos
itive
Rat
e
False Positive Rate
Chi squareKL Divergence
RandomOriginal Grammar
SCFG Accuracy vs. Nonterminals Added
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25
Are
a un
der
RO
C c
urve
Additional nonterminals
Chi squareKL Divergence
Random
Conclusions
• summary– we have developed an approach to predicting transcription units in bacterial
genomes– we have predicted a complete set of transcription units for the E. coli genome
• advantages of the probabilistic grammar approach– can readily incorporate background knowledge– can simultaneously get a coherent set of predictions for a set of related
elements– can be easily extended to incorporate other genomic elements
• current directions– expanding the vocabulary of elements modeled (genes, transcription factor
binding sites, etc.)– handling overlapping elements– making predictions for multiple related genomes
Acknowledgements
• Craven Lab: Joe Bockhorst, Keith Noto
• David Page, Jude Shavlik
• Blattner Lab: Fred Blattner, Jeremy Glasner, Mingzhu Liu, Yu Qiu
• funding from National Science Foundation, National Institutes of Health