Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so...
-
Upload
edwina-james -
Category
Documents
-
view
219 -
download
0
Transcript of Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so...
![Page 1: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/1.jpg)
Doug RaifordLesson 3
![Page 2: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/2.jpg)
Have a fully sequenced genome
How identify the genes?
What do we know so far?
04/21/23 2Gene Prediction
![Page 3: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/3.jpg)
Remember Start codon codes for methionine Stop codons do not code for an amino acid
Does every ATG mark the beginning of a gene?
Does every TAG, TAA, or TGA mark the end?Start codon: ATGStop codons: TAG, TAA, or TGA
04/21/23 3Gene Prediction
![Page 4: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/4.jpg)
The start and stop codons must be “in frame”
A set of codons must fit between them Length evenly divisible
by three Open reading frame
Series of codons bracketed by start and stop codons (in frame)
04/21/23 4Gene Prediction
![Page 5: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/5.jpg)
The distance between start and stop codons tends to be longer than expected
How long would we expect that distance to be?
04/21/23 5Gene Prediction
![Page 6: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/6.jpg)
There are 64 different codons
A given codon should show-up randomly around once every 64 codons or 192 nts (64*3)
3 stop codons Expect 3 in every 64
codons or once every 21 1/3 codons(21 1/3 * 3 = 64 nts)
04/21/23 6Gene Prediction
![Page 7: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/7.jpg)
Number of genes in E. coli is 4356
Min 44 nts, max 8621 8 are < 64 143 < 128 (3%)
Good start but must be more
Approximately 77,000 ORFs > 2* expected on each strand
Escherichia coli
04/21/23 7Gene Prediction
![Page 8: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/8.jpg)
To “find” a gene would look for nt sequences that look like the parts of a gene
Promoter Region
Coding regionTerminator
Region
RNA polymerase
Start Codon‘ATG’ = Methionine
Stop Codon: non coding‘TAA’, ‘TAG’,
or ‘TGA’04/21/23 8Gene Prediction
![Page 9: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/9.jpg)
Attract polymerase Specific sequences
Gene regulation Each promoter has unique pattern
Motifs
04/21/23 9Gene Prediction
Coding region
-35 -10
Transcription start site
Ribosomal binding site
for -10 sequence T A T A A T
for -35 sequence T T G A C A
Start Codon
Polymerase binding
![Page 10: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/10.jpg)
Slightly different -35 and -10 motifs attract different sigma factors
Genes with similar upstream regions tend to be related: they express similarly
04/21/23 Gene Prediction 10
![Page 11: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/11.jpg)
HairpinFollowed by U-run
(A-run in the DNA)
04/21/23 Gene Prediction 11
![Page 12: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/12.jpg)
Week uracil bindings coupled with hairpin binding with nusA protein bound to polymerase
04/21/23 Gene Prediction 12
DNA AAAAAAAA
PolymeraseUUUUUUU
mRNA
![Page 13: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/13.jpg)
How find?Difficult: fuzzy, not carved in stone
04/21/23 13Gene Prediction
Coding region
-35 -10
Transcription start site
Ribosomal binding site
for -10 sequence T A T A A T
for -35 sequence T T G A C A
Start Codon
Polymerase binding
![Page 14: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/14.jpg)
Hidden Markov Models often usedAll about the statisticsMarkov Chain: series of events along
with probabilities
04/21/23 14Gene Prediction
T A T A A T
A
Start
G or C
Yay! I
found one
or T or A
![Page 15: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/15.jpg)
Previous was a “state machine” representation
Should have states and observations The states are “hidden”
04/21/23 Gene Prediction 15
1 2 4 5 80
A C G T
.25.25.25 .25
1 1 1
T A T A
A C G T
.25.25 .25.25
1.99
.1
1.5
1 1
1
3
.5
1
6 7
A
1 1
T
11
![Page 16: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/16.jpg)
Each state has a probability of “emitting” any given observation
Each state has a probability of “transitioning” to any given next state
04/21/23 Gene Prediction 16
1 2 4 5 80
A C G T
.25.25.25 .25
1 1 1
T A T A
A C G T
.25.25 .25.25
1.99
.1
1.5
1 1
1
3
.5
1
6 7
A
1 1
T
11
![Page 17: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/17.jpg)
Transition probability matrix Rows represent current state Columns represent state to
which a transition will occur Entry is the probability
associated with that transition Emission probability matrix
Rows represent states Columns represent which
observation is emitted Entry is the probability
associated with that emission
04/21/23 Gene Prediction 17
TRANS
To state
From state
probability
EMIS
Observations
state
probability
![Page 18: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/18.jpg)
Requires a subject matter expert to build a model
Often start with a state for each position in a possible match
Example looking for something similar to TATAAT Might not have both A’s Might have extra one in first slot Never have G’s or C’s
04/21/23 Gene Prediction 18
1 2 4 5 80
A C G T
.25.25.25 .25
1 1 1
T A T A
A C G T
.25.25 .25.25
1.99
.1
1.5
1 1
1
3
.5
1
6 7
A
1 1
T
11
![Page 19: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/19.jpg)
Also need a state for non-participating regions
04/21/23 Gene Prediction 19
1 2 4 5 80
A C G T
.25.25.25 .25
1 1 1
T A T A
A C G T
.25.25 .25.25
1.99
.1
1.5
1 1
1
3
.5
1
6 7
A
1 1
T
11
![Page 20: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/20.jpg)
1 2 4 5 80
A C G T
.25.25.25 .25
1 1 1T A T A
A C G T
.25.25 .25.25
1.99.1
1.5
1 1
1
3
.5
1
6 7
A1 1
T
1 1
First guess as to probabilities Maybe from state associated with first T to A
100% Then 50% 50% whether A or T Then 50% 50% whether A or T Then 100% T
04/21/23 Gene Prediction 20
![Page 21: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/21.jpg)
Baum-Welch or Viterbi algorithmPass the algorithm a sequence of
observations and first guess as to probabilities
It refines the probability matrices
04/21/23 Gene Prediction 21
•Assumes that the sequence adheres to the underlying probabilities.•Traverses states keeping track of actual frequency of emissions and transitions•Adjusts matrices accordingly
![Page 22: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/22.jpg)
Called checking the posterior probabilities Given a sequence, check all possible paths
through the model Multiply the associated probabilities Path with the highest probability is likely the
path through the hidden states Can use the “forward algorithm” to cut down
the number of paths (dynamic programming) Location in sequence where most probable
states are “TATAAT” is a match
04/21/23 Gene Prediction 22
1 2 3 4 50
A C G T
.25.25.25 .25
1 1 1T A T A
A C G T
.25.25 .25.25
116/171/17
1 1 1 1
1
![Page 23: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/23.jpg)
Matlab very useful at matrix operations
04/21/23 Gene Prediction 23
seq =['a','g','c','g','a','t','a','c','g','c','g','a','t','c','g','a','t','a','t','a','g','t','g','c']seq =[1,3,2,3,1,4,1,2,3,2,3,1,4,2,3,1,4,1,4,1,3,4,3,2]EMIS = [.25,.25,.25,.25;#ACGT
0,0,0,1;1,0,0,0;0,0,0,1; 1,0,0,0;.25,.25,.25,.25]
TRANS = [16/17,1/17,0,0,0,0;0,0,1,0,0,0;0,0,0,1,0,0;0,0,0,0,1,0;0,0,0,0,0,1;0,0,0,0,0,1]
seq =['a','g','c','g','a','t','a','c','g','c','g','a','t','c','g','a','t','a','t','a','g','t','g','c']seq =[1,3,2,3,1,4,1,2,3,2,3,1,4,2,3,1,4,1,4,1,3,4,3,2]EMIS = [.25,.25,.25,.25;#ACGT
0,0,0,1;1,0,0,0;0,0,0,1; 1,0,0,0;.25,.25,.25,.25]
TRANS = [16/17,1/17,0,0,0,0;0,0,1,0,0,0;0,0,0,1,0,0;0,0,0,0,1,0;0,0,0,0,0,1;0,0,0,0,0,1]
1 2 3 4 50
A C G T
.25.25.25 .25
1 1 1T A T A
A C G T
.25.25 .25.25
116/171/17
1 1 1 1
1
![Page 24: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/24.jpg)
Gene mark georgia institute http://exon.biology.gatech.edu/
Genscan http://genes.mit.edu/GENSCAN.html
Genie Berkeley http://www.fruitfly.org/seq_tools/genie.ht
mlGlimmer university of maryland
http://www.cbcb.umd.edu/software/GlimmerHMM/
04/21/23 24Gene Prediction
![Page 25: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/25.jpg)
Can include all regions in the model States for each position in each region Coding region could be simple set of
three regions
04/21/23 25Gene Prediction
Coding region-35 -10
Transcription start site
Ribosomal binding site
for -10 sequence T A T A A T
for -35 sequence T T G A C A
Start Codon
Polymerase binding
Termination region
![Page 26: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/26.jpg)
Classic example: states are rainy or sunny If know whether
someone is walking, shopping or cleaning, can predict state
04/21/23 26Gene Prediction
states
Emissions Observations
![Page 27: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/27.jpg)
04/21/23 27Gene Prediction
![Page 28: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/28.jpg)
If something that is observable is dependent on an underlying state can use HMM
In motifs sequence is visible, whether or not a region is a promoter site is not
04/21/23 28Gene Prediction
![Page 29: Doug Raiford Lesson 3. Have a fully sequenced genome How identify the genes? What do we know so far? 10/13/20152Gene Prediction.](https://reader036.fdocuments.us/reader036/viewer/2022062309/56649eb55503460f94bbe567/html5/thumbnails/29.jpg)
Each state has a probability of emitting any given observation
Each state has a probability of transitioning to any given next state
04/21/23 Gene Prediction 29
Probabilistic parameters of a hidden Markov model (example)x — statesy — possible observationsa — state transition probabilitiesb — output probabilities