Regular Meeting December 22, 2008
-
Upload
basil-huffman -
Category
Documents
-
view
35 -
download
1
description
Transcript of Regular Meeting December 22, 2008
Regular Meeting
December 22, 2008
Mark BorodovskyIvan Antonov
11/6/2008 GATech 2
Topics
1.What have been done
2.FSMark HMM implementation
3.Answers to the previous meeting questions
4.Future work
11/6/2008 GATech 3
What have been done
•HMM implementation in FSMark has been changed
•Some questions from the previous meeting have been answered
FSMark HMM implementation
11/6/2008 GATech 5
Current HMM implementation
• Currently for a given position i we look backward on 2 nucleotides instead of looking forward
• FSMark starts examining sequence from the 3rd position only (i=2), so we have complete emission string (there are strange results if we start with 1st position)
• Since FSMark starts with i=2 gene without frame shift will have state 2
11/6/2008 GATech 6
FSMark prediction depends on FS letter
• A test has been done for a sample gene inserting different letters in the middle of the gene. FSMark-GM hmm_def file was used.
FS letter FSMark prediction
A Gene overlap
C Frame shift
G Frame shift
T Frame shift
Answers to the previous meeting
questions
11/6/2008 GATech 8
Control
Genome without frame
shifts
GeneMark 417
overlaps
FSMark-GM
118 frame shifts
True Positive
0
False Positive
118
False Negative
0
11/6/2008 GATech 9
Experiment
Genome with frame shifts in
400 genes
GeneMark 599
overlaps
FSMark-GM
325 frame shifts
True Positive
113
False Positive
212
False Negative
287
171 overlaps
caused by frame shift
11/6/2008 GATech 10
Questions to answer
• Take a look at the distribution of overlap lengths in GeneMark output
• Understand why GeneMark predicts gene overlap for less than 50% of genes with Frame Shifts. There are two possible reasons:– Missing short part, i.e. GeneMark predicts one gene only– GeneMark predicts two genes but they don’t overlap
• Try to understand why did we get more False Positive in experiment than in control
11/6/2008 GATech 11
All overlaps length (genome without FS)
0
50
100
150
200
250
300
4 7 8 10 11 13 14 16 17 19 20 22 23 25 26 29 31 32 35 38 40 43 56
11/6/2008 GATech 12
Overlaps caused by frame shift
0
5
10
15
20
25
8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 77 89 95 140
11/6/2008 GATech 13
GeneMark analysis
• Why does GeneMark barely predict overlaps for genes with frame shift?
• In my GeneMark output there are 357 typical genes (out of 400).
• Probably I use wrong GeneMark option?
11/6/2008 GATech 14
GeneMark output statistics
Genome with frame
shifts in 400 genes
4,388 gene
s
599 gene
overlaps
335 genes with fs
171 overlaps
caused by fs
22 genes with fs
are missing
fs in 164 genes didn’t
cause overlap
4 fs caused new gene downstream the initial
gene
163 decreased
their lengths
11/6/2008 GATech 15
Conclusions
• I need to check how to run GeneMark in order to get the same 400 typical genes
• It seems that the small chunk in the shifted frame is not enough for GeneMark to predict a new gene
11/6/2008 GATech 16
Time Table
Date TODO
Dec 24, Wed
Insensitive zone length analysis for FSMark to determine length of zones 1 and 3
2009 Apply FSMark-GM to 3 typical genomes using found zone 1 and 3 lengths