Promoter prediction assessment by Vladimir B Bajic ENCODE Workshop 2005 at Sanger Institute.
-
Upload
vernon-craig -
Category
Documents
-
view
218 -
download
0
Transcript of Promoter prediction assessment by Vladimir B Bajic ENCODE Workshop 2005 at Sanger Institute.
Promoter prediction assessment
by
Vladimir B Bajic
ENCODE Workshop 2005ENCODE Workshop 2005at Sanger Instituteat Sanger Institute
Predictors• ENCODE participants (3):• 7-80-8 (McPromoter1) • 7-81-8 (McPromoter2)• 41-108-8 (Fprom)
additional predictors • Beyond ENCODE participants (4) (out of competition)• DBTSS (reference experimental dataset of capped flcDNA)• FirstEF• Dragon Gene Start Finder• Dragon Promoter Finder
Goals
• How good are promoter predictors?
• Does performance change on this dataset?
• Implications for future developments
Data (1)
Category “Known genes with CDS” (category = 2)
• 1061 annotated transcripts• 1009 -> 994 unique starts of transcripts (TSSs)• 319 unique TSSs in Encode ‘training’ set (13 regions)• 675 unique TSSs in Encode test set
• Length of ENCODE regions 29,998,060 bp• Length of ‘training’ regions 8,538,447 bp• Length of testing regions 21,459,613 bp
Data (2)
programs predictions unique
McPromoter1 694 694
McPromoter2 727 727
Fprom (combined with gene annotation from Fgenesh)
634 533
DBTSS (exp) 628 524
FEF 1266 1266
DPF 2168 2168
DGSF 628 628
Method for counting TP and FP
All hits to ‘orange’ count as FPsOnly one hit within A, B, or C counts as TP for unique position of TSS(3 hits within C will count only as 1 TP)Only minimum distance from all TSSs counts
Results
• Different measures of success
• Test ENCODE regions
• Also: comparison with other participants (test + all regions)
Se, ppv, AE (average positional error)
DIP1, DIP2, CC, ASM
Comments
• Compared to previous whole human genome analysis, now we use a more strict distance constraint: max allowed distance 1000 nt (vs. previous 2000 nt)
• Previously: Se [0.4 – 0.8], ppv [0.25 – 0.67]
• Now, for experimental DBTSS data: – Positional error ~100 nt, Se 0.61, ppv 0.93
• Computational promoter prediction (CPP) (using single genome, no transcripts):
positional error 200-300 nt (2-3 fold larger than DBTSS) (positive surprise)
• Se [0.32-0.62] (negative surprise but expected) – (reason poor G+C content of some of the test regions)
• CPP: ppv >80 (in some cases >90%) (positive surprise)
• Having in mind the type of information used for ab initio promoter finding, we see no dramatic difference in 5’ end prediction by methods class 1 and 3, and CPP (positive surprise); however, Se and ppv are better with methods of class 1 and class 3 for obvious reasons.
Future developments• Combine TSS predictors and gene finding programs or
transcript info (positive effects of this are visible in Fprom, 20-76-4 and 20-76-5, since in these cases the TSS search space is effectively restricted)
• This, however, requires retuning of TSS predictors and some change in their design philosophy
• Expected performance should be similar or better than in class 1 and class 3 systems as TSS finding systems should be more specialized for the 5’end type of signals
• More emphasis should be given to positional accuracy of TSS predictors
Thank you for your time
You may wake up now