Post on 28-Nov-2014
description
1
Finn Drabløs [tare.medisin.ntnu.no]
Computational discovery of composite motifs in DNA
Geir Kjetil Sandve, Osman Abul and Finn Drabløs
2
Finn Drabløs [tare.medisin.ntnu.no]
Basic gene regulationIntroduction
• Proteins (transcription factors, TFs) recognise binding sites (sequence motifs) in gene regulatory regions
• The transcription factors stabilise the transcription complex
• Distal promoters (enhancers) interact through DNA looping
Michael Lones
3
Finn Drabløs [tare.medisin.ntnu.no]
De novo prediction of binding sites• Make a set of co-regulated genes
– E.g. from microarray experiments, normally imperfect sets
• Extract assumed regulatory regions– Normally a fixed region upstream from TSS of each gene
• Search for overrepresented patterns in these regions– Use a model for what a motif should look like
• Consensus sequence with mismatches• Position Weight Matrix (PWM) based on log odds scores for occurrences
– Use a strategy to find (local) optima for this model• E.g. Gibbs sampling, expectation maximisation …
• Problem: More than 100 different methods– Which methods are reliable?
Motivation
4
Finn Drabløs [tare.medisin.ntnu.no]
Benchmarking of de novo tools• Tompa et al, Nature Biotech 23, 137-144 (2005)• Tested 14 different tools for motif discovery• Used 52 data sets from fly (6), human (26), mouse (12)
and yeast (8)• Used data sets with real (Transfac) binding sites in
different sequence contexts– ”real” – The actual promoter sequences– ”generic” – Randomly chosen promoter sequences from same genome– ”markov” – Sequences generated by Markov chain of order 3
• Measured performance at nucleotide level
Motivation
5
Finn Drabløs [tare.medisin.ntnu.no]
Average benchmark performanceMethod TP FP FN TN
AlignAce 477 3789 8186 436048
ANN-Spec 754 7799 7909 432038
Consensus 178 1394 8485 438443
GLAM 223 5619 8440 434218
Improbizer 594 7942 8069 431895
MEME 581 4836 8082 435001
MEME3 673 6726 7990 433111
MITRA 272 4092 8391 435745
MotifSampler 520 4344 8143 435493
Oligo/dyad 345 1891 8318 437946
QuickScore 151 4856 8512 434981
SeSiMCMC 530 13813 8133 426024
Weeder 748 1748 7915 438089
YMF 554 3492 8109 436345
TP FNFP TN Pred_P Pred_N
Real_P 471 8192Real_N 5167 434670
nCC = 0.053
Performance is close to random!
Too many FP, FN
Motivation
6
Finn Drabløs [tare.medisin.ntnu.no]
Can we improve performance?• Use better motif representations
– Hidden Markov Models
• Use better algorithms– More exhaustive searching– Discriminative motif discovery
• Use better background models– Real sequences (not Markov models)
• Filter out false positives– Identify “motif-like” solutions– Identify regulatory regions– Use co-occurrence of motifs
• Modules, composite motifs
Motivation
TODAY!
TODAY!
TODAY!
7
Finn Drabløs [tare.medisin.ntnu.no]
Composite motif discovery
• TFs act together as modules• Modules are not completely unique
Approach
8
Finn Drabløs [tare.medisin.ntnu.no]
Basic definitions• Frequent modules
– Modules (and motifs) can be ranked by support• Fraction of sequences where the module (or motif) is found
– Support is monotonous• Adding a motif to a module can never increase module support
• Specific modules– Modules can be ranked by hit probability
• Probability that a sequence supports the module– Hit probability is monotonous (as for support)– Specific modules have low hit probability in background sequences
• Significant modules– Modules can be ranked by significance
• Probability that support in sequence ≠ background
Algorithm
9
Finn Drabløs [tare.medisin.ntnu.no]
Search tree• Discretized single motifs
{1, 2, 3, …} organised as an implicit search tree
• Support set H and hit probability P is iteratively computed (monotonicity)– Initially H is full sequence set and
P is 1)
• Search tree is efficiently pruned (indicated with X) based on H and P
• Final output can be ranked by module significance
Algorithm
10
Finn Drabløs [tare.medisin.ntnu.no]
Module significance• Position-level probability in background
– Probability of single motif at specific location– Estimated from real DNA background sequences
• Sequence-level probability in background– Probability of single motif at least once in given background sequence– Estimated as union of position-level probabilities
• Hit-probability in background– Probability of composite motif at least once in background sequence– Estimated as product of individual motif components
• Significance p-value of observed support– Probability of seeing at least observed support in background set– Estimated as right tail of binomial distribution
• At least k out of n successes given hit-probability
Implementation
p
11
Finn Drabløs [tare.medisin.ntnu.no]
Problem specification• Frequent and specific modules
– Use thresholds on support and specificity
– Complete solutions but multi-objective optimization
• Top-ranking modules– Combine objectives into single
measure, e.g. p-value
• Pareto-optimal modules– Each objective is a separate
dimension of optimality– Return Pareto front of composite
motifs
Implementation
http://en.wikipedia.org/wiki/Pareto_efficiency
12
Finn Drabløs [tare.medisin.ntnu.no]
Motif prediction flowchartImplementation
13
Finn Drabløs [tare.medisin.ntnu.no]
Benchmark data set
• Known composite motifs from the TransCompel database• Tests performance by adding “noise matrices” to input
– Matrices for TFs assumed not to bind in sequence set• Will have random (false positive) hits
– Selected at random from Transfac• Max noise level includes all Transfac matrices
– Similar to actual usage• Searching for motifs consisting of unknown TFs
Benchmarking
14
Finn Drabløs [tare.medisin.ntnu.no]
General performance (nCC)
• Compo compared to several other tools– TransCompel benchmark set
• Compo has clearly best performance, in particular at realistic settings (high noise level)
Benchmarking
15
Finn Drabløs [tare.medisin.ntnu.no]
Background and support• Compo gains performance from realistic background (real
DNA) and support– Random DNA based on multinomial sequence model
• Performance without real DNA background or support comparable to other tools
Benchmarking
16
Finn Drabløs [tare.medisin.ntnu.no]
Pareto front• Pareto front on support,
max motif distance and significance (colour)
• Compo prediction not optimal– Compo predicted Ets and
GATA– Annotated motif is AP1 and
NFAT
• Explore alternative solutions
• Explore parameter interactions
Future development
X – NFATO – AP1
17
Finn Drabløs [tare.medisin.ntnu.no]
The research groupBiGR
Drabløs, Finn
Postdocs / ResearchersSætrom, PålKusnierczyk, WacekRye, MortenKlein, JörnAnderssen, EndreWang, Xinhui (ERCIM)Capatana, Ana (ERCIM, starting 2009)
PhDsBratlie, Marit SkyrudKlepper, KjetilSaito, TakayaLundbæk, MarieHåndstad, Tony
Programmers / TechniciansJohansen, JosteinThomas, LaurentOlsen, Lene C.
OthersSolbakken, Trude
Master studentsBolstad, KjerstiMuiser, IweSponberg, BjørnBrands, StefSkaland, Even
Former membersSandve, Geir KjetilAbul, OsmanSchwalie, PetraLones, Michael
Acknowledgements