Drablos Composite Motifs Bosc2009

Finn Drabløs [tare.medisin.ntnu.no]

Computational discovery of composite motifs in DNA

Geir Kjetil Sandve, Osman Abul and Finn Drabløs

Basic gene regulationIntroduction

• Proteins (transcription factors, TFs) recognise binding sites (sequence motifs) in gene regulatory regions

• The transcription factors stabilise the transcription complex

• Distal promoters (enhancers) interact through DNA looping

Michael Lones

De novo prediction of binding sites• Make a set of co-regulated genes

– E.g. from microarray experiments, normally imperfect sets

• Extract assumed regulatory regions– Normally a fixed region upstream from TSS of each gene

• Search for overrepresented patterns in these regions– Use a model for what a motif should look like

• Consensus sequence with mismatches• Position Weight Matrix (PWM) based on log odds scores for occurrences

– Use a strategy to find (local) optima for this model• E.g. Gibbs sampling, expectation maximisation …

• Problem: More than 100 different methods– Which methods are reliable?

Motivation

Benchmarking of de novo tools• Tompa et al, Nature Biotech 23, 137-144 (2005)• Tested 14 different tools for motif discovery• Used 52 data sets from fly (6), human (26), mouse (12)

and yeast (8)• Used data sets with real (Transfac) binding sites in

different sequence contexts– ”real” – The actual promoter sequences– ”generic” – Randomly chosen promoter sequences from same genome– ”markov” – Sequences generated by Markov chain of order 3

• Measured performance at nucleotide level

Motivation

Average benchmark performanceMethod TP FP FN TN

AlignAce 477 3789 8186 436048

ANN-Spec 754 7799 7909 432038

Consensus 178 1394 8485 438443

GLAM 223 5619 8440 434218

Improbizer 594 7942 8069 431895

MEME 581 4836 8082 435001

MEME3 673 6726 7990 433111

MITRA 272 4092 8391 435745

MotifSampler 520 4344 8143 435493

Oligo/dyad 345 1891 8318 437946

QuickScore 151 4856 8512 434981

SeSiMCMC 530 13813 8133 426024

Weeder 748 1748 7915 438089

YMF 554 3492 8109 436345

TP FNFP TN Pred_P Pred_N

Real_P 471 8192Real_N 5167 434670

nCC = 0.053

Performance is close to random!

Too many FP, FN

Motivation

Can we improve performance?• Use better motif representations

– Hidden Markov Models

• Use better algorithms– More exhaustive searching– Discriminative motif discovery

• Use better background models– Real sequences (not Markov models)

• Filter out false positives– Identify “motif-like” solutions– Identify regulatory regions– Use co-occurrence of motifs

• Modules, composite motifs

Motivation

TODAY!

Composite motif discovery

• TFs act together as modules• Modules are not completely unique

Approach

Basic definitions• Frequent modules

– Modules (and motifs) can be ranked by support• Fraction of sequences where the module (or motif) is found

– Support is monotonous• Adding a motif to a module can never increase module support

• Specific modules– Modules can be ranked by hit probability

• Probability that a sequence supports the module– Hit probability is monotonous (as for support)– Specific modules have low hit probability in background sequences

• Significant modules– Modules can be ranked by significance

• Probability that support in sequence ≠ background

Algorithm

Search tree• Discretized single motifs

{1, 2, 3, …} organised as an implicit search tree

• Support set H and hit probability P is iteratively computed (monotonicity)– Initially H is full sequence set and

P is 1)

• Search tree is efficiently pruned (indicated with X) based on H and P

• Final output can be ranked by module significance

Algorithm

Module significance• Position-level probability in background

– Probability of single motif at specific location– Estimated from real DNA background sequences

• Sequence-level probability in background– Probability of single motif at least once in given background sequence– Estimated as union of position-level probabilities

• Hit-probability in background– Probability of composite motif at least once in background sequence– Estimated as product of individual motif components

• Significance p-value of observed support– Probability of seeing at least observed support in background set– Estimated as right tail of binomial distribution

• At least k out of n successes given hit-probability

Implementation

Problem specification• Frequent and specific modules

– Use thresholds on support and specificity

– Complete solutions but multi-objective optimization

• Top-ranking modules– Combine objectives into single

measure, e.g. p-value

• Pareto-optimal modules– Each objective is a separate

dimension of optimality– Return Pareto front of composite

motifs

Implementation

http://en.wikipedia.org/wiki/Pareto_efficiency

Motif prediction flowchartImplementation

Benchmark data set

• Known composite motifs from the TransCompel database• Tests performance by adding “noise matrices” to input

– Matrices for TFs assumed not to bind in sequence set• Will have random (false positive) hits

– Selected at random from Transfac• Max noise level includes all Transfac matrices

– Similar to actual usage• Searching for motifs consisting of unknown TFs

Benchmarking

General performance (nCC)

• Compo compared to several other tools– TransCompel benchmark set

• Compo has clearly best performance, in particular at realistic settings (high noise level)

Benchmarking

Background and support• Compo gains performance from realistic background (real

DNA) and support– Random DNA based on multinomial sequence model

• Performance without real DNA background or support comparable to other tools

Benchmarking

Pareto front• Pareto front on support,

max motif distance and significance (colour)

• Compo prediction not optimal– Compo predicted Ets and

GATA– Annotated motif is AP1 and

• Explore alternative solutions

• Explore parameter interactions

Future development

X – NFATO – AP1

The research groupBiGR

Drabløs, Finn

Postdocs / ResearchersSætrom, PålKusnierczyk, WacekRye, MortenKlein, JörnAnderssen, EndreWang, Xinhui (ERCIM)Capatana, Ana (ERCIM, starting 2009)

PhDsBratlie, Marit SkyrudKlepper, KjetilSaito, TakayaLundbæk, MarieHåndstad, Tony

Programmers / TechniciansJohansen, JosteinThomas, LaurentOlsen, Lene C.

OthersSolbakken, Trude

Master studentsBolstad, KjerstiMuiser, IweSponberg, BjørnBrands, StefSkaland, Even

Former membersSandve, Geir KjetilAbul, OsmanSchwalie, PetraLones, Michael

Acknowledgements

Drablos Composite Motifs Bosc2009

Technology

Transcript of Drablos Composite Motifs Bosc2009

Mini motifs

Schbath Rmes Bosc2009

crochet motifs

Japanese Design Motifs

Swertz Molgenis Bosc2009

Motifs & Accessories

Persian Motifs

Jewish Motifs 2015

Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

Welch Wordifier Bosc2009

Previous Lecture: Motifs

Ch04 motifs

Motifs Ch1

Combinational Motifs

Senger Soaplab Bosc2009

Japanese Motifs

Symbols and Motifs

Lecture 8:Motifs and Motifs finding - National Center for ... · Lecture 8:Motifs and Motifs finding (with a section on Chip-Seq) Principles of Computational Biology Teresa Przytycka,

Sequence Motifs

Modernist Motifs