R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic...
-
Upload
duane-knight -
Category
Documents
-
view
216 -
download
0
Transcript of R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic...
![Page 1: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/1.jpg)
REGULATORY MOTIF FINDINGMohammed AlQuraishi
![Page 2: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/2.jpg)
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
![Page 3: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/3.jpg)
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
![Page 4: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/4.jpg)
CELL = FACTORY, PROTEINS = MACHINES
Biovisions, Harvard
![Page 5: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/5.jpg)
Instructions for making the machines
DNA
“Coding” Regions
Instructions for when and where to make them
“Regulatory” Regions (Regulons)
![Page 6: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/6.jpg)
TRANSCRIPTIONAL REGULATION
Regulatory regions are comprised of “binding sites”
“Binding sites” attract a special class of proteins, known as “transcription factors”
Bound transcription factors can inhibit DNA transcription
![Page 7: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/7.jpg)
DNA REGULATION
Source: Richardson, University College London
![Page 8: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/8.jpg)
CELL REGULATION
Transcriptional regulation is one of many regulatory mechanisms in the cell
Focus of Talk
Source: Mallery, University of Miami
![Page 9: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/9.jpg)
STRUCTURAL BASIS OF INTERACTION
![Page 10: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/10.jpg)
STRUCTURAL BASIS OF INTERACTION
Key Feature: Transcription factors are not 100% specific when
binding DNA
Not one sequence, but family of sequences, with varying affinities
0.540.48
0.32
0.25
0.110.08
![Page 11: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/11.jpg)
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
![Page 12: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/12.jpg)
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
![Page 13: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/13.jpg)
MOTIF FINDING
Basic Objective: Find regions in the genome that bind
transcription factors
Many classes of algorithms, differ in: Types of input data Motif representation
![Page 14: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/14.jpg)
INPUT DATA
Single sequence
Evolutionarily related set of sequences
Sequence + other data Microarray expression profile ChIP-chip Others…
![Page 15: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/15.jpg)
MOTIF REPRESENTATION
Probabilistic
Word-Based
Focus of Talk
![Page 16: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/16.jpg)
MOTIF REPRESENTATION
Structural discussion immediately raises difficulties
![Page 17: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/17.jpg)
STRUCTURAL BASIS OF INTERACTION
Key Feature: Transcription factors are not 100% specific when
binding DNA
Not one sequence, but family of sequences, with varying affinities
0.540.48
0.32
0.25
0.110.08
![Page 18: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/18.jpg)
MOTIF REPRESENTATION
Structural discussion immediately raises difficulties
Least Expressive: Single sequence
Most Expressive: 4k-dimensional probability distribution Independently assign probability for each possible
kmer
![Page 19: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/19.jpg)
MOTIF REPRESENTATION
Standard Solution: Position-Specific Scoring Matrix (PSSM) Assuming independence of positions, assign a
probability for each position
Fraught with problems… (Will revisit this)
![Page 20: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/20.jpg)
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
![Page 21: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/21.jpg)
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
![Page 22: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/22.jpg)
REFERENCEAuthors:
Eugene Fratkin, Brian T. Naughton, Douglas L. Brutlag, and Serafim Batzoglou
Title: MotifCut: regulatory motifs finding
with maximum density subgraphs
Publication:Bioinformatics Vol. 22 no. 14 2006,
pages e150–e157
![Page 23: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/23.jpg)
OVERVIEW
Motif Finding Algorithm (“MotifCut”)
Motivation Oversimplicity of PSSMs Intractability of more complex models
![Page 24: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/24.jpg)
OVERSIMPLICITY OF PSSMS
Assumes independence between positions
~25% of TRANSFAC motifs have been shown to violate this assumption Two Examples: ADR1 and YAP6
![Page 25: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/25.jpg)
OVERSIMPLICITY OF PSSMS
Assumes independence between positions
Generates potentially unseen motifs
![Page 26: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/26.jpg)
BASIC FEATURES OF MOTIFCUT
Does not assume an underlying PSSM
Represents a motif with a graph structure In principle maximally expressive In practice not quite
Motif finding cast as maximum density subgraph Subquadratic complexity
![Page 27: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/27.jpg)
MOTIF GRAPH REPRESENTATION
Nodes are kmers Edge weights are distances between kmers
Generative model: Frequency of kmer node equal to frequency of generating kmer
Distance definition is complicated (Will come back to)
Same kmer node can appear multiple times
AGTGGGAC
AGTGGGAC
AGTGCGAC
AGTGCTAC
0
1
2
11
2
![Page 28: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/28.jpg)
MOTIF FINDING
Find highest density subgraph
Density is defined as sum of edge weights per node
Somewhat limits representational power
![Page 29: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/29.jpg)
MOTIF FINDING
Read new sequence
Generate graph as previously described Kmers are generated by shifting one base pair Each kmer in the sequence gets a node,
including identical kmers Graph contains as many nodes as there are base
pairs Connect edges with weights based on distances
between nodes
Find densest subgraph
![Page 30: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/30.jpg)
EDGE WEIGHTS
Heart of the algorithm, will focus on this
Semantics: Edge weight is the likelihood of two kmers to be
in the same motif
Use Hamming distance as a way to quantify distance between kmers
0TT AACC
123
![Page 31: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/31.jpg)
EDGE WEIGHTS
Heart of the algorithm, will focus on this
Semantics: Edge weight is the likelihood of two kmers to be
in the same motif
Use Hamming distance as a way to quantify distance between kmers
“Interpret” hamming distance as a measure of the likelihood of two kmers to be in same motif: F(hamming distance) = likelihood of two kmers
to be in same motif
![Page 32: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/32.jpg)
EDGE WEIGHTS
Let’s make this a bit more precise:
But how to compute ?
Simulate it! Way too many variables to account for
analytically: Background model, kmer length, hamming distance, etc…
![Page 33: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/33.jpg)
“GENOME SIMULATION” Background + Motifs
No genes, promoters, signaling sequences, etc.
Background Model 3rd order Markov model
Probability of next base depends on previous 3 bases Modeled on the yeast genome Incorporates GC bias
Motif Model PSSM Based on empirically observed information
content of yeast motifs
![Page 34: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/34.jpg)
“GENOME SIMULATION”
Use Markov model to generate 10k – 20k length sequences of background
Seed with 20 motifs generated by the PSSM
Result is a simulated genome of yeast We know which parts are the real motifs, and
which are not
![Page 35: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/35.jpg)
EDGE WEIGHTS
Back to :
is number of true motifs of k-length that are l-distance away
is number of non-motifs of k-length that are l-distance away
![Page 36: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/36.jpg)
EDGE WEIGHTSTrue Motifs
G G G G G G
G G G C G G
G G G G G GG G G G G G
G T G G G G
False Motifs (Part of Background)
![Page 37: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/37.jpg)
EDGE WEIGHTS
G G G G G G
G G G C G G
G G G G G GG G G G G G
G T G G G G
All ≤1 distance away (Hamming distance) α(k = 6, l = 1) = 1 β(k = 6, l = 1) = 1
Let’s perform calculation from the perspective of this motif
G G G G G G
G G G C G GG T G G G G
![Page 38: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/38.jpg)
EDGE WEIGHTS
Computation provides an empirical estimate for
Parameterized by two quantities: k, the kmer length l, the Hamming distance between two kmers
Fit to a sigmoidal function
![Page 39: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/39.jpg)
EDGE WEIGHTS
Normalization step Won’t go into details
This covers problem formulation How is motif finding actually done?
![Page 40: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/40.jpg)
MAXIMUM DENSITY SUBGRAPH
Standard graph theory method Max-flow / min-cut O(nm log(n2m))
Need faster method
Developed heuristic approach that utilizes max-flow / min-cut method with modifications
![Page 41: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/41.jpg)
MAXIMUM DENSITY SUBGRAPH
Remove all edges below a certain threshold
![Page 42: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/42.jpg)
MAXIMUM DENSITY SUBGRAPH
Pick one vertex (do this for every vertex)
![Page 43: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/43.jpg)
MAXIMUM DENSITY SUBGRAPH
Put back all neighboring edges for that vertex
![Page 44: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/44.jpg)
MAXIMUM DENSITY SUBGRAPH Use standard algorithm to calculate densest subgraph
![Page 45: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/45.jpg)
RESULTS
Synthetic Tests Plenty of test cases Measure performance as data set size grows Avoid over biasing on empirical data Know real answer, can unambiguously test
performance
Yeast Test Gold standard data (Harbinson et al., 2004)
![Page 46: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/46.jpg)
SYNTHETIC TESTS
Varied: Motif length Information content
Simulated genome (as before)
Correlated predicted PSSMs to real ones, counted as true positive if correlation > 0.7
![Page 47: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/47.jpg)
SYNTHETIC TESTS RESULTS
![Page 48: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/48.jpg)
YEAST TEST RESULTS
![Page 49: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/49.jpg)
PERFORMANCE
![Page 50: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/50.jpg)
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
![Page 51: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/51.jpg)
TALK OUTLINE
Biology Background
Algorithmic Problem
PapersNew Motif Finding Algorithm
(MotifCut)Analysis of Motif Finders’ Performance
Shorter but more drier (no pretty pictures)
![Page 52: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/52.jpg)
REFERENCE Authors:
Patrick Ng, Niranjan Nagarajan, Neil Jones, and Uri Keich
Title: Apples to apples: improving the
performance of motif finders and their significance analysis in the Twilight Zone
Publication:Bioinformatics Vol. 22 no. 14 2006, pages
e393–e401
![Page 53: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/53.jpg)
OVERVIEW
Twilight Zone Non-negligible probability that a maximally
scoring random motif would have a higher score than motifs that overlap the ‘‘real’’ motif
Motivation Behavior of Motif Finders in Twilight Zone is
poorly understood Understanding would aid in development of Motif
Finders Sheds light on whether it is theoretically possible
![Page 54: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/54.jpg)
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
![Page 55: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/55.jpg)
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
![Page 56: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/56.jpg)
E-VALUE
E-value is defined in terms of information content
Information Content
E-value Expected number of random alignments
exhibiting an information content at least as high as that of the given alignment
AlignmentLength
Number of sequences
Background frequency of jth
letter
AlphabetSize
Frequency of jth letter at ith position
![Page 57: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/57.jpg)
ANALYSIS
Generate 400 random datasets Dataset = 40 sequences totaling 1485 bases
Implant a single motif of length 13 per dataset
High likelihood that motif finders would miss it
![Page 58: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/58.jpg)
RESULTS
Reported E-value: 8 x 1015
Very high, very statistically insignificant In principle, theoretically impossible to find
Search results Alignment covering ≥30% of motif found in
288/400 cases!
Data generated exactly in accordance with E-value model
![Page 59: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/59.jpg)
WHAT’S GOING ON?
They don’t know, hand-waive it
Many “satellite” alignments boost up effective score Difficult to characterize analytically
![Page 60: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/60.jpg)
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
![Page 61: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/61.jpg)
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
![Page 62: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/62.jpg)
NEW METRIC: OPV
Also defined in terms of Information Content
OPV(s) (Overall p-value) Probability that a random sample of the same
size as the input set will contain an alignment with at least as much information content as s
Contrast E-value: Expected number of alignments (in
general) OPV: Probability of finding an alignment in a
dataset
![Page 63: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/63.jpg)
ESTIMATION
Caveat Random sample (no biasing)
Difficult to calculate analytically
Estimate empirically General OPV Finder-specific OPV
![Page 64: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/64.jpg)
GENERAL OPV ESTIMATION
Generate 1600 random datasets No implants
Run a collection of motif finders on each dataset
Pick highest scoring motif in each dataset Out of all finders
Sort scores, then pick score with 95% quantile
![Page 65: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/65.jpg)
GENERAL OPV ESTIMATION
Score such that 95% of scores are below it, 5%
above it
![Page 66: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/66.jpg)
GENERAL OPV ESTIMATION
Meaning 95% of the time, highest scoring random motif
scored less than s0
Obtaining a score ≥ s0 means ≤ 5% chance for the motif to be random
![Page 67: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/67.jpg)
GENERAL OPV RESULTS
Run on previous 400 datasets
90% of correct runs (288/400) were classified as noise
Not good…
![Page 68: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/68.jpg)
FINDER-SPECIFIC OPV ESTIMATION
Same as before, but use only one finder
Better biased toward the parameter space of the specific finder
![Page 69: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/69.jpg)
FINDER-SPECIFIC OPV RESULTS
Tested it on Gibbs
Same 400 datasets 228 TPs 13 FPs
Much better…
![Page 70: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/70.jpg)
USING OPV
Impractical
A priori generation is prohibitive given parameter space of motif finders
Per problem estimation is prohibitive Requires ~100x more runs
Not theirs…
![Page 71: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/71.jpg)
ANOTHER METRIC: ILR (INCOMPLETE LIKELIHOOD RATIO)
Not defined in terms of Information Content
Number of Sequences
Length of nth sequence
Length of motif
Probability of subsequence starting at
m to be the motifMotif PSSM
Background PSSMProbability of subsequence
starting at m to be background
![Page 72: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/72.jpg)
ANOTHER METRIC: ILR (INCOMPLETE LIKELIHOOD RATIO)
Not defined in terms of Information Content
Ratio of null hypothesis to OOPS hypothesis OOPS: Once occurrence per sequence
Intuition behind it
![Page 73: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/73.jpg)
ANOTHER METRIC: ILR (INCOMPLETE LIKELIHOOD RATIO)
![Page 74: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/74.jpg)
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
![Page 75: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/75.jpg)
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
![Page 76: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/76.jpg)
MOTIF FINDING USING ILR
Used existing algorithms, ranked final output by ILR
Developed simple new algorithm that uses ILR as objective function
![Page 77: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/77.jpg)
ILR MOTIF FINDING RESULTS
![Page 78: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/78.jpg)
ILR MOTIF FINDING RESULTS
![Page 79: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/79.jpg)
ILR MOTIF FINDING RESULTS
Promising…
![Page 80: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/80.jpg)
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
![Page 81: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/81.jpg)
OBJECTIVES
Analyze existing standard (E-value) Statistical significance of motifs in Twilight Zone
Examine and suggest new metrics
Employ new metric for motif finding
One More Thing!
![Page 82: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/82.jpg)
![Page 83: R EGULATORY M OTIF F INDING Mohammed AlQuraishi. T ALK O UTLINE Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis.](https://reader036.fdocuments.us/reader036/viewer/2022062423/56649ebd5503460f94bc5eee/html5/thumbnails/83.jpg)
THANK YOU