Motif instance identification using comparative genomics
description
Transcript of Motif instance identification using comparative genomics
![Page 1: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/1.jpg)
Motif instance identification using comparative genomics
Pouya KheradpourJoint work with: Alexander Stark, Sushmita Roy and Manolis Kellis
![Page 2: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/2.jpg)
Background and goal
TF1 microRNA1TF2
• Regulators bind to short (5 to 20bp) sequence specific patterns (motifs)
• Genes are largely controlled through the binding of regulators
– Transcription factors (TFs) are proteins that bind near the transcription start site (TSS) of genes and either activate or repress transcription
– miRNAs bind to the 3’ un-translated region (UTR) of mRNAs to repress translation
• The goal of our work is to identify these binding sites (motif instances)
![Page 3: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/3.jpg)
Motivation
Network: Davidson and Erwin, Science (2006)Mouse: Pennacchio, et al., Nature (2006) Fly: Tomancak, et al., Genome Biology (2002)
• In all animals, genes are both temporally and spatially regulated to produce complex expression patterns
• Identifying the targets of regulators is vital to understanding this expression
• Conservation allows for identifying targets that are evolutionarily meaningful
![Page 4: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/4.jpg)
Previous work
• Single genome approaches– Generally use positional clustering of motif matches to increase signal (e.g.
Berman, et al. 2002; Schroeder, et al. 2004; Philippakis, et al. 2006)• A single 5mer match occurs on average 3 million times in mammalian genome
– Requires set of specific factors that act together– Miss instances of motifs that may occur alone
• Multi-genome approaches (phylogentic footprinting)– Blanchette and Tompa 2002 use an alignment free phylogenetic approach
to find k-mers that are unusually well conserved– Moses, et al. 2004 use a strict phylogenetic model to find regions that
evolve according to the motif and not the background– Etwiller, et al. 2005 use both nearby species and distant species (fish) to
identify motif instances– Lewis, et al. 2005 finds putative microRNA binding sites requiring full
conservation in five species
![Page 5: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/5.jpg)
Approach outline
1. Produce a raw conservation score for each motif match (branch length score or BLS)
2. For each motif and region, produce a mapping from BLS to confidence
Advantages• Now we have many, complete, closely related genomes
– Gives enough power to identify binding sites (Eddy, 2005)– Do not have to worry about dramatic divergence
• Account for non-motif conservation using globally derived statistics
• Robust against errors and evolutionary turnover• Computationally feasible to run genome wide for all available
motifs
![Page 6: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/6.jpg)
Large phylogeny challenges in instance identification
• Sequencing / assembly / alignment artifacts– Low coverage sequencing, mis-alignments
• Evolutionary variation– Individual binding sites can move / mutate– Some instances found only in subset of species
Don’t require perfect conservation: Branch length score
Don’t require exact alignment: Search within a window
Motif instance
movement
missing sequence
![Page 7: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/7.jpg)
Computing Branch Length Score (BLS)
CTCF
BLS = 2.23sps (78%)Does not over count redundant branch lengthAllows for:1. Mutations permitted by motif degeneracy2. Misalignment/movement of motifs within
window (up to hundreds of nucleotides)3. Missing motif matches in dense species tree
mutations
missing short branches
movement
![Page 8: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/8.jpg)
Branch Length Score Confidence
1. Evaluate non-motif probability of a given score• Sequence could also be conserved due to overlap with
un-annotated element (e.g. non-coding RNA)
2. Account for differences in motif composition and length
• For example, short motif more likely to be conserved by chance
![Page 9: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/9.jpg)
Control motifs
• Control motifs are the basis of our estimation of the background level of conservation and for evaluating enrichment
• Each motif has its own set of controls• They are chosen to:
– Have the same composition as the original motif– Match the target regions (e.g. promoters) with
approximately the same frequency (+/- 20%)– Not too similar to each other (to preserve diversity)– Not be similar to known motifs (including the one being
shuffled)• Background level is estimated separately in each region
type (e.g. Promoters or 3’ UTRs)
![Page 10: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/10.jpg)
Branch Length Score Confidence
1. Use motif-specific shuffled control motifs determine the expected number of instances at each BLS by chance alone or due to non-motif conservation
2. Compute Confidence Score as fraction of instances over noise at a given BLS(=1 – false discovery rate)
3. Select movement window that leads to the most instances at each confidence
![Page 11: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/11.jpg)
Confidence selects for functional instances
Transcription factor motifs
Promoter
5’UTR
CDS
Intron
3’UTR
MicroRNA motifs
Promoter
5’UTR
CDS
Intron
3’UTR
1. Confidence selects for transcription factor motif instances in promoters and miRNA motifs in 3’ UTRs
![Page 12: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/12.jpg)
Confidence selects for functional instances
1. Confidence selects for transcription factor motif instances in promoters and miRNA motifs in 3’ UTRs
2. miRNA motifs are found preferentially on the plus strand, whereas no such preference is found for TF motifs
Strand Bias
![Page 13: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/13.jpg)
Experimental identification of binding sites
• Chromatin immunoprecipitation (ChIP) combined with either sequencing (seq) or with microarrays (chip) are experimental procedures that are used to identify binding sites– Not all binding is functional, can have high false positive rate– Only binding that is active in the surveyed conditions is found
ChIP-seq
Maridis 2007
![Page 14: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/14.jpg)
Intersection with CTCF ChIP-Seq regions
• Conserved CTCF motif instances highly enriched in ChIP-Seq sites
• High enrichment does not require low sensitivity
• Many motif instances are verified
ChIP data from Barski, et al., Cell (2007)
≥ 50% of regions with a motif50% motifs verified
50% confidence
CTCF
![Page 15: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/15.jpg)
Enrichment found for other factors in mammals and flies B
arsk
i, et
al.,
Cel
l (20
07)
Odo
m, e
t al.,
Nat
ure
Gene
tics (
2007
)
Lim
, et a
l., M
olec
ular
Cel
l (20
07)
Wei
, et a
l., C
ell (
2006
)
Zelle
r, et
al.,
PN
AS (2
006)
Lin,
et a
l., P
LoS
Gene
tics (
2007
)
Robe
rtso
n, e
t al.,
Nat
ure
Met
hods
(200
6)
Mammals
Abra
ms a
nd A
ndre
w, D
evel
(200
5) (N
ot C
hIP)
Sand
man
n, e
t al.,
Dev
el C
ell (
2006
)Ze
itlin
ger,
et a
l., G
enes
& D
evel
(200
7)Sa
ndm
ann,
et a
l., G
enes
& D
evel
(200
7)
Flies
![Page 16: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/16.jpg)
Enrichment increases in conserved bound regions
Human: Barski, et al., Cell (2007)Mouse: Bernstein, unpublished
1. ChIP bound regions may not be conserved (Odom, et al. 2007)2. For CTCF we also have binding data in mouse 3. Enrichment in intersection is dramatically higher
![Page 17: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/17.jpg)
Enrichment increases in conserved bound regionsHu
man
: Bar
ski,
et a
l., C
ell (
2007
)M
ouse
: Ber
nste
in, u
npub
lishe
d
Odo
m, e
t al.,
Nat
ure
Gene
tics (
2007
)
1. ChIP bound regions may not be conserved (Odom, et al. 2007)2. For CTCF we also have binding data in mouse 3. Enrichment in intersection is dramatically higher4. Trend persists for other factors where we have multi-species
ChIP data
![Page 18: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/18.jpg)
1. Motifs at 60% confidence and ChIP have similar enrichments (depletion for the repressor Snail) in the functional promoters
2. Enrichments persist even when you look at non-overlapping subsets3. Intersection of two has strongest signal4. Evolutionary and experimental evidence is complementary
• ChIP includes species specific regions and differentiates tissues• Conserved instances include binding sites not seen in tissues surveyed
ChIP data from: Zeitlinger, et al., G&D (2007); Sandmann, et al,. G&D (2007); Sandmann, et al., Dev Cell (2006)
Enrichment of instances in fly muscle genes
![Page 19: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/19.jpg)
Fly regulatory network at 60% confidence
TFs: 67 of 83 (81%) 46k instances
miRNAs: 49 of 67 (86%) 4k instances
• Several connections confirmed by literature (either directly or indirectly)Global view of instances allows us to make network level observations:• TFs were more targeted by TFs (P < 10-20) and by miRNAs (P < 5 x 10-5)• TF in-degree associated with miRNA in-degree (high-high: P < 10-4; low-low P < 10-6)
![Page 20: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/20.jpg)
Contributions
• A general methodology for regulatory motif instance identification using many, closely related genomes
– Robust against errors from sequencing, assembly and alignment
– Allows limited functional turnover and motif movement– Provides statistical measurement of confidence for each
instance, correcting for length, composition and overlap with other functional elements
• Validation and comparison to experimental data– High enrichment of binding sites in ChIP regions for a
variety of factors– Functional enrichments suggest comparable ability to
identify functional instances as ChIP
![Page 21: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/21.jpg)
Future directions
• Our predicted network was static, but real regulatory networks are dynamic– They change throughout development and in different
conditions– They can vary greatly in different species
• We want to expand this work to learn about this network dynamics– ChIP data is becoming increasingly available in a variety
of conditions – we can use this to learn what causes changes in binding
– Multi-species data is also becoming more available• Can match motif binding to cross-species expression changes
– We can train on this data to find motifs that act together or compensate for each other
![Page 22: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/22.jpg)
Acknowledgments• Alexander Stark• Sushmita Roy• Manolis Kellis
Mouse CTCF ChIP-Seq• Tarjei Mikkelsen• Brad Bernstein
Funding• William C.H. Chao Fellowship• NSF Graduate Research Fellowship
MIT CSAIL• Matt Rasmussen• Mike Lin• Issao Fujiwara• Rogerio Candeias
Broad Institute• Or Zuk• Michele Clamp• Manuel Garber• Mitch Guttman• Eric Lander
![Page 23: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/23.jpg)
The End
![Page 24: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/24.jpg)
Implementation details
• Table lookup on the next 8 bases of the genome are used to find potential matches to the target genome – Results in an order-of-magnitude increase in speed over
scanning through all motifs• In a first run, 100 shuffles of each motif are evaluated
and up to 10 that fulfill the requirements are selected• All motifs and their selected shuffles are matched to
the target genome and their BLS scores are computed• The matches are evaluated at each branch length
cutoff and a mapping is produced for each motif from branch length score to confidence
• All code is designed to run on BROAD cluster (often with parallelization) and is written in C
![Page 25: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/25.jpg)
Performance on mammalian TRANSFAC motifs
• Most motifs have confident instances into 90% confidence with 18 mammals• Substantial increase in the number of instances compared to only human, mouse rat and
dog.
2.5x increase
3.5x
6.5x
![Page 26: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/26.jpg)
The promise of many genomes
• Eddy showed that with many genomes, resolving binding sites using conservation is possible
• The goal of our work is to make this practical– Integrate evidence from multiple informant species– Determine which of the thousands of motif matches are
functional using conservation
![Page 27: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/27.jpg)
Slides on motif discovery
![Page 28: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/28.jpg)
Related problem: computational motif discovery
• Discovery of the regulatory motifs (as opposed to their binding sites) has also been an active area of research for several years
• Single species work has generally required sequences thought to have similar regulation (for comparison, see Tompa, et al. 2005; Elemento, et al. 2007)
– Looked for patterns that were enriched in target sequences• Use of conservation has been generally successful in re-
identifying known binding affinities for TFs and miRNAs (e.g. Kellis, et al. 2003; Xie, et al. 2005; Etwiller, et al. 2005)
– Requires fewer species (i.e. less branch length) than instance identification because signal can be integrated over thousands of instances found genome-wide
![Page 29: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/29.jpg)
Motif discovery pipeline
1.Enumerate motif seeds
• Six non-degenerate characters with variable size gap in the middle
2.Score seed motifs• Use a conservation ratio corrected for composition
and small counts to rank seed motifs3.Expand seed motifs
• Use expanded nucleotide IUPAC alphabet to fill unspecified bases around seed using hill climbing
4.Cluster to remove redundancy• Using sequence similarity
GT C A GTgap
GT C A GTR RY gapS W
![Page 30: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/30.jpg)
Consensus MCS Matches to known Expression enrichment Promoters Enhancers
1 CTAATTAAA 65.6 engrailed (en) 25.4 2
2 TTKCAATTAA 57.3 reversed-polarity (repo) 5.8 4.2
3 WATTRATTK 54.9 araucan (ara) 11.7 2.6
4 AAATTTATGCK 54.4 paired (prd) 4.5 16.5
5 GCAATAAA 51 ventral veins lacking (vvl) 13.2 0.3
6 DTAATTTRYNR 46.7 Ultrabithorax (Ubx) 16 3.3
7 TGATTAAT 45.7 apterous (ap) 7.1 1.7
8 YMATTAAAA 43.1 abdominal A (abd-A) 7 2.2
9 AAACNNGTT 41.2 20.1 4.3
10 RATTKAATT 40 3.9 0.7
11 GCACGTGT 39.5 fushi tarazu (ftz) 17.9
12 AACASCTG 38.8 broad-Z3 (br-Z3) 10.7
13 AATTRMATTA 38.2 19.5 1.2
14 TATGCWAAT 37.8 5.8 2
15 TAATTATG 37.5 Antennapedia (Antp) 14.1 5.4
16 CATNAATCA 36.9 1.8 1.7
17 TTACATAA 36.9 5.4
18 RTAAATCAA 36.3 3.2 2.8
19 AATKNMATTT 36 3.6 0
20 ATGTCAAHT 35.6 2.4 4.6
21 ATAAAYAAA 35.5 57.2 -0.5
22 YYAATCAAA 33.9 5.3 0.6
23 WTTTTATG 33.8 Abdominal B (Abd-B) 6.3 6
24 TTTYMATTA 33.6 extradenticle (exd) 6.7 1.7
25 TGTMAATA 33.2 8.9 1.6
26 TAAYGAG 33.1 4.7 2.7
27 AAAKTGA 32.9 7.6 0.3
28 AAANNAAA 32.9 449.7 0.8
29 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n) 11 0.8
30 TTATTTAYR 32.9 Deformed (Dfd) 30.7
Top 30 discovered fly motifs
1. Many of the top discovered motifs match known motifs2. Motifs are associated with genes that are preferentially expressed in tissues
![Page 31: Motif instance identification using comparative genomics](https://reader036.fdocuments.us/reader036/viewer/2022062814/5681678f550346895ddcbcc2/html5/thumbnails/31.jpg)
Discovered motifs have functional enrichments
1. Most motifs avoided in ubiquitously expressed genes 2. Functional clusters emerge
Tissues
Mot
ifs Enrichment or depletion of a motif in the promoters
of genes expressed in a tissue