Large scale genomic data mining Curtis Huttenhower 10-23-09 Harvard School of Public Health...
-
date post
22-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Large scale genomic data mining Curtis Huttenhower 10-23-09 Harvard School of Public Health...
Large scalegenomic data mining
Curtis Huttenhower
10-23-09Harvard School of Public HealthDepartment of Biostatistics
Mining Biological Data
~100 GB
More than 100GB
Mining Biological Data
~100 GB
More than 100GB
Mining Biological Data
~100 GB
More than 100GB
How can we ask and answer specific biomedical questions
using thousands ofgenome-scale datasets?
5
Outline
2. Applications:Human molecular data
and clinical cancer cohorts
1. Methodology:Algorithms for mining
genome-scale datasets
3. Next steps:Methods for microbial communities
and functional metagenomics
6
A Definition of Functional Genomics
Genomic data Prior knowledge
Data↓
Function
Function↓
Function
Gene↓
Gene
Gene↓
Function
7
MEFIT: A Framework forFunctional Genomics
BRCA1 BRCA2 0.9BRCA1 RAD51 0.8RAD51 TP53 0.85…
Related Gene Pairs
HighCorrelation
LowCorrelation
Fre
quen
cy
MEFIT
8
MEFIT: A Framework forFunctional Genomics
BRCA1 BRCA2 0.9BRCA1 RAD51 0.8RAD51 TP53 0.85…
BRCA2 SOX2 0.1RAD51 FOXP2 0.2ACTR1 H6PD 0.15…
Related Gene Pairs
Unrelated Gene PairsHigh
CorrelationLow
Correlation
Fre
quen
cy
MEFIT
9
MEFIT: A Framework forFunctional Genomics
Golub 1999
Butte 2000
Whitfield 2002
Hansen 1998
Functional Relationship
10
MEFIT: A Framework forFunctional Genomics
Golub 1999
Butte 2000
Whitfield 2002
Hansen 1998
Functional Relationship
Biological Context
Functional areaTissueDisease…
11
Functional Interaction Networks
MEFIT
Global interaction network
Autophagy networkVacuolar transport
network Translation network
Currently have data from30,000 human experimental results,
15,000 expression conditions +15,000 diverse others, analyzed for
200 biological functions and150 diseases
12
Predicting Gene Function
Cell cycle genes
Predicted relationships between genes
HighConfidence
LowConfidence
13
Predicting Gene FunctionPredicted relationships
between genes
HighConfidence
LowConfidence
Cell cycle genes
14
Cell cycle genes
Predicting Gene FunctionPredicted relationships
between genes
HighConfidence
LowConfidence
These edges provide a measure of how likely a gene is to
specifically participate in the process of
interest.
15
Comprehensive Validation of Computational Predictions
Genomic data
Computational Predictions of Gene Function
MEFITSPELLHibbs et al 2007
bioPIXIEMyers et al 2005
Genes predicted to function in mitochondrion organization
and biogenesis
Laboratory ExperimentsPetite
frequencyGrowthcurves
Confocal microscopy
New known functions for correctly predicted genes
Retraining
With David Hess, Amy Caudy
Prior knowledge
16
Evaluating the Performance of Computational Predictions
106Original GO Annotations
Genes involved in mitochondrion organization and biogenesis
135Under-annotations
82Novel Confirmations,
First Iteration
17Novel Confirmations,
Second Iteration
340 total: >3x previously known genes in ~5 person-months
17
Evaluating the Performance of Computational Predictions
106Original GO Annotations
Genes involved in mitochondrion organization and biogenesis
95Under-annotations
40Confirmed
Under-annotations
80Novel Confirmations
First Iteration
17Novel Confirmations
Second Iteration
340 total: >3x previously known genes in ~5 person-months
Computational predictions from large collections of genomic data can be
accurate despite incomplete or misleading gold standards, and they
continue to improve as additional data are incorporated.
18
Functional Associations Between Contexts
Predicted relationships between genes
HighConfidence
LowConfidence
The average strength of these relationships
indicates how cohesive a process is.
Cell cycle genes
19
Functional Associations Between Contexts
Predicted relationships between genes
HighConfidence
LowConfidence
Cell cycle genes
20
Functional Associations Between Contexts
DNA replication genes
The average strength of these relationships indicates how
associated two processes are.
Predicted relationships between genes
HighConfidence
LowConfidence
Cell cycle genes
21
Functional mapping:Scoring functional associations
How can we formalizethese relationships?
Any sets of genes G1 and G2 in a network can be compared
using four measures:
• Edges between their genes
• Edges within each set• The background edges
incident to each set• The baseline of all edges
in the network
),(),(
),(
2121
21, 21 GGwithin
baseline
GGbackground
GGbetweenFA GG
Stronger connections between the sets increase association.
Stronger within self-connections or nonspecific background connections decrease association.
22
Functional mapping:Bootstrap p-values
• Scoring functional associations is great……how do you interpret an association score?– For gene sets of arbitrary sizes?– In arbitrary graphs?– Each with its own bizarre distribution of edges?
Empirically!# Genes 1 5 10 50
1
5
10
50
Histograms of FAs for random sets
For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is
approximately normal with mean 1.
Standard deviation is asymptotic in the sizes
of both gene sets.
Maps FA scores to p-values for any gene sets and
underlying graph.
100
102
104
100
101
102
103
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
|G1|
|G2|
Null distribution σs for one graph
|)(|||
|||)(|),(ˆ
1),(ˆ
ji
jijiFA
jiFA
GCG
BGGAGG
GG
)(1)( ),(ˆ),,(ˆ, 212121xxFAP GGGGGG
23
Functional Associations Between Processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox Homeostasis
Aldehyde Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
Catabolism
Negative Regulation of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
24
Functional Associations Between Processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
BordersData coverage of processes
WellCovered
SparselyCovered
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox Homeostasis
Aldehyde Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
Catabolism
Negative Regulation of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
25
Functional Associations Between Processes
EdgesAssociations between processes
VeryStrong
ModeratelyStrong
NodesCohesiveness of processes
BelowBaseline
Baseline(genomic
background)
VeryCohesive
BordersData coverage of processes
WellCovered
SparselyCovered
Hydrogen Transport
Electron Transport
Cellular Respiration
Protein ProcessingPeptide
Metabolism
Cell Redox Homeostasis
Aldehyde Metabolism
Energy Reserve
Metabolism
Vacuolar Protein
Catabolism
Negative Regulation of Protein Metabolism
Organelle Fusion
Protein Depolymerization
Organelle Inheritance
AHP1DOT5GRX1GRX2…
APE3LAP4PAI3PEP4 …
26
Functional Maps:Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA
Data integration summarizes an impossibly huge amount of experimental data into an
impossibly huge number of predictions; what next?
27
Functional Maps:Focused Data Summarization
ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA
How can a biologist take advantage of all this data to study
his/her favorite gene/pathway/disease without
losing information?
Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease
associations• Underlying experimental results and
functional activities in data
28
Outline
2. Applications:Human molecular data
and clinical cancer cohorts
1. Methodology:Algorithms for mining
genome-scale datasets
3. Next steps:Methods for microbial communities
and functional metagenomics
29
HEFalMp: Predicting human gene function
HEFalMp
30
HEFalMp: Predicting humangenetic interactions
HEFalMp
31
HEFalMp: Analyzing human genomic data
HEFalMp
32
HEFalMp: Understanding human disease
HEFalMp
33
Validating Human Predictions
Autophagy
Luciferase(Negative control)
ATG5(Positive control) LAMP2 RAB11A
NotStarved
Starved(Autophagic)
Predicted novel autophagy proteins
5½ of 7 predictions currently confirmed
With Erin Haley, Hilary Coller
34
Current Work: MolecularMechanisms in a Colon Cancer CohortWith Shuji Ogino, Charlie Fuchs
~3,100gastrointestinal
subjects
~3,800tissue samples
~1,450colon cancer
samples~1,150
CpG island methylation
~1,200LINE-1
methylation
~700TMA immuno-histochemistry
~2,100cancer
mutation tests
Health Professionals Follow-Up
StudyNurse’s HealthStudy
LINE-1 Methylation• Repetitive element making up ~20% of
mammalian genomes• Very easy to assay methylation level (%)• Good proxy for whole-genome methylation
level
DASL Gene Expression• Gene expression analysis from
paraffin blocks• Thanks to Todd Golub, Yujin
Hoshida
~775gene
expression
35
Colon Cancer:LINE-1 methylation levels
30 35 40 45 50 55 60 65 70 75 8030
40
50
60
70
80
LINE-1 Methylation in Mul-tiple Tumors from the Same
Subject
Methylation %, Tumor #1M
eth
ylat
ion
%,
Tu
mo
r #2
ρ = 0.718, p < 0.01
Ogino et al, 2008
Lower LINE-1 methylation associates with poor colon cancer prognosis.
LINE-1 methylation varies remarkably between individuals…
…but it is highly correlated within individuals.
What does it all mean??What is the biological
mechanism linking LINE-1 methylation to colon cancer?
With Shuji Ogino, Charlie Fuchs
36
Colon Cancer:LINE-1 methylation levels
30 35 40 45 50 55 60 65 70 75 8030
40
50
60
70
80
LINE-1 Methylation in Mul-tiple Tumors from the Same
Subject
Methylation %, Tumor #1M
eth
ylat
ion
%,
Tu
mo
r #2
ρ = 0.718, p < 0.01
Ogino et al, 2008
Lower LINE-1 methylation associates with poor colon cancer prognosis.
LINE-1 methylation varies remarkably between individuals…
…but it is highly correlated within individuals.
This suggests a genetic effect.
This suggests a copy number variation.
This suggests linkage to a cancer-related pathway.
Is anything different about these outliers?
What is the biological mechanism linking LINE-1
methylation to colon cancer?
With Shuji Ogino, Charlie Fuchs
37
Colon Cancer:LINE-1 methylation levels
What is the biological mechanism linking LINE-1
methylation to colon cancer?
Preliminary Data• Six genes differentially expressed even using naïve methods• One uncharacterized, one oncogene, three malignancy, one histone• 1/3 are from a family with known variable GI expression, prognostic
value• 2/3 fall in same cytogenic band, which is also a known CNV hotspot• HEFalMp links to a set of transmembrane receptors/channels• Better analysis pulls out mostly one-carbon metabolism and a few
more signaling pathways (neurotransmitters??)
Check back in acouple of months!
38
Outline
2. Applications:Human molecular data
and clinical cancer cohorts
1. Methodology:Algorithms for mining
genome-scale datasets
3. Next steps:Methods for microbial communities
and functional metagenomics
39
Next Steps:Microbial Communities
• Data integration is off to a great start in humans– Complex communities of distinct cell types– Very sparse prior knowledge
• Concentrated in a few specific areas
– Variation across populations– Critical to understand mechanisms of disease
40
Next Steps:Microbial Communities
• What about microbial communities?– Complex communities of distinct species/strains– Very sparse prior knowledge
• Concentrated in a few specific species/strains
– Variation across populations– Critical to understand mechanisms of disease
41
Next Steps:Functional Metagenomics
• Metagenomics: data analysis from environmental samples– Microflora: environment includes us!
• Another data integration problem– Must include datasets from multiple organisms
• Another context-specificity problem– Now “context” can also mean “species”
• What questions can we answer?– How do human microflora interact with diabetes,
obesity, oral health, antibiotics, aging, …– What’s shared within community X?
What’s different? What’s unique?– What’s perturbed in disease state Y?
One organism, or many? Host interactions?– Current methods annotate ~50% of synthetic data,
<5% of environmental data
PKH1
PKH3
PKH2LPD1
CAR1
W04B5.5
pdk-1
R04B3.2
LLC1.3
T21F4.1
PDPK1
ARG1DLD
ARG2
AGA
42
Next Steps:Microbial Communities
PKH1
PKH3
PKH2LPD1
CAR1
W04B5.5
pdk-1
R04B3.2
LLC1.3
T21F4.1
PDPK1
ARG1DLD
ARG2
AGA
~120 available expression datasets
~70 species
PKH1
PKH3
PKH2LPD1
CAR1
W04B5.5
pdk-1
R04B3.2
LLC1.3
T21F4.1
PDPK1
ARG1DLD
ARG2
AGA
Weskamp et al 2004
Flannick et al 2006
Kanehisa et al 2008
Tatusov et al 1997
• Data integration works just as well in microbes as it does in humans• We know an awful lot about some microorganisms and almost nothing about others• Purely sequence-based and purely network-based tools for function transfer both fall short• We need data integration to take advantage of both and mine out useful biology!
43
Functional Maps forFunctional Metagenomics
YG17
YG16YG15
YG10
YG6
YG9
YG8
YG5
YG11
YG7
YG12
YG13
YG14
YG2
YG1
YG4
YG3
KO8
KO4
KO5
KO7
KO9
KO6
KO2
KO3
KO1
KO1: YG1, YG2, YG3KO2: YG4KO3: YG6…
ECG1, ECG2PAG1ECG3, PAG2…
44
Functional Maps forFunctional Metagenomics
45
Validating Orthology-BasedFunctional Mapping
Does unweighted data integration predict functional relationships?
What is the effect of “projecting” through an orthologous space?
Recall
log(
Pre
cisi
on/R
ando
m)
KEGG
GO
Recall
log(
Pre
cisi
on/R
ando
m)
Recall
log(
Pre
cisi
on/R
ando
m)
GO
Unsupervised integration
Individual datasets
Recall
log(
Pre
cisi
on/R
ando
m) Individual
datasets
KEGG
Unsupervised integration
46
Validating Orthology-BasedFunctional Mapping
YG17
YG16YG15
YG10
YG6
YG9
YG8
YG5
YG11
YG7
YG12
YG13
YG14
YG2
YG1
YG4
YG3Holdout set,
uncharacterized “genome”
Random subsets,characterized “genomes”
47
Validating Orthology-BasedFunctional Mapping
48KEGG KEGG
GO GO
Validating Orthology-BasedFunctional Mapping
Can subsets of the yeast genome predict a heldout subset’s
functional maps?
Can subsets of the yeast genome predict a heldout subset’s
interactome?
0.68 0.48
0.39 0.25
0.30 0.37
0.27 0.39
0.43
0.40
What have we learned?• Yeast is incredibly well-curated
• KEGG tends to be more specific than GO
• Predicting interactomes by projecting through
functional maps
works decently in the absolute best case
49
Functional Maps forFunctional Metagenomics
Now, what happens if you do this forcharacterized microbes?
• ~20 (somewhat) well-characterized species
• 1-35 datasets each
• Integrate within species
• Evaluate using KEGG
• Then cross-validate by holding out species
Recall
log(
Pre
cisi
on/R
ando
m)
KEGG
Unsupervised integrations
50
Next Steps:Missing Methodology, Mining
• Most machine learning algorithms are optimized for one of two cases:
– Small, dense data
– Large, sparse data
• HEFalMp integrates ~300M records using ~1K features, relatively few of which are missing, in ~200 contexts
Feature selection
Regularization
Dimension reductionSimple models, efficient algorithmsSlightly less
51
Next Steps:Missing Methodology, Models
Dataset #1
Dataset #2
Dataset #2 …
Functional Relationship
52
Next Steps:Missing Methodology, Models
Dataset #1
Dataset #2
Dataset #3 …
Functional Relationship
Biological Context
53
Next Steps:Missing Methodology, Models
Dataset #1
Dataset #2
Dataset #3 …
Functional Relationship
Cellular Processes
Tissue/Cell Lineage
Disease State
Developmental Stage
Cross-Species Orthology
This is clearly not a sustainable system;novel large-scale hierarchical modeling is needed to capture the complex biology of metazoan and
metagenomic interaction networks.
Types of Interactions
Regulation
54
Efficient Computation For Biological Discovery
Massive datasets and genomes require efficient algorithms and implementations.
• Sleipnir C++ library for computational functional genomics
• Data types for biological entities• Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.• Network communication, parallelization
• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)
• And it’s fully documented!It’s also speedy: improves on Bayes Net Toolbox by
~22x in memory usage and up to >100x in runtime.
55
Efficient Computation For Biological Discovery
Massive datasets and genomes require efficient algorithms and implementations.
• Sleipnir C++ library for computational functional genomics
• Data types for biological entities• Microarray data, interaction data, genes and gene sets,
functional catalogs, etc. etc.• Network communication, parallelization
• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)
• And it’s fully documented!
8 hours
1 minute
30 years
2 months
18 hours
Original processing time
Current processing time
2-3 hours
56
Outline
2. Applications:Human molecular data
and clinical cancer cohorts
1. Methodology:Algorithms for mining
genome-scale datasets
3. Next steps:Methods for microbial communities
and functional metagenomics
• Bayesian system for genomic
data integration• Sleipnir software for efficient
large scale data mining• Functional mapping to statistically
summarize large data collections
• HEFalMp system for human data
analysis and integration
• Six confirmed predictions in
autophagy• Ongoing analysis of
LINE-1methylation in colon
cancer• Data integration
applied tomicrobial
communities andfunctional
metagenomics• Efficient machine
learningfor large, dense
feature spaces
57
Thanks!
NIGMShttp://function.princeton.edu/sleipnir
http://function.princeton.edu/hefalmp
Interested? We’re lookingfor students and postdocs!Biostatistics Department
http://huttenhower.sph.harvard.edu
Hilary CollerErin HaleyTsheko Mutungu
Olga TroyanskayaMatt HibbsChad MyersDavid HessEdo AiroldiFlorian Markowetz
Shuji OginoCharlie Fuchs
59
Colon Cancer:Immunohistochemistry
Tumor #1 Tumor #2 … Tumor #700
AKT1 0 11 55AURKA 0 5 0CCND1 25 0 30
… …
Gen
es
Conditions
Quantities
The world’s smallest, cheapest microarray!
What is the biological mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data tell us about LINE-1 hypomethylation?
60
Colon Cancer:Immunohistochemistry
~700 Tumor Samples
LINE-1 hypomethylated outliers
LINE-1 methylation “normal”
STAT3VDR
HIF1A
CDKN1B
AURKA
MAPK
CDX2
DNMT1
PPARGCDK8
CTSBPTEN
CCND10
10
20
30
40
50
60
70
80
LINE-1 Methylation Low
Normal
IHC
Pse
ud
oex
pre
ssio
n
What is the biological mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data tell us about LINE-1 hypomethylation?
Can existing microarrays amplify the LINE-1
hypomethylation signal?
The world’s smallest, cheapest microarray!
61
Colon Cancer:Mining Microarrays
STAT3VDR
HIF1A
CDKN1B
AURKA
MAPK
CDX2
DNMT1
PPARGCDK8
CTSBPTEN
CCND1-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
log
2( L
ow
/ N
orm
al )
~650 datasets~15,000 expression conditions
~24,000 genes
Most like our 26-gene LINE-1 differential methylation
signature
Least like the signature
26 genes in signature
What is the biological mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data tell us about LINE-1 hypomethylation?
Can existing microarrays amplify the LINE-1
hypomethylation signal?
Identify microarray datasets with conditions enriched for
LINE-1 hypomethylation.
62
Colon Cancer: Mining Microarrays
“The goal of GSEA is to determine whether members of a gene
set S tend to occur toward the top (or bottom) of the list L.”
data
Subramanian et al, 2005
Most like our 26-gene LINE-1 differential methylation
signature
Least like the signature
Bleomycin effect on mutagen-sensitive lymphoblastoid cells
Folic acid deficiency effect on colon cancer cells
Bladder tumor stage classification
Normal tissue of diverse types
Muscle function and aging
Non-diseased lung tissue
What is the biological mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data tell us about LINE-1 hypomethylation?
Can existing microarrays amplify the LINE-1
hypomethylation signal?
Identify microarray datasets with conditions enriched for
LINE-1 hypomethylation.
What CNV-linked genes are differentially expressed in
these datasets?
Dataset 1
Condition XCondition YCondition Z
Dataset 2
Condition ACondition BCondition CCondition DCondition E
63
Colon Cancer: Mining Microarrays
“The goal of GSEA is to determine whether members of a gene
set S tend to occur toward the top (or bottom) of the list L.”
Subramanian et al, 2005
What is the biological mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data tell us about LINE-1 hypomethylation?
Can existing microarrays amplify the LINE-1
hypomethylation signal?
Identify microarray datasets with conditions enriched for
LINE-1 hypomethylation.
What CNV-linked genes are differentially expressed in
these datasets?
CNV 1
Gene XGene YGene Z
CNV 2
Gene AGene BGene CGene DGene E
Most upregulated insignificantly enriched datasets
Most downregulated
PSGs (11 genes on 19q13.3)
PCDHs (~50 genes on 5q31.3) Misc. ~12 genes on 16p13.3
Iafrate et al, 2005
?
64
Colon Cancer: Mining Microarrays
What is the biological mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data tell us about LINE-1 hypomethylation?
Can existing microarrays amplify the LINE-1
hypomethylation signal?
Identify microarray datasets with conditions enriched for
LINE-1 hypomethylation.
What CNV-linked genes are differentially expressed in
these datasets?
Iafrate et al, 2005
Pregnancy specific β glycoproteins
Salahshor et al, 2005
“PSG9 is not found in the non-pregnant adult except in association with cancer, and it appears to be an early molecular event associated with colorectal cancer.”
Differential gene expression profile reveals deregulation of pregnancy specific β1 glycoprotein 9 early during colorectal carcinogenesis
65
Colon Cancer:Generating a Hypothesis
What is the biological mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data tell us about LINE-1 hypomethylation?
Can existing microarrays amplify the LINE-1
hypomethylation signal?
Identify microarray datasets with conditions enriched for
LINE-1 hypomethylation.
What CNV-linked genes are differentially expressed in
these datasets?
Pregnancy specific β glycoproteins
66
Colon Cancer:Generating a Hypothesis
What is the biological mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data tell us about LINE-1 hypomethylation?
Can existing microarrays amplify the LINE-1
hypomethylation signal?
Identify microarray datasets with conditions enriched for
LINE-1 hypomethylation.
What CNV-linked genes are differentially expressed in
these datasets?
Pregnancy specific β glycoproteins
67
Colon Cancer:Using All the Data
What is the biological mechanism linking LINE-1
methylation to colon cancer?
What does the IHC data tell us about LINE-1 hypomethylation?
Can existing microarrays amplify the LINE-1
hypomethylation signal?
Identify microarray datasets with conditions enriched for
LINE-1 hypomethylation.
What CNV-linked genes are differentially expressed in
these datasets?
Pregnancy specific β glycoproteins
GI cancers and chemotherapy
Yes(caveat investigator)
Get back to me in a couple of months…
What’s the state of the data?• Extremely hypomethylated colon cancer carries a significantly poor prognosis
• In our cohort, these ~20 tumors are weakly enriched for a protein activity signature based on IHC
• The expression datasets most enriched for the same signature represent mainly GI cancer and chemotherapy conditions
• The PSG gene family is upregulated in these datasets and is linked to a known CNV
• HEFalMp associates the PSGs with cancer based on correlation with known colorectal cancer genes in a variety of expression datasets
Nothing definite – yet.
68
• Of only five regulators found, four have
generic cell cycle/proliferation targets
• Just five basic regulators for ~7,000 genes?
• These motifs only appear upstream of ~half
of the genes
Human Regulatory Networks
G0
I
III
IV
V
VIVII
IX
VIII
II
X
6,829genes
Serum re-stimulated (hrs)Serum starved (hrs)1
5< <50
2 4 8 24 96 1 2 4 8 24 48
De
velo
pm
en
t
De
velo
pm
en
t
Ch
ole
ste
rol
Pro
tein
loca
liza
tion
Ce
ll cy
cle
RN
A p
roce
ssin
g
Me
tab
olis
m
FIRE: Elemento et al. 2007
Elk-1
Sp1
NF-Y
YY1
Quiescence: reversible exit from the cell cycle
69
Regulatory Modules:Expression Biclusters + Sequence Motifs
CRG1
CRG2
CRG3
CRG4
RND1
RND2
RND3
RND4
RND5
RND6
RND7
RND8
3 4 71 2 5 6 8Bicluster:Coregulated subset of genes and conditions
70
Regulatory Modules:Expression Biclusters + Sequence Motifs
CRG1
CRG2
CRG4
CRG3
RND1
RND2
RND3
RND4
RND5
RND6
RND7
RND8
3 4 71 2 5 6 8Bicluster:Coregulated subset of genes and conditions
71
Regulatory Modules:Expression Biclusters + Sequence Motifs
CRG1
CRG2
CRG4
CRG3
RND1
RND2
RND3
RND4
RND5
RND6
RND7
RND8
3 4 71 2 5 6 8Bicluster:Coregulated subset of genes and conditions
…do all that, and simultaneously find
(under)enriched sequence motifs!
…any dataset can contain many
overlapping biclusters…
…any gene or condition can participate in
multiple biclusters…
72
COALESCE: Combinatorial Algorithm forExpression and Sequence-based Cluster Extraction
Gene Expression DNA Sequence
5’ UTR 3’ UTR
Upstream flank Downstream flank
Evolutionary Conservation
Nucleosome Positions
Identify conditions where genes
coexpress
Identify motifs enriched in
genes’ sequences
Create a new module
Select genes based on conditions
and motifs
Subtract mean from all data
Regulatory modules• Coregulated genes• Conditions where they’re
coregulated• Putative regulating motifs
Feature selection:Tests for differential expression/frequency
Bayesian integration
73
COALESCE: SelectingCoexpressed Conditions
• For each gene expression condition…– Compare distributions of values for
• Genes in the module versus• Genes not in the module
– If significantly different, include the condition
Preserving data structure:• If multiple conditions derive from the
samedataset, can be included/excluded as a
unit• For example, time course vs. deletion
collection• Test using multivariate z-test• Precalculate covariance matrix; still very
efficient
74
COALESCE: SelectingSignificant Motifs
• Coalesce looks for three kinds of motifs:– K-mers– Reverse complement pairs– Probabilistic Suffix Trees (PSTs)
• For every possible motif…– Compare distributions of values for
• Genes in the module versus• Genes not in the module
– If significantly different, include the motif
ACGACGT
ACGACAT | ATGTCGT
A
TC
G
T
TG
CA
• This can distinguish flanks from UTRs• Fast!• Efficient enough to search coding sequence
(e.g. exons/introns)
75
COALESCE: SelectingProbable Genes
• For each gene in the genome…For each significant condition… For each significant motif…
What’s the probability the gene came from the module’s distribution?
What’s the probability that it came from outside the module?
)()|()()|(
)()|()|(
MgPMgDPMgPMgDP
MgPMgDPDMgP
Distributions of each feature in and out of the developing module are observed from the data.
Prior is used to stabilize module convergence; genes already in the module are more likely to stay there next iteration.
The probability of a gene being in the module given some data…
76
COALESCE: IntegratingAdditional Data Types
Nucleosome placement Evolutionary conservation
• Can be included as additional datasets and feature
selected just like expression conditions/motifs.
• Or can be used as a prior or weight on the values of
individual motifs.
N C
G1 2.5 0.0
G2 0.6 0.5
G3 1.2 0.9
… … …
TCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATG
77
COALESCE Results:S. cerevisiae Modules
~2,200 conditions
~6,000 genes
The haystack
A needle
100 genes80 conditions
78
COALESCE Results:Yeast TF/Target Accuracy
Bas1p Hap4p Met32p
Cup2p Met31p
Zap1p Upc2p Mbp1p
Hsf1p Gln3p Hap3p Gcn4p Uga3p Gis1p Hap5p
-0.3
-0.1
0.0999999999999997
0.3
0.5
0.7
0.9
1.1
1.3
COALESCE
cMonkey
FIRE
Weeder
Z-S
core
79
COALESCE Results:Yeast Clustering Accuracy
• ~2,200 yeast conditions– Recapitulation of known biology from Gene Ontology
80
COALESCE Results:Yeast Clustering Accuracy
• ~2,200 yeast conditions– Recapitulation of known biology from Gene Ontology
ASCL1 in 5’ flank, unch. sequences underenriched in 3’ UTR
M. musculus: Up in callosal and motor neurons
C. elegans: Up in larvae, down in adults
GATA in 5’ flank, miR-788 seed in 3’ UTR
AAGGGGC (zf?) and enriched in 5’ flank
H. sapiens: Up in normal muscle, down in diabetic
81
COALESCE: Coregulated Quiescence Modules
Down during quiescence entry, up during quiescence exit,down with adenoviral infection
Specific predicted uncharacterized reverse complement motif
Up during quiescence entry, down during quiescence exit
Many known related (proliferation) motifs:Pax4, Staf, NFKB1, Gfi, ESR1, Runx1, Su(H)
Down during quiescence entry,enriched for transport/trafficking
miR-297 motif predicted in 3’ UTR (CACATAC)
Down with let-7 exposure
let-7 motifs predicted in 3’ UTR (UACCUC)
82
Summary
• COALESCE algorithm for regulatory module prediction
– Biclustering + putative de novo motifs
– Optimized for complex organisms (fast!)• Large genomes, large data collections
– High accuracy, low false positives
– Leverage prior knowledge, multiple data types