Lecture 8 (D. Geman)
-
Upload
alain-trouve -
Category
Documents
-
view
216 -
download
0
Transcript of Lecture 8 (D. Geman)
-
7/29/2019 Lecture 8 (D. Geman)
1/45
STATISTICAL LEARNING IN CANCERBIOLOGY: LECTURE 8
Donald Geman, Michael Ochs, Laurent YounesJohns Hopkins Unversity
ENS-Cachan
March 1, 2013
-
7/29/2019 Lecture 8 (D. Geman)
2/45
LECTURE SERIES
Lecture 1: Introduction (DG)
Lecture 2: Cancer Biology (MO)
Lecture 3: Cell Signaling Inference (MO)
Lecture 4: Genetic Variation (DG)
Lecture 5: Massive Testing (LY)
Lecture 6: Biomarker Discovery (LY)
Lecture 7: Phenotype Prediction (DG) Lecture 8: Embedding Mechanism (DG)
2 / 39
-
7/29/2019 Lecture 8 (D. Geman)
3/45
OUTLINE
Results Without Biology
Gene Regulation in Cancer
Reversed Enrichment Analysis
Regulatory Motifs and Predictors
Looking Ahead
3 / 39
-
7/29/2019 Lecture 8 (D. Geman)
4/45
ACCURACY OF RANK-BASED CLASSIFIERS
0.7
0.8
0.9
1
96 97 96 98 98 100
Leukemia 2
0.9
0.95
1
98 98 98 98 100 97
Leukemia 3
0.9
0.95
1
93 97 96 97 98 97
Leukemia 4
0.6
0.8
1
94 93 93 93 93 90
Prostate 1
0.4
0.6
0.8
1
68 77 77 79 74 76
Prostate 2
0.4
0.6
0.8
1
88 94 95 91 1 00 99
Prostate 3
0.4
0.6
0.8
1
69 78 77 82 77 81
Prostate 4
0.4
0.6
0.81
98 98 98 97 99 97
Prostate 5
0.6
0.8
1
86 86 86 88 83 87
Breast 1
0.6
0.8
1
83 81 82 81 78 85
Breast 2
0.6
0.7
0.8
0.9
1
86 89 89 90 90 88
Average
TSP
TSM
SWP
kTSP
SVM
PAM
Figure: Estimated classification accuracy (ten runs of 10-fold CV) for
datasets D13,...,D21. Bottom diagram represents the average of the
accuracies across all data sets.
4 / 39
-
7/29/2019 Lecture 8 (D. Geman)
5/45
SO WHAT?
All about the same (based on cross-validation).
5 / 39
-
7/29/2019 Lecture 8 (D. Geman)
6/45
SO WHAT?
All about the same (based on cross-validation).
Generally, nothing stands up to cross-study validation, i.e.,
nothing is replicable.
5 / 39
-
7/29/2019 Lecture 8 (D. Geman)
7/45
SO WHAT?
All about the same (based on cross-validation).
Generally, nothing stands up to cross-study validation, i.e.,
nothing is replicable.
What is missing, both for serious applications and probablyeven for robust performance, is mechanism.
5 / 39
-
7/29/2019 Lecture 8 (D. Geman)
8/45
SO WHAT?
All about the same (based on cross-validation).
Generally, nothing stands up to cross-study validation, i.e.,
nothing is replicable.
What is missing, both for serious applications and probablyeven for robust performance, is mechanism.
Bring in the biology at the beginning, not just at the end (thecustomary story about the discovered genes).
5 / 39
-
7/29/2019 Lecture 8 (D. Geman)
9/45
SO WHAT?
All about the same (based on cross-validation).
Generally, nothing stands up to cross-study validation, i.e.,
nothing is replicable.
What is missing, both for serious applications and probablyeven for robust performance, is mechanism.
Bring in the biology at the beginning, not just at the end (thecustomary story about the discovered genes).
TSP was originally motivated by a comment by aboutcomparing protein concentrations.
5 / 39
-
7/29/2019 Lecture 8 (D. Geman)
10/45
TOWARDS MECHANISM
What might be a mechanistic interpretation of the TSP
classifier, where the context consists of only two genes? Example: Obscurin and PRUNE2 are a TSP that perfectly
distinguish between gastrointestinal stromal tumor (GIST)
and Leiomyosarcoma (LMS) (Price et al, PNAS, 2006).
It has been recently shown that both modulate RhoAactivity (which controls many signaling events):
A splice variant of Prune2 is reported to decrease RhoA
activity when over-expressed; Also, Obscurin contains a Rho-GEF binding domain which
helps to activate RhoA.
Hence, providing an explanation, an hypothesized
mechanism, is not straightforward.
Can we say anything of a generic nature?
6 / 39
-
7/29/2019 Lecture 8 (D. Geman)
11/45
OUTLINE
Results Without Biology
Gene Regulation in Cancer
Reversed Enrichment Analysis
Regulatory Motifs and Predictors
Looking Ahead
7 / 39
-
7/29/2019 Lecture 8 (D. Geman)
12/45
REGULATORY CANCER BIOLOGY
Hundreds of different cell types with different morphology
and biological functions exist in the human body.
Gene regulatory networks orchestrate this diversity by
regulating distinct gene expressions all encoded from thesame genome.
A variety of molecular alterations (e.g. DNA mutation and
epigenetic modification) can ultimately result in profound
modifications of these regulatory networks and resultinggene expression programs, in some cases causing cancer.
8 / 39
-
7/29/2019 Lecture 8 (D. Geman)
13/45
REGULATORY PATTERNS
These complex networks have distinct network motifs,
defined as patterns of inter-connectivity occurring more
frequently within a network than by chance, showing
distinctive structure and organization. The analysis of the biochemical regulatory networks that
control gene expression, organism development, and
cellular signal transduction has revealed prominently
enriched topologies among all possible triads motifsinvolving three nodes.
9 / 39
-
7/29/2019 Lecture 8 (D. Geman)
14/45
REVIEW: ACTIVATION AND REPRESSION
As seen in the lectures on cell signaling, perhaps the twomost generic and elementary regulatory motifs are simply Aactivates B (denoted A B) and A inhibits B (denotedA B).
As examples of inhibition: A may be constutively on and B constutively off after
development. Or perhaps A is a transcription factor or involved in
methylation of B. In the normal phenotype we see A expressed but perhaps A
becomes inactivated in the cancer phenotype, resulting in
the expression of B, and hence an expression reversal from
normal to cancer.
10 / 39
-
7/29/2019 Lecture 8 (D. Geman)
15/45
REVIEW: TRANSCRIPTION FACTORS (TFS)
Transcription factors are proteins that usually bind upstream
of genes and regulate transcription, either by activation orinhibition.
11 / 39
-
7/29/2019 Lecture 8 (D. Geman)
16/45
TF/MIR MOTIFS (I)
Until recently, gene expression regulatory motifs referred to
the molecular circuitry of transcription factors controlling
gene expression.
In recent years, however, the role of additional regulatory
factors has been revealed. MicroRNAs (miRs) a family ofsmall non-coding RNA molecules negatively regulate
gene expression both at the transcriptional and
post-transcriptional level.
These changes control tissue development, stem cellmaintenance, and key cellular processes like cell growth,
differentiation and apoptosis.
12 / 39
-
7/29/2019 Lecture 8 (D. Geman)
17/45
REVIEW: MICRORNAS (MIRS)
miRs mark mRNAs for degradation by binding to the 3
UTR. miR targets can be predicted due to complementary
binding.
13 / 39
-
7/29/2019 Lecture 8 (D. Geman)
18/45
TF/MIR MOTIFS (II)
TFs and miRs share common regulatory properties and
often co-regulate gene expression.
Statistical analyses have shown that modules involving TF,
miR, and other non-TF genes are usually configured in a
feed-forward-loop (FFL) topology, where the miR inhibits
both the TF and the genes this latter regulates.
These TF/miR regulatory motifs are crucial in organizing
the body plan during development, controlling stem cells,
orchestrating epithelial to mesenchymal transition (EMT),and distinguishing tissues.
14 / 39
-
7/29/2019 Lecture 8 (D. Geman)
19/45
FEED-FORWARD LOOP OF TF/MIR PAIRS
SOX9
Inhibition
Activation Inhibition
SOX9
TARGETS
MIR-124
MIR-30-5P
TARGETS
Figure: The miR inhibits both the TF SOX9 and its target genes. In the
example SOX9 activates the transcription of its target genes, while
miR-124 and miR-30-5p contribute to their degradation.
15 / 39
-
7/29/2019 Lecture 8 (D. Geman)
20/45
TF/MIR MOTIFS IN CANCER
Alterations of miR/TF regulatory modules have beenimplicated in cancer pathogenesis and progression. For instance, a motif involving the tumor suppressor p53, the
oncogene c-Myc, miR-34b, and miR-34c has recently been
identified in prostate cancer. Another circuit involving c-Myc, PTEN, E2F1, p21, and the
miR-17-92 cluster has been implicated in lymphoma, breast,
prostate, stomach, colon, pancreatic, lung cancers.
Finally TF/miR regulatory motifs have been also shown to
modulate therapy response. These data underscore the role of regulatory motifs in
cancer, suggesting that they can classify and predict cancer
phenotypes.
16 / 39
O
-
7/29/2019 Lecture 8 (D. Geman)
21/45
OUTLINE
Results Without Biology
Gene Regulation in Cancer
Reversed Enrichment Analysis
Regulatory Motifs and Predictors
Looking Ahead
17 / 39
H L
-
7/29/2019 Lecture 8 (D. Geman)
22/45
HYPOTHESIS-DRIVEN LEARNING
Associate candidates for differential mechanism (e.g.,
regulatory motifs) with multivariate features.
Strategy: map motifs {M} to features gM(X), where M is
an instantiated regulatory pattern (template). However, doing this directly appears very difficult. Again,
the sample size does not support the combinatorial search.
Instead, pass through differential expression on the way to
g(X).
18 / 39
D E (DE)
-
7/29/2019 Lecture 8 (D. Geman)
23/45
DIFFERENTIAL EXPRESSION (DE)
How well does the expression of a single gene i predictphenotype?
Decision rule: f(xi) = {xi > t} for a threshold t. Measure performance by the area under the ROC curve.
Not surprisingly, TSPs are enriched for genes that are
significantly differentially expressed.
19 / 39
DE TSP
-
7/29/2019 Lecture 8 (D. Geman)
24/45
DE AND TSP
Let G be the top 100 DE genes (by AUROC).
Let S be the top 100 TSPs (measured by score ij.
Then S is enriched for pairs i,j G (Fisher exact test).
Example: Prostate data (similar results on other datasets):
i or j / G i,j G
(i,j) S 86 14(i,j) /
S 79,368,664 4,936
Cond. Probs. 106 0.003
20 / 39
GENERAL QUESTION
-
7/29/2019 Lecture 8 (D. Geman)
25/45
GENERAL QUESTION
Let G1, G2 be subsets of genes. Does G1 G imply G2
enriched for DE genes?
G1 G2+ DE Genes miRNA Regulators
- DE miRNAs Gene targets
+ DE TFs Gene targets
+ DE Genes TF Regulators
21 / 39
PARTIAL RANK SUM TEST
-
7/29/2019 Lecture 8 (D. Geman)
26/45
PARTIAL RANK SUM TEST
Let A B (features).
Rank all elements within B.
Test statistic: The sum of the ranks of the N largest
elements of A. Reduces to Wilcoxon Rank Sum Test ifN = |A|.
P-value computed by monte carlo.
Measure whether members of A are enriched near the very
top of B.
22 / 39
DE TFS TO GENE TARGETS
-
7/29/2019 Lecture 8 (D. Geman)
27/45
DE TFS TO GENE TARGETS
Genes targeted by DE TFs (AUROC > 0.85) are enriched for
DE (p< 0.005, PRST with N = 100).
23 / 39
DE GENES TO TF REGULATORS
-
7/29/2019 Lecture 8 (D. Geman)
28/45
DE GENES TO TF REGULATORS
TFs that target DE genes (AUROC > 0.95) are marginally
enriched for DE (p = 0.05, PRST, N = 10).
24 / 39
DE GENES TO MIRNA REGULATORS
-
7/29/2019 Lecture 8 (D. Geman)
29/45
DE GENES TO MIRNA REGULATORS
miRNAs which target DE genes are enriched for DE (p = 0.04,
PRST, N = 10).
25 / 39
OUTLINE
-
7/29/2019 Lecture 8 (D. Geman)
30/45
OUTLINE
Results Without Biology
Gene Regulation in Cancer
Reversed Enrichment Analysis
Regulatory Motifs and Predictors
Looking Ahead
26 / 39
MOTIVATION
-
7/29/2019 Lecture 8 (D. Geman)
31/45
MOTIVATION
Complex regulatory cross-talks involving microRNAs (miR)
and transcription factors (TF) control key cellular processes
like apoptosis and proliferation, and are perturbed in cancer.
Therefore, design novel prediction algorithms based on
miR/TF molecular circuitry.
27 / 39
PRELIMINARY EXPERIMENT
-
7/29/2019 Lecture 8 (D. Geman)
32/45
PRELIMINARY EXPERIMENT
Designed to llustrate the impact of embedding theseregulatory motifs into computational learning.
The phenotype is ER status in breast cancer. Whereas this
is not an open problem, it provides a test case in which the
clinical attributes of the phenotypes are well characterized.
The data are expressions of 9,000 genes common to the
breast cancer datasets GSE22220 (Dataset I) and
GSE19783 (Dataset II), training on the first one and
validating on the second one.
Compare the performance of classifiers based on the
relative expression of two genes chosen randomly versus
chosen under simple network-based constraints.
28 / 39
REGULATORY GRAPH
-
7/29/2019 Lecture 8 (D. Geman)
33/45
REGULATORY GRAPH
Start with experimentally-justified regulatory networks
among genes.
The regulation of miRs by TFs was obtained from the
miRgen 2.0 database.
The list of experimentally validated miR targets, which
includes the TF, was retrieved from the TarBase v5.0
database.
After cross-referencing to the common genes in Dataset I
and II this yields a network of 200 TF, 373 miR, and 2772target genes.
29 / 39
RANDOM PAIR CLASSIFIERS
-
7/29/2019 Lecture 8 (D. Geman)
34/45
RANDOM PAIR CLASSIFIERS
Consider the predictor based on comparing the expressions
X1 and X2 of two genes g1 and g2.
Training consists in choosing the order which predicts ER+
and estimating accuracy. There are no parameters to
estimate.
Generate a baseline distribution of performance (measured
on Dataset II) using random pairs by sampling 100,000
classifiers out of the
9,0002
4.107 possible pairs.
Compared the results to pairs (classifiers) derived from thenetwork as follows.
30 / 39
NETWORK-BASED PAIRS
-
7/29/2019 Lecture 8 (D. Geman)
35/45
NETWORK BASED PAIRS
Consider now pairs of genes {g1, g2} both related to a hubs, either a TF or a miR.
Suppose g1 regulates s, which in turn regulates g2.
Example: g1 inhibits s and s activates g2. We might thenexpect X1 to be large and X2 to be small when s is off,
and vice-versa when s is on, so that s acts as a switchregulating the relative expressions of the two genes.
In our dataset, there are about 31,000 (resp., 42,000) such
pairs with a TF hub (resp., miR hub). All were testedagainst the random pairs.
31 / 39
TEST RESULTS (I)
-
7/29/2019 Lecture 8 (D. Geman)
36/45
TEST RESULTS (I)
Random and network pairs with classification rates above
0.7 are compared.
There were respectively 862 (0.86%), 375 (1.2%) and 389
(0.96%) high-performing classifiers in each of the threecategories (random, TF hub, miR hub).
A Wilcoxon rank-sum test comparing high-performing
random classifiers with either TF or miR hubs has p-values
of 10
14 and 10
26, respectively.
32 / 39
TEST RESULTS (II)
-
7/29/2019 Lecture 8 (D. Geman)
37/45
TEST RESULTS (II)
33 / 39
TEST RESULTS (III)
-
7/29/2019 Lecture 8 (D. Geman)
38/45
TEST RESULTS (III)
The top two-gene classifiers in the network class all
involved the ERS1 gene, consistent with the biology of ER
status in breast cancer.
More interesting were the genes paired with ERS1 in the
best classifiers.
For instance POU2F1 (OCT1) is a TF member of the POU
family, which physically interacts with BRCA1 and ER itself,
and that recruits BRCA1 to the ESR1 promoter to control
ER expression. Notably, BRCA1-mutant breast tumors aretypically ER negative.
34 / 39
FEEDBACK LOOPS (I)
-
7/29/2019 Lecture 8 (D. Geman)
39/45
C OO S ( )
Still more generally, a variety of regulatory feedback loops
have been identified in mammals. For instance, an exampleof a bi-stable loop is shown below.
Molecules A1, A2 (resp. B1, B2) are from the same species,for example two miRNAs (resp., two mRNAs). Letters in
boldface indicate an on state.
35 / 39
FEEDBACK LOOPS (II)
-
7/29/2019 Lecture 8 (D. Geman)
40/45
( )
Due to the activation and suppression patterns, we might
expect P(XA1 < XA2|Y = 1) P(XA1 < XA2|Y = 2) and
P(XB1 < XB2|Y = 1) P(XB1 < XB2|Y = 2). Thus there are two expression reversals, one between the
two miRNAs and one, in the opposite direction, between the
two mRNAs.
36 / 39
FEEDBACK LOOPS (III)
-
7/29/2019 Lecture 8 (D. Geman)
41/45
( )
Given both miRNA and mRNA data, we might then build a
classifier based on the these two switches.
For example, the rank discriminant might simply be 2TSP,
the number of reversals observed. It is in this sense that that expression comparisons may
provide an elementary building block for a connection
between rank-based decision rules and potential
mechanism.
37 / 39
OUTLINE
-
7/29/2019 Lecture 8 (D. Geman)
42/45
Results Without Biology
Gene Regulation in Cancer
Reversed Enrichment Analysis
Regulatory Motifs and Predictors
Looking Ahead
38 / 39
OPPORTUNITY KNOCKING
-
7/29/2019 Lecture 8 (D. Geman)
43/45
The potential impact of applied mathematics on cancer
systems biology is enormous.
39 / 39
OPPORTUNITY KNOCKING
-
7/29/2019 Lecture 8 (D. Geman)
44/45
The potential impact of applied mathematics on cancer
systems biology is enormous.
But business as usual in the culture of mathematics is not
likely to have an impact, which requires: Collaborative efforts with biologists and doctors. Stepping outside your intellectual comfort zone, e.g.,
learning some biology.
39 / 39
OPPORTUNITY KNOCKING
-
7/29/2019 Lecture 8 (D. Geman)
45/45
The potential impact of applied mathematics on cancer
systems biology is enormous.
But business as usual in the culture of mathematics is not
likely to have an impact, which requires: Collaborative efforts with biologists and doctors. Stepping outside your intellectual comfort zone, e.g.,
learning some biology.
Right now is the most exciting and opportunisitic entry point.
39 / 39