Lecture 8 (D. Geman)

download Lecture 8 (D. Geman)

of 45

Transcript of Lecture 8 (D. Geman)

  • 7/29/2019 Lecture 8 (D. Geman)

    1/45

    STATISTICAL LEARNING IN CANCERBIOLOGY: LECTURE 8

    Donald Geman, Michael Ochs, Laurent YounesJohns Hopkins Unversity

    ENS-Cachan

    March 1, 2013

  • 7/29/2019 Lecture 8 (D. Geman)

    2/45

    LECTURE SERIES

    Lecture 1: Introduction (DG)

    Lecture 2: Cancer Biology (MO)

    Lecture 3: Cell Signaling Inference (MO)

    Lecture 4: Genetic Variation (DG)

    Lecture 5: Massive Testing (LY)

    Lecture 6: Biomarker Discovery (LY)

    Lecture 7: Phenotype Prediction (DG) Lecture 8: Embedding Mechanism (DG)

    2 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    3/45

    OUTLINE

    Results Without Biology

    Gene Regulation in Cancer

    Reversed Enrichment Analysis

    Regulatory Motifs and Predictors

    Looking Ahead

    3 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    4/45

    ACCURACY OF RANK-BASED CLASSIFIERS

    0.7

    0.8

    0.9

    1

    96 97 96 98 98 100

    Leukemia 2

    0.9

    0.95

    1

    98 98 98 98 100 97

    Leukemia 3

    0.9

    0.95

    1

    93 97 96 97 98 97

    Leukemia 4

    0.6

    0.8

    1

    94 93 93 93 93 90

    Prostate 1

    0.4

    0.6

    0.8

    1

    68 77 77 79 74 76

    Prostate 2

    0.4

    0.6

    0.8

    1

    88 94 95 91 1 00 99

    Prostate 3

    0.4

    0.6

    0.8

    1

    69 78 77 82 77 81

    Prostate 4

    0.4

    0.6

    0.81

    98 98 98 97 99 97

    Prostate 5

    0.6

    0.8

    1

    86 86 86 88 83 87

    Breast 1

    0.6

    0.8

    1

    83 81 82 81 78 85

    Breast 2

    0.6

    0.7

    0.8

    0.9

    1

    86 89 89 90 90 88

    Average

    TSP

    TSM

    SWP

    kTSP

    SVM

    PAM

    Figure: Estimated classification accuracy (ten runs of 10-fold CV) for

    datasets D13,...,D21. Bottom diagram represents the average of the

    accuracies across all data sets.

    4 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    5/45

    SO WHAT?

    All about the same (based on cross-validation).

    5 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    6/45

    SO WHAT?

    All about the same (based on cross-validation).

    Generally, nothing stands up to cross-study validation, i.e.,

    nothing is replicable.

    5 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    7/45

    SO WHAT?

    All about the same (based on cross-validation).

    Generally, nothing stands up to cross-study validation, i.e.,

    nothing is replicable.

    What is missing, both for serious applications and probablyeven for robust performance, is mechanism.

    5 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    8/45

    SO WHAT?

    All about the same (based on cross-validation).

    Generally, nothing stands up to cross-study validation, i.e.,

    nothing is replicable.

    What is missing, both for serious applications and probablyeven for robust performance, is mechanism.

    Bring in the biology at the beginning, not just at the end (thecustomary story about the discovered genes).

    5 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    9/45

    SO WHAT?

    All about the same (based on cross-validation).

    Generally, nothing stands up to cross-study validation, i.e.,

    nothing is replicable.

    What is missing, both for serious applications and probablyeven for robust performance, is mechanism.

    Bring in the biology at the beginning, not just at the end (thecustomary story about the discovered genes).

    TSP was originally motivated by a comment by aboutcomparing protein concentrations.

    5 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    10/45

    TOWARDS MECHANISM

    What might be a mechanistic interpretation of the TSP

    classifier, where the context consists of only two genes? Example: Obscurin and PRUNE2 are a TSP that perfectly

    distinguish between gastrointestinal stromal tumor (GIST)

    and Leiomyosarcoma (LMS) (Price et al, PNAS, 2006).

    It has been recently shown that both modulate RhoAactivity (which controls many signaling events):

    A splice variant of Prune2 is reported to decrease RhoA

    activity when over-expressed; Also, Obscurin contains a Rho-GEF binding domain which

    helps to activate RhoA.

    Hence, providing an explanation, an hypothesized

    mechanism, is not straightforward.

    Can we say anything of a generic nature?

    6 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    11/45

    OUTLINE

    Results Without Biology

    Gene Regulation in Cancer

    Reversed Enrichment Analysis

    Regulatory Motifs and Predictors

    Looking Ahead

    7 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    12/45

    REGULATORY CANCER BIOLOGY

    Hundreds of different cell types with different morphology

    and biological functions exist in the human body.

    Gene regulatory networks orchestrate this diversity by

    regulating distinct gene expressions all encoded from thesame genome.

    A variety of molecular alterations (e.g. DNA mutation and

    epigenetic modification) can ultimately result in profound

    modifications of these regulatory networks and resultinggene expression programs, in some cases causing cancer.

    8 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    13/45

    REGULATORY PATTERNS

    These complex networks have distinct network motifs,

    defined as patterns of inter-connectivity occurring more

    frequently within a network than by chance, showing

    distinctive structure and organization. The analysis of the biochemical regulatory networks that

    control gene expression, organism development, and

    cellular signal transduction has revealed prominently

    enriched topologies among all possible triads motifsinvolving three nodes.

    9 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    14/45

    REVIEW: ACTIVATION AND REPRESSION

    As seen in the lectures on cell signaling, perhaps the twomost generic and elementary regulatory motifs are simply Aactivates B (denoted A B) and A inhibits B (denotedA B).

    As examples of inhibition: A may be constutively on and B constutively off after

    development. Or perhaps A is a transcription factor or involved in

    methylation of B. In the normal phenotype we see A expressed but perhaps A

    becomes inactivated in the cancer phenotype, resulting in

    the expression of B, and hence an expression reversal from

    normal to cancer.

    10 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    15/45

    REVIEW: TRANSCRIPTION FACTORS (TFS)

    Transcription factors are proteins that usually bind upstream

    of genes and regulate transcription, either by activation orinhibition.

    11 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    16/45

    TF/MIR MOTIFS (I)

    Until recently, gene expression regulatory motifs referred to

    the molecular circuitry of transcription factors controlling

    gene expression.

    In recent years, however, the role of additional regulatory

    factors has been revealed. MicroRNAs (miRs) a family ofsmall non-coding RNA molecules negatively regulate

    gene expression both at the transcriptional and

    post-transcriptional level.

    These changes control tissue development, stem cellmaintenance, and key cellular processes like cell growth,

    differentiation and apoptosis.

    12 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    17/45

    REVIEW: MICRORNAS (MIRS)

    miRs mark mRNAs for degradation by binding to the 3

    UTR. miR targets can be predicted due to complementary

    binding.

    13 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    18/45

    TF/MIR MOTIFS (II)

    TFs and miRs share common regulatory properties and

    often co-regulate gene expression.

    Statistical analyses have shown that modules involving TF,

    miR, and other non-TF genes are usually configured in a

    feed-forward-loop (FFL) topology, where the miR inhibits

    both the TF and the genes this latter regulates.

    These TF/miR regulatory motifs are crucial in organizing

    the body plan during development, controlling stem cells,

    orchestrating epithelial to mesenchymal transition (EMT),and distinguishing tissues.

    14 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    19/45

    FEED-FORWARD LOOP OF TF/MIR PAIRS

    SOX9

    Inhibition

    Activation Inhibition

    SOX9

    TARGETS

    MIR-124

    MIR-30-5P

    TARGETS

    Figure: The miR inhibits both the TF SOX9 and its target genes. In the

    example SOX9 activates the transcription of its target genes, while

    miR-124 and miR-30-5p contribute to their degradation.

    15 / 39

  • 7/29/2019 Lecture 8 (D. Geman)

    20/45

    TF/MIR MOTIFS IN CANCER

    Alterations of miR/TF regulatory modules have beenimplicated in cancer pathogenesis and progression. For instance, a motif involving the tumor suppressor p53, the

    oncogene c-Myc, miR-34b, and miR-34c has recently been

    identified in prostate cancer. Another circuit involving c-Myc, PTEN, E2F1, p21, and the

    miR-17-92 cluster has been implicated in lymphoma, breast,

    prostate, stomach, colon, pancreatic, lung cancers.

    Finally TF/miR regulatory motifs have been also shown to

    modulate therapy response. These data underscore the role of regulatory motifs in

    cancer, suggesting that they can classify and predict cancer

    phenotypes.

    16 / 39

    O

  • 7/29/2019 Lecture 8 (D. Geman)

    21/45

    OUTLINE

    Results Without Biology

    Gene Regulation in Cancer

    Reversed Enrichment Analysis

    Regulatory Motifs and Predictors

    Looking Ahead

    17 / 39

    H L

  • 7/29/2019 Lecture 8 (D. Geman)

    22/45

    HYPOTHESIS-DRIVEN LEARNING

    Associate candidates for differential mechanism (e.g.,

    regulatory motifs) with multivariate features.

    Strategy: map motifs {M} to features gM(X), where M is

    an instantiated regulatory pattern (template). However, doing this directly appears very difficult. Again,

    the sample size does not support the combinatorial search.

    Instead, pass through differential expression on the way to

    g(X).

    18 / 39

    D E (DE)

  • 7/29/2019 Lecture 8 (D. Geman)

    23/45

    DIFFERENTIAL EXPRESSION (DE)

    How well does the expression of a single gene i predictphenotype?

    Decision rule: f(xi) = {xi > t} for a threshold t. Measure performance by the area under the ROC curve.

    Not surprisingly, TSPs are enriched for genes that are

    significantly differentially expressed.

    19 / 39

    DE TSP

  • 7/29/2019 Lecture 8 (D. Geman)

    24/45

    DE AND TSP

    Let G be the top 100 DE genes (by AUROC).

    Let S be the top 100 TSPs (measured by score ij.

    Then S is enriched for pairs i,j G (Fisher exact test).

    Example: Prostate data (similar results on other datasets):

    i or j / G i,j G

    (i,j) S 86 14(i,j) /

    S 79,368,664 4,936

    Cond. Probs. 106 0.003

    20 / 39

    GENERAL QUESTION

  • 7/29/2019 Lecture 8 (D. Geman)

    25/45

    GENERAL QUESTION

    Let G1, G2 be subsets of genes. Does G1 G imply G2

    enriched for DE genes?

    G1 G2+ DE Genes miRNA Regulators

    - DE miRNAs Gene targets

    + DE TFs Gene targets

    + DE Genes TF Regulators

    21 / 39

    PARTIAL RANK SUM TEST

  • 7/29/2019 Lecture 8 (D. Geman)

    26/45

    PARTIAL RANK SUM TEST

    Let A B (features).

    Rank all elements within B.

    Test statistic: The sum of the ranks of the N largest

    elements of A. Reduces to Wilcoxon Rank Sum Test ifN = |A|.

    P-value computed by monte carlo.

    Measure whether members of A are enriched near the very

    top of B.

    22 / 39

    DE TFS TO GENE TARGETS

  • 7/29/2019 Lecture 8 (D. Geman)

    27/45

    DE TFS TO GENE TARGETS

    Genes targeted by DE TFs (AUROC > 0.85) are enriched for

    DE (p< 0.005, PRST with N = 100).

    23 / 39

    DE GENES TO TF REGULATORS

  • 7/29/2019 Lecture 8 (D. Geman)

    28/45

    DE GENES TO TF REGULATORS

    TFs that target DE genes (AUROC > 0.95) are marginally

    enriched for DE (p = 0.05, PRST, N = 10).

    24 / 39

    DE GENES TO MIRNA REGULATORS

  • 7/29/2019 Lecture 8 (D. Geman)

    29/45

    DE GENES TO MIRNA REGULATORS

    miRNAs which target DE genes are enriched for DE (p = 0.04,

    PRST, N = 10).

    25 / 39

    OUTLINE

  • 7/29/2019 Lecture 8 (D. Geman)

    30/45

    OUTLINE

    Results Without Biology

    Gene Regulation in Cancer

    Reversed Enrichment Analysis

    Regulatory Motifs and Predictors

    Looking Ahead

    26 / 39

    MOTIVATION

  • 7/29/2019 Lecture 8 (D. Geman)

    31/45

    MOTIVATION

    Complex regulatory cross-talks involving microRNAs (miR)

    and transcription factors (TF) control key cellular processes

    like apoptosis and proliferation, and are perturbed in cancer.

    Therefore, design novel prediction algorithms based on

    miR/TF molecular circuitry.

    27 / 39

    PRELIMINARY EXPERIMENT

  • 7/29/2019 Lecture 8 (D. Geman)

    32/45

    PRELIMINARY EXPERIMENT

    Designed to llustrate the impact of embedding theseregulatory motifs into computational learning.

    The phenotype is ER status in breast cancer. Whereas this

    is not an open problem, it provides a test case in which the

    clinical attributes of the phenotypes are well characterized.

    The data are expressions of 9,000 genes common to the

    breast cancer datasets GSE22220 (Dataset I) and

    GSE19783 (Dataset II), training on the first one and

    validating on the second one.

    Compare the performance of classifiers based on the

    relative expression of two genes chosen randomly versus

    chosen under simple network-based constraints.

    28 / 39

    REGULATORY GRAPH

  • 7/29/2019 Lecture 8 (D. Geman)

    33/45

    REGULATORY GRAPH

    Start with experimentally-justified regulatory networks

    among genes.

    The regulation of miRs by TFs was obtained from the

    miRgen 2.0 database.

    The list of experimentally validated miR targets, which

    includes the TF, was retrieved from the TarBase v5.0

    database.

    After cross-referencing to the common genes in Dataset I

    and II this yields a network of 200 TF, 373 miR, and 2772target genes.

    29 / 39

    RANDOM PAIR CLASSIFIERS

  • 7/29/2019 Lecture 8 (D. Geman)

    34/45

    RANDOM PAIR CLASSIFIERS

    Consider the predictor based on comparing the expressions

    X1 and X2 of two genes g1 and g2.

    Training consists in choosing the order which predicts ER+

    and estimating accuracy. There are no parameters to

    estimate.

    Generate a baseline distribution of performance (measured

    on Dataset II) using random pairs by sampling 100,000

    classifiers out of the

    9,0002

    4.107 possible pairs.

    Compared the results to pairs (classifiers) derived from thenetwork as follows.

    30 / 39

    NETWORK-BASED PAIRS

  • 7/29/2019 Lecture 8 (D. Geman)

    35/45

    NETWORK BASED PAIRS

    Consider now pairs of genes {g1, g2} both related to a hubs, either a TF or a miR.

    Suppose g1 regulates s, which in turn regulates g2.

    Example: g1 inhibits s and s activates g2. We might thenexpect X1 to be large and X2 to be small when s is off,

    and vice-versa when s is on, so that s acts as a switchregulating the relative expressions of the two genes.

    In our dataset, there are about 31,000 (resp., 42,000) such

    pairs with a TF hub (resp., miR hub). All were testedagainst the random pairs.

    31 / 39

    TEST RESULTS (I)

  • 7/29/2019 Lecture 8 (D. Geman)

    36/45

    TEST RESULTS (I)

    Random and network pairs with classification rates above

    0.7 are compared.

    There were respectively 862 (0.86%), 375 (1.2%) and 389

    (0.96%) high-performing classifiers in each of the threecategories (random, TF hub, miR hub).

    A Wilcoxon rank-sum test comparing high-performing

    random classifiers with either TF or miR hubs has p-values

    of 10

    14 and 10

    26, respectively.

    32 / 39

    TEST RESULTS (II)

  • 7/29/2019 Lecture 8 (D. Geman)

    37/45

    TEST RESULTS (II)

    33 / 39

    TEST RESULTS (III)

  • 7/29/2019 Lecture 8 (D. Geman)

    38/45

    TEST RESULTS (III)

    The top two-gene classifiers in the network class all

    involved the ERS1 gene, consistent with the biology of ER

    status in breast cancer.

    More interesting were the genes paired with ERS1 in the

    best classifiers.

    For instance POU2F1 (OCT1) is a TF member of the POU

    family, which physically interacts with BRCA1 and ER itself,

    and that recruits BRCA1 to the ESR1 promoter to control

    ER expression. Notably, BRCA1-mutant breast tumors aretypically ER negative.

    34 / 39

    FEEDBACK LOOPS (I)

  • 7/29/2019 Lecture 8 (D. Geman)

    39/45

    C OO S ( )

    Still more generally, a variety of regulatory feedback loops

    have been identified in mammals. For instance, an exampleof a bi-stable loop is shown below.

    Molecules A1, A2 (resp. B1, B2) are from the same species,for example two miRNAs (resp., two mRNAs). Letters in

    boldface indicate an on state.

    35 / 39

    FEEDBACK LOOPS (II)

  • 7/29/2019 Lecture 8 (D. Geman)

    40/45

    ( )

    Due to the activation and suppression patterns, we might

    expect P(XA1 < XA2|Y = 1) P(XA1 < XA2|Y = 2) and

    P(XB1 < XB2|Y = 1) P(XB1 < XB2|Y = 2). Thus there are two expression reversals, one between the

    two miRNAs and one, in the opposite direction, between the

    two mRNAs.

    36 / 39

    FEEDBACK LOOPS (III)

  • 7/29/2019 Lecture 8 (D. Geman)

    41/45

    ( )

    Given both miRNA and mRNA data, we might then build a

    classifier based on the these two switches.

    For example, the rank discriminant might simply be 2TSP,

    the number of reversals observed. It is in this sense that that expression comparisons may

    provide an elementary building block for a connection

    between rank-based decision rules and potential

    mechanism.

    37 / 39

    OUTLINE

  • 7/29/2019 Lecture 8 (D. Geman)

    42/45

    Results Without Biology

    Gene Regulation in Cancer

    Reversed Enrichment Analysis

    Regulatory Motifs and Predictors

    Looking Ahead

    38 / 39

    OPPORTUNITY KNOCKING

  • 7/29/2019 Lecture 8 (D. Geman)

    43/45

    The potential impact of applied mathematics on cancer

    systems biology is enormous.

    39 / 39

    OPPORTUNITY KNOCKING

  • 7/29/2019 Lecture 8 (D. Geman)

    44/45

    The potential impact of applied mathematics on cancer

    systems biology is enormous.

    But business as usual in the culture of mathematics is not

    likely to have an impact, which requires: Collaborative efforts with biologists and doctors. Stepping outside your intellectual comfort zone, e.g.,

    learning some biology.

    39 / 39

    OPPORTUNITY KNOCKING

  • 7/29/2019 Lecture 8 (D. Geman)

    45/45

    The potential impact of applied mathematics on cancer

    systems biology is enormous.

    But business as usual in the culture of mathematics is not

    likely to have an impact, which requires: Collaborative efforts with biologists and doctors. Stepping outside your intellectual comfort zone, e.g.,

    learning some biology.

    Right now is the most exciting and opportunisitic entry point.

    39 / 39