Georg Gerber, PhD Gifford Laboratory, MIT CSAIL April 9, 2009
description
Transcript of Georg Gerber, PhD Gifford Laboratory, MIT CSAIL April 9, 2009
Initial Steps Toward Computational Discovery of Genetic Regulatory Networks in Pancreatic Islet Development
Georg Gerber, PhDGifford Laboratory, MIT CSAILApril 9, 2009
OutlineGoalsExpression data overviewTF-TF interaction networks
◦pair-wise mutual information◦Bayesian networks
Gene expression programsChIP-seq dataDirections for future work
Biological goals of building a transcriptional regulatory network of pancreatic specification
Knowledge of distinct signaling/transcriptional steps involved in pancreatic specification◦ Optimize ES differentiation by determining signaling event(s)
directly inducing each sequential TFWhat is the network structure? Linear or cross-regulatory,
parallel or all interrelated◦ Direct reprogramming using TFs would benefit from knowing
hierarchy of each network◦ Are TFs that play role in specification of pancreas necessary for
later function of pancreas or are they merely required to properly induce other necessary TFs?
Can knowledge of the pancreatic specification network teach us about lineage diversification within the pancreas (endocrine, exocrine, duct)?
Immediate computational goals
Determine set of transcription factors active at different developmental stages
Discover network “wiring”Determine how network
changes/evolves throughout development
Compare in vivo and ESC networks
OutlineGoalsExpression data overviewTF-TF interaction networks
◦pair-wise mutual information◦Bayesian networks
Gene expression programsChIP-seq dataDirections for future work
Definitive endoderm (E7.75 and E8.75 as
well)
Embryonic mesoderm
Embryonic ectoderm/notoch
ord
Esophagealendoderm
Lungendoderm
PancreaticEndoderm
(E10.5 as well)
Liverendoderm
Stomachendoderm
Intestinalendoderm
E8.25
E11.5
Expression data overview
Tcf2Foxa2
DMSO
2 uM RA
ES
Sox17
GFP+
50 ng/mL ActA6 days
DMSO/2 uM RA6h/24h
FACS sort Sox17GFP+Dpp4-
definitive endodermand perform microarray
1. Implant bead coated with DMSO/RA into foregut of E8.25 (4-6 somite) embryo
2. Explant embryo anterior to 1st somite
3. Culture for 6/24 hours4. Dissociate, sort for EpCAM+
endoderm5. Amplify RNA and profile on
Illumina Mouse Ref8 v2 chips
Expression data overview (cont.)
120 Illumina arrays (18118 genes/array)72 distinct experiments (41 in mESC’s)Standardized mESC/in vivo experiments
separately2758 genes w/ ≥ 2-fold change in ≥ 5 experiments154 TFs w/ ≥ 2-fold change in ≥ 5 experiments
(out of 946 “definite” or “candidate” TFs from TFCat, Fulton et al, Genome Biology 2009)
Limitations of expression data for genetic network reconstructionNeed 100’s of varied
experiments for finding relevant/significant networks
Association ≠ causationHigh false positive rates (high
dimensional, noisy, dependent data)
High false negative rates (low TF transcript abundance, post-transcriptional regulation, etc.)
OutlineGoalsExpression data overviewTF-TF interaction networks
◦pair-wise mutual information◦Bayesian networks
Gene expression programsChIP-seq dataDirections for future work
Pair-wise mutual information networks (CLR)Context Likelihood of Relatedness
method: Faith et al., PLoS Biology 2007
Computes MI between all genes Innovation: considers MI
distribution for both target and source to compute p-values/estimate FDR
CLR (cont.)
E8.25 4-6s definitive endoderm
TF-TF network (MI)
E8.75 13-15s definitive endoderm
TF-TF network (MI)
E9.5 definitive endoderm
TF-TF network (MI)
E10.5 pancreatic endoderm
TF-TF network (MI)
E11.5 pancreatic endoderm
TF-TF network (MI)
E11.5 intestinal endoderm
TF-TF network (MI)
6h 83 uM RA bead mES 2 uM RA 6h
TF-TF network (MI)
24h 83 uM RA bead mES 2 uM RA 24h
TF-TF network (MI)
OutlineGoalsExpression data overviewTF-TF interaction networks
◦pair-wise mutual information◦Bayesian networks
Gene expression programsChIP-seq dataDirections for future work
Bayesian networksDirected networks, allow for multiple
parentsEncode conditional independencePenalize complexity automaticallySoftware: Banjo (Alexander Hartemink,
Duke University)
E8.25 4-6s definitive endodermTF-TF network (Bayes
Net)
E8.75 13-15s definitive endodermTF-TF network (Bayes
Net)
E9.5 definitive endoderm TF-TF network (Bayes Net)
E10.5 pancreatic endodermTF-TF network (Bayes Net)
E11.5 pancreatic endodermTF-TF network (Bayes Net)
6h 83 uM RA bead mES 2 uM RA 6h
TF-TF network (Bayes Net)
24h 83 uM RA bead mES 2 uM RA 24h
TF-TF network (Bayes Net)
OutlineGoalsExpression data overviewTF-TF interaction networks
◦pair-wise mutual information◦Bayesian networks
Gene expression programsChIP-seq dataDirections for future work
Advantages to methods that discover groups of genesInfer more robust relationships
because considering many genesAllow for enrichment analysis
◦Functional categories◦Signaling pathways◦TF DNA binding sequence motifs
GeneProgramGerber et al, PLoS Comp Bio 2007Discovers sets of genes co-expressed
across subsets of conditionsInnovations:
◦Simultaneously models probabilistic structure of experiments (tissues) and genes
◦Uses Hierarchical Dirichlet Processes, a fully Bayesian method for automatically determining the number of expression programs and tissue groups
◦Outperforms state-of-the-art biclustering methods
Hierarchical clustering
Singular Value Decomposition
(SVD)
Non-negative Matrix Factorization (NMF)
GeneProgram w/o tissue groups
Full GeneProgram model
GeneProgram produced a map of 12
tissue groups and 62
expression programs
tissue groups
GeneProgram produced a map of 12
tissue groups and 62
expression programs
tissue
GeneProgram produced a map of 12
tissue groups and 62
expression programs
expression programs (sorted by
generality score)
GeneProgram produced a map of 12
tissue groups and 62
expression programs
expression program use by tissue
Expression program enrichment analysisGO categories
◦FDR controlled to 5%TRANSFAC motifs
◦Software: SAMBA◦Scans +3000 to -200 bp for each
motif◦Uses PWM to score region,
background to calculate p-value (Bonferroni corrected)
E8.25 4-6s definitive endoderm
Expression programs (GO and motif enrichment)
E8.75 13-15s definitive endoderm
Expression programs (GO and motif enrichment)
E9.5 definitive endoderm
Expression programs (GO and motif enrichment)
E10.5 pancreatic endoderm
Expression programs (GO and motif enrichment)
Expression programs showing TFs in programs and motif enrichment
E8.25 4-6s definitive endoderm
E8.75 13-15s definitive endoderm
Expression programs showing TFs in programs and motif enrichment
E9.5 definitive endoderm
Expression programs showing TFs in programs and motif enrichment
E10.5 pancreatic endoderm
Expression programs showing TFs in programs and motif enrichment
Expression programs showing TFs in programs and motif enrichment
E11.5 pancreatic endoderm
OutlineGoalsExpression data overviewTF-TF interaction networks
◦pair-wise mutual information◦Bayesian networks
Gene expression programsChIP-seq dataDirections for future work
Retinoic acid receptor ChIP-seq dataGenerated in the Wichterle lab at
Columbia (unpublished data, Motor Neuron Development Project)
mESC’s grown to embryoid body stage, profiled after 8h of RA exposure
ChIP-seq RAR binding: Cyp26a1
ChIP-seq RAR binding: Rarb
Overlap of Melton lab expression data and RAR binding data
# upreg genes
# bound genes
% bound genes
p-value
6h RA 83 uM bead
104 29 28% 0
1d RA 83 uM bead
369 29 8% 0.069
mESC 6h 2 uM RA
165 33 20% 0
mESC 1d 2 uM RA
220 38 17% 0Binding events determined with modified MACS method (Zhang et al, Genome Biology 2008); called if significant peak found w/in 50 kb of gene start site
Future computational directionsAdd publically available ES expression data Apply more sophisticated TF binding motif
methods (phylogeny, spatial arrangements, co-regulation)
Extend GeneProgram framework for add’l data types (TF expression, binding motifs, ChIP-seq, knockdown/overexpression, ?protein-protein interactions, etc.) → causal/predictive models
Infer dynamic rewiring networks over inferred developmental tree
Develop novel probabilistic methods for ChIP-seq data
AcknowledgementsRich Sherwood (Melton lab) - all
the expression data! Arvind Jammalamadaka (Gifford
lab) -initial data analysis/normalization methods
Shaun Mahony (Gifford lab) - RA ChIP-seq data analysis
Esteban Mazzoni (Wichterle lab) - RA ChIP-seq data
Backup slides
E11.5 stomach endoderm
TF-TF network (MI)
E12.5 esophagus endoderm
TF-TF network (MI)
E11.5 liver endoderm
TF-TF network (MI)
E11.5 lung endoderm
TF-TF network (MI)
E8.25 anterior endoderm
E8.25 4-6s ectoderm
E8.25 4-6s mesoderm
6h 83 mM RA bead
d1 83 mM RA bead
mES 2mM RA 6h
mES 2mM RA 24h
mES differentiated 7d
GeneProgram outperformed popular biclustering algorithms in discovery of biologically meaningful gene sets from real microarray data
Datasource
Algorithm Gene dimension(GO category enrichment)
Tissue dimension(manually derived category
enrichment)N GeneProgram 93% 76%N NMF 35% 29%N Samba 53% 9%S GeneProgram 66% 53%S NMF 28% 19%S Samba 51% 28%
N = Novartis Tissue Atlas v2 (141 mouse and human tissues)S = Shyamsundar et al. (115 human tissues)