Use of Bionetworks to Build Maps of Diseases
Stephen Friend MD PhD
Sage Bionetworks (Non-Profit Organization) Seattle/ Beijing/ San Francisco
MIPS Seminar Series August 15th, 2011
why consider the fourth paradigm- data intensive science
thinking beyond the narrative, beyond pathways
advantages of an open innovation compute space
it is more about how than what
Alzheimer’s Diabetes
Cancer Obesity Treating Symptoms v.s. Modifying Diseases
Will it work for me?
Familiar but Incomplete
Reality: Overlapping Pathways
WHY NOT USE “DATA INTENSIVE” SCIENCE
TO BUILD BETTER DISEASE MAPS?
Equipment capable of generating massive amounts of data
“Data Intensive Science”- “Fourth Scientific Paradigm” For building: “Better Maps of Human Disease”
Open Information System
IT Interoperability
Evolving Models hosted in a Compute Space- Knowledge Expert
It is now possible to carry out comprehensive monitoring of many traits at the population level
Monitor disease and molecular traits in populations
Putative causal gene
Disease trait
what will it take to understand disease?
DNA RNA PROTEIN (dark matter)
MOVING BEYOND ALTERED COMPONENT LISTS
2002 Can one build a “causal” model?
trait trait trait trait trait trait trait trait trait trait trait trait trait
How is genomic data used to understand biology?
!Standard" GWAS Approaches Profiling Approaches
!Integrated" Genetics Approaches
Genome scale profiling provide correlates of disease Many examples BUT what is cause and effect?
Identifies Causative DNA Variation but provides NO mechanism
Provide unbiased view of molecular physiology as it
relates to disease phenotypes
Insights on mechanism
Provide causal relationships and allows predictions
RNA amplification Microarray hybirdization
Gene Index
Tum
ors
Tum
ors
Integration of Genotypic, Gene Expression & Trait Data
Causal Inference
Schadt et al. Nature Genetics 37: 710 (2005) Millstein et al. BMC Genetics 10: 23 (2009)
Chen et al. Nature 452:429 (2008) Zhang & Horvath. Stat.Appl.Genet.Mol.Biol. 4: article 17 (2005)
Zhu et al. Cytogenet Genome Res. 105:363 (2004) Zhu et al. PLoS Comput. Biol. 3: e69 (2007)
“Global Coherent Datasets” • population based
• 100s-1000s individuals
Constructing Co-expression Networks
Start with expression measures for genes most variant genes across 100s ++ samples
Note: NOT a gene expression heatmap
1 -0.1 -0.6 -0.8
-0.1 1 0.1 0.2
-0.6 0.1 1 0.8
-0.8 0.2 0.8 1 1
2
3
4
1 2 3 4
Correlation Matrix Brain sample
expr
essi
on
1 0 1 1 0 1 0 0 1 0 1 1 1 0 1 1 1
2
3
4
1 2 3 4
Connection Matrix
1 0 0 0 0 1 1 1 0 1 1 1 0 1 1 1 1
2
4
3
1 2 4 3
4 1
3 2
4 1
2
Establish a 2D correlation matrix for all gene pairs
Define Threshold eg >0.6 for edge
Clustered Connection Matrix
Hierarchically cluster
sets of genes for which many pairs interact (relative to the total number of pairs in that
set)
Network Module
Identify modules
Preliminary Probabalistic Models- Rosetta /Schadt
Gene symbol Gene name Variance of OFPM explained by gene expression*
Mouse model
Source
Zfp90 Zinc finger protein 90 68% tg Constructed using BAC transgenics Gas7 Growth arrest specific 7 68% tg Constructed using BAC transgenics Gpx3 Glutathione peroxidase 3 61% tg Provided by Prof. Oleg
Mirochnitchenko (University of Medicine and Dentistry at New Jersey, NJ) [12]
Lactb Lactamase beta 52% tg Constructed using BAC transgenics Me1 Malic enzyme 1 52% ko Naturally occurring KO Gyk Glycerol kinase 46% ko Provided by Dr. Katrina Dipple
(UCLA) [13] Lpl Lipoprotein lipase 46% ko Provided by Dr. Ira Goldberg
(Columbia University, NY) [11] C3ar1 Complement component
3a receptor 1 46% ko Purchased from Deltagen, CA
Tgfbr2 Transforming growth factor beta receptor 2
39% ko Purchased from Deltagen, CA
Networks facilitate direct identification of genes that are causal for disease
Evolutionarily tolerated weak spots
Nat Genet (2005) 205:370
50 network papers http://sagebase.org/research/resources.php
List of Influential Papers in Network Modeling
(Eric Schadt)
Recognition that the benefits of bionetwork based molecular models of diseases are powerful but that they require significant resources
Appreciation that it will require decades of evolving representations as real complexity emerges and needs to be integrated with therapeutic interventions
Sage Mission
Sage Bionetworks is a non-profit organization with a vision to create a “commons” where integrative bionetworks are evolved by
contributor scientists with a shared vision to accelerate the elimination of human disease
Sagebase.org
Data Repository
Discovery Platform
Building Disease Maps
Commons Pilots
Sage Bionetworks Collaborators
Pharma Partners Merck, Pfizer, Takeda, Astra Zeneca,
Amgen, Johnson &Johnson
22
Foundations Kauffman CHDI, Gates Foundation
Government NIH, LSDF
Academic Levy (Framingham)
Rosengren (Lund)
Krauss (CHORI)
Federation Ideker, Califarno, Butte, Schadt
RULES GOVERN
Engaging Communities of Interest
PLAT
FORM
NEW
MAP
S NEW MAPS
Disease Map and Tool Users- ( Scientists, Industry, Foundations, Regulators...)
PLATFORM Sage Platform and Infrastructure Builders-
( Academic Biotech and Industry IT Partners...)
RULES AND GOVERNANCE Data Sharing Barrier Breakers-
(Patients Advocates, Governance and Policy Makers, Funders...)
NEW TOOLS Data Tool and Disease Map Generators- (Global coherent data sets, Cytoscape,
Clinical Trialists, Industrial Trialists, CROs…)
PILOTS= PROJECTS FOR COMMONS Data Sharing Commons Pilots-
(Federation, CCSB, Inspire2Live....)
Research Platform Research Platform Commons
Data Repository
Discovery Platform
Building Disease
Maps
Tools & Methods
Repository
Discovery
Maps
Tools &
Repository
Discovery Platform
Repository Repository
Discovery
Repository
Discovery
Commons Pilots
Outposts Federation
CCSB
LSDF-WPP Inspire2Live
POC
Cancer Neurological Disease
Metabolic Disease
Pfizer Merck Takeda
Astra Zeneca CHDI Gates NIH
Curation/Annotation
CTCAP Public Data Merck Data TCGA/ICGC
Hosting Data Hosting Tools
Hosting Models
LSDF
Bayesian Models Co-expression Models
KDA/GSVA
25
Example 1: Breast Cancer
Zhang B et al., manuscript
Bayesian Network
Survival Analysis
Coexpression Networks Module combination
Partition BN
4 Public Breast Cancer Datasets
NKI: van de Vijver et al. A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med. 2002 Dec 19;347(25):1999-2009.
Wang Y et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005 Feb 19-25;365(9460):671-9.
Miller: Pawitan Y et al. Gene expression profiling spares early breast cancer patients from adjuvant therapy: derived and validated in two population-based cohorts. Breast Cancer Res. 2005;7(6):R953-64.
Christos: Sotiriou C et al.. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006 Feb 15;98(4):262-72.
295 samples
286 samples
159 samples
189 samples
Generation of Co-expression & Bayesian Networks from published Breast Cancer Studies
Recovery of EGFR and Her2 oncoproteins downstream pathways by super modules downstream pathways by super modules
Comparison of Super-modules with EGFR and Her2 signaling and resistance pathways
29
Key Driver Analysis • Identify key regulators for a list of genes h and a network N • Check the enrichment of h in the downstream of each node in N • The nodes significantly enriched for h are the candidate drivers The nodes significantly enriched for
30
A) Cell Cycle (blue)
C) Pre-mRNA Processing (brown)
B) Chromatin modification (black)
D) mRNA Processing (red)
Global driver
Global driver & RNAi validation
Signaling between Super Modules
(View Poster presented by Bin Zhang)
Example 2. The Sage Non-Responder Project in Cancer
Sage Bionetworks • Non-Responder Project
• To identify Non-Responders to approved drug regimens so we can improve outcomes, spare patients unnecessary toxicities from treatments that have no benefit to them, and reduce healthcare costs
• Co-Chairs Stephen Friend, Todd Golub, Charles Sawyers & Rich Schilsky
• AML (at first relapse) • Non-Small Cell Lung Cancer
• Ovarian Cancer (at first relapse)
• Breast Cancer • Renal Cell
• Multiple Myeloma
Purpose:
Leadership:
Initial Studies:
Model of Alzheimer’s Disease Bin Zhang Jun Zhu
AD
normal
AD
normal
AD
normal
Cell cycle
http://sage.fhcrc.org/downloads/downloads.php
Blue module: 3000 genes Associated with Type 2 diabetes Elevated HbA1c Reduced insulin secretion
Global expression data from 64 human islet donors
340 genes in islet-specific open chromatin regions
168 overlapping genes, which have
• Higher connectivity • Markedly stronger association with
• Type 2 diabetes • Elevated HbA1c • Reduced insulin secretion
• Enrichment for beta-cell transcription factors and exocytotic proteins
New Type II Diabetes Disease Models Anders Rosengren
• Search across 1300 datasets in MetaGEO at Sage for similar expression profiles Top hit: Islet dedifferentiation study where the 168 genes were upregulated in mature islets and downregulated in dedifferentiated islets (Kutlu et al., Phys Gen 2009)
• Analyses of expression-SNPs and clinical SNPs as well as Causal Inference Test
• Identification of candidate key genes affecting beta-cell differentiation and chromatin
Working hypothesis:
Normal beta-cell: open chromatin in islet-specific regions, high expression of beta-cell transcription factors, differentiated beta-cells and normal insulin secretion
Diabetic beta-cell: lower expression of beta-cell transcription factors affecting the identified module, dedifferentiation, reduced insulin secretion and hyperglycemia
Next steps: Validation of hypothesis and suggested key genes in human islets
Anders Rosengren
New Type II Diabetes Disease Models
Clinical Trial Comparator Arm Partnership (CTCAP)
Description: Collate, Annotate, Curate and Host Clinical Trial Data with Genomic Information from the Comparator Arms of Industry and Foundation Sponsored Clinical Trials: Building a Site for Sharing Data and Models to evolve better Disease Maps.
Public-Private Partnership of leading pharmaceutical companies, clinical trial groups and researchers.
Neutral Conveners: Sage Bionetworks and Genetic Alliance [nonprofits].
Initiative to share existing trial data (molecular and clinical) from non-proprietary comparator and placebo arms to create powerful new tool for drug development.
Examples: The Sage Federation
• Founding Lab Groups
– Seattle- Sage Bionetworks – New York- Columbia: Andrea Califano – Palo Alto- Stanford: Atul Butte – San Diego- UCSD: Trey Ideker – San Francisco: UCSF/Sage: Eric Schadt
• Initial Projects – Aging – Diabetes – Warburg
• Goals: Share all datasets, tools, models Develop interoperability for human data
THE FEDERATION Butte Califano Friend Ideker Schadt
vs
Federation s Genome-wide Network and Modeling Approach
Califano group at Columbia Sage Bionetworks Butte group at Stanford
Human Aging Project
Brain A (n=363)
Brain B (n=145)
Blood A (n=~1000)
Blood B (n=~1000)
Brain C (n=400)
Adipose (n=~700)
Data Transformations
TF Activity Profile
Gene Set / Pathway Variation Analysis
Interactome
Machine Learning
Elastic Net
Network Prior Models
Tree Classifiers
Age Model
Deriving Master Regulators from Transcription Factors Regulatory Networks Glycolysis & Glycogenesis Metabolism Pathway
Inferring Prostate Cancer Regulatory Modules for Glycolysis &Glycogenesis Metabolism Pathway
Duarte N. et al (2006) PNAS 107(6):1777-1782
Glycolysis and Glycogenesis Metablism
Gene Set (GGMSE)
Prostate cancer global coherent data set (GSE21032) Taylor BS. et al (2010) Cancer Cell 18(1):11-22
Inferred Transcriptional Regulatory Network in Prostate
Cancer
Zhu J. et al (2008) Nature Genetics 40(7):854-61
Integrated Bayesian Approach
Prostate Cancer Regulatory Modules for GGMSE and Other
Metabolism Pathways
Cox Proportional-Hazards Regression model based on
individual gene for recurrence free survival
Metabolism pathways with regulatory modules enriched by poor prognosis genes
for prostate cancer
Sage bionetworks’ approach
Genes Associated with Poor Prognosis are disproportionally found among the networks regulating the !glycolysis" Genes
Size of the node proportional to -log10 P value for recurrence free survival.
>5 fold enrichment of recurrence free prognostic genes with the Glycolysis BN module than random selection (p<1e-100)
P-Value<0.005
Inferred regulatory module for GGMSE Inferred regulatory module for Oxidative Phosphorylation and Sphingolipid
Metabolism genes
Federated Aging Project : Combining analysis + narrative
=Sweave Vignette Sage Lab
Califano Lab Ideker Lab Califano Lab Ideker Lab
Shared Data Repository
JIRA: Source code repository & wiki
R code + narrative
PDF(plots + text + code snippets) R code + PDF(plots + text + code snippets)
Data objects
PDF(plots + text + code snippets) PDF(plots + text + code snippets) PDF(plots + text + code snippets) PDF(plots + text + code snippets)
HTML
Submitted Paper
Why not share clinical /genomic data and model building in the ways currently used by the software industry (power of tracking workflows and versioning
Synapse as a Github for building models of disease
Evolution of a Software Project
Biology Tools Support Collaboration
Potential Supporting Technologies
Taverna
Addama
tranSMART
Platform for Modeling
SYNAPSE
� � � � � � � � � � � � � �
INTEROPERABILITY (tranSMART)
TENURE FEUDAL STATES
IMPACT ON PATIENTS IMPACT ON PATIENTS
Eight Projects Initiated in last year
!
Group D LEGAL STACK-ENABLING PAIENTS: John Wilbanks
Arch2POCM
Restructuring Drug Discovery
“Absurdity” of Current R&D Ecosystem
• $200B per year in biomedical and drug discovery R&D • Handful of new medicines approved each year • Productivity in steady decline since 1950 • 90% of novel drugs entering clinical trials fail • NIH and EU just started spending billions to duplicate process
• Significant pharma revenues going off patent in next 5 years • >30,000 pharma employees fired in each of last four years • Number of R&D sites in Europe down from 29 to 16 since 2009
What is the problem? • Regulatory hurdles too high? • Low hanging fruit picked? • Payers unwilling to pay? • Genome has not delivered? • Valley of death? • Companies not large enough to execute on strategy? • Internal research costs too high? • Clinical trials in developed countries too expensive?
In fact, all are true but none is the real problem
What is the problem? • The current system is designed as if every new program is destined to
deliver an approved drug
• Past 20 years prove this assumption wrong (again and again)
• Why do promising early results rarely translate into approved drugs?
• Bottom line: we have poor understanding of biology
• Lack of early-data sharing within closed information systems dooms drug discovery for frequent avoidable failure
What is the problem?
We need to rebuild the drug discovery process so that we better understand disease biology before testing proprietary compounds on sick patients
The solution – Arch2POCM 1. Create an Archipelago of clinicians and scientists from public
and private sectors to take projects from ideas to Proof of Clinical Mechanism (POCM)
2. Arch2POCM is a collaborative, data-sharing network of scientists, whose drug discovery objective is to use robust compounds against new targets to disentangle the complexity of human biology, not to create a medicine
3. Success? • A compound that provides proof of concept for a novel target-
allowing companies to use this common information to compete, with dramatic increased chances of success
• Culling targets with doomed mechanisms before multiple companies waste money exploring them - at $50M a pop
Why data sharing through to Phase IIb?
• Most rapidly reveals limitations and opportunities associated with the target
• Increases probability of success for internal proprietary programs
• Scientific decisions are not influenced by market considerations or biased internal thinking
• Target mechanism is only properly tested at Phase IIb
Why no IP on “Common Stream” compounds?
• Allows multiple groups to test diverse indications without funds from Arch2POCM- crowdsourcing drug discovery
• Broader and faster data dissemination
• Far fewer legal agreements to negotiate
• Generates “freedom to operate” on target because there are no patent thickets to wade through
• Efficient way to access world’s top scientists and doctors without hassle
Existing Team Ready to Execute
First major milestones
2013- First Compound in clinical trials
2014- Go and No-Go Decisions from common stream of targets driving Proprietary Programs
2014- Full complement of target programs activated
2014- Core Clinical Programs joined by crowdsourced clinical trials
why consider the fourth paradigm- data intensive science
thinking beyond the narrative, beyond pathways
advantages of an open innovation compute space
it is more about how than what
OPPORTUNITIES FOR MIPS COMMUNITY
Data sets, Tools and Models
Joining Synapse Communities
Joining Federation Projects
Joinig Arch2POCM
Change reward structures for sharing data (patients and academics)
Top Related