Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana...
-
Upload
beverly-james -
Category
Documents
-
view
217 -
download
1
Transcript of Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana...
Using Bayesian Networks to Analyze Whole-Genome Expression Data
Nir Friedman Iftach Nachman Dana Pe’er
Institute of Computer Science, The Hebrew University of Jerusalem
Biological Background
Similar DNA... Yet different expression
of proteins
The expression profile depends onTissueExternal conditionsGrowth stages...
Gene expression is responsible for
cell activity (including regulation of expression)
DNA RNA Protein
cDNA microarray
DNA hybridization measures abundance of RNA
• Recently developed technologies allow parallel measurement of the expression level of thousands of genes/proteins
• This allows biologists to view the cell as a complete system
1Big Challenge
Extracting meaningful information from the expression data Infer regulatory mechanisms Reveal function of proteins
Experiment planning
Prior Work Clustering of expression data
Groups together genes with similar expression patterns Disadvantage: Does not reveal structural relations between
genes
Boolean Networks Deterministic models of the logical interactions between genes Disadvantages: Deterministic, impractical for real data
We suggest a probabilistic framework capable of learning
complex relations between genes.
2Bayesian Networks
A Bayesian Network (BN) is a graphical representation of a probability distribution.
Advantages: Compact & intuitive representation Captures causal relationships Efficient model learning Deals with Noisy data Integration of prior knowledge Effective inference for experiment planning
0.9 0.1
e
b
e
0.2 0.8
0.01 0.99
0.9 0.1
be
b
b
e
BE P(A | E,B)Gene E
Gene D
Gene B
Gene A
Gene C
Qualitative part: Directed acyclic graph (DAG):
•Nodes - random variables of interest •Edges - direct (causal) influence
Quantitative part:
•Local probability models. •Set of conditional probability distributions.
3
Data from Spellman et al. (Mol.Bio. of the Cell 1998) http://genome-www.stanford.edu/cell-cycle
Contains 76 samples of all the yeast genome Different methods for synchronizing cell-cycle in yeast Time series at few minutes (5-20min) intervals
Spellman et al. Identified 800 cell-cycle regulated genes, and clustered them 250 of these genes were combined in 8 clusters
We took these 250 genes and Discretized into three levels of expression Run 100-fold bootstrap using our sparse learning algorithm Computed confidence in predictions
Evaluation Pairs with 80% confidence were evaluated against
original clustering:70% of these were intra-clusterThe rest show interesting inter-cluster relations
Biological Insight M. Linial, Life Sciences, Hebrew U., examined relations Most relations involved unknown/putative proteins,
...but we can guess functions based on homologies… and they mostly make a lot of biological senseonly 3 pairs considered suspicious
Preliminary Experiments6
To get better results, we need More data!
Publicly available gene expression experiments are extremely small.
Frequent samples: Current sampling is far below rate of the regulation process
External Variables: We want to relate regulation to external events: stimuli,
temperature, nutrient levels, etc.
We plan to improve modeling by More suitable local distribution models Correct handling of hidden variables
Can we recognize hidden causes of coordinated regulation events
Improving computational efficiency Incorporating prior knowledge
Need to incorporate large mass of biological knowledge, and insight from sequence/structure databases
Learning from interventions How to learn causality from knockout experiments? How to plan
such experiments? Related issues have been examined in the BN literature
Future Directions & Work8
N. Friedman, I. Nachman, and D. Pe’er, Learning of Bayesian Network structure form massive datasets: The “sparse candidate algorithm”. HUJI tech report CS99-3. (Submitted)
N. Friedman, M. Goldszmidt, and A. Wyner. Data Analysis with Bayesian Networks: A Bootstrap Approach. HUJI tech report CS99-4. (Submitted)
N. Friedman, M. Linial, I. Nachman, and D. Pe’er, Using Bayesian Networks to analyze whole genome expression data: A Preliminary Investigation. HUJI tech report CS99-6. (In preparation.)
D. Heckerman, A tutorial on learning with Bayesian Networks. In Learning Graphical Models, MIT press 1998
J. Pearl, Probabilistic Reasoning in Intelligent Systems. Morgan Kaufman, San Francisco, Calif., 1988
Spellman et. al., Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Sacch. Cervisiae by Microarray Hybridization, Mol. Bio. of the Cell, vol. 9, December 1998.
References9
Possible extensions: Random variables that measureExternal stimuliEnvironment parameters (temp, nutrients, PH, etc.)Biological factors
Measured expression level of each gene
Random variables affecting on another
Bayesian Networks for Gene Expression
We want to apply methods for learning Bayesian networks to analyze gene expression experiments
4
LearnerLearnerLearnerLearnerData +Prior information
E
D
B
A
C
Efficient algorithms exist for learning a BN from data. Learning a BN can:
Reveal underlying structure of domain. Direct relations between variables Find causal influence. Discover hidden variables.
Learning Bayesian Networks
Issues:Massive number of variables (thousands) Small number of samples (dozens)Sparse networks (only a small number of genes directly affect one another).
Crucial Aspects:Computational ComplexityStatistical significance of features in learned models
To address these issues we developed:Sparse Candidate algorithm
Efficient heuristic search that relies on sparseness•Choose candidate set for direct influence for each gene•Find optimal BN constrained on candidates•Iteratively improve candidate set
Bootstrap confidence estimates Use resampling to generate perturbations of
training data. Use the number of times a feature is repeated among networks learned
from these datasets to estimate confidence of Bayesian network features
parents in BNcandidates
5Technical Challenges
Network Learned
0.9--1.0
0.8--0.9
0.7--0.8
0.6--0.7
0.5--0.6
0.4--0.5
0.0--0.4
7