Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014
-
Upload
pydata -
Category
Technology
-
view
128 -
download
4
description
Transcript of Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014
Understanding your data with Bayesian networks
(in python)
Bartek Wilczyń[email protected]
University of Warsaw
PyData Silicon Valey, May 5th 2014
Are you confused enough?
Or should I confuse you a bit more ? Image from xkcd.org/552/
Data show: Confused students score better!
Data from Eric Mazur
There may be factors we haven't thought about
● Maybe confusion helps with learning?
● Or maybe there is an alternative explanation?
● As long as these are just cartoon models – we cannot really rule out any structure
Paying attention
Beingconfused
Correct answer
Beingconfused
Correct answer
or
What do I mean by data?Sex Age Smoking Stress Lung Heart FeelM 0-20 never N No no great
F 70 sometimes N minor no OK
M 50-70 daily Y no severe Not-so-well
M 20-50 daily N no minor OK
F 70 never N no minor great
F 20-50 sometimes Y severe minor Not-so-well
F 20-50 never Y no no great
M 20-50 sometimes N minor no great
M 50-70 never Y severe no OK
F 0-20 never N no severe OK
M 20-50 daily Y no no OK
M 0-20 daily N no no Not-so-well
M 20-50 never N minor no OK
.... ... ... ... ... ... ...
Network of connections
Smoking (daily, sometimes, never)
Age(0-20,20-50, 50-70,70+)
Stressful job(yes,no)
Lung problems(no,minor,severe)
Heart problems(no,minor,severe)
Sex(male,female)
How did you feel this morning?(great, OK, not-so-well, terrible)
What is a Bayesian Network ? ● A directed acyclic graph without cycles● with nodes representing random variables ● and edges between nodes representing dependencies
(not necessarily causal)● Each edge is directed from a parent to a child, so all
nodes with connections to a given node constitute its set of parents
● Each variable is associated with a value domain and a probability distribution conditional on parents' values
Back to our confused students
● Let us consider our model of confused students
● We can consider the model with an additional variable
● We need to heve data on the additional variable to be predictive
● Sometimes we need to use “wrong” models if they are predictive
Paying attention
Beingconfused
Correct answer
Paying attention
yes no
confused 80% 0%
not confused 20% 100%
Paying attention
Beingconfused
Correct answer
Paying attention
yes no
correct 50% 20%
incorrect 50% 80%
Can we find the “best” Bayesian Network?
● Given a dataset with observations, we can try to find the “best” network topology (i.e. the best collection of parents' sets)
● In order to do it automatically we need a scoring function to define what we mean by “best”
● A score function is useful if it can be written as a sum over variables, i.e. the best network consists of best parent sets for variables (modulo acyclicity)
How to find the best network?● There are generally three main approaches to defining BN scores:
– Bayesian statistics, e.g. BDe (Herskovits et al. '95)
– Information Theoretic, e.g. MDL (Lam et al. '94)
– Hypothesis testing, e.g. MMPC (Salehi et al. '10)● There are also hybrid approaches, like the recent MIT (de Campos '06)
approach that uses information theory and hypothesis testing
● We have two issues:
– There are exponentially many potential parent sets
– The desired network needs to have no cycles● The second issue is more important and makes the problem NP-complete
(Chickering '96)
Cycles are not always a problem
● Dynamic Bayesian Networks are avariant of BN models that describe temporal dependencies
● We can safely assume that the causal links only go forward in time
● That breaks the problem of cycles as we now have two versions of each variable: “before” and “after”
X1
X2
X3
X1 X1
t t+1
X2 X2
X3 X3
Different types of variables
● Another common situation is when we have different types of variables
● We may know that only certain types of connections are causal
● Or we may be interested only in certain types of connections
● This breaks the cycles as well
Mutations
Protein expression
Diseases
BNFinder – python library for Bayesian Networks
● A library for identification of optimal Bayesian Networks
● Works under assumption of acyclicity by external constraints (disjoint sets of variables or dynamic networks)
● fast and efficient (relatively)
Example1 – the simplest possible
Now, parallellize!
● Since we have external constraints on acyclicity, we can search for parent sets independently
● This leads to a simple parallelization scheme and good efficiency
Bonn et al. Nat. Genet, 2012
Active Inactive
Making the training set for “activity” variable
Handling continuous data
Network model
Does it provide useful predictions?
• 12 positive and 4 negative predictions tested
• >90% success (1 error)
Some more continuous data with perturbations
• 8008 enhancers compiled from 15 ChIP experiments (almost 20k binding peaks)
• Activity data for ~140 enhancers divided into
– 3 tissues (MESO, VM, SM)
– 5 stages (4-6,7-8,9-10,1112,13-16)
• Gene expression data for 5082 genes from the BDGP database
Wilczynski et al.PLoS Comp.Biol 2012
Predictions validated:19/20 correct stage, 10/20 correct tissue
Summary
● Bayesian Networks can provide predictive models based on conditional probability distributions
● BNFinder is an effective tool for finding optimal networks given tabular data. And it's open source!
● It can be used as a commandline tool or as a library● It can use continuous data as well as discrete● Can be run in parallel on multiple cores (with good efficiency)● Convenience functions (cross-validation, ROC plots) included
http://launchpad.net/bnfinder
Thanks!
● Norbert Dojer
● Alina Frolova
● Paweł Bednarz● Agnieszka Podsiadło
● Questions?