Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

28
Understanding your data with Bayesian networks (in python) Bartek Wilczyński [email protected] University of Warsaw PyData Silicon Valey, May 5th 2014

description

Today's world is full of data that is easily accessible for anyone. The problem now is how to make sense of this data and extract some useful insights from the terabytes of raw material. Typically, this involves using machine learning tools - allowing you to build classifiers, cluster data, etc. Many of these approaches give you models that describe the data accurately, but may be difficult to interpret. If you want to be able to understand the result more intuitively it is worth looking at Bayesian Networks - a graphical representation that simplifies complex mathematical model into a most likely graph of dependencies between your variables. I will talk about BNFinder - a python library allowing you to take any tabular data and convert it to a much simplified representation of conditional dependencies between variables. It can be the used for classification of unseen objects while the connection structure can be interpreted even by a non specialist. BNfinder is publicly available under GNU GPL and it can be used by anyone on their data.

Transcript of Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Page 1: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Understanding your data with Bayesian networks

(in python)

Bartek Wilczyń[email protected]

University of Warsaw

PyData Silicon Valey, May 5th 2014

Page 2: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Are you confused enough?

Or should I confuse you a bit more ? Image from xkcd.org/552/

Page 3: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Data show: Confused students score better!

Data from Eric Mazur

Page 4: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

There may be factors we haven't thought about

● Maybe confusion helps with learning?

● Or maybe there is an alternative explanation?

● As long as these are just cartoon models – we cannot really rule out any structure

Paying attention

Beingconfused

Correct answer

Beingconfused

Correct answer

or

Page 5: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

What do I mean by data?Sex Age Smoking Stress Lung Heart FeelM 0-20 never N No no great

F 70 sometimes N minor no OK

M 50-70 daily Y no severe Not-so-well

M 20-50 daily N no minor OK

F 70 never N no minor great

F 20-50 sometimes Y severe minor Not-so-well

F 20-50 never Y no no great

M 20-50 sometimes N minor no great

M 50-70 never Y severe no OK

F 0-20 never N no severe OK

M 20-50 daily Y no no OK

M 0-20 daily N no no Not-so-well

M 20-50 never N minor no OK

.... ... ... ... ... ... ...

Page 6: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Network of connections

Smoking (daily, sometimes, never)

Age(0-20,20-50, 50-70,70+)

Stressful job(yes,no)

Lung problems(no,minor,severe)

Heart problems(no,minor,severe)

Sex(male,female)

How did you feel this morning?(great, OK, not-so-well, terrible)

Page 7: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

What is a Bayesian Network ? ● A directed acyclic graph without cycles● with nodes representing random variables ● and edges between nodes representing dependencies

(not necessarily causal)● Each edge is directed from a parent to a child, so all

nodes with connections to a given node constitute its set of parents

● Each variable is associated with a value domain and a probability distribution conditional on parents' values

Page 8: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Back to our confused students

● Let us consider our model of confused students

● We can consider the model with an additional variable

● We need to heve data on the additional variable to be predictive

● Sometimes we need to use “wrong” models if they are predictive

Paying attention

Beingconfused

Correct answer

Paying attention

yes no

confused 80% 0%

not confused 20% 100%

Paying attention

Beingconfused

Correct answer

Paying attention

yes no

correct 50% 20%

incorrect 50% 80%

Page 9: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Can we find the “best” Bayesian Network?

● Given a dataset with observations, we can try to find the “best” network topology (i.e. the best collection of parents' sets)

● In order to do it automatically we need a scoring function to define what we mean by “best”

● A score function is useful if it can be written as a sum over variables, i.e. the best network consists of best parent sets for variables (modulo acyclicity)

Page 10: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

How to find the best network?● There are generally three main approaches to defining BN scores:

– Bayesian statistics, e.g. BDe (Herskovits et al. '95)

– Information Theoretic, e.g. MDL (Lam et al. '94)

– Hypothesis testing, e.g. MMPC (Salehi et al. '10)● There are also hybrid approaches, like the recent MIT (de Campos '06)

approach that uses information theory and hypothesis testing

● We have two issues:

– There are exponentially many potential parent sets

– The desired network needs to have no cycles● The second issue is more important and makes the problem NP-complete

(Chickering '96)

Page 11: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Cycles are not always a problem

● Dynamic Bayesian Networks are avariant of BN models that describe temporal dependencies

● We can safely assume that the causal links only go forward in time

● That breaks the problem of cycles as we now have two versions of each variable: “before” and “after”

X1

X2

X3

X1 X1

t t+1

X2 X2

X3 X3

Page 12: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Different types of variables

● Another common situation is when we have different types of variables

● We may know that only certain types of connections are causal

● Or we may be interested only in certain types of connections

● This breaks the cycles as well

Mutations

Protein expression

Diseases

Page 13: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

BNFinder – python library for Bayesian Networks

● A library for identification of optimal Bayesian Networks

● Works under assumption of acyclicity by external constraints (disjoint sets of variables or dynamic networks)

● fast and efficient (relatively)

Page 14: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Example1 – the simplest possible

Page 15: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Now, parallellize!

● Since we have external constraints on acyclicity, we can search for parent sets independently

● This leads to a simple parallelization scheme and good efficiency

Page 16: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Bonn et al. Nat. Genet, 2012

Page 17: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Active Inactive

Page 18: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Making the training set for “activity” variable

Page 19: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Handling continuous data

Page 20: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Network model

Page 21: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014
Page 22: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Does it provide useful predictions?

• 12 positive and 4 negative predictions tested

• >90% success (1 error)

Page 23: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Some more continuous data with perturbations

Page 24: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

• 8008 enhancers compiled from 15 ChIP experiments (almost 20k binding peaks)

• Activity data for ~140 enhancers divided into

– 3 tissues (MESO, VM, SM)

– 5 stages (4-6,7-8,9-10,1112,13-16)

• Gene expression data for 5082 genes from the BDGP database

Wilczynski et al.PLoS Comp.Biol 2012

Page 25: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014
Page 26: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Predictions validated:19/20 correct stage, 10/20 correct tissue

Page 27: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Summary

● Bayesian Networks can provide predictive models based on conditional probability distributions

● BNFinder is an effective tool for finding optimal networks given tabular data. And it's open source!

● It can be used as a commandline tool or as a library● It can use continuous data as well as discrete● Can be run in parallel on multiple cores (with good efficiency)● Convenience functions (cross-validation, ROC plots) included

http://launchpad.net/bnfinder

Page 28: Understanding your data with Bayesian networks (in Python) by Bartek Wilczynski PyData SV 2014

Thanks!

● Norbert Dojer

● Alina Frolova

● Paweł Bednarz● Agnieszka Podsiadło

● Questions?