Advanced Machine Learning - Carnegie Mellon School of...

26
1 Advanced Machine Learning Advanced Machine Learning Learning Graphical Learning Graphical Models Models Structure Structure Eric Xing © Eric Xing @ CMU, 2006-2009 1 Eric Eric Xing Xing Lecture 22, April 12, 2010 Reading: Inference and Learning A BN M describes a unique probability distribution P T i lt k Typical tasks: Task 1: How do we answer queries about P? We use inference as a name for the process of computing answers to such queries So far we have learned several algorithms for exact and approx. inference Task 2: How do we estimate a plausible model M from data D? Eric Xing © Eric Xing @ CMU, 2006-2009 2 Task 2: How do we estimate a plausible model M from data D? i. We use learning as a name for the process of obtaining point estimate of M. ii. But for Bayesian, they seek p(M |D), which is actually an inference problem. iii. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data.

Transcript of Advanced Machine Learning - Carnegie Mellon School of...

Page 1: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

1

Advanced Machine LearningAdvanced Machine Learning

Learning Graphical Learning Graphical Models Models StructureStructure

Eric Xing © Eric Xing @ CMU, 2006-2009 1

Eric Eric XingXing

Lecture 22, April 12, 2010

Reading:

Inference and LearningA BN M describes a unique probability distribution P

T i l t kTypical tasks:

Task 1: How do we answer queries about P?

We use inference as a name for the process of computing answers to such queriesSo far we have learned several algorithms for exact and approx. inference

Task 2: How do we estimate a plausible model M from data D?

Eric Xing © Eric Xing @ CMU, 2006-2009 2

Task 2: How do we estimate a plausible model M from data D?

i. We use learning as a name for the process of obtaining point estimate of M.

ii. But for Bayesian, they seek p(M |D), which is actually an inference problem.

iii. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data.

Page 2: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

2

The goal:

Learning Graphical Models

Given set of independent samples (assignments of random variables), find the best (the most likely?) graphical model (both the graph and the CPDs)

E

R

B

A

E

R

B

A

Eric Xing © Eric Xing @ CMU, 2006-2009 3

(B,E,A,C,R)=(T,F,F,T,F)(B,E,A,C,R)=(T,F,T,T,F)……..

(B,E,A,C,R)=(F,T,T,T,F)

C

0.9 0.1

e

be

0.2 0.8

0.01 0.990.9 0.1

bebb

e

BE P(A | E,B)C

Structural SearchHow many graphs over n nodes? )(

22nO

How many trees over n nodes?

But it turns out that we can find exact solution of an optimal tree (under MLE)!

Trick: in a tree each node has only one parent!Chow liu algorithm

)!(nO

Eric Xing © Eric Xing @ CMU, 2006-2009 4

Chow-liu algorithm

Page 3: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

3

Information Theoretic Interpretation of ML

= GG GDpDG ),|(log);,( θθl

∑ ∑

∑ ∑

∏ ∏

⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎠

⎞⎜⎝

⎛=

⎟⎟⎠

⎞⎜⎜⎝

⎛=

i xGiGi

Gi

i nGiGnin

n iGiGnin

GG

Gii

ii

i

ii

ii

xpMxcount

M

xp

xp

p

)(,)(|)(

)(

)(|)(,,

)(|)(,,

),|(log),(

),|(log

),|(log

),|(g);,(

π

πππ

ππ

ππ

θ

θ

θ

xx

x

x

x

Eric Xing © Eric Xing @ CMU, 2006-2009 5

∑ ∑ ⎟⎟⎠

⎞⎜⎜⎝

⎛=

i xGiGiGi

Gii

iiixpxpM

)(,)(|)()( ),|(log),(ˆ

π

πππ θx

xx

From sum over data points to sum over count of variable states

Information Theoretic Interpretation of ML (con'd)

= GG GDpDG ),|(ˆlog);,( θθl

∑ ∑∑ ∑

∑ ∑

∑ ∑

⎟⎠

⎞⎜⎜⎝

⎛−⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎠

⎞⎜⎜⎝

⎛=

iiGiGi

Gi

i x i

i

G

GiGiGi

i xGiGiGi

GG

xpxpMxpp

xpxpM

xpxp

pxp

xpM

xpxpM

p

ii

i

Gii i

ii

i

Gii

iii

)(ˆlog)(ˆ)(ˆ)(ˆ

),,(ˆlog),(ˆ

)(ˆ)(ˆ

)(ˆ),,(ˆ

log),(ˆ

),|(ˆlog),(ˆ

),|(g);,(

)(|)()(

, )(

)(|)()(

,)(|)()(

)(

)(

πππ

π

πππ

πππ

π

π

θ

θ

θ

xx

x

xx

x

xx

x

x

Eric Xing © Eric Xing @ CMU, 2006-2009 6

∑∑

∑ ∑∑ ∑

−=

⎠⎜⎝⎠

⎜⎝

ii

iGi

i xi x iG

xHMxIM

xpp

i

iGii i

i

)(ˆ),(ˆ

)()(

)(

, )()(

)(

π

ππ

x

xx

Decomposable score and a function of the graph structure

Page 4: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

4

Chow-Liu tree learning algorithmObjection function:

GDDG )|(ˆl)( θθl

Chow-Liu:For each pair of variable xi and xj

Compute empirical distribution:

∑∑ −=

=

ii

iGi

GG

xHMxIM

GDpDG

i)(ˆ),(ˆ

),|(ˆlog);,(

)(π

θθ

x

l ∑=i

Gi ixIMGC ),(ˆ)( )(πx⇒

Mxxcount

XXp jiji

),(),(ˆ =

Eric Xing © Eric Xing @ CMU, 2006-2009 7

Compute mutual information:

Define a graph with node x1,…, xn

Edge (I,j) gets weight

∑=ji xx ji

jijiji xpxp

xxpxxpXXI

, )(ˆ)(ˆ),(ˆ

log),(ˆ),(ˆ

),(ˆ ji XXI

Chow-Liu algorithm (con'd)Objection function:

GDDG )|(ˆl)( θθl

Chow-Liu:Optimal tree BN

Compute maximum weight spanning treeDirection in BN: pick any node as root do breadth-first-search to define directions

∑∑ −=

=

ii

iGi

GG

xHMxIM

GDpDG

i)(ˆ),(ˆ

),|(ˆlog);,(

)(π

θθ

x

l ∑=i

Gi ixIMGC ),(ˆ)( )(πx⇒

Eric Xing © Eric Xing @ CMU, 2006-2009 8

Direction in BN: pick any node as root, do breadth-first-search to define directionsI-equivalence:

A

B C

D E

C

A E

B

DE

C

D A B

),(),(),(),()( ECIDCICAIBAIGC +++=

Page 5: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

5

Structure Learning for general graphs

Theorem:The problem of learning a BN structure with at most d parents is p g pNP-hard for any (fixed) d≥2

Most structure learning approaches use heuristicsExploit score decomposition Two heuristics that exploit decomposition in different ways

Greedy search through space of node-orders

Eric Xing © Eric Xing @ CMU, 2006-2009 9

Local search of graph structures

Gene Expression Profiling by Microarrays

R t A R t BR t A R t BR t A R t B

Eric Xing © Eric Xing @ CMU, 2006-2009 10

Receptor A

Kinase C

Trans. Factor F

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

X6

X7 X8

FF

EE

Receptor A

Kinase C

Trans. Factor F

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

X6

X7 X8

Receptor A

Kinase C

Trans. Factor F

Gene G Gene H

Kinase EKinase D

Receptor BX1 X2

X3 X4 X5

X6

X7 X8

FF

EE

FF

EE

Page 6: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

6

Microarray Data

Eric Xing © Eric Xing @ CMU, 2006-2009 11

1hr2hr

3hr

4hr

Structural EM (Friedman 1998)

Structure Learning Algorithms

Expression data

The original algorithm

Sparse Candidate Algorithm (Friedman et al.)

Discretizing array signalsHill-climbing search using local operators: add/delete/swap of a single edgeFeature extraction: Markov relations order relations

Eric Xing © Eric Xing @ CMU, 2006-2009 12

E

R

B

A

C

Learning Algorithm

Feature extraction: Markov relations, order relationsRe-assemble high-confidence sub-networks from features

Module network learning (Segal et al.)Heuristic search of structure in a "module graph"Module assignmentParameter sharingPrior knowledge: possible regulators (TF genes)

Page 7: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

7

E B

Scoring Networks

D resample

D1

D2 .

Learn

Learn

R A

C

E

R

B

A

C

Eric Xing © Eric Xing @ CMU, 2006-2009 13

Dm

...Learn

E

R

B

A

C

Learning GM structureLearning of best CPDs given DAG is easy

collect statistics of values of each node given specific assignment to its parents

A BA B

g p g p

Learning of the graph topology (structure) is NP-hardheuristic search must be applied, generally leads to a locally optimal network

OverfittingIt turns out, that richer structures give higher likelihood P(D|G) to the data (adding an edge is always preferable)

Eric Xing © Eric Xing @ CMU, 2006-2009 14

AC

BAC

B

),|(≤)|( BACPACPmore parameters to fit => more freedom => always exist more "optimal" CPD(C)

We prefer simpler (more explanatory) networksPractical scores regularize the likelihood improvement complex networks.

Page 8: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

8

Gaussian Graphical ModelMultivariate Gaussian density:

{ })()()|( 111 T

WOLG: let

{ })-()-(-exp)(

),|( //µµ

πµ xxx 1

21

21221 −ΣΣ

=Σ Tn

p

( )⎭⎬⎫

⎩⎨⎧

−== ∑∑ jiijiiinp xxqxqQ

Qxxxp 221

2/

2/1

21 -exp)2(

),0|,,,(π

µL

© Eric Xing @ CMU, 2005-2009 15

We can view this as a continuous Markov Random Field with potentials defined on every node and edge:

⎭⎩∑∑<

jji

ji

np /)2( π

The covariance and the precision matrices

Covariance matrix

Graphical model interpretation?

Precision matrix

© Eric Xing @ CMU, 2005-2009 16

Graphical model interpretation?

How to prove the later?

Page 9: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

9

Sparse precision vs. sparse covariance in GGM

211 33 544

⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜

=Σ−

5900094800083700072600061

1

⎟⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜⎜

−−−−

−−−−

−−

080070120030150070040070010080120070100020130030010020030150150080130150100

.....

...............

.....

© Eric Xing @ CMU, 2005-2009 17

)5(or )1(511

15 0 nbrsnbrsXXX ⊥⇔=Σ−

01551 =Σ⇔⊥ XX

Another example

How to estimate this MRF?

© Eric Xing @ CMU, 2005-2009 18

What if p >> nMLE does not exist in general!What about only learning a “sparse” graphical model?

This is possible when s=o(n)Very often it is the structure of the GM that is more interesting …

Page 10: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

10

{ }1

Learning (sparse) GGMMultivariate Gaussian over all continuous expressions

∑ )K/K(=)|( -j

jiiijii xxxE

{ })-()-(-exp||)2(

1=]),...,([ 1-

21

1 21

2µµ

πxxxxp T

n n

rrΣ

Σ

The precision matrix K=Σ−1 reveals the topology of the (undirected) network

Eric Xing © Eric Xing @ CMU, 2006-2009 19

jEdge ~ |Kij| > 0

Learning Algorithm: Covariance selectionWant a sparse matrix

Regression for each node with degree constraint (Dobra et al.)Regression for each node with hierarchical Bayesian prior (Li, et al)Graphical Lasso (we will describe it shortly)

Learning Ising Model (i.e. pairwise MRF)

Assuming the nodes are discrete, and edges are weighted, then for a sample xd, we have p d,

Graph lasso has been used to obtain a sparse estimate of E with continuous X

Eric Xing © Eric Xing @ CMU, 2006-2009 20

We can use graphical L_1 regularized logistic regression to obtain a sparse estimate of with discrete X

Page 11: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

11

Recall lasso

Eric Xing © Eric Xing @ CMU, 2006-2009 21

Graph Regression

Eric Xing © Eric Xing @ CMU, 2006-2009 22

Lasso:Neighborhood selection

Page 12: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

12

Graph Regression

Eric Xing © Eric Xing @ CMU, 2006-2009 23

Neighborhood selection

Graph Regression

Eric Xing © Eric Xing @ CMU, 2006-2009 24

Neighborhood selection

Page 13: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

13

Why this is reasonable?

© Eric Xing @ CMU, 2005-2009 25

Single-node Conditional The conditional dist. of a single node i given the rest of the nodes can be written as:

WOLG: let

© Eric Xing @ CMU, 2005-2009 26

Page 14: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

14

Conditional auto-regression From

We can write the following conditional auto-regression function for each node:

© Eric Xing @ CMU, 2005-2009 27

Neighborhood est. based on auto-regression coefficient

Conditional independenceFrom

Given an estimate of the neighborhood si, we have:

© Eric Xing @ CMU, 2005-2009 28

Thus the neighborhood si defines the Markov blanket of node i

Page 15: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

15

Consistency

Theorem: for the graphical regression algorithm, under certain verifiable conditions (omitted here for simplicity):certain verifiable conditions (omitted here for simplicity):

Note the from this theorem one should see that the regularizer is not actually

Eric Xing © Eric Xing @ CMU, 2006-2009 29

g yused to introduce an “artificial” sparsity bias, but a devise to ensure consistency under finite data and high dimension condition.

Recent trends in GGM:Covariance selection (classical method)

L1-regularization based method (hot !)

Dempster [1972]: Sequentially pruning smallest elements in precision matrix

Drton and Perlman [2008]: Improved statistical tests for pruning

Meinshausen and Bühlmann [Ann. Stat. 06]:

Used LASSO regression for neighborhood selection

Banerjee [JMLR 08]: Block sub-gradient algorithm for finding precision matrix

Friedman et al. [Biostatistics 08]:

© Eric Xing @ CMU, 2005-2009 30

Efficient fixed-point equations based on a sub-gradient algorithm

Serious limitations in practice: breaks down when covariance matrix is not invertible

Structure learning is possible even when # variables > # samples

Page 16: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

16

Learning GMLearning of best CPDs given DAG is easy

collect statistics of values of each node given specific assignment to its parentsg p g p

Learning of the graph topology (structure) is NP-hardheuristic search must be applied, generally leads to a locally optimal network

We prefer simpler (more explanatory) networksRegularized graph regression

Eric Xing © Eric Xing @ CMU, 2006-2009 31

Regularized graph regression

New Problem: Evolving Social Networks

Can I get his vote?

CorporativityCorporativity,

Antagonism,

Cliques,…

over time?

© Eric Xing @ CMU, 2005-2009 32

March 2005 January 2006 August 2006

Page 17: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

17

Time-Varying Gene Regulations

© Eric Xing @ CMU, 2005-2009 33

Departing from invariant GM est.Existing work:

Assuming networks or network time series are observable and giveng g

Then model/analyze the generative and/or dynamic mechanisms

© Eric Xing @ CMU, 2005-2009 34

We assume:Networks are not observableSo we need to INFER the networks from nodal attributes before analyzing them

Page 18: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

18

t*

Reverse engineer temporal/spatial-specific "rewiring" gene networks

T0 TN

…n=1 or some small #

© Eric Xing @ CMU, 2005-2009 35

Drosophila developmentDrosophila development

Challenges

Very small sample sizeobservations are scarce and costlyy

Noisy data

Large dimensionality of the datausually complexity regularization is required to avoid curse of dimensionality,

© Eric Xing @ CMU, 2005-2009 36

e.g. sparsity

And now the data are non-iid since underlying probability distribution is changing !

Page 19: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

19

KELLER: Kernel Weighted L1-regularized Logistic Regression

Inference I [Song, Kolar and Xing, Bioinformatics 09]

© Eric Xing @ CMU, 2005-2009 37

Constrained convex optimizationEstimate time-specific one by oneCould scale to ~104 genes, but under stronger smoothness assumptions

Conditional likelihood

Algorithm - neighborhood selection

Neighborhood:

Estimate at

© Eric Xing @ CMU, 2005-2009 38

Where

and

Page 20: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

20

Structural consistency of KELLER

AssumptionsDefine:

A1: Dependency Condition

A2: Incoherence Condition

© Eric Xing @ CMU, 2005-2009 39

A3: Smoothness Condition

A4: Bounded Kernel

Theorem [Kolar and Xing, 09]

© Eric Xing @ CMU, 2005-2009 40

Page 21: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

21

TESLA: Temporally Smoothed L1-regularized logistic regression

Inference II [Ahmed and Xing, PNAS 09]

© Eric Xing @ CMU, 2005-2009 41

Constrained convex optimizationScale to ~5000 nodes, does not need smoothness assumption, can accommodate abrupt changes.

Modified estimation procedure

estimate block partition on which the coefficient functions are constant

estimate the coefficient functions on each block of the partition

(*)

(**)

© Eric Xing @ CMU, 2005-2009 42

(**)

Page 22: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

22

Structural Consistency of TESLA

I. It can be shown that, by applying the results for modelselection of the randomized Lasso on a temporal difference

[Kolar and Xing, NIPS2009]

ptransformation of (*), the the blockblock are estimated consistentlyare estimated consistently

II. Then it can be further shown that, by applying Lasso on (**),thethe neighborhoodneighborhood of each node of each node on each of the on each of the estimated blocksestimated blocks consistentlyconsistently

© Eric Xing @ CMU, 2005-2009 43

Further advantages of the two step procedurechoosing parameters easierfaster optimization procedure

Two Scenarios

© Eric Xing @ CMU, 2005-2009 44

Smoothly evolving graphsSmoothly evolving graphs Abruptly evolving graphsAbruptly evolving graphs

Page 23: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

23

Comparison of KELLER and TESLA

© Eric Xing @ CMU, 2005-2009 45

Smoothly varying Abruptly varying

Senate network – 109th congress

Voting records from 109th congress (2005 - 2006)There are 100 senators whose votes were recorded on the 542 bills, each vote is a binary outcome

© Eric Xing @ CMU, 2005-2009 46

Estimating parameters:KELLER: bandwidth parameter to be hn = 0.174, and the penalty parameter λ1 = 0.195TESLA: λ1 = 0.24 and λ2 = 0.28

Page 24: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

24

Senate network – 109th congress

© Eric Xing @ CMU, 2005-2009 47

March 2005 January 2006 August 2006

Senator Chafee

© Eric Xing @ CMU, 2005-2009 48

Page 25: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

25

Senator Ben Nelson

© Eric Xing @ CMU, 2005-2009 49

T=0.2 T=0.8

Drosophila life cycleFrom Arbeitman et al. (2002)

Four stages: embryo, larva, pupa, adult

66 microarray measuredacross full life cycle

© Eric Xing @ CMU, 2005-2009 50

Focus on 588 developmentrelated genes

Page 26: Advanced Machine Learning - Carnegie Mellon School of ...epxing/Class/10701-10s/Lecture/lecture22.pdf · Advanced Machine Learning Learning Graphical Learning Graphical Models Models

26

Dynamic Gene Interactions Networks of Drosophila Melanogaster

biological biological processprocess

© Eric Xing @ CMU, 2005-2009 51

molecular molecular functionfunction

cellular cellular componentcomponent

SummaryGraphical Gaussian Model

The precision matrix encode structureNot estimatable when p >> n

Neighborhood selection:Conditional dist under GGMGraphical lassoSparsistency

© Eric Xing @ CMU, 2005-2009 52

Time-varying GGMKernel reweighting est.Total variation est.