10-701 Machine Learning - Spring 2012 Problem Set 4: Solutions

10-701 Machine Learning - Spring 2012

Problem Set 4: SolutionsOut: March 21stIn: April 9th

Byron Boots ([email protected])School Of Computer Science, Carnegie Mellon University

• Homework will be done individually: each student must hand in their ownanswers. It is acceptable for students to collaborate in figuring out answersand helping each other solve the problems. We will be assuming that, asparticipants in a graduate course, you will be taking the responsibilityto make sure you personally understand the solution to any work arisingfrom such collaboration. You also must indicate on each homework withwhom you collaborated.

• Homework is due at the beginning of class on the due date. For program-ming questions, please submit your code and any plot/figures to the AFSsubmission folder:/afs/cs.cmu.edu/academic/class/10701-s12-users/yourandrewid/hw4

where yourandrewid is your Andrew ID. To copy the files to the folder,you can follow these steps:

– Copy the homework file (pdf format) and other code and data

files as needed to your home directory on an Andrew machine, e.g.scp probset4.pdf [email protected]:yourhomepath

password : ********

– Start an Andrew session, e.g.ssh -l yourandrewid linux.andrew.cmu.edu

password : *********

– Invoke aklog and copy the files over to your dedicated course folder.You have to insert and listing rights for this folder, e.g.aklog cs.cmu.edu

mkdir /afs/cs.cmu.edu/academic/class/10701-s12-users/yourandrewid/hw4

cp probset4.pdf /afs/cs.cmu.edu/academic/class/

10701-s12-users/yourandrewid/hw4/

• Please also include a README file that describes your submission.

1

• You may not overwrite your submission — if you want to update a fileplease submit a file with a new filename, keeping in mind a 20 MB spacelimit per student. The file with the newest timestamp prior to the duedate and time will be evaluated.

2

1 Bayesian Networks [50 points]

1.1 Independence [14 points]

In this question we analyze how a probabilistic graphical model encodes prob-abilistic dependence assumptions. Given the graphical model below, which ofthe following statements are true, regardless of the conditional probability dis-tributions? [2 points each]

A

B

E

C

F

D

G

I

J K M

L

H

N

(a) P (D,H) = P (D)P (H)

(b) P (A, I) = P (A)P (I)

(c) P (A, I|G) = P (A|G)P (I|G)

(d) P (J,G|F ) = P (J |F )P (G|F )

(e) P (J,M |K,L) = P (J |K,L)P (M |K,L)

(f) P (E,C|A,G) = P (E|A,G)P (C|A,G)

(g) P (E,C|A) = P (E|A)P (C|A)

Solution:

(a) P (D,H) = P (D)P (H) is true. The possible paths from D to H areDCGFEIJKMH, DCGFEIJLMH, DCBAEIJKMH andDCBAEIJLMH.For the first two paths, the triplet CGH forms a v-structure and since nei-ther G nor its descendants are observed, influence do not flow on these 2paths. For the 3rd and 4th paths, the triplet AEI forms a v-structure andsince neither E nor its descendants are observed, influence do not flow onthese 2 paths. Since there is no active trails from D to H, this impliesthat D ⊥ H.

3

(b) P (A, I) = P (A)P (I) is true. The possible paths from A to I are AEIand ABCGFEI. For AEI, the triplet AEI forms a v-structure and sinceneither E nor its descendants are observed, influence do not flow on thispath. For ABCGFEI, the triplet CGF forms a v-structure and sinceneither G nor its descendants are observed, influence do not flow on thispath. Since there is no active trails from A to I, this implies that A ⊥ I.

(c) P (A, I|G) = P (A|G)P (I|G) is false. Influence flows on the pathABCGFEIfrom A to I. Since there is an active trail from A to I, A ⊥ I | G is nottrue.

(d) P (J,G|F ) = P (J |F )P (G|F ) is false. Influence flows on the pathGCBAEIJfrom G to J . E is considered as observed because its descendant F is ob-served. Since there is an active trail from J to G, J ⊥ G | F is nottrue.

(e) P (J,M |K,L) = P (J |K,L)P (M |K,L) is true. The possible paths fromJ to M are JKM and JLM . Since K and L are observed, influence donot flow on the 2 paths respectively. Since there is no active trails from Jto M , this implies that J ⊥M | K,L.

(f) P (E,C|A,G) = P (E|A,G)P (C|A,G) is false. Influence flows on thepath EFGC from E to G. Since there is an active trail from E to C,E ⊥ C | A,G is not true.

(g) P (E,C|A) = P (E|A)P (C|A) is true. The possible paths from E to Care EFGC and EABC. Influence do not flow on EFGC because G is notobserved. Influence do not flow on EABC because A is observed. Sincethere are no active trails from E to C, this implies that E ⊥ C | A.

1.2 Constructing a Network [6 points]

Consider three binary variables x, y, and z with the following joint distribution:

x y z p(x, y, z)0 0 0 0.1350 0 1 0.090 1 0 0.0050 1 1 0.021 0 0 0.11251 0 1 0.0751 1 0 0.11251 1 1 0.45

Show that the joint distribution p(x, y, z) can be represented by a Bayes netthat has just two edges.

4

Solution: From the table below, we see that p(x, y, z) = p(y)p(x | y)p(z | y).Thus, X is independent from Z given Y , and we can represent p(x, y, z) asa Bayes net with 2 edges in the form of X ← Y → Z, X → Y → Z, orX ← Y ← Z.

x y z p(y) p(x | y) p(z | y) p(y)p(x | y)p(z | y) p(x, y, z)0 0 0 0.4125 0.54545 0.6 0.135 0.1350 0 1 0.4125 0.54545 0.4 0.09 0.090 1 0 0.5875 0.04255 0.2 0.005 0.0050 1 1 0.5875 0.04255 0.8 0.02 0.021 0 0 0.4125 0.45454 0.6 0.1125 0.11251 0 1 0.4125 0.45454 0.4 0.075 0.0751 1 0 0.5875 0.95745 0.2 0.1125 0.11251 1 1 0.5875 0.95745 0.8 0.45 0.45

1.3 Variable Elimination [20 points]

B D

C

G

E

F

A

In this question we are going to practice variable elimination on the Bayesiannetwork given above. All of the variables are binary valued {T, F}. The condi-

5

tional probabilities are:

P (A = T ) = 0.7 P (D = T ) = 0.6

P (B = T |A = T ) = 0.9, P (B = T |A = F ) = 0.3

P (C = T |D = T,B = T ) = 0.9, P (C = T |D = T,B = F ) = 0.5

P (C = T |D = F,B = T ) = 0.7, P (C = T |D = F,B = F ) = 0.2

P (E = T |D = T ) = 0.1, P (E = T |D = F ) = 0.4

P (F = T |E = T,C = T ) = 0.2, P (F = T |E = T,C = F ) = 0.1

P (F = T |E = F,C = T ) = 0.9, P (F = T |E = F,C = F ) = 0.2

P (G = T |F = T ) = 0.5, P (G = T |F = F ) = 0.1

(a) Joint Probability [3 points] Write down the formula for the jointprobability distribution that makes the same conditional independent as-sumptions as the above graph.

Solution: We can directly read the joint probability distribution fromthe graph. Thus we obtain

P (A,D,B,E,C, F,G) = P (A)P (D)P (B|A)P (E|D)P (C|D,B)P (F |E,C)P (G|F )

(b) Variable Elimination [7 points] Using variable elimination, computethe probability P (G = T |A = T ) =?. Show your work.

Solution: The conditional probability P (G = t|A = t) can be rewrittenas:

P (G = t|A = t) =P (G = t, A = t)

P (A = t)

Since we already know P (G = t), we just need to solve for the jointprobability P (G = t, A = t). To obtain P (G = t, A = t), we need tomarginalize out all the other variables using variable elimination. We firststart by writing out the full set of sums and then factoring out constants

6

to obtain

P (G = t|A = t) =∑f

∑c

∑e

∑b

∑d

P (A)P (B|A)P (C|BD)P (D)P (E|D)P (F |E,C)P (G|F )

=P (A = t)∑f

P (G = t|F )∑c

∑e

P (F |E,C)∑b

P (B|A = t)

∑d

P (C|B,D)P (D)P (E|D)

We will eliminate D, B, E, and C in that order. Along the way we willcreate temporary functions (table) g1, g2, g3, and g4. The details are givenbelow:

P (G = t, A = t) =P (A = t)∑f

P (G = t|F )∑c

∑l

P (F |E,C)∑b

P (B|A = t)

∑d

P (C|B,D)P (D)P (E|D)︸︷︷︸g1(E,B,C)

=P (A = t)∑f

P (G = t|F )∑c

∑l

P (F |E,C)∑b

P (B|A = t)g1(E,B,C)︸︷︷︸g2(E,C)

=P (A = t)∑f

P (G = t|F )∑c

∑l

P (F |E,C)g2(E,C)︸︷︷︸g3(F,C)

=P (A = t)∑f

P (G = t|F )∑c

g3(F,C)︸︷︷︸g4(F )

Thus we obtain the final form of the answer:

P (G = t, A = t) =P (A = t)∑f

P (G = t|F )g4(F )

Performing the computation we obtain:g1(E,B,C) C = f C = t

E = f, B = f 0.4620 0.3180E = f, B = t 0.1260 0.6540E = t, B = f 0.1580 0.0620E = t, B = t 0.0540 0.1660

g2(E,C) C = f C = tE = f 0.1596 0.6204E = t 0.0644 0.1556

g3(F,C) C = f C = tF = f 0.1856 0.1865F = t 0.0384 0.5896

g4(F )F = f 0.3722F = t 0.6278

7

Finally we obtain P (G = t, A = t) = 0.2458 which we normalize to get:P (F = t|A = t) = 0.3511.

(c) Variable Elimination [10 points] Using variable elimination, computethe probability P (G = T |A = T,D = T ) =?. Show your work.

Solution: Here we want to compute:

P (G = t|A = tD = t) =P (G = t, A = t,D = t)

P (A = t,D = t)

We know that P (A = t,D = t) = P (A = t)P (D = t) by the conditionalindependence structure encoded in the graph. Therefore, we need onlyreturn to the elimination we used to compute P (G = t, A = t). Repeatingthe same math again here we have:

P (G = t, A = t,D = t) = P (A = t)P (D = t)∑f

P (G = t|F )∑c

∑e

P (F |E,C)P (E|D = t)

∑b

P (B|A = t)P (C|B,D = t)︸︷︷︸g1(C)

= P (A = t)P (D = t)∑f

P (G = t|F )∑c

∑e

P (F |E,C)P (E|D = t)︸︷︷︸g2(F,C)

g1(C)

= P (A = t)P (D = t)∑f

P (G = t|F )∑c

g2(F,C)g1(C)︸︷︷︸g3(F )

= P (A = t)P (D = t)∑f

P (G = t|F )g3(F )

Performing the computation, we obtain:

g1(C)C = f 0.14C = t 0.86

g2(F,C) C = f C = tF = f 0.81 0.17F = t 0.19 0.83

g3(F )F = f 0.2596F = t 0.7404

Finally, we obtain P (G = t, A = t,D = t) = 0.166 which we normalize to get:P (G = t|A = t,D = t) = 0.396.

1.4 Admissible Bayesian Networks [10 points]

Provide an upper and lower bound on the number of possible Bayesian Networkswith n nodes and explain them. Be as tight as possible (remember that aBayesian Network is a Directed Acyclic Graph).

8

Solution: A suitable upper-bound is given by considering all pairs of nodes.In a Bayesian network with n nodes there are

(n2

)pairs. For each pair of nodes

i and j there can be an edge from i to j, an edge for j to i or no edge at all be-

tween them. This provides an upper bound of 3(n2). Note this is an upper-bound

because this counts some networks with cycles (e.g.: i→ j, j → k, k → i).

For our lower bound we will impose an arbitrary ordering on the nodes from1, ..., n. We will count all networks with nodes i and j such that i < j and thereis either an edge from i→ j or no edge between the two nodes. Counting onlynetworks with edges leaving nodes earlier in the ordering and entering nodeslater in the ordering ensures there are no cycles. Again there are

(n2

)such pairs

making our bound 2(n2).

2 Semi-supervised Learning [25 points]

We begin by looking at the problem of Bernoulli Naive Bayes classification withone binary class variable Y and 3 binary feature variables X1, X2, X3. For theNaive Bayes classifier, we would like to learn the best choice of parameters forP (Y ), P (X1 | Y ), P (X2 | Y ), and P (X3 | Y ). Assume Y , X1 | Y , X2 | Y , andX3 | Y are all Bernoulli variables and let us denote the Bernoulli parametersas1

θY=y = P (Y = y), θX1=x1|Y=y = P (X1 = x1|Y = y),

θX2=x2|Y=y = P (X2 = x2|Y = y), θX3=x3|Y=y = P (X3 = x3|Y = y).

(a) [2 points] Write the log-probability of X and Y in terms of the parame-ters θ first for a single example (X1 = x1, X2 = x2, X3 = x3, Y = y), thenfor n i.i.d. examples (Xi

1 = xi1, Xi2 = xi2, X

i3 = xi3, Y

i = yi) for i = 1, ..., n.

Solution: For a single example

logP (X1, X2, X3, Y |θ) = log

∏j

P (Xj |Y |θ)

P (Y |θ)

= logP (Y |θ) +∑j

logP (Xj |Y, θ)

For many examples we would just sum over each example:

logP (X1:n1 , X1:n

2 , X1:n3 , Y 1:n|θ) =

∑i

logP (Y i|θ) +∑j

logP (Xj |Y, θ)

1We only need θY = P (Y = 1), θXi|Y =y = P (Xi = 1|Y = y), ... since θY =0 = 1−θY =1, ...,

but the set of θs defined here should help you notationally.

9

(b) [3 points] Next derive the maximum likelihood estimate of the parametersθY = arg maxθ

∑ni=1 logP (Y i | θ) and θXj=xj |Y=1 = arg maxθ

∑ni=1

∑j logP (Xi

j |Y i, θ).

Solution: For θY :

n∑i=1

logP (Y i|θ) =

n∑i=1

(yi log θ + (1− yi) log(1− θ)

)To find θY = arg maxθ

∑ni=1 logP (Y i | θ), we take the partial derivative

of∑ni=1


)with respect to θ:

∂

∂θ

n∑i=1


)=

n∑i=1

(yi

θ+ (1− yi) −1

1− θ

)We then set to zero and solve for θ, giving

θY =

∑ni=1 y

i

n

For θXj=xj |Y=1:

n∑i=1

∑j

logP (Xij | Y i, θ) =

n∑i=1

yi∑j

xij log(θXj=xj |Y=1) + (1− xij) log(1− θXj=xj |Y=1)

To find θXj=xj |Y=1 = arg maxθ∑ni=1

∑j logP (Xi

j | Y i, θ), we take the

partial derivative of∑ni=1 y

i∑j x

ij log(θXj=xj |Y=1)+(1−xij) log(1−θXj=xj |Y=1)

with respect to θXj=xj |Y=1:

∂

∂θXj=xj |Y=1

n∑i=1

yi∑j

xij log(θXj=xj |Y=1) + (1− xij) log(1− θXj=xj |Y=1

=

n∑i=1

yi

(xij

θXj=xj |Y=1+ (−1)

1− xij1− θXj=xj |Y=1

)

We then set to zero and solve for θXj=xj |Y=1, giving

θXj=xj |Y=1 =

∑ni=1 y

ixij∑ni=1 y

i

Next, consider the case where the class value Y is never directly observed butit is approximately observed using a sensor. Let Z be the binary variable repre-senting the sensor values. One morning you realize the sensor value is missing

10

in some of the examples. From the sensor specifications, you learn that theprobability of missing values is four times higher when Y = 1 than when Y = 0.More specifically, the exact values from the sensor specifications are:

P (Z missing|Y = 1) = .08, P (Z = 1|Y = 1) = .92

P (Z missing|Y = 0) = .02, P (Z = 0|Y = 0) = .98

(c) [2 points] Draw a Bayes net that represents this problem with a node Ythat is the unobserved label, a node Z that is either a copy of Y or hasthe value “missing”, and the three features X1, X2, X3.

Solution: Here are two possibilities (each expresses the same indepen-dence assumptions):

Y → Z

↙ ↓↘X1X2X3

Y ← Z

↙ ↓↘X1X2X3

(d) [3 points] What is the probability of the unobserved class label being1 given no other information, i.e., P (Y = 1|Z = “missing”)? Derive thequantity using the Bayes rule and write your final answer in terms of θY=1,our estimate of P (Y = 1).

Solution:

P (Y = 1|Z = missing) =P (Z = missing|Y = 1)P (Y = 1)

P (Z = missing)

=P (Z = missing|Y = 1)P (Y = 1)

P (Z = missing|Y = 1)P (Y = 1) + P (Z = missing|Y = 0)P (Y = 0)

=0.08θY

0.08θY + 0.02(1− θY )

(e) [5 points] Write the log-probability of X, Y and Z given θ, in terms ofθ and P (Z|Y ), first for a single example (X1 = x1, X2 = x2, X3 = x3, Z =z, Y = y), then for n i.i.d. examples (Xi

1 = xi1, Xi2 = xi2, X

i3 = xi3, Z

i =zi, Y i = yi) for i = 1, ..., n.

11

Solution: For a single example

logP (X1, X2, X3, Y, Z|θ) = log

∏j

P (Xj |Y |θ)

P (Z|Y, θ)P (Y |θ)

= logP (Z|Y, θ) + logP (Y |θ) +∑j

logP (Xj |Y, θ)

For many examples we would just sum over each example:

logP (X1:n1 , X1:n

2 , X1:n3 , Y 1:n|θ, Z1:n) =

∑i

logP (Zi|Y i, θ) + logP (Y i|θ) +∑j

logP (Xj |Y, θ)

(f) [10 points] Provide the E-step and M-step for performing expectation

maximization of θ for this problem.

In the E-step, compute the distribution Qt+1(Y |Z,X) using

Qt+1(Y = 1|Z,X) = E[Y |Z,X1, X2, X3, θt]

using your Bayes net from part (d) and conditional probability from part(e) for the unobserved class label Y of a single example.

In the M-step, compute

θt+1 = argmaxθ

n∑i=1

∑y

Q(Y i = y|Zi, Xi) logP (Xi1, X

i2, X

i3, Y

i, Zi|θ)

using all of the examples (X11 , X

12 , X

13 , Y

1, Z1), ..., (Xn1 , X

n2 , X

n3 , Y

n, Zn).Note: it is OK to leave your answers in terms of Q(Y |Z,X).

Solution: The E-step is simply computing the belief of each exampleslabel given the examples features and Z.

Qt+1(Y = 1|Z,X) = P (Y = 1|X1, X2, X3, Z, θ)

=P (Y = 1, X1, X2, X3, Z, θ)

P (X1, X2, X3, Z, θ)

=P (Y = 1, X1, X2, X3, Z, θ)

P (Y = 1, X1, X2, X3, Z, θ) + P (Y = 0, X1, X2, X3, Z, θ)

This can be further expanded using the chain rule and the Independenceassumptions of the Bayes net. For example, if we take the Bayes net withY → Z, we get

Qt+1(Y = 1|Z,X)

=P (Z|Y = 1, θ)P (Y = 1|θ)

∏j P (Xj |Y = 1, θ)

P (Z|Y = 1, θ)P (Y = 1|θ)∏j P (Xj |Y = 1, θ) + P (Z|Y = 0, θ)P (Y = 0|θ)

∏j P (Xj |Y = 0, θ)

12

When Z isn’t missing, we will have Q(Y = z|Z = z,X) = 1.When Z is missing (which is a rare occurrence),

Qt+1(Y = 1|Z,X) =0.08P (Y = 1|θ)

∏j P (Xj |Y = 1, θ)

0.08P (Y = 1|θ)∏j P (Xj |Y = 1, θ) + 0.02P (Y = 0|θ)

∏j P (Xj |Y = 0, θ)

The M-step finds θ such that

θt+1 = arg maxθ

n∑i=1

∑y

Q(Y i = y|Zi, Xi) logP (Xi1, X

i2, X

i3, Y

i, Zi|θ)

For each example,∑y

Q(Y = y|Z,X) logP (X1, X2, X3, Y, Z|θ)

=∑y

Q(Y = y|Z,X)(logP (Z|Y = y, θ) + logP (Y = y|θ) +∑j

logP (Xj |t = y, θ))

Since were maximizing with respect to θY and θXj |Y , we only need toconsider ∑

y

Q(Y = y|Z,X)(logP (Y |θ) +∑j

logP (Xj |Y, θ))

So then we can write

θ̂Y = arg maxθ

n∑1=1

∑y

Q(Y = y|Z,X) logP (Y |θ)

θ̂Xj=xj |Y=a = arg maxθ

n∑1=1

∑y

Q(Y = y|Z,X)∑j

logP (Xj |Y, θ)

Since the probabilities are constrained to sum to 1, (using Lagrangian) weget:

θ̂Y =

∑iQ(Y i = 1|Zi, Xi)∑

iQ(Y i = 1|Zi, Xi) +∑iQ(Y i = 0|Zi, Xi)

θ̂Xj=xj |Y=a =

∑i I(Xi

j = xj)Q(Y i = 1|Zi, Xi∑x′j

∑i I(Xi

j = x′j)Q(Y i = 1|Zi, Xi)

We are basically just counting, but rather than weighting the counts byY i, which we dont always know, we weight them by Q(Y i), our belief thatY i takes on a certain value.

3 K-means and GMMs [25 points]

3.1 K-Means [5 points]

Consider the data set in Figure 1. The ‘+’ symbols indicate data points and the(centers of the) circles A,B, and C indicate the starting cluster centers. Show

13

the results of running the K-means algorithm on this data set. To do this, usethe remaining figures, and for each iteration, indicate which data points willbe associated with each of the clusters, and show the locations of the updatedclass centers. If a cluster center has no points associated with it during thecluster update step, it will not move. Use as many figures as you need until thealgorithm converges.

Solution: The solution is drawn on the graph.

14

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

CBA

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

Figure 1: K-Means data set

15

3.2 Gaussian Mixture Models [20 points]

In this problem we will be implementing Gaussian mixture models and workingwith the digits data set. The provided data set is a Matlab file consisting of5000 10×10 pixel hand written digits between 0 and 9. Each digit is a greyscaleimage represented as a 100 dimensional row vector (the images have been downsampled from the original 28× 28 pixel images. The variable X is a 5000× 100matrix and the vector Y contains the true number for each image. Please submityour code and include in your write-up a copy of the plots that you generatedfor this problem.

(a) Implementation [10 points] Implement Expectation Maximization (EM)for the axis aligned Gaussian mixture model. Recall that the axis alignedGaussian mixture model uses the Gaussian Naive Bayes assumption that,given the class, all features are conditionally independent Gaussians. Thespecific form of the model is given in Equation 1.

Zi ∼Multinomial(p1, . . . , pK)

Xi | Zi = z ∼ N

µz1

...µzd

,

(σz1)2

0 . . . 0

0 (σz2)2 ...

.... . . 0

0 . . . 0 (σzd)2

(1)

Remember, code should be written and turned in individually.

(b) [5 points] Run EM to fit a Gaussian mixture model with 16 Gaussianson the digits data. Plot each of the means using subplot(4, 4, i) to savepaper.

(c) [5 points] Evaluating Performance Evaluating clustering performanceis difficult. However, because we have information about the ground truthdata, we can roughly assess clustering performance. One possible metricis to label each cluster with the majority label for that cluster using theground truth data. Then, for each point we predict the cluster label andmeasure the mean 0/1 loss. For the digits data set, report your loss forsettings k = 1, 10 and 16.

16

10-701 Machine Learning - Spring 2012 Problem Set 4: Solutions

Documents

Transcript of 10-701 Machine Learning - Spring 2012 Problem Set 4: Solutions