B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science...

download B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte.

If you can't read please download the document

Transcript of B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science...

  • Slide 1

B AYESIAN L EARNING & G AUSSIAN M IXTURE M ODELS Jianping Fan Dept of Computer Science UNC-Charlotte Slide 2 B ASIC C LASSIFICATION !!!!$$$!!!! Spam filtering Character recognition InputOutput C Binary Multi-Class Spam vs. Not-Spam C vs. other 25 characters Slide 3 S TRUCTURED C LASSIFICATION Handwriting recognition InputOutput 3D object recognition building tree brace Structured output Slide 4 O VERVIEW OF B AYESIAN D ECISION Bayesian classification: one example E.g. How to decide if a patient is sick or healthy, based on A probabilistic model of the observed data ( data distributions ) Prior knowledge ( ratio or importance ) Slide 5 B AYES R ULE : Who is who in Bayes rule Slide 6 C LASSIFICATION PROBLEM Training data: examples of the form (d,h(d)) where d are the data objects to classify (inputs) and h(d) are the correct class info for d, h(d) {1,K} Goal: given d new, provide h(d new ) Slide 7 W HY B AYESIAN ? Provides practical learning algorithms E.g. Nave Bayes Prior knowledge and observed data can be combined It is a generative (model based) approach, which offers a useful conceptual framework E.g. sequences could also be classified, based on a probabilistic model specification Any kind of objects can be classified, based on a probabilistic model specification Slide 8 Gaussian Mixture Model (GMM) Slide 9 Slide 10 Slide 11 U NIVARIATE N ORMAL S AMPLE Sampling Slide 12 M AXIMUM L IKELIHOOD Sampling Given x, it is a function of and 2 We want to maximize it. Slide 13 L OG -L IKELIHOOD F UNCTION Maximize this instead By setting and Slide 14 M AX. THE L OG -L IKELIHOOD F UNCTION Slide 15 Slide 16 G AUSSIAN M IXTURE M ODELS Rather than identifying clusters by nearest centroids Fit a Set of k Gaussians to the data Maximum Likelihood over a mixture model Slide 17 GMM EXAMPLE Slide 18 M IXTURE M ODELS Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution, Slide 19 G AUSSIAN M IXTURE M ODELS GMM: the weighted sum of a number of Gaussians where the weights are determined by a distribution, Slide 20 E XPECTATION M AXIMIZATION Both the training of GMMs and Graphical Models with latent variables can be accomplished using Expectation Maximization Step 1: Expectation (E-step) Evaluate the responsibilities of each cluster with the current parameters Step 2: Maximization (M-step) Re-estimate parameters using the existing responsibilities Similar to k-means training. Slide 21 L ATENT V ARIABLE R EPRESENTATION We can represent a GMM involving a latent variable What does this give us? TODO: plate notation Slide 22 GMM DATA AND L ATENT VARIABLES Slide 23 O NE LAST BIT We have representations of the joint p(x,z) and the marginal, p(x) The conditional of p(z|x) can be derived using Bayes rule. The responsibility that a mixture component takes for explaining an observation x. Slide 24 M AXIMUM L IKELIHOOD OVER A GMM As usual: Identify a likelihood function And set partials to zero Slide 25 M AXIMUM L IKELIHOOD OF A GMM Optimization of means. Slide 26 M AXIMUM L IKELIHOOD OF A GMM Optimization of covariance Note the similarity to the regular MLE without responsibility terms. Slide 27 M AXIMUM L IKELIHOOD OF A GMM Optimization of mixing term Slide 28 MLE OF A GMM Slide 29 EM FOR GMM S Initialize the parameters Evaluate the log likelihood Expectation-step: Evaluate the responsibilities Maximization-step: Re-estimate Parameters Evaluate the log likelihood Check for convergence Slide 30 EM FOR GMM S E-step : Evaluate the Responsibilities Slide 31 EM FOR GMM S M-Step : Re-estimate Parameters Slide 32 EM FOR GMM S Evaluate the log likelihood and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied, return to E-Step. Slide 33 V ISUAL EXAMPLE OF EM Slide 34 I NCORRECT N UMBER OF G AUSSIANS Slide 35 Slide 36 T OO MANY PARAMETERS For n-dimensional data, we need to estimate at least 3n parameters Slide 37 R ELATIONSHIP TO K- MEANS K-means makes hard decisions. Each data point gets assigned to a single cluster. GMM/EM makes soft decisions. Each data point can yield a posterior p(z|x) Soft K-means is a special case of EM. Slide 38 S OFT MEANS AS GMM/EM Assume equal covariance matrices for every mixture component: Likelihood: Responsibilities: As epsilon approaches zero, the responsibility approaches unity. Slide 39 S OFT K-M EANS AS GMM/EM Overall Log likelihood as epsilon approaches zero: The expectation of soft k-means is the inter-cluster variability Note: only the means are re-estimated in Soft K- means. The covariance matrices are all tied. Slide 40 G ENERAL FORM OF EM Given a joint distribution over observed and latent variables: Want to maximize: 1.Initialize parameters 2.E Step: Evaluate: 3.M-Step: Re-estimate parameters (based on expectation of complete-data log likelihood) 4.Check for convergence of parameters or likelihood Slide 41 G AUSSIAN M IXTURE M ODEL Slide 42 The conditional probability p(z k = 1|x) denoted by (z k ) is obtained by Bayes' theorem, We view k as the prior probability of z k = 1, and (z k ) as the posterior probability. (z k ) is the responsibility that k-component takes for explaining the observation x. Slide 43 M AXIMUM L IKELIHOOD E STIMATION s.t. Given a data set of X = {x 1,, x N }, the log of the likelihood function is Slide 44 Setting the derivatives of ln p(X|, , ) with respect to k to zero, we obtain Slide 45 objective ftn. f(x) constraints g(x) max f(x) s.t. g(x)=0 Finally, we maximize ln p(X|, , ) with respect to the mixing coefficients k. We use a Largrange multiplier Slide 46 which gives we find = -N and Slide 47 1.Initialise the means k, covariances k and mixing coefficients k. 2. step: Evaluate the responsibilities using the current parameter values 3. M step: RE-estimate the parameters using the current responsibilities 4. Evaluate the log likelihood and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied, return to step 2. Slide 48 Slide 49 S OURCE CODE FOR GMM 1. EPFL http://lasa.epfl.ch/sourcecode/ 2. Google Source Code on GMM http://code.google.com/p/gmmreg/ 3. GMM & EM http://crsouza.blogspot.com/2010/10/gaussia n-mixture-models-and-expectation.html Slide 50 P ROBABILITIES AUXILIARY SLIDE FOR MEMORY REFRESHING Have two dice h 1 and h 2 The probability of rolling an i given die h 1 is denoted P(i|h 1 ). This is a conditional probability Pick a die at random with probability P(h j ), j=1 or 2. The probability for picking die h j and rolling an i with it is called joint probability and is P(i, h j )=P(h j )P(i| h j ). For any events X and Y, P(X,Y)=P(X|Y)P(Y) If we know P(X,Y), then the so-called marginal probability P(X) can be computed as Slide 51 D OES PATIENT HAVE CANCER OR NOT ? A patient takes a lab test and the result comes back positive. It is known that the test returns a correct positive result in only 98% of the cases and a correct negative result in only 97% of the cases. Furthermore, only 0.008 of the entire population has this disease. 1. What is the probability that this patient has cancer? 2. What is the probability that he does not have cancer? 3. What is the diagnosis? Slide 52 Slide 53 C HOOSING H YPOTHESES Maximum Likelihood hypothesis: Generally we want the most probable hypothesis given training data. This is the maximum a posteriori hypothesis: Useful observation: it does not depend on the denominator P(d) Slide 54 N OW WE COMPUTE THE DIAGNOSIS To find the Maximum Likelihood hypothesis, we evaluate P(d|h) for the data d, which is the positive lab test and chose the hypothesis (diagnosis) that maximises it: To find the Maximum A Posteriori hypothesis, we evaluate P(d|h)P(h) for the data d, which is the positive lab test and chose the hypothesis (diagnosis) that maximises it. This is the same as choosing the hypotheses gives the higher posterior probability. Slide 55 N AVE B AYES C LASSIFIER What can we do if our data d has several attributes? Nave Bayes assumption: Attributes that describe data instances are conditionally independent given the classification hypothesis it is a simplifying assumption, obviously it may be violated in reality in spite of that, it works well in practice The Bayesian classifier that uses the Nave Bayes assumption and computes the MAP hypothesis is called Nave Bayes classifier One of the most practical learning methods Successful applications: Medical Diagnosis Text classification Slide 56 E XAMPLE. P LAY T ENNIS DATA Slide 57 N AVE B AYES SOLUTION Classify any new datum instance x= (a 1,a T ) as: To do this based on training examples, we need to estimate the parameters from the training examples: For each target value (hypothesis) h For each attribute value a t of each datum instance Slide 58 Based on the examples in the table, classify the following datum x : x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong) That means: Play tennis or not? Working: Slide 59 L EARNING TO CLASSIFY TEXT Learn from examples which articles are of interest The attributes are the words Observe the Nave Bayes assumption just means that we have a random sequence model within each class! NB classifiers are one of the most effective for this task Resources for those interested: Tom Mitchell: Machine Learning (book) Chapter 6. Slide 60 R ESULTS ON A BENCHMARK TEXT CORPUS Slide 61 R EMEMBER Bayes rule can be turned into a classifier Maximum A Posteriori (MAP) hypothesis estimation incorporates prior knowledge; Max Likelihood doesnt Naive Bayes Classifier is a simple but effective Bayesian classifier for vector data (i.e. data with several attributes) that assumes that attributes are independent given the class. Bayesian classification is a generative approach to classification Slide 62 R ESOURCES Textbook reading (contains details about using Nave Bayes for text classification): Tom Mitchell, Machine Learning (book), Chapter 6. Software: NB for classifying text: http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive- bayes.html Useful reading for those interested to learn more about NB classification, beyond the scope of this module: http://www-2.cs.cmu.edu/~tom/NewChapters.html