3. Generative Algorithms, Machine Learnig
-
Upload
hilman-yanuar -
Category
Documents
-
view
22 -
download
0
description
Transcript of 3. Generative Algorithms, Machine Learnig
![Page 1: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/1.jpg)
Generative Algorithms
By : Shedriko
![Page 2: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/2.jpg)
PreliminaryGDA (Gaussian Discriminant Analysis)NB (Naive Bayes)
Outline & Content
![Page 3: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/3.jpg)
We have learned from previous chapter about : algorithm that try to learn directly (such as logistic regression) or algorithms that try to learn mappings directly from the space of inputs X to the labels {0,1} (such as the perceptron algorithm) are called discriminative learning algorithms.
Preliminary
![Page 4: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/4.jpg)
Here, we’ll talk about algorithms that try to model
and/or . These algorithms are called generative learning algorithms. For e.g. :
If y indicates whether an example is a dog (y = 0) or an elephant (y = 1), then p(x|y = 0) models the distribution of dogs’ features, and p(x|y = 1) models the distribution of elephants’ features.
Preliminary
![Page 5: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/5.jpg)
After modeling (called class prior) and our algorithm can then use Bayes rule to derive the posterior distribution on y given x :
Here the denominator is given by :p(x) = p(x|y = 1)p(y = 1) + p(x|y = 0)p(y
= 0) If we are calculating in order to make a prediction, we don’t need to calculate the denominator, since
Preliminary
![Page 6: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/6.jpg)
The multivariate normal distribution In this model, we’ll assume that is distributed according to multivariate distribution The multivariate normal distribution in n-dimensions (also called multivariate Gaussian distribution), is parameterized by a mean vector and a covariance matrix where is symmetric and positive semi-definite Also written its density is given by :
GDA (Gaussian)
![Page 7: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/7.jpg)
GDA (Gaussian)
denotes the determinant of the matrix For a random variable X distributed , the mean is given by :
The covariance of a vector-valued random variable Z is defined as Cov(Z) =
or Cov(Z) = If then
![Page 8: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/8.jpg)
GDA (Gaussian)
Examples of what the density of a Gaussian distribution looks like :
The left-most figure shows a Gaussian with mean zero (that is the 2x1 zero-vector) and covariance matrix
(the 2x2 identity matrix). A Gaussian with zero mean and identity covariance is also called the
standard normal distribution
![Page 9: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/9.jpg)
GDA (Gaussian)
The middle figure shows the density of a Gaussian with zero mean and The right-most figure shows one with We see that as becomes larger, the Gaussian becomes more “spread-out”
![Page 10: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/10.jpg)
GDA (Gaussian)
Here is some more examples :
The figures above show Gaussian with mean 0, and with covariance matrices respectively
![Page 11: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/11.jpg)
GDA (Gaussian)
The left-most figure shows the familiar standard normal distribution, and we see that as we increase the off-diagonal entry in , the density becomes more “compressed” towards the 45o line (given by x1 = x2), as seen at the contours of those three densities
below
![Page 12: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/12.jpg)
GDA (Gaussian)
Here’s the last set of examples generated by varying :
The plots above used, respectively
![Page 13: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/13.jpg)
GDA (Gaussian)
From the leftmost and middle figures, by decreasing the diagonal elements of the covariance matrix, the density now becomes “compressed” again. As we vary the parameters, the contours will form ellipses (the rightmost figure)
![Page 14: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/14.jpg)
GDA (Gaussian)
Fixing , by varying , we can also move the mean of the density around
which plots is
![Page 15: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/15.jpg)
GDA (Gaussian)
The GDA model When we have the classification problem in
which the input features x are continuous-valued random variables, we can use GDA model
It’s using a multivariate normal distribution, the model is :
![Page 16: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/16.jpg)
GDA (Gaussian)
The distribution is
The parameter of the model are , , and (there are two different mean vectors and , with only one covariance matrix )
![Page 17: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/17.jpg)
GDA (Gaussian)
The log-likelihood of the data is given by
By maximizing , we find the maximum likelihood :
![Page 18: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/18.jpg)
GDA (Gaussian)
Example of two different mean vectors and , with only one covariance matrix
Also shown in the figure the straight line giving the decision boundary at which On one side of the bound- ary we’ll predict and the other side
![Page 19: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/19.jpg)
GDA (Gaussian)
Discussion: GDA and logistic regression
GDA model has a relationship to logistic regression
If we view the quantity as a function of x, we’ll find
where is some appropriate function of
(This is the form of logistic regression – a discriminative algorithm – to model p(y = 1|x) )
![Page 20: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/20.jpg)
GDA (Gaussian)
Which one is better ? GDA :
- stronger modeling assumptions- more data efficient (requires less
training data to learn “well”) Logistic regression :
- weaker assumptions, but more robust to deviations from modeling assumptions
- when the data is indeed non-Gaussian (in large dataset), it’s better than GDA
For the last reason, logistic regression is used more often than GDA
![Page 21: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/21.jpg)
NB (Naive Bayes)
In GDA, the feature vector x were continuous, in Bayes the xi’s are discrete-valued
For e.g. : to classify whether an email is unsolicited commercial (spam) email or non-spam email
We begin our spam filter by specifying the feature xi used to represent an email
We’ll represent an email via a feature vector whose length is equal to the number of words in the dictionary, specifically if an email contains the i-th word of the dictionary, then we’ll set xi = 1; otherwise we let xi = 0
![Page 22: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/22.jpg)
NB (Naive Bayes)
For instance, the vector :
The set of words encoded into the feature vector is called the vocabulary, so the dimension of x is equal to the size of the
vocabulary Then, we build a generative model If, say, we have 50,000 words, then
(x is a 50,000-dimensional vector of 0’s and 1’s), we model x with a multinomial distribution, we have 250000-1 dimensional parameter vector (too many)
![Page 23: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/23.jpg)
NB (Naive Bayes)
To model , we have to make a very strong assumption
Assume: xi’s are conditionally independent given y (called NB assumption and NB classifier for the resulting algorithm)
For instance: I tell you y = 1 (spam email), so x2087 (word “buy” whether appears in the message) and x39831 (word “price” whether appears in the message) can be written:
x2087 and x39831 are conditionally independent given y
![Page 24: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/24.jpg)
NB (Naive Bayes)
We now have
First equality : usual properties of probabilities
Second equality : NB assumption (extremely strong assumption, works well on many problems)
![Page 25: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/25.jpg)
NB (Naive Bayes)
Our model is parameterized by , and
As usual given training set {(x(i),y(i))}; i = 1, …, m}, the joint likelihood of the data :
Maximizing this with respect to , and gives the maximum likelihood estimates :
( symbol means “and”, spam email in which word j does appear)
![Page 26: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/26.jpg)
NB (Naive Bayes)
To make a prediction on a new example with features x
We have used NB where feature xi are binary-valued, for xi which has values {1, 2, …, ki}
Here, we model as multinomial rather than as Bernoulli
For e.g. : We have input of living area of a house in continuous valued, then we discretize it as follows :
![Page 27: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/27.jpg)
NB (Naive Bayes)
Laplace smoothing When the original continuous-valued attributes are not well-modeled by a multivariate normal distribution, discretizing the features and using NB (instead of GDA) will often result in better classifier Simple change will make NB algorithm work much better, especially for text classification For e.g. : You never receive email with word “NIPS” before, it’s the 35000th word in the dictionary, your NB spam filter therefore had picked its max likelihood estimates of the parameters to be:
![Page 28: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/28.jpg)
NB (Naive Bayes)
i.e. it’s never been in training examples (either spam or non-spam) when trying to decide “nips” is spam, it calculates the class posterior probabilities and obtains
![Page 29: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/29.jpg)
NB (Naive Bayes)
The result 0 because each of terms includes a term = 0 We then, estimate the mean of a multinomial random variable z valued {1, …, k} parameterized with
, the maximum likelihood estimates are
The above might end’s up as zero (which was a problem), to avoid this, we use Laplace smoothing
![Page 30: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/30.jpg)
NB (Naive Bayes)
Returning to our NB classifier above, with laplace smoothing to avoid 0, we have
![Page 31: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/31.jpg)
NB (Naive Bayes)
Event models for text classification Usually for text classification, NB use multi-variate Bernoulli event model (as presented). First, it’s randomly determine whether spam or non- spam (according to class prior ) Then it runs through the dictionary, deciding whether to include each word in that email independently, according to Thus the probability of a message is given by
![Page 32: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/32.jpg)
NB (Naive Bayes)
Here’s the different model, a better one, called the multinomial event model (MEM). We let xi denote the identity of the i-th in the email, which has values in {1, …, |V|}, where |V| is the size of our vocabulary For e.g. : An email starts with “A NIPS….”,
x1 = 1 (“a” is the first word in the dictionary)
x2 = 35000 (if “nips” is the 35000th word in the dictionary)
![Page 33: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/33.jpg)
NB (Naive Bayes)
In MEM, assume an email is generated via a random process which spam/non-spam is first determined
(according to as before) Then, it generates x1, x2, x3 and so on from some multinomial distribution over words ( ), the overall probability is given by (as multinomial, not as Bernoulli distribution) The parameters :
, (for any j) and
![Page 34: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/34.jpg)
NB (Naive Bayes)
The likelihood of the data is given by
Maximum likelihood
![Page 35: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/35.jpg)
NB (Naive Bayes)
Maximum likelihood with laplace smoothing
![Page 36: 3. Generative Algorithms, Machine Learnig](https://reader035.fdocuments.us/reader035/viewer/2022062314/55cf8ff1550346703ba181d1/html5/thumbnails/36.jpg)
Thank you….