Machine Learningwith Applications in Categorization, Popularity and Sequence labeling
(linear models, decision trees, ensemble methods, evaluation)
Dr. Nicolas Nicolov<[email protected]>
2
Goals
• Introduce important ML concepts• Illustrate ML techniques through examples in:
– Categorization– Popularity– Sequence labeling
(tutorial aims to be self-contained and to explain the notation)
3
Outline• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow, Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
4
EXAMPLES OF MACHINE LEARNINGWhy?– Get a flavor of the diversity of areas where ML is applied.
5
Sequence Labeling
George W. Bush discussed Iraq
GPEXPER_ _PER_ _PER
<PER>George W. Bush</PER> discussed <GPE>Iraq</GPE>
George W. Bush discussed Iraq
Geo-Political Entity
(like search query analysis)
6
Spam
www.dietsthatwork.com
www . dietsthatwork . com
www . diets that work . com
SPAM!
further segmentation
classification
7
TokenizationWhat!?I love the iphone:-)
What !? I love the iphone :-)
How difficult can that be? — 98.2% [Zhang et al. 2003]
NO TRESSPASSING VIOLATORS WILL BE PROSECUTED
8
NL Parsing
Unlikemy sluggish Chevy the Audi handlesthe winding mountain roads superbly
PREP
POSS
MODDET
SUBJ DETMOD
MOD
MANRDOBJ
CONTR
syntactic structure
9
State Transitions
λ β
λ β
λ β
λ β
λ β
λ
λ
λ
λ
LEFTARC:
RIGHTARC:
NOARC:
SHIFT:
using ML to make the decisionwhich action to take
10
Two Ladies in a Men’s Club
11
We serve men
IOBJ
We serve men
DOBJSUBJ
SUBJ
We serve food to men.We serve our community.serve —IndirectObject men
We serve organic food.We serve coffee to connoiseurs.serve —DirectObject men
12
Audi is an automaker that makes luxury cars and SUVs. The company was born in Germany . It was established by August Horch in 1910. Horch had previosly founded another company and his models were quite popular. Audi started with four cylinder models. By 1914, Horch 's new cars were racing and winning. August Horch left the Audi company in 1920 to take a position as an industry representative for the German motor vehicle industry federation. Currently Audi is a subsidiary of the Volkswagen group and produces cars of outstanding quality.
Coreference
13
Parts of Objects (Meronymy)
[…] the interior seems upscale with leatherette upholstery that looks and feels better than the real cow hide found in more expensive vehicles, a dashboard accented by textured soft-touch materials, a woven mesh
headliner, and other materials that give the New Beetle’s interior a sense of quality. […] Finally, and a big plus in my book, both front seats were height adjustable, and the steering column tilted and telescoped for optimum comfort.
14
Sentiment Analysis
I love pineapple nearly as much as I hate bananas.
POSITIVE sentiment regarding topic pineapple.
Xbox
Xbox
Positive Negative
Neutral
Negative
Negative
Neutral
Positive
15
Chinese Sentiment
Car aspects Sentiment categories
Sentence
16
17
18
Categorization
• High-level task: – Given a restaurant what is its restaurant sub-category?
• Encoding entities with features• Feature selection• Linear models
non-standard order
“Though this be madness, yet there is method in't.”
19
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
20
ENCODING OBJECTS WITH FEATURESWhy?– ML algorithms are “generic”; most of them are cast as solutions around vector encodings of the domain objects. Regardless of the ML algorithm we will need to represent/encode the domain objects as feature vectors. How well we do this (the quality of features) directly impacts system performance.
21
FlatObject
Encoding
1 0 0 1 1 1 0 1 …37
= feature values (binary in this example)=Target class index (for asian)
Machine learning (training) instance/example/observation.
Default feature: Always on.
Name: has “asian bistr
o”
Description has “china”
Description has “indonesia”
has FB page
Name: has “restaurant”
Name: has “ginger”
Can be a set;object can belong to several classes.
URL has “french”
Number offeatures canbe millions.
22
Structured Objects to Strings
to Features
a b c d e
Structured object:
f1
f2
f3
f4
f5
f6
“f2:f4>a”“f2:f4>b”“f2:f4>c”…“f2:f4>a_b”“f2:f4>b_c”“f2:f4>c_d”…“f2:f4>a_b_c”“f2:f4>b_c_d”
uni-grams
bi-grams
tri-grams
Feature string Feature index
*DEFAULT* 0
… …
f2:f4>a 100
f2:f4>b 101
f2:f4>c 102
… …
f2:f4>a_b 105
f2:f4>b_c 106
f2:f4>c_d 107
… …
f2:f4>a_b_c 109
Read as field “f2:f4” contains feature “a”.
Table can be quite large.
23
Sliding Window (bi-grams)
SkyCity at the Space Needle
SkyCity at the Space Needle^ $
add initial “^” and final “$” tokens
SkyCity at the Space Needle^ $
SkyCity at the Space Needle^ $
SkyCity at the Space Needle^ $
SkyCity at the Space Needle^ $
sliding window
24
Example: Feature Templatespublic static List<string> NGrams( string field ){ var featutes = new List<string>(); string[] tokens = field.Split( spaceCharArr, System.StringSplitOptions.RemoveEmptyEntries );
featutes.Add( string.Join( "", field.Split(SPLIT_CHARS) ) ); // the entire field
string unigram = string.Empty, bigram, previous1 = "^", previous2 = "^", trigram;
for (int i = 0; i < tokens.Length; i++) { unigram = tokens[ i ]; featutes.Add(unigram);
bigram = previous1 + "_" + unigram; featutes.Add( bigram );
if ( i >= 1 ) { trigram = previous2 + "_" + bigram; featutes.Add( trigram ); }
previous2 = previous1; previous1 = unigram; } featutes.Add( unigram + "_$" ); featutes.Add( bigram + "_$" );
return result;}
initial tri-gram is: "^_tokens[0]_tokens[1] "
initial bigram is “^_tokens[0]"
last trigram is “tokens[tokens.Length-2]_tokens[tokens.Length-1]_$"
could add field name as argument and prefix all features
25
The Art of Feature Engineering:Disjunctive Features
• Useful feature = triggers often and with a particular class.• Rarely occurring (but indicative of a class) features can be
combined in a disjunction. This results in:– Need for less data to achieve good performance.– Final system performance (with all available data) is higher.
• How can we get insights about such features: Error analysis!
Regex ITALIAN_FOOD = new Regex(@"al dente|agnello|alfredo|antipasti|antipasto|arrabbiata|bistecca|bolognese| branzino|caprese|carbonara|carpaccio|cioppino|cozze|fettuccine|filetto|focaccia|frutti di mare|funghi| gnocchi|gorgonzola|insalata|lasagna|linguine|linguini|macaroni|minestrone|mozzarella|ossobuco|panini| panino| parmigiana|pasticcio|pecorino|penne|pepperoncini|pesce|pesto|piatti|piatto|piccata|polpo|pomodori|prosciutto| radicchio|ravioli|ricotta|rigatoni|risotto|saltimbocca|scallopini|scaloppini|spaghetti|tagliatelle|tiramisu| tortellini|vitello|vongole");
if (ITALIAN_FOOD.Match(entity.description).Success) features.Add("Italian_Food_Matched_Description");
Up to us how we call the feature.Triggering of the feature.
26
instance( class= 7, features=[0,300857,100739,200441,...])instance( class=99, features=[0,201937,196121,345758,13,...])instance( class=42, features=[0,99173,358387,1001,1,...])...
Generic Nature of ML Systems
human sees
computer “sees”
Default feature always triggers.
Number of features that trigger for individual instances are often not the same.
Indices of (binary) features that trigger.
27
Training Data
𝑋=( 𝑥0( 1) ⋯ 𝑥𝑑
(1 )
⋮ ⋱ ⋮𝑥0
(𝑁 ) ⋯ 𝑥𝑑(𝑁 )) ( 𝑦
(1 )
⋮𝑦 (𝑁 ))
Instance /w outcome.
28
Feature Selection
• Templates: powerful way to get lots of features.• We get too many features.• Danger of overfitting.• Feature selection:
– CountCutOff.– TFxIDF.– Mutual information.– Information gain.– Chi square.
Doing well on seen data but poorly on unseen data.
e.g., 20M for dependency parsing.
Automatic ways of finding discriminative features.
We will examine in detail the implementation of this.
29
Mutual Information• Measure of relative entropy between distributions of two random variables.• = expected value of across all classes:
• An alternative is to use:
𝐼 ( 𝑓 ,𝑐 )=log( 𝑃 ( 𝑓 ,𝑐 )𝑃 ( 𝑓 )𝑃 (𝑐 ) )=log(
𝑛 𝑓 ,𝑐
𝑁 𝑡
𝑛𝑓
𝑁𝑡
∙𝑛𝑐
𝑁 𝑡
)𝑀𝐼 ( 𝑓 ,𝐶 )=∑
𝑐∈𝐶
𝑃 (𝑐 ) 𝐼 ( 𝑓 ,𝑐 )=∑𝑐∈𝐶
𝑛𝑐𝑁𝑡
log(𝑛𝑓 ,𝑐
𝑁𝑡
𝑛 𝑓
𝑁 𝑡
∙𝑛𝑐𝑁𝑡
)𝐼𝑚𝑎𝑥 ( 𝑓 ,𝐶 )=max
𝑐∈𝐶𝐼 ( 𝑓 ,𝑐)=max
𝑐∈𝐶log(
𝑛𝑓 , 𝑐
𝑁 𝑡
𝑛𝑓
𝑁𝑡
∙𝑛𝑐
𝑁𝑡
)
30
Information Gain
Balances effects of feature triggering for an object with the effects of feature being absent for an object.
𝐼𝐺 ( 𝑓 ,𝐶 )=𝐻 (𝐶 ) −𝐻 (𝐶|𝑓 ) −𝐻 (𝐶∨¬ 𝑓 )
¿− ∑𝑐∈𝐶
𝑃 (𝑐 ) log 𝑃 (𝑐 )−(− ∑𝑐∈𝐶 𝑃 ( 𝑓 ,𝑐 ) log 𝑃 (𝑐|𝑓 ))−(−∑𝑐∈𝐶 𝑃 (¬ 𝑓 ,𝑐 ) log 𝑃 (𝑐|¬ 𝑓 ))
¿− ∑𝑐∈𝐶 ( 𝑛𝑐
𝑁𝑡
log ( 𝑛𝑐𝑁𝑡)− 𝑛𝑐𝑁 𝑡
log(𝑛𝑓 ,𝑐
𝑛 𝑓)− 𝑛𝑐−𝑛 𝑓 , 𝑐
𝑁𝑡
log (𝑛𝑐−𝑛𝑓 ,𝑐
𝑁 𝑡−𝑛 𝑓))
31
Chi Square
Quantifies lack of independence between feature and class :
𝑋 2 ( 𝑓 ,𝑐 )=𝑁𝑡 (𝑃 ( 𝑓 ,𝑐 )𝑃 (¬ 𝑓 , ¬𝑐 )−𝑃 ( 𝑓 ,¬𝑐 ) 𝑃 (¬ 𝑓 ,𝑐))2
𝑃 ( 𝑓 ) 𝑃 (¬ 𝑓 )𝑃 (𝑐 ) 𝑃 (¬𝑐)
¿𝑁𝑡 (𝑛𝑓 ,𝑐 (𝑁 𝑡−𝑛𝑓 −𝑛𝑐+𝑛𝑓 , 𝑐)− (𝑛𝑓 −𝑛𝑓 ,𝑐 ) (𝑛𝑐−𝑛𝑓 ,𝑐 ))
2
𝑛𝑐𝑛𝑓 (𝑁𝑡−𝑛𝑐 ) (𝑁 𝑡−𝑛𝑓 )
float Chi2(int a, int b, int c, int d) { return (a+b+c+d)* ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d)); }
Calling: Chi2( , , , )
32
Exponent(Log) TrickWhile the final output may not be big intermediate results are. Solution:
float Chi2(int a, int b, int c, int d) { return (a+b+c+d) * ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d)); }
float Chi2_v2(int a, int b, int c, int d){ double total = a + b + c + d; double n = Math.Log(total); double num = 2.0 * Math.Log(Math.Abs((a * d) - (b * c))); double den = Math.Log(a + b) + Math.Log(a + c) + Math.Log(c + d) + Math.Log(b + d); return (float) Math.Exp(n+num-den);}
𝒙=𝒆𝐥𝐧 𝒙 (𝑎+𝑏+𝑐+𝑑 ) (𝑎𝑑−𝑏𝑐 )2
(𝑎+𝑏)(𝑎+𝑐)(𝑐+𝑑)(𝑏+𝑑)=¿
¿𝑒ln
(𝑎+𝑏+𝑐 +𝑑 ) (𝑎𝑑−𝑏𝑐 )2
(𝑎+𝑏)(𝑎+𝑐)(𝑐+𝑑)(𝑏+𝑑)=¿
¿𝑒ln (𝑎+𝑏+𝑐+𝑑 ) (𝑎𝑑−𝑏𝑐 )2 − ln (𝑎+𝑏 ) (𝑎+𝑐 ) (𝑐+𝑑 ) (𝑏+𝑑 )=¿
¿𝑒ln (𝑎+𝑏+𝑐+𝑑 )+2 ln|𝑎𝑑−𝑏𝑐|− ln (𝑎+𝑏 )− ln (𝑎+𝑐 )− ln (𝑐+𝑑 ) − ln (𝑏+𝑑 )
33
Chi Square: Score per Feature
• We know how to compute .• Two options for an aggregate score across classes:
– Weighted average:
– Highest score among any class:
𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 )= ∑𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠
𝑃 (𝑐 ) 𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 ,𝑐 )
𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 )= max𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠
𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 ,𝑐 )
34
Chi Square Feature Selectionint[] featureCounts = new int[ numFeatures ]; int numLabels = labelIndex.Count;int[] classTotals = new int[ numLabels ]; // instances with that label.float[] classPriors = new float[ numLabels ]; // class priors: classTotals[label]/numInstances.int[,] counts = new int[ numLabels, numFeatures ]; // (label,feature) co-occurrence counts.int numInstances = instances.Count;
...
float[] weightedChiSquareScore = new float[ numFeatures ];for (int f = 0; f < numFeatures; f++) // f is a feature index{ float score = 0.0f; for (int labelIdx = 0; labelIdx < numLabels; labelIdx++) { int a = counts[ labelIdx, f ]; int b = classTotals[ labelIdx ] - p; int c = featureCounts[ f ] - p; int d = numInstances - ( p + q + r ); if (p >= MIN_SUPPORT && q >= MIN_SUPPORT) { // MIN_SUPPORT = 5 score += classPriors[ labelIdx ] * Chi2( a, b, c, d ); } } weightedChiSquareScore[ f ] = score;}
Do a pass over the data and collect above counts.
Weighted average across all classes.
35
⇒ Summary: Encoding
• Object representation is crucial.• Humans: good at suggesting features (templates).• Computers: good at filtering (feature selection).
• Feature engineering: Ensuring systems use the “right” features.
The system designer does not have to worry about which feature is more important or useful, and the job is left to the learning algorithm to assign appropriate weights to the corresponding features. The system designer’s job is to define a set of features that is large enough to represent most of the useful information, yet small enough to be manageable for the algorithms and the infrastructure.
36
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
37
MACHINE LEARNINGGENERAL FRAMEWORK
38
Machine Learning: Representation
object encoded with features(think DB attributes/ OO member fields of primitive types) is the feature dimensionality.
classifier
prediction(response/dependent variable).Can be qualitative/quantitative(classification/regression).
𝑶𝒃𝒋𝒆𝒄𝒕→𝑶𝒖𝒕𝒄𝒐𝒎𝒆Entity CategoryEntity PopularityEntity IsChainElement
Complex decision making:
𝑿→𝒀
�⃗�=(𝒙𝟎 , …, 𝒙𝒅)→𝒀
input/independent variable
We may know the relation for certain values of and :
In fact, we may know the relation for many s and s:
(𝒙 , 𝑦 )
{ (𝒙 (𝟏 ) , 𝑦 (𝟏 )) ,… , ( �⃗� ( 𝑵 ) , 𝑦 (𝑵 ) ) } The -th is:
39
Notation
𝒙(𝑖)=(𝑥0(𝑖) , …, 𝑥 𝑗
(𝑖) , … 𝑥𝑑(𝑖))
-th instance.
is the total number of data items.
is not “to the power of”hence, the parentheses.
is the corresponding component of the feature vector..
We will often have be the default feature with value of 1.
40
TRAINING
Machine Learning
Input
Online System
object encoded with features
classifier
prediction(response/dependent variable)
FinalOutput
ModelOffline
TrainingSub-system
Training Data
where
𝑓 (𝑋 )=𝑌𝑋→𝑌 Task is very complex . Hard to construct good .We construct an approximation to : Hypothesis space: .
41
Classes of Learning Problems
• Classification: Assign a category to each item (Chinese | French | Indian | Italian | Japanese restaurant).
• Regression: Predict a real value for each item (stock/currency value, temperature).
• Ranking: Order items according to some criterion (web search results relevant to a user query).
• Clustering: Partition items into homogeneous groups (clustering twitter posts by topic).
• Dimensionality reduction: Transform an initial representation of items into a lower-dimensional representation while preserving some properties (preprocessing of digital images).
42
ML Terminology• Examples: Items or instances used for learning or evaluation.• Features: Set of attributes represented as a vector associated with an example.• Labels: Values or categories assigned to examples. In classification the labels are categories; in
regression the labels are real numbers.• Target: The correct label for a training example. This is extra data that is needed for supervised learning.• Output: Prediction label from input set of features using a model of the machine learning algorithm.• Training sample: Examples used to train a machine learning algorithm.• Validation sample: Examples used to tune parameters of a learning algorithm.• Model: Information that the machine learning algorithm stores after training. The model is used when
predicting the output labels of new, unseen examples.• Test sample: Examples used to evaluate the performance of a learning algorithm. The test sample is
separate from the training and validation data and is not made available in the learning stage.• Loss function: A function that measures the difference/loss between a predicted label and a true label.
We will design the learning algorithms so that they minimize the error (cumulative loss across all training examples).
• Hypothesis set: A set of functions mapping features (feature vectors) to the set of labels. The learning algorithm chooses one function among those in the hypothesis set to return after training. Usually we pick a class of functions (e.g., linear functions) parameterized by a set of free parameters (e.g., coefficients of the linear function) and pinpoint the final hypothesis by identifying the parameters that minimize the error.
• Model selection: Process for selecting the free parameters of the algorithm (actually of the function in the hypothesis set).
43
Classification
• Data:
• Binary classification:– Outcomes:
−
++
+
++
++
+ +
+
+ −−
−
−
−
−
−
−
− −
−
−
decision boundary
Yes, this is mysterious at this point.
44
Multi-Class Classification
• Outcomes: • Common to use binary classification
approaches: One-Versus-All (OVA). One-Versus-One (OVO).
45
One-Versus-All (OVA)
For each category in turn, create a binary classifier where an instance in the data belonging to the category is considered a positive example, all other examples are considered negative examples.
Given a new object, run all these binary classifiers and see which classifier has the “highest prediction”.
The scores from the different classifiers need to be calibrated!
�̂�=argmax𝑦∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠
𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑆𝑐𝑜𝑟𝑒𝑦 (𝒙 )
46
One-Versus-One (OVO)For each pair of classes, create binary classifier on data labeled as either of the classes.
How many such classifiers?
Given a new instance run all classifiers and predict class with maximum number of wins.
(𝑘2 )=𝑘(𝑘−1)2
47
Errors“Nobody is perfect, but then again, who wants to be nobody.”
Binary classifier: :
#misclassified examples (penalty score of 1 for every misclassified example).𝐸𝑟𝑟𝑜𝑟= 1
𝑁∙∑𝑖=1
𝑁
|�̂� (𝑖 ) −𝑦 (𝑖 )|
𝐸𝑟𝑟𝑜𝑟= 1𝑁
∙∑𝑖=1
𝑁
𝐿𝑜𝑠𝑠 ( �̂� ( 𝑖 ) , 𝑦 (𝑖 ) )
Point-wise error (for data point ,The corresponding prediction and true value ).
�̂� (𝑖 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 (𝑖 ) ) Value predicted by the algorithm for input data point .
𝐿𝑜𝑠𝑠 ( �̂� (𝑖 ) , 𝑦 (𝑖 ) )Average error across all instances.Goal: Minimize the Error.Beneficial to have differentiable loss function.
𝐿𝑜𝑠𝑠 ( �̂� , 𝑦 )=|�̂�− 𝑦|
This encoding makes more sense than .
This particular function is called “Zero-One Loss”.For simplicity we are skipping the indices.
48
Error: Function of the Parameters
�̂� (𝑖 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 (𝑖 ) )=𝑔 ( �⃗�( 𝑖 ) ,𝑝𝑎𝑟𝑎𝑚𝑠 ) Value predicted by the algorithm for input data point .
The cumulative error across all instances is a function of the parameters.
𝐸𝑟𝑟𝑜𝑟 (𝑝𝑎𝑟𝑎𝑚𝑠 )= 1𝑁
∙∑𝑖=1
𝑁
𝐿𝑜𝑠𝑠 ( �̂� ( 𝑖 ) , 𝑦 (𝑖 ) )= 1𝑁
∙∑𝑖=1
𝑁
𝐿𝑜𝑠𝑠 (𝑔 ( �⃗� ( 𝑖 ) ,𝑝𝑎𝑟𝑎𝑚𝑠 ) , 𝑦 (𝑖 ) )
2 When the params are fixed we can compute given (testing).
1 When the s and the s are fixed we can compute (optimize) params (training).
49
Evaluation
• Motivation:– Benchmark algorithms (which system is better).– Tuning parameters during training.
50
Evaluation Measures
GeneralizationError: Probability to misclassify an instance selected according to the distribution of the labeled instance space
ClassificationAccuracy GeneralizationError
TrainingError: Percentage of training examples which are correctly classified.
Optimistically biased estimate especially if the inducer over-fits the (training) data.
Empirical estimation of the generalization error:• Heldout method• Re-sampling:
1. Random resampling2. Cross-validation
51
Precision, Recall and F-measure
Let’s consider binary classification:
Space of all instances
Instances identified as positive by the system.
Positive instances in reality.
System identified these as positive but got them wrong(false positive).
System identified these as positive but got them correct(true positive).
System identified these as negative but got them wrong(false negative).
System identified these as negative and got them correct(true negative).
General Setup
52
Accuracy, Precision, Recall,and F-measure
Definitions
𝑝=𝑇𝑃
𝑇𝑃+𝐹𝑃
𝑟=𝑇𝑃
𝑇𝑃+𝐹𝑁
𝐹=1
12 ( 1𝑝+
1𝑟 )
=2𝑝𝑟𝑝+𝑟
FP: false positives
TP:true positives
FN: false negatives
TN: true negatives Precision:
Recall:
Accuracy:
𝑎𝑐𝑐=𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
F-measure: Harmonic mean ofprecision and recall
53
Accuracy vs. Prec/Rec/F-measAccuracy can be misleading for evaluating a model with an imbalanced distribution of the class. When there are more majority class instances than minority class instances, predicting always the majority class gives good accuracy.
Precision and recall (together) are better indicators.
As a single, aggregate number f-measure favors the lower of the precision or recall.
54
Extreme Cases for Precision & Recall
TP:true positive
FN: false negatives
TN: true negatives
system actual
If very few (one in the extreme) instance(s) are correctly predicted as belonging to the class precision is 100% () but recall is low ( is high).
all instances
TP: true positives
system
actual
all instances
FP: false positives If all instances are predicted as belonging to the class (some correctly, some not) recall is 100% () but precision is low ( is high).
Precision can be traded for recall and vice versa.
55
Sensitivity & Specificity
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=𝑇𝑁
𝑇𝑁+𝐹𝑃
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦=𝑇𝑃
𝑇𝑃+𝐹𝑁FP: false positives
TP:true positives
FN: false negatives
TN: true negatives
[same as recall;aka true positive rate]
False positive rate:
𝑀𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑅𝑎𝑡𝑒=1 −𝐴𝑐𝑐=𝐹𝑃+𝐹𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
Definitions
[aka true negative rate]
False negative rate:
𝐹𝑃𝑅=𝐹𝑃
𝐹𝑃+𝑇𝑁𝐹𝑁𝑅=
𝐹𝑁𝐹𝑁+𝑇𝑃
56
Venn Diagrams
John Venn (1880) “On the Diagrammatic and Mechanical Representation of Propositions and Reasonings”, Philosophical Magazine and Journal of Science, 5:10(59).
These visualization diagrams were introduced by John Venn:
What if there are three classes?
Four classes?
Six classes?
With more classes our visual intuitions are helping less and less.
A subtle point: These are just the actual/real classes without the system classes drawn on top!
57
Confusion Matrix
Predicted class A Predicted class B Predicted class C
Actual class ANumber of instances in the actual class A AND predicted as belonging to class A.
Number of instances in the actual class A BUT predicted as belonging to class B.
… Total number of actual instances of class A
Actual class B … … … Total number of actual instances of class B
Actual class C … … … Total number of actual instances of class C
Total number of instances predicted as class A
Total number of instances predicted as class B
Total number of instances predicted as class C
Total number of instances
Shows how the predictions of instances of an actual class are distributed across all classes.Here is an example confusion matrix for three classes:
Counts on the diagonal are the true positives for each class. Counts not on the diagonal are errors.Confusion matrices can handle many classes.
58
Confusion Matrix:Accuracy, Precision and Recall
Predicted class A Predicted class B Predicted class C
Actual class A 50 80 70 200
Actual class B 40 140 120 300
Actual class C 120 220 160 500
210 440 350 1000
Given a confusion matrix, it’s easy to compute accuracy, precision and recall:
Confusion matrices can, themselves, be confusing sometimes
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝟓𝟎+𝟏𝟒𝟎+𝟏𝟔𝟎
𝟏𝟎𝟎𝟎𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝐴=
𝟓𝟎𝟓𝟎+𝟒𝟎+𝟏𝟐𝟎
𝑅𝑒𝑐𝑎𝑙𝑙𝐴=𝟓𝟎
𝟓𝟎+𝟖𝟎+𝟕𝟎
59
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
60
LINEAR MODELSWhy?– Linear models are good way to learn about core ML concepts.
61
Refresher: Vectors
point point
vector
𝑥
𝑦
vector
vector
points are also vectors.
sum of vectors
𝑣2
𝑣1
𝑣1+𝑣2
𝑥 𝑥1
𝑥2𝑦
𝑦=13𝑥
Equation of the line.
3
1
3 𝑦=1 𝑥(−1 ) 𝑥+3 𝑦=0
(−1 ) 𝑥1+3 𝑥2=0
Can be re-written as:
(−1,3 )(𝑥1
𝑥2)=0
(𝑤1 ,𝑤2 ) (𝑥1
𝑥2)=0vector notation
𝒘=(𝑤0
⋮𝑤𝑑
)=(𝑤0 ,… ,𝑤𝑑)𝑇
transpose
62
Refresher: Vectors (2)
𝑥 𝑥1
𝑥2𝑦
𝑦=13𝑥
Equation of the line.
3
13 𝑦=1 𝑥(−1 ) 𝑥+3 𝑦=0
(−1 ) 𝑥1+3 𝑥2=0
Can be re-written as:
(−1,3 )(𝑥1
𝑥2)=0
(𝑤1 ,𝑤2 ) (𝑥1
𝑥2)=0vector notation
3
−1
Normal vector.
63
Refresher: Dot Product
𝑥1
𝑥2
float DotProduct(float[] v1, float[] v2) { float sum = 0.0; for(int i=0; i< a.Length; i++) sum+= v1[i] * v2[i]; return sum; }
(𝑤1 ,𝑤2 ) ∙(𝑥1
𝑥2)=𝑤1𝑥1+𝑤2𝑥2
𝒘 ∙ �⃗�=|⃗𝒘||⃗𝒙|cos𝛾
𝛾
𝛾
𝒘
𝒘 ∙ �⃗�>𝟎
𝒘 ∙ �⃗�<𝟎
𝒘 ∙ �⃗�=𝟎
64
Refresher: Pos/Neg Classes
𝑥 𝑥1
𝑥2𝑦
Normal vector.
−
+ 𝒘 ∙ �⃗�>𝟎
𝒘 ∙ �⃗�<𝟎
𝒘 ∙ �⃗�=𝟎
65
sgn Function
𝑥
𝑦
1
−1
𝑠𝑔𝑛 (∎)={+1:∎>00 :∎=0
− 1:∎<0
𝑥
𝑦
1
−1
In mathematics:
We will use:𝑠𝑔𝑛 (∎ )={+1 :∎≥ 0
− 1:∎<0
We are purposefully avoiding using here.We will use for the feature vector.
Informally drawn as:
66
Two Linear Models
𝑔 (𝒙 )=𝑠𝑖𝑔𝑛 (𝒘𝑇 �⃗� ) 𝑔 (𝒙 )=𝒘𝑇 𝒙
Perceptron Linear regression
The features of an object have associated weights indicating their importance.
Signal: s=𝒘𝑇 �⃗�=∑𝑖=0
𝑑
𝑤𝑖 𝑥 𝑖
When is known the solution function is known; determines the hypothesis space.
67
Why “Regression”?Why the term for quantitative output prediction is “regression”?
“That same year [1875], [Francis] Galton decided to recruit some of his friends for an experiment with sweet peas. He distributed seeds among seven of them, asking them to plant the seeds and return the offspring. Galton measured the baby seeds and compared their diameters to those of their parents. He noticed a phenomenon that initially seems counter-intuitive: the large seeds tended to produce smaller offspring, and the small seeds tended to produce larger offspring. A decade later he analyzed data from his anthropometric laboratory and recognized the same pattern with human heights. After measuring 205 pairs of parents and their 928 adult children, he saw that exceptionally tall parents had kids who were generally shorter than they were, while exceptionally short parents had children who were generally taller than their parents.
After reflecting upon this, we can understand why it must be the case. If very tall parents always produced even taller children, and if very short parents always produced even shorter ones, we would by now have turned into a race of giants and midgets. Yet this hasn't happened. Human populations may be getting taller as a whole – due to better nutrition and public health – but the distribution of heights within the population is still contained.
Galton called this phenomenon ‘regression towards mediocrity in hereditary stature’. The concept is now more generally known as regression to the mean.”
[A.Bellos pp.375]
68
On-Line (Sequential) Learning• On-line = process one example at a time.• Attractive for large scale problems.
Objective: Minimize cumulative loss:
return parameters
for iteration (epoch/time).
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )
𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔… Compute loss.
∑𝑡=1
𝑇
𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) , �̂� ( 𝑡 ) ,𝐿𝑜𝑠𝑠 ,𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)
69
On-Line (Sequential) Learning (2)Sometimes written out more explicitly:
return parameters
for # passes over the data.
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )
𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔…
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔
𝒙 (𝑖 )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()�̂� (𝑖 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 (𝑖 ) )𝑦 (𝑖 )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝑇𝑟𝑢𝑒𝐿𝑎𝑏𝑒𝑙()
for
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()
for
if
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(…)
return parameters
𝑅𝑎𝑛𝑑𝑜𝑚𝑖𝑧𝑒𝐷𝑎𝑡𝑎 ()
for each data item.
.
𝑈𝑝𝑑𝑎𝑡𝑒 ( �⃗� (𝑡 ) , 𝑦 ( 𝑡 ) , �̂� (𝑡 ) ,𝐿𝑜𝑠𝑠 ,𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 )
70
Perceptron
• One of the earliest ML algorithms (Rosenblatt 1958).• On-line linear binary classification algorithm.• Determines a hyperplane (line in , plane in ,…) separating the
points for the two classes.
−
++
+
++
++
+ +
+
+ − −
−
−
−
−
−
−
− −
−
−
−
+
+
+
++
++
+ +
+
+ −
−
−−
−
−
−
−
−
−
−
−
Linearly separable data: Non-linearly separable data:
+
++
−−
−
71
First: Perceptron Update Rule𝒘𝑛𝑒𝑤=𝒘𝑜𝑙𝑑+𝑦
(𝑡 )𝒙 (𝑡 )
𝑥 𝑥1
𝑥2𝑦
𝑦=
3𝑥
𝑦=13𝑥
−
+
−+
+
Example (initially misclassified).
(−1 ) 𝑥+3 𝑦=0
(−3)𝑥+1𝑦=
0
(−1,3 ) (𝑥𝑦)=0
(−3,1) (𝑥 𝑦)=0
(𝑤1
𝑤2)=(−3
1 )+ (+1 )(22)=(− 13 )
(22)
Simplification: Lines pass through origin.
in order to simplify the update rule .
Example is now correctly classified with the new separating boundary. Not always the case that we can achieve this with one update.
72
On-Line (Sequential) Learning
return
for
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )
𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔…
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) , �̂� ( 𝑡 ) ,𝐿𝑜𝑠𝑠 ,𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)
73
𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔ 12|�̂� (𝑡 ) −𝑦 (𝑡 )|
Perceptron Learning Algorithm
return
for iteration (epoch/time).
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔𝑠𝑖𝑔𝑛 (𝒘 𝑇 ∙𝒙 ( 𝑡 ) )Compute zero-one loss
�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )∈ {−1 ,+1 }
return parameters
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()𝒘=(𝑤0 , …,𝑤𝑑)𝑇=(0 ,…,0 )𝑇∈ℝ𝑑+1
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(⋯)if
𝒘≔𝒘+ 𝑦 ( 𝑡 ) ∙ �⃗� (𝑡 )
74
Perceptron Learning Algorithm
return
for iteration (epoch/time). sample size.
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔𝑠𝑖𝑔𝑛 (𝒘 𝑇 ∙𝒙 ( 𝑡 ) ) �̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )∈ {−1 ,+1 }
return parameters
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()𝒘=(𝑤0 , …,𝑤𝑑)𝑇=(0 ,…,0 )𝑇∈ℝ𝑑+1
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(⋯)if
𝒘≔𝒘+ 𝑦 ( 𝑡 ) ∙ �⃗� (𝑡 )
represents transpose here
(algorithm makes multiple passes over data.)
75
Perceptron Learning Algorithm (PLA)
Initialize weights:
Select a mis-classified example:
Update weights:
return
𝒘≔𝒘+ 𝑦 ( 𝑡 ) ∙ �⃗� (𝑡 )
while( mis-classified examples exist ):
𝑦 (𝑡 ) ≠𝑠𝑖𝑔𝑛 (𝒘𝑇 �⃗�( 𝑡 ) )
Misclassified example means:With the current weights
1. A challenge: Algorithm will not terminate for non-linearly separable data (outliers, noise). 2. Unstable: jump from good perceptron to really bad one within one update.3. Attempting to minimize:
min�⃗�
1𝑁∑
𝑡=1
𝑁
⟦𝑦 ( 𝑡 )≠ 𝑠𝑖𝑔𝑛 (𝒘𝑇 𝒙 (𝑡 )) ⟧ NP-hard.
more generally
76
Perceptron
If a point is classified incorrectly:
⇒𝑠𝑔𝑛 ( 𝑦 ( 𝑡 ) )≠𝑠𝑔𝑛 (𝒘𝑜𝑙𝑑𝑇 ∙ �⃗�( 𝑡 ) )⇒ 𝑦 ( 𝑡 )𝒘 𝑜𝑙𝑑
𝑇 ∙ �⃗� (𝑡 )<0
𝒘𝑛𝑒𝑤=𝒘𝑜𝑙𝑑+𝑦(𝑡 )𝒙 (𝑡 )
Weight update:
𝑦 (𝑡 )𝒘𝑛𝑒𝑤𝑇 ∙ �⃗� (𝑡 )=𝑦 ( 𝑡 ) (𝒘 𝑜𝑙𝑑+𝑦
( 𝑡 )𝒙 ( 𝑡 ) )𝑇 ∙𝒙 (𝑡 )=¿
¿ 𝑦 (𝑡 )𝒘𝑜𝑙𝑑𝑇 ∙ �⃗�( 𝑡 )+‖�⃗� ( 𝑡 )‖2
>𝑦 ( 𝑡 )𝒘 𝑜𝑙𝑑𝑇 ∙ 𝒙 (𝑡 )
¿ 𝑦 (𝑡 )𝒘𝑜𝑙𝑑𝑇 ∙ �⃗�( 𝑡 )+( 𝑦 (𝑡 ) )2‖�⃗� (𝑡 )‖2
=¿
¿0 ¿0
Thus, the perceptron weight update pushes in the “right direction”.
77
Looks Simple – Does It Work?
Number of updates by the Perceptron Algorithm ≤𝑟 2
𝜌 2
where:
𝒙 (1 )…𝒙 (𝑁 )∈ℝ𝑑+1
𝑟 ≥‖𝒙 ( 𝑖 )‖ (for all )
𝜌 ≤𝑦 (𝑖 ) (𝑣 ∙𝒙 ( 𝑖 ))
‖�⃗�‖(for all )
There exist and such that:
Margin-based upper bound on updates:
The quantity is known as the “normalized margin”.
Remarkable:Does not depend on dimension of feature space!
Fact:
78
Compact Model Representation
void Save( StreamWriter w, int labelIdx, float[] weights ){ w.Write( labelIdx ); int previousIndex = 0; for (int i = 0; i < weights.Length; i++) { if (weights[ i ] != 0.0f) { w.Write( " " + (i - previousIndex) + " " + weights[ i ] ); previousIndex = i; } }}
Use float instead of double:
Store only non-zero weights (and indices):
Store non-zero weights and diff of indices:
Difference of indices.
Remember last index where the weight was non-zero .
79
Linear Classification Solutions
A fixed choice of defines the hyperplane and, thus, the solution to our (linear) task.
−
++
+
++
++
+ +
+
+ − −
−
−
−
−
−
−
− −
−
−
Different solutions (infinitely many)
80
The Pocket AlgorithmA better perceptron algorithm: Keep track of the error and update weights when we lower the error.
Initialize weights:
Run PLA for one iteration and obtain new .
return
𝐸𝑟𝑟 (𝒘 ( 𝑖+1 ) )= 1𝑁∑
𝑛=1
𝑁
⟦𝑠𝑔𝑛 (𝒘 ( 𝑖+1 )𝒙 (𝑛 ) )≠ 𝑦 (𝑛 ) ⟧
for :
𝑏𝑒𝑠𝑡𝐸𝑟𝑟 ≔𝐷𝑜𝑢𝑏𝑙𝑒 .𝑀𝐴𝑋
if :
Compute error. Expensive step!
Only update the best weights if we lower the error!
Access to the entire data needed!
81
Voted Perceptron• Training as the usual perceptron algorithm (with some extra book-keeping).• Decision rule:
�̂�=sgn((∑𝑡 𝑐𝑡 �⃗�(𝑡 ))∙ �⃗�)
Coefficient proportional to the number of iterations survives(number of iterations between and ).
iterations
𝒘 (𝑡 ) 𝒘 (𝑡+1 )
( �⃗�(𝒊 𝟏) , 𝑦
(𝒊 𝟏) )
( �⃗�(𝒊 𝟐) , 𝑦
(𝒊 𝟐) )
( �⃗�(𝒊 𝒄 𝒕
) , 𝑦(𝒊 𝒄 𝒕
) )
�̂� ( 𝒊𝟏 ) �̂� ( 𝒊𝟐 )�̂� (𝒊𝒄
𝒕)
82
Dual Perceptron: Intuitions
𝑥1
𝑥2
−
+
−
+ separating line.++
+
+
+
++
−−
−
−−
−
𝑦 ¿¿
𝑦 − �⃗�−
normal vector
𝑦 −=−1
𝑦 +¿=+1¿
83
Dual Perceptron
return
for iteration (epoch/time). sample size.
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔sign(∑𝑗=1
𝑁
𝛼 𝑗 𝑦( 𝑗 ) (𝒙 ( 𝒋 )𝑇 ∙ 𝒙 (𝑡 ))) �̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )∈ {−1 ,+1 }
return parameters
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()�⃗�=(𝛼1 ,…,𝛼𝑁 )𝑇= (0 ,…,0 )𝑇 ∈ℝ𝑁
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(⋯)if
𝛼 𝑡≔𝛼𝑡+1
(algorithm makes multiple passes over data.)
�̂�≔sign (∑𝑗=1
𝑁
𝛼 𝑗 𝑦( 𝑗 ) ( �⃗� ( 𝒋 )𝑇 ∙ �⃗� ))Decision rule:
gives a notion of how difficult instance is.
Kernel perceptron uses:
84
Exclusive OR (XOR) Function
𝑥1
𝑥2
1
1
0
0
Truth table: Inputs in and color-coding of the output:
𝑥1
𝑥2
1
1
0
0
Challenge: The data is not linearly separable (no straight line can be drawn that separates the green from the blue points).
???
85
Solution for the Exclusive OR (XOR)
𝑥1
𝑥3
1
1
0
0
We introduce another input dimension:
Now the data is linearly separable: 𝑥2
86
𝑍≔∑𝑖= 0
𝑑
𝑤 𝑖𝑒𝑦 ( 𝑡 ) �⃗� 𝑖
(𝑡 )
Winnow Algorithm
return
for iteration (epoch).
(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()
�̂� (𝑡 )≔𝑠𝑔𝑛 (𝒘 𝑇 ∙𝒙 (𝑡 ) )
𝒘≔( 1𝑑+1
,…)𝑇
;
if
�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(…)for
𝑤𝑖≔𝑤𝑖𝑒
𝑦 (𝑡 ) �⃗� 𝑖(𝑡 )
𝑍
return parameters
Normalizing constant.
Multiplicative update.
87
Training, Test Error and Complexity
Test error
Training error
Model complexity
88
Logistic Regression𝑔 (𝒙 )=𝜃 (𝒘𝑇 𝒙 )
𝜃 (𝑠 )= 𝑒𝑠
1+𝑒𝑠
Logistic function:
𝑦∈ {−1 ,+1 }
𝑓 (𝑥 )=𝑃 ( 𝑦=+1∨𝒙 )
1−𝜃 (𝑠 )=𝜃 (−𝑠)
Target:
Data does not give the probability explicitly:
89
Logistic Regression𝑃 (𝑦∨𝑥 )={ 𝑓 (𝑥 ) when 𝑦=+1
1 − 𝑓 (𝑥 ) when 𝑦=−1
𝑃 (𝑦∨𝒙 )={ 𝑔 (𝒙 )=𝜃 (𝒘 𝑇 𝒙 )=𝜃 ( 𝑦𝒘𝑇 𝒙 )when 𝑦=+1
1−𝑔 (𝒙 )=1 −𝜃 (𝒘 𝑇 𝒙 )=𝜃 (−𝒘𝑇 𝒙 )=𝜃 ( 𝑦𝒘𝑇 𝒙 ) when 𝑦=−1
Data likelihood: 𝐿 ( �⃗� )=∏𝑖=1
𝑁
𝑃 (𝑦 (𝑖)|𝒙(𝑖))
Negative log-likelihood:
−𝑙 (𝒘 )=−1𝑁
ln (∏𝑖=1
𝑁
𝑃 ( 𝑦 (𝑖)|𝒙 (𝑖)) )= 1𝑁∑
𝑖=1
𝑁
ln( 1
𝑃 ( 𝑦(𝑖)|𝒙(𝑖) ) )= 1𝑁∑
𝑖=1
𝑁
ln( 1
𝜃 (𝑦(𝑖)𝒘 𝑇 𝒙(𝑖)) )= 1𝑁∑
𝑖=1
𝑁
ln (1+𝑒−𝑦 (𝑖 )𝒘𝑇 𝒙( 𝑖) )
𝐸 (𝒘 )= 1𝑁∑
𝑖=1
𝑁
ln (1+𝑒−𝑦 ( 𝑖)𝒘𝑇 𝒙(𝑖) )Error:
How likely is it that we get output when we have input :
Which maximizes this? orminimizes this?
90
RefresherDerivative:
(3 𝑥2 )′=3 ∙2 ∙𝑥2 −1
Partial derivative:𝜕𝜕𝑥
(𝑥2+2 𝑥𝑦+𝑦2 )=2 𝑥+2 𝑦
Partial derivative at a point :𝜕𝜕𝑤0
(𝑤02+2𝑤0𝑤1+𝑤1
2 )¿𝑤0=2 ,𝑤1=3=(2𝑤0+2𝑤1 )¿𝑤0=2,𝑤1=3=2 ∙2+2 ∙ 3
Gradient (derivatives with respect to each component): [ 𝜕 ( ∙ )𝜕𝑤0
,𝜕 ( ∙ )𝜕𝑤1
,…,𝜕 (∙ )𝜕𝑤𝑑 ]
Gradient of the error: 𝛻 𝐸 (𝒘 )=[ 𝜕𝐸𝜕𝑤0
,𝜕𝐸𝜕𝑤1
,…,𝜕𝐸𝜕𝑤𝑑 ]
𝜕𝜕𝑥
𝐹 (𝑥 , 𝑦 )
(ln 𝑥 )′= 1𝑥 (𝑒𝑥) ′=𝑒𝑥 ( 𝑓 (𝑔 ) )′= 𝑓 ′ (𝑔) ∙𝑔 ′
This is a vector and we can compute it at a point.
Chain rule:
𝑓 ′ (𝑥 )= lim∆𝑥→ 0
𝑓 (𝑥+∆ 𝑥 )− 𝑓 (𝑥)∆ 𝑥
𝑓 ′ (𝑥2 )= lim∆ 𝑥→0
(𝑥+∆ 𝑥 )2−𝑥2
∆𝑥= lim
∆ 𝑥→0
𝑥2+2𝑥 ( ∆𝑥 )+( ∆𝑥 )2 −𝑥2
∆𝑥= lim
∆𝑥→ 0(2 𝑥+∆𝑥 )=2𝑥
91
Hypothesis SpaceThe best to use is the one which minimizes: 𝐸 (𝒘 )= 1
𝑁∑𝑖=1
𝑁
ln (1+𝑒−𝑦 ( 𝑖)𝒘𝑇 �⃗�(𝑖) )
Different give rise to different values for .
−𝛻𝐸 (𝒘 )
is the error surface.
Weight space/hyperplane.
[graph from T.Mitchell]
92
Math FactThe gradient of the error:
(a vector in weight space) specifies the direction of the argument that leads to the steepest increase for the value of the error.
The negative of the gradient gives the direction of the steepest decrease.
𝛻 𝐸 (𝒘 )=[ 𝜕𝐸𝜕𝑤0
,𝜕𝐸𝜕𝑤1
, …,𝜕𝐸𝜕𝑤𝑑 ]
𝑤0
𝑤1
𝒘 (𝑡 )= (𝑤0 ,𝑤1 )
−𝛻𝐸 (𝒘 (𝑡 ))
𝒘 (𝑡+1 )
Best weights we can findup to iteration .
New best weights atiteration .
Negative gradient (see next slides).
93
¿ 1𝑁∑
𝑖=1
𝑁1
1+𝑒−𝑦 ( 𝑖)𝒘𝑇 �⃗�(𝑖) ∙𝑒−𝑦 ( 𝑖)𝒘𝑇 𝒙 (𝑖)
∙ (−𝑦 (𝑖)𝒙 (𝑖))=¿
𝛻 𝐸 (𝒘 )=𝛻 1𝑁∑
𝑖=1
𝑁
ln (1+𝑒− 𝑦 (𝑖)𝒘 𝑇 �⃗�( 𝑖) )= 1𝑁∑
𝑖=1
𝑁
𝛻 ln (1+𝑒−𝑦 (𝑖)𝒘 𝑇 𝒙( 𝑖) )=¿¿
Computing the Gradient𝐸 (𝒘 )= 1
𝑁∑𝑖=1
𝑁
ln (1+𝑒−𝑦 ( 𝑖)𝒘𝑇 �⃗�(𝑖) )
¿ 1𝑁∑
𝑖=1
𝑁1
1+ 1
𝑒𝑦 (𝑖)𝒘 𝑇 �⃗�( 𝑖)
∙1
𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖) ∙ (− 𝑦 (𝑖) �⃗�(𝑖) )=¿
¿ 1𝑁∑
𝑖=1
𝑁1
𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖 )
+1
𝑒𝑦(𝑖 )𝒘𝑇 𝒙 (𝑖)
∙1
𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖) ∙ (− 𝑦 (𝑖) �⃗�(𝑖) )=−
1𝑁∑
𝑖=1
𝑁 𝑦 (𝑖) �⃗�(𝑖)
1+𝑒𝑦 (𝑖)𝒘 𝑇 �⃗�( 𝑖)
Because gradient is a linear operator.(ln∎ )′=1
∎∙∎ ′
1∎ ∎′= (1+𝑒𝑧 )′=𝑒𝑧 ∙ 𝑧′
𝑒−𝑧=1
𝑒𝑧
94
(Batch) Gradient Descent
Initialize weights:
Compute gradient:
Update weights:
return
General technique for minimizing a differentiable function like .
�⃗�𝒓𝒂𝒅 :=−1𝑁∑
𝑖=1
𝑁 𝑦(𝑖)𝒙 (𝑖)
1+𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖 )
𝒘≔𝒘 −𝜂 �⃗�𝒓𝒂𝒅
repeat
until Stop
int
max #iterations; marginal error improvement; andsmall value for the error.
is the learning rate.
If a random training example () is selected and gradient computed on it alone, the algorithm is called SGD(Stochastic Gradient Descent).
95
Punch Line
With the best weights computed using gradient descent,given a unknown input object encoded as vector of features ,the output probability that the object is in the class is:
𝑃 (𝑦=+1|𝒙 ;𝒘 )= 𝑒�⃗�𝑇 𝒙
1+𝑒𝒘𝑇 �⃗�
𝑃 (𝑦=+1|𝒙 ;𝒘 )>𝜏 classification rule.
The new object is in the class if:
Predict if or equivalently if . The larger , the better; will be larger and so will our degree of confidence that . The prediction that is very confident if . Similarly, logistic regression makes a very confident decision that if .
96
Newton’s Method• Alternate way to minimize a function (like ).• We need to find the derivative of the error (negative log likelihood) and find for which values
of the parameters the derivative is zero.• Let be a function and we want to find such that .
𝑢𝑖+1≔𝑢𝑖−𝑓 (𝑢𝑖 )𝑓 ′ (𝑢𝑖 )
0 0.5 1 1.5 2 2.5 3
-0.5
0
0.5
1
1.5
2
2.5
3
𝑢𝑖𝑢∗
𝑓 (𝑢𝑖 )
𝑢𝑖+1
𝑓 (𝑢𝑖 )𝑢𝑖−𝑢𝑖+1
=tan𝛾= 𝑓 ′ (𝑢𝑖 )
𝛾
97
Newton-Raphson• Generalization f Newton’s method to multidimensional case.• The parameters are a vector: (we used the notation ).
𝜃≔𝜃−𝑙 ′ (𝜃 )𝑙 ′ ′ (𝜃 )
𝜃≔𝜃−𝐻−1 ∙𝛻 𝑙(𝜃)
𝐻 𝑖𝑗=𝜕2 𝑙(𝜃)𝜕𝜃 𝑖𝜕 𝜃 𝑗
is the Hessian matrix:
98
Robust Risk Minimizationinput vector
label
training examples
weight vector
bias
continuous linear model
𝒙𝑦∈ {−1 ,+1 }(𝒙 ( 𝑖 ) , 𝑦 (𝑖 ) )𝒘𝑏𝑝 (𝒙)
Prediction rule:
�̂� (𝒙 )={+1 :𝑝 (𝒙 )≥ 0− 1:𝑝 ( �⃗� )<0
Classification error:
𝑙 (𝑝 (𝒙 ) , 𝑦 )={+1 :𝑝 ( �⃗�) ∙ 𝑦 ≤ 00 :𝑝 ( �⃗� ) ∙ 𝑦>0
Notation:
99
Robust Classification Loss
Parameter estimation:
Hinge loss:
Robust classification loss:
(�̂� ,�̂� )=argmin�⃗� ,𝑏
1𝑁∑
𝑖=1
𝑁
𝑙𝑜𝑠𝑠 (𝒘𝑇 ∙ �⃗� (𝑖 )+𝑏 , 𝑦 ( 𝑖 ) )
𝑔 (𝑝 (𝒙 ) , 𝑦 )={1−𝑝 (𝒙 ) 𝑦 :𝑝 ( �⃗� ) 𝑦 ≤10 :𝑝 (𝒙 ) 𝑦>1
h (𝑝 ( �⃗� ) , 𝑦 )={ − 2𝑝 ( �⃗� ) 𝑦 :𝑝 ( �⃗� ) 𝑦 ≤ −112
(𝑝 (𝒙 ) 𝑦−1 )2:𝑝 ( �⃗� ) 𝑦∈ [− 1,1 ]
0:𝑝 (𝒙 ) 𝑦>1
100
Loss Functions: Comparison
101
Confidence and Regularization
smaller λ corresponds to a larger A.
Confidence 𝑃 (𝑦=1|⃗𝒙 ):
�̂� (𝒙 )=𝑚𝑎𝑥 (0 ,𝑚𝑖𝑛(1 ,�̂� ∙ �⃗�+ �̂�+1
2 ))Regularization:
‖𝒘‖2+𝑏2≤ 𝐴
(�̂� ,�̂� )=argmin�⃗� ,𝑏
1𝑁∑
𝑖=1
𝑁
h (𝒘𝑇 ∙ �⃗� (𝑖 )+𝑏 , 𝑦 ( 𝑖 ) )
Unconstrained optimization (Lagrange multiplier):
(�̂� ,�̂� )=argmin�⃗� ,𝑏 [ 1
𝑁∑𝑖=1
𝑁
h (𝒘𝑇 ∙𝒙 (𝑖 )+𝑏 , 𝑦 (𝑖 ) )+λ2
(‖�⃗�‖2+𝑏2 ) ]
102
Robust Risk Minimization
Input:
Initialization:
𝑝≔ 𝑦(𝑖) (𝒘 𝑇 ∙𝒙 (𝑖))𝑑𝑖≔𝑚𝑎𝑥 (𝑚𝑖𝑛(2𝑐−𝛼 𝑖 ,𝜂(𝑐−𝛼𝑖
𝑐−𝑝)) ,−𝛼 𝑖)
𝒘≔𝒘+𝑑𝑖 𝑦(𝑖) �⃗�(𝑖)
𝑏≔𝑏+𝑑𝑖 𝑦(𝑖)
𝛼 𝑖≔𝛼𝑖+𝑑𝑖
for
for
return
Number of passes over the data ( is a good default).
; .
𝒙 (𝑖 )∈ℝ𝑑+ 1;𝑦 (𝑖 )∈ {−1 ,+1 }
Go over the training data.
103
Learning Curve• Plots evaluation metric
against fraction of training data (on the same test set!).
• Highest performance bounded by human inter annotator agreement (ITA).
• Leveling off effect that can guide us how much data is needed.
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0
10
20
30
40
50
60
70
80
90
100
Percentage of data used for each experiment.
Experiment with 50% of the training data yields
evaluation number of 70.
104
Summary
• Examples of ML• Categorization• Object encoding• Linear models:
– Perceptron– Winnow– Logistic Regression– RRM
• Engineering aspects of ML systems
105PART II: POPULARITY
106
Goal
• Quantify how popular an entity is.
Motivation:• Used in the new local search relevance metric.
107
What is popularity?
• Use clicks on entity as proxy for popularity.
• Popularity score [0..1].• Goal: preserve relative
ranking between clicks vs. predicted popularity score.
108
POPULARITY IN LOCAL SEARCH
109
Popularity
• Output a popularity score (regression)• Ensemble methods• Tree base procedure (non-linear)• Boosting
110
When is a Local Entity Popular?
• Definition:Visited by many people in the context of alternative choices.
• Is the popularity of restaurants the same as the popularity of movies, etc.?
• How to operationalize “visit”, “many”, “alternative choices”?– Initially we are using: popular means clicked more.
• Going forward we will use:– “visit” = click given an impression.– “choice” = density of entities in the same primary category.– “many” = fraction of clicks from impressions.
111
Local Entity Popularity
𝑃𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 (𝑒)=𝐶𝑇𝑅𝑒+ (1−𝐶𝑇𝑅𝑒 )∙𝐷𝑒𝑛𝑠𝑖𝑡𝑦𝑒
𝐷𝑒𝑛𝑠𝑖𝑡𝑦 𝑒=1𝜋2
∙ tan−1 (𝑛𝑢𝑚𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠𝑒𝑠𝑁𝑒𝑎𝑟 (𝑒))
Popularity = Boosted Click Through Rate (CTR) for entity :
where :
The model then will be regression:
0 1𝐶𝑇𝑅=
𝐶𝑙𝑖𝑐𝑘𝑠𝐼𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝑠
1 −𝐶𝑇𝑅
(1−𝐶𝑇𝑅) ∙𝐷𝑒𝑛𝑠𝑖𝑡𝑦
𝐶𝑇𝑅 (𝑒 )= 𝐶𝑙𝑖𝑐𝑘𝑠 (𝑒)𝐼𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝑠(𝑒)
𝑛𝑢𝑚𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠𝑒𝑠𝑁𝑒𝑎𝑟 (𝑒)=¿ Number of entities in the same primary category as within a radius
112
Not all Clicks are Born the Same
• Click in the context of a named query:– Can even be argued we are not satisfying the user
information needs (and they have to click further to find out what they are looking for).
• Click in the context of a category query:– Much more significant (especially when alternative results
are present).
113
Local Entity Popularity
• Popularity & 1st page , current ranker.• Entities without URL.• Newly created entities.• Clicks vs. mouseovers.• Scenario: 50 French restaurants; best entity
has 2k clicks. 2 Italian restaurants; best entity has 2k clicks. The French entity is more popular because of higher available choice.
114
Entity Representation
8000 … 4000 65 4.7 73 … 1 …9000
feature valuesTarget
Machine learning (training) instance
clicks for week -1
clicks for week -9
# ratingsaggregate ratings
# reviews
has FB page
115
POISSON REGRESSIONWhy?– We will practice the ML machinery on a different problem, re-iterating the concepts. Poisson regression is an example of log-linear models good for modeling counts (e.g., number of visitors to a store in a certain time).
116
SetupTraining data: where: are counts (rather than for regression problems).
Goal: Come up with a system which given a new observation can correctly predict the corresponding outcome .
response/outcome variable
These counts for our scenario are the clicks on the web page.
A good way to model counts of observations is using the Poisson distribution.
explanatory variables
117
Poisson Distribution: PreliminariesThe Poisson distribution realistically describes the pattern of requests over time in many client-server situations.
Examples are: incoming customers at a bank, calls into a company’s telephone exchange, requests for storage/retrieval services from a database server, and interrupts to a central processor. It also has higher-dimensional applications, such as the spatial distribution of defects on integrated circuit wafers and the volume distribution of contaminants in well water. In such cases, the “events”, which are request arrivals or defects occurrences, are independent. Customers do not conspire to achieve some special pattern in their access to a bank teller; rather they operate as independent agents. The manufacture of hard disks or integrated circuits introduces unavoidable defects because the process pushes the limits of geometric tolerances. Therefore, a perfectly functional process will still occasionally produce a defect, such as a small area on the disk surface where the magnetic material is not spread uniformly or a shorted transistor on an integrated circuit chip. These errors are independent in the sense that a defect at one point does not influence, for better or worse, the chance of a defect at another point. Moreover, if the time interval or spatial area is small, the probability of an event is correspondingly small. This is a characterizing feature of a Poisson distribution: event probability decreases with the window of opportunity and is linear in the limit. A second characterizing feature, negligible probability of two or more events in a small interval, is also present in the mentioned examples.
118
Poisson Distribution: FormallyThe Poisson distribution can be used to model situations in which the expected number of events scales with the length of the interval within which the events can occur. If is the expected number of events per unit interval, then the distribution of the number of events within an interval is:
𝑝 (𝑋=𝑘|𝜆 )= 1𝑘!𝑒−𝜆𝑡 (𝜆𝑡 )𝑘
For unit length interval
Mean: Variance:
119
Poisson Distribution: Mental StepsFirst, we are keeping ’s for the input. So we will write:
𝑝 (𝑌=𝑦|𝜆 )= 1𝑦 !𝑒−𝜆𝑡 ( 𝜆𝑡 )𝑦
The output is determined by a single scalar parameter . We will have be dependent on the input in the following way:
𝜇=𝐸 [ 𝑋 ]=𝜆=𝑒𝒙 𝑇 ∙𝜷 This comes from the theory of Generalized Linear Models (GLM).
ln ( 𝜆 )=¿ �⃗�𝑇 ∙𝜷 ¿
log linear combination of the input features.
Hence, the name log-linear model.
In contrast, a linear model could potentially make negative but which is a count!
We used to write (when discussing logistic regression). Now, we call the parameters and because in the training phase they are unknown we will write them as the second argument in the dot product to emphasize they are the argument.
120
Poisson Distribution
Data likelihood: 𝐿 ( �⃗� )=∏𝑖=1
𝑁
𝑃 (𝑦 (𝑖)|�⃗�(𝑖))
Log-likelihood:
Which maximizes this?
𝑙 ( �⃗� )= ln(∏𝑖=1
𝑁
𝑃 (𝑦 (𝑖)|�⃗�(𝑖)) )=∑𝑖=1
𝑁
ln (𝑃 ( 𝑦(𝑖)|⃗𝒙(𝑖)))=∑𝑖=1
𝑁
ln(𝑒−𝑒 �⃗�( 𝑖)𝜷 (𝑒𝒙( 𝑖)𝜷 )𝑦(𝑖)
𝑦 (𝑖) ! )=¿
¿∑𝑖=1
𝑁 [ ln (𝑒−𝑒 �⃗�( 𝑖)𝜷 )+ ln (𝑒𝒙 (𝑖) �⃗� )𝑦(𝑖 )
− ln (𝑦(𝑖) ! )]=∑𝑖=1
𝑁
[−𝑒𝒙 (𝑖) 𝜷+𝑦 (𝑖 ) ln (𝑒 �⃗�(𝑖 )𝜷 )− ln ( 𝑦(𝑖)! )]=¿
¿∑𝑖=1
𝑁
[−𝑒 �⃗�( 𝑖) 𝜷+𝑦 (𝑖 )𝒙 (𝑖) �⃗�− ln (𝑦 (𝑖 )! ) ]
𝑝 (𝑌=𝑦|𝜆 )= 1𝑦 !𝑒−𝜆𝑡 ( 𝜆𝑡 )𝑦 𝜆=𝑒 �⃗�𝑇 ∙ �⃗�
121
Maximizing the Log-Likelihood
𝑙 ( �⃗� )=∑𝑖=1
𝑁
[−𝑒 �⃗�( 𝑖) 𝜷+𝑦 (𝑖 )𝒙 (𝑖) �⃗�− ln (𝑦 (𝑖 )! ) ] Which maximizes this?
𝛻 𝑙 (𝜷 )=0
𝛻 𝑙 (𝜷 )=∑𝑖=1
𝑁
[− �⃗� (𝑖 )𝑒 �⃗� ( 𝑖) 𝜷+𝑦 (𝑖 )𝒙 (𝑖)]=∑𝑖=1
𝑁
(𝑦 (𝑖 )−𝑒𝒙 (𝑖 )𝜷 )𝒙 (𝑖)=0
Non-linear in ; does not have analytical solution.
122
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
123
DECISION TREESWhy?– DTs are an influential development in ML. Combined in ensemble they provide very competitive performance. We will see ensemble techniques in the next part.
124
Decision Trees
𝑥𝑖<𝑠1
𝑥 𝑗<𝑠2
Binary partitioning of the data during training(navigating to leaf node during testing).
Selecting dimension and split value .
predictionTraining instances are more homogeneousin terms of the output variable (more pure) compared to ancestor nodes.
Stopping when instances are homogeneous or
small number of instances.
Training instances. Color reflects output variable(classification example).
𝑥 𝑗≥𝑠2
𝑥𝑖≥ 𝑠1
125
Decision Tree: Example
Parents Visiting
Weather
Money
Cinema
CinemaShopping
Stay in
PoorRich
RainyWindySunny
NoYes
Play tennis
Attribute/feature/predicate
Value of the attribute
Predicted classes.
(classification example with categorical features)
Branching factor depends on the number of possible values for the attribute (as seen in the training set).
126
Entropy (needed for describing how an attribute is selected.)
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 )=− ∑𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠
𝑝𝑐 ∙ log 2𝑝𝑐
Entropy values for two classes varying the probability of one classes (the probability of the other class is):
𝐸𝑛𝑡𝑟𝑜𝑝𝑦=−𝑝1∙ log 2𝑝1−𝑝2 ∙ log2𝑝2=−𝑝 ∙ log 2𝑝−(1 −𝑝 )∙ log 2 (1 −𝑝 )
Example
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
127
Selecting an Attribute: Information Gain
Measure of expected reduction in entropy.
𝐺𝑎𝑖𝑛 (𝑆 ,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 )− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)
|𝑆𝑣||𝑆|
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣 )
instances attribute
Choose attribute with the highest information gain ( that minimizes this).
See Mitchell’97, p.59 for an example.
instances with value for attribute
128
Splitting ‘Hairs’
?
If there are only a small number of instances do not split the node further (statistics are unreliable).
If there are no instances in the current node, inherit statistics (majority class) from parent node.
𝑎𝑡𝑡𝑟=𝑣𝑎𝑙1 𝑎𝑡𝑡𝑟=𝑣𝑎𝑙2 𝑎𝑡𝑡𝑟=𝑣𝑎𝑙3
If there is more training data, the tree can be “grown” bigger.
129
ID3 AlgorithmID3: { new node
if then ; return
if then ; return
if then ; return
best attribute
foreach : possible value of attribute :
if then
else
return }
𝒙 (𝑖 )∈ℝ𝑑+ 1;𝑦 (𝑖 )∈ {−1 ,+1 }
Examples that have value for attribute .
Attributes without .
most common class among
130
Alternative Attribute Selection:Gain Ratio
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )= 𝐺𝑎𝑖𝑛(𝑆 ,𝑎)𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑆 ,𝑎)
instances attribute
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )=− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)
|𝑆𝑣||𝑆|
log 2(|𝑆𝑣||𝑆| )
instances with value for attribute
[Quinlan 1986]
Examples:
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )=− ∑𝑣∈ { 1. .𝑛 }
1𝑛
log2( 1𝑛 )=−𝑛 1
𝑛log2 (𝑛
−1 )=log 2𝑛
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )=− ∑𝑣∈ { 0,1}
𝑛2𝑛
log2( 𝑛2𝑛 )=−212
log2 (2− 1 )=1all different values.
0
1
131
Alternative Attribute Selection:GINI Index
𝐺𝑖𝑛𝑖 (𝑆 , 𝑦 )=1− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 (𝑦 )
(|𝑆𝑣||𝑆| )
2
𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛 (𝑆 ,𝑎)=𝐺𝑖𝑛𝑖 (𝑆 , 𝑦 ) − ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)
|𝑆𝑣||𝑆|
𝐺𝑖𝑛𝑖 (𝑆𝑣 ,𝑎 )
target is just like another attribute.
�̂�= argmax𝑎∈ 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠
𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛 (𝑆 ,𝑎 ) The selected attribute is the one that maximizes the .
[Corrado Gini: Italian statistician]
132
Space of Possible Decision Trees
𝒏−𝟏𝒏−𝟏
𝒏
𝒏−𝟐 𝒏−𝟐
Assume:• Binary classifier;• binary attributes;• height.
22h
∙[∑𝑖=0
h
2𝑖 (𝑛−𝑖 )]
𝒏−𝟐 𝒏−𝟐
10101010
𝑖nodes attributes
h
Number of possible trees:
133
Decision Trees and Rule SystemsPath from each leaf node to the root represents a conjunctive rule:
Cinema
CinemaShopping
Stay in
PoorRich
RainyWindySunny
NoYes
Play tennis
if (ParentsVisiting==No) & (Weather==Windy) & (Money==Poor) then Cinema.
Parents Visiting
Weather
Money
134
Decision Trees
• Different training sample -> different resulting tree (different structure).
• Learning does (conditional) feature selection.
135
Regression TreesLike classification trees but the prediction is a number (as suggested by “regression”).
1. How do we split?2. When to stop?
𝑥𝑖<𝑠1
𝑥 𝑗<𝑠2
predictions(constants)
𝑐1∈𝑅
𝑐2 𝑐3
136
Regression Trees: How to Split
Finding:• Dimension • Split value .
⟨ 𝑗 ,𝑠 ⟩=min𝑗 , 𝑠 (min
𝑐1( ∑𝑋 (𝑖 ) [ 𝑗 ]<𝑠
(𝑌 (𝑖 ) −𝑐1 )2)+min
𝑐2( ∑𝑋 (𝑖 ) [ 𝑗 ]≥ 𝑠
(𝑌 ( 𝑖 )−𝑐2 )2))
𝑋 [ 𝑗 ](𝑖 )
𝑌 (𝑖 )
𝑠
𝑐1
𝑐2
𝑋 (1 )=(… 𝑋[ 𝑗 ](1 ) … )
𝑋 (2 )=(… 𝑋 [ 𝑗 ](2 ) …)
𝑋 (𝑁 )=(… 𝑋 [ 𝑗 ](𝑁 ) …)
137
Regression Trees: PruningTree operation where a pre-terminal gets its two leaves collapsed:
𝑥𝑖<𝑠10
𝑥 𝑗<𝑠20
𝑐20 𝑐30
𝑥𝑖<𝑠10
𝑐 ′
138
Regression Trees: How to Stop1. Don’t stop.2. Build big tree.3. Prune.4. Evaluate sub-trees.
139
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
140
BOOSTING
141
ENSEMBLE
Ensemble Methods
INPUT
System System System System
Output Output Output Output
object encoded with featuresclassifiers
predictions(response/dependent variable)
…
…
FinalOutput
majority voting/averaging
142
Where the Systems Come from
Sequential ensemble scheme:
System
System
System
…
Data
Data
Data
System Data
Inducing a classifier.
Identifying difficult examples (through weighting the examples).
143
Contrast with Bagging
Non-sequential ensemble scheme:
System
System
System
Data
Data
Data
System Data
Inducing a classifier.
Sampling with replacement.
⋮
DATA
Datai are independent of each other (likewise for Sytemi).
144
Base Procedure:Decision Tree
SystemData
𝑥𝑖<𝑠1
𝑥 𝑗<𝑠2
Binary partitioning of the data during training(navigating to leaf node during testing).
Selecting dimension and split value .
predictionTraining instances are more homogeneousin terms of the output variable (more pure) compared to ancestor nodes.
Stopping when instances are homogeneous or
small number of instances.
Training instances. Color reflects output variable(classification example).
𝑥 𝑗≥𝑠2
𝑥𝑖≥ 𝑠1
145
TRAINING DATA
Ensemble Schemebase procedure
{ (𝑿 (𝟏 ) ,𝒀 (𝟏 )) ,… , (𝑿 (𝑵 ) ,𝒀 (𝑵 ) ) } 𝑮(𝑿)
base procedure 𝑮𝟏(𝑿 )Original data
base procedure 𝑮𝟐(𝑿 )Weighted data
base procedure 𝑮𝑴 (𝑿)Weighted data
⋮ ⋮ ⋮
𝑔 (𝑋 )=∑𝑚=1
𝑀
𝛼𝑚∙𝐺𝑚 (𝑋 )Final prediction (regression)
Small systems.Don’t need to be perfect.
Weights depend only on previous
iteration (memory-less).N.B.: Data weights
feature weights inlinear models.
146
Ada Boost (classification)𝑮𝟏(𝑿 )Original data
𝑮𝟐(𝑿 )Weighted data
𝑮𝒎(𝑿)Weighted data
⋮ ⋮ ⋮
𝑔 (𝑋 )=𝑠𝑖𝑔𝑛(∑𝑚=1
𝑀
𝛼𝑚 ∙𝐺𝑚 (𝑋 ))
𝑮𝑴 (𝑿)Weighted data
⋮ ⋮ ⋮𝑒𝑟𝑟𝑚=
∑𝑖
𝑤 𝑖(𝑚 )
∑𝑗=1
𝑁
𝑤 𝑗(𝑚 )
𝑤𝑖( 1)=
1𝑁
𝛼𝑚=log( 1−𝑒𝑟𝑟𝑚𝑒𝑟𝑟𝑚 )
𝑤𝑖(𝑚+1)=
~𝑤𝑖
∑𝑗=1
𝑁~𝑤 𝑗
~𝑤𝑖=𝑤𝑖(𝑚−1 ) ∙𝑒𝛼𝑚
weight associated with -th training example.
normalizing factor.
for miss-classified example .
final prediction.
for miss-classified example .
Goodness ofpredictor .
147
AdaBoost
𝑒𝑟𝑟𝑚=∑𝑖=1
𝑁
𝑤𝑖(𝑚 ) ∙ ⟦𝐺𝑚 (𝑋 ( 𝑖 ) )≠𝑌 ( 𝑖 )⟧
∑𝑗=1
𝑁
𝑤 𝑗(𝑚 )
𝛼𝑚=log( 1−𝑒𝑟𝑟𝑚𝑒𝑟𝑟𝑚 )
𝑤𝑖(𝑚+1)=
~𝑤𝑖
∑𝑗=1
𝑁~𝑤 𝑗
~𝑤𝑖=𝑤𝑖(𝑚 ) ∙𝑒𝛼𝑚 ⟦𝐺𝑚 (𝑋 ( 𝑖) )≠𝑌 ( 𝑖) ⟧
Initializing weights.
normalizing factor.
for
for :
𝑔 (𝑋 )=𝑠𝑖𝑔𝑛(∑𝑚=1
𝑀
𝛼𝑚 ∙𝐺𝑚 (𝑋 ))=𝑎𝑟𝑔𝑚𝑎𝑥∑𝑚=1
𝑀
𝛼𝑚∙ ⟦𝐺𝑚 (𝑋 )=𝑌 ⟧ final prediction.
weight update.
Create using .
𝑌
148
Binary Classifier
• Constraint:– Must not have all zero clicks for current week, previous week and week before last
[shopping team uses stronger constraint: only instances with non-zero clicks for current week].
• Training: – 1.5M instances.– 0.5M instances (validation).
• Feature extraction:– 4.82mins (Cosmos job).
• Training time:– 2hrs 20mins.
• Testing:– 10k instances: 1sec.
149
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
150
POPULARITY EVALUATION
How do we know we have a good popularity?
151
Rank Correlation Metrics
• Input: two rankings: and • Requirements:
−1≤𝐶 (𝑅1,𝑅2)≤1
𝐶 (𝑅1 ,𝑅2 )=1
𝐶 (𝑅1 ,𝑅2 )=−1
The two rankings are the same.
The two rankings are reverse of each other.
• •• •
• •
Actual input is a set of objects with two rank scores (ties are possible).
152
Kendall’s Tau Coefficient
Considers concordant/discordant pairs in two rankings (each ranking w.r.t. the other):
Complexity:
153
What is a concordant pair?
a a
b c
c b𝑅1 (𝑎 )−𝑅1 (𝑐 )
𝑅2 (𝑎 )−𝑅2 (𝑐 )
Need to have the same sign
154
Kendall Tau: ExampleA
B
C
D
C
D
A
B
𝑅1 𝑅2
Pairs:(discordant pairs in red):
Observation: Total number of discordant pairs = 2x the discordant pairs in one ranking w.r.t. the other.
𝜏=1−2 ∙𝐷𝑖𝑠𝑐𝑜𝑟𝑑𝑎𝑛𝑡𝑃𝑎𝑖𝑟𝑠 (𝑅1 ,𝑅2 )
𝑛 (𝑛− 1 )=1−
2 ∙ 84 ∙ (4 − 1 )
=−13
155
Spearman’s Coefficient
Considers ranking differences for the same object:
a a
b c
c b
𝑆 (𝑅1 ,𝑅2 )=1− 6 ∙(𝑅1 (𝑎) −𝑅2 (𝑎) )2+(𝑅1 (𝑏 )−𝑅2 (𝑏) )2+(𝑅1 (𝑐 )−𝑅2 (𝑐 ) )2
3 (32− 1 )=1− 6 ∙
(1 −1 )2+(2 −3 )2+(3 −2 )2
3 ∙ 8=
12
Complexity:
0≤∑𝑗=1
𝑛
(𝑅1 (𝑜 𝑗 )−𝑅2 (𝑜 𝑗 ) )2≤𝑛 (𝑛2−1 )
3
Example:
156
Rank Intuitions: Setup
𝑅1 𝑅2
The sequence <3,1,4,10,5,9,2,6,8,7> is sufficient to encode the two rankings.
123456789
10
123456789
10
Objects ordered by rank scores. Viewing as if scrambling the order of .
157
Rank Intuitions: Pairs
Rankings in complete agreement.
Rankings in complete dis-agreement.
−1 0 1
𝑅1 𝑅2
𝑅1 𝑅2
158
Rank Intuitions: Spearman
−1 0 1
Segment lengths represent R1 rank scores.
0.5− 0.50
𝑝=1𝑝=2𝑝=3𝑝=4
𝑝=5𝑝=6𝑝=10𝑝=15𝑝=𝑛
− 0.78− 0.88− 0.92
489
159
Rank Intuitions: Kendall
−1
01
Segment lengths represent R1 rank scores.
0.5− 0.53
𝑝=1𝑝=2𝑝=3𝑝=4
𝑝=5𝑝=6𝑝=10𝑝=15𝑝=𝑛
01− 0.36− 0.639− 0.830
160
What about ties?
The position of an object within set of objects with the same scores in the rankings affects the rank correlation.
𝑅1 𝑅2
𝑜 𝑗
Objects have the same ranking scores.
𝑜 𝑗
𝑜 𝑗 𝑜 𝑗Objects have the same ranking scores.
For example, red positioning of oj leads to lower Spearman’s coefficient; green – higher.
161
Ties
• Kendall: Strict discordance:
• Spearman:– Can use per entity upper and lower bounds.– Do as in the Olympics:
𝑅1 𝑅2
𝑜 𝑗
𝑜 𝑗
Objects with thesame score have
the same rank.
𝑠𝑔𝑛(𝑠𝑐𝑜𝑟𝑒1 (𝑎 )−𝑠𝑐𝑜𝑟𝑒1 (𝑏))≠𝑠𝑔𝑛(𝑠𝑐𝑜𝑟𝑒2 (𝑎 )−𝑠𝑐𝑜𝑟𝑒2 (𝑏))
162
Ties: Kendall TauB
http://en.wikipedia.org/wiki/Kendall_tau#Tau-b
𝜏 𝐵=𝑛𝑐−𝑛𝑑
√(𝑛 (𝑛−1)2
−𝑛1)(𝑛(𝑛−1)2
−𝑛2)where:
𝑛𝑐𝑛𝑑
𝑛
is the number of concordant pairs.
is the number of discordant pairs.
is the number of objects in the two rankings.
𝑛1=∑𝑖
𝑡𝑖(𝑡 𝑖−1)2
𝑛2=∑𝑗
𝑢 𝑗 (𝑢 𝑗−1)2
number of pairs among elements with ties in ranking .
number of pairs among elements with ties in ranking .
163
Uses of popularityPopularity can be used to augment gain in NDCG by linearly scaling it:
1 3 7 15-1
𝑙𝑎𝑏𝑒𝑙
1 2 3 4
31
5
perfectexcellentgoodfairpoor
𝐺𝑎𝑖𝑛+ (𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 ) ∙𝐺𝑎𝑖𝑛
164
Next Steps
• How to determine popularity of new entities– Challenge: No historical data.– Usually there is an initial period of high popularity
(e.g., a new restaurant is featured in local paper, promotions, etc.).
• Good abandonment (no user clicks but good entity in terms of satisfying the user information needs, e.g., phone number).– Use number impressions for named queries.
165
References1. Yaser S. Abu-Mostafa, Malik Magdon-Ismail & Hsuan-Tien Lin (2012) Learning From Data. AMLBook. [link]2. Ethem Alpaydin (2009) Introduction to Machine Learning. 2nd edition. Adaptive Computation and Machine Learning series. MIT Press. [link
]3. David Barber (2012) Bayesian Reasoning and Machine Learning. Cambridge University Press. [link]4. Ricardo Baeza-Yates & Berthier Ribeiro-Neto (2011) Modern Information Retrieval: The Concepts and Technology behind Search. 2nd
Edition. ACM Press Books. [link]5. Alex Bellos (2010) Alex's Adventures in Numberland. Bloomsbury: New York. [link]6. Ron Bekkerman, Mikhail Bilenko & John Langford (2011) Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge
University Press. [link]7. Christopher M. Bishop (2007) Pattern Recognition and Machine Learning. Information Science and Statistics. Springer. [link]8. George Casella & Roger L. Berger (2001) Statistical Inference. 2nd edition. Duxbury Press. [link]9. Anirban DasGupta (2011) Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. Springer Texts in Statistics.
Springer. [link]10. Luc Devroye, László Györfi & Gábor Lugosi (1996) A Probabilistic Theory of Pattern Recognition. Springer. [link]11. Richard O. Duda, Peter E. Hart & David G. Stork (2000) Pattern Classification. 2nd Edition. Wiley-Interscience. [link]12. Trevor Hastie, Robert Tibshirani & Jerome Friedman (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd
Edition. Springer Series in Statistics. Springer. [link]13. James L. Johnson (2008) Probability and Statistics for Computer Science. Wiley-Interscience. [link]14. Daphne Koller & Nir Friedman (2009) Probabilistic Graphical Models: Principles and Techniques. Adaptive Computation and Machine
Learning series. MIT Press. [link]15. David J. C. MacKay (2003) Information Theory, Inference and Learning Algorithms. Cambridge University Press. [link]16. Zbigniew Michalewicz & David B. Fogel (2004) How to Solve It: Modern Heuristics. 2nd edition. Springer. [link]17. Tom M. Mitchell (1997) Machine Learning. McGraw-Hill (Science/Engineering/Math). [link]18. Mehryar Mohri, Afshin Rostamizadeh & Ameet Talwalkar (2012) Foundations of Machine Learning. Adaptive Computation and Machine
Learning series. MIT Press. [link]19. Lior Rokach (2010) Pattern Classification Using Ensemble Methods. World Scientific. [link]20. Gilbert Strang (1991) Calculus. Wellesley-Cambridge Press. [link]21. Larry Wasserman (2010) All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer. [link]22. Sholom M. Weiss, Nitin Indurkhya & Tong Zhang (2010) Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer. [link]
166
Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees• Boosting
– AdaBoost• Ranking evaluation
– Kendall tau and Spearman’s coefficient• Sequence labeling
– Hidden Markov Models (HMMs)
167
SEQUENCE LABELING:HIDDEN MARKOV MODELS (HMMs)
168
Outline
• The guessing game• Tagging preliminaries• Hidden Markov Models• Trellis and the Viterbi algorithm• Implementation (Python)• Complexity of decoding• Parameter estimation and smoothing• Second order models
169
The Guessing Game
• A cow and duck write an email message together.• Goal – figure out which word is written by which animal.
The cow/duck illustration of HMMs is due to Ralph Grishman (NYU).
170
What’s the Big Deal ?
• The vocabularies of the cow and the duck can overlap and it is not clear a priori who wrote a certain word!
171
The Game (cont)
? ?
moo hello
?
quack
COW ?
moo hello
DUCK
quack
172
The Game (cont)
COW COW
moo hello
DUCK
quack
DUCK
173
What about the Rest of the Animals?
ZEBRA ZEBRA
word1 word2
ZEBRA
word3
PIG
ZEBRA
word4
ZEBRA
word5
PIG
DUCK
COW
ANT
DUCK
COW
ANT
PIG
DUCK
COW
ANT
PIG
DUCK
COW
ANT
PIG
DUCK
COW
ANT
174
A Game for Adults
• Instead of guessing which animal is associated with each word guess the corresponding POS tag of a word.
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
175
POS Tags"CC", "CD", "DT", "EX", "FW","IN", "JJ", "JJR", "JJS", "LS","MD","NN", "NNS","NNP", "NNPS","PDT", "POS", "PRP", "PRP$", "RB","RBR", "RBS", "RP", "SYM", "TO","UH", "VB", "VBD", "VBG", "VBN","VBP", "VBZ", "WDT", "WP", "WP$","WRB", "#", "$", ".",",",":", "(", ")", "`", "``","'", "''"
"CC", "CD", "DT", "EX", "FW","IN", "JJ", "JJR", "JJS", "LS","MD","NN", "NNS","NNP", "NNPS","PDT", "POS", "PRP", "PRP$", "RB","RBR", "RBS", "RP", "SYM", "TO","UH", "VB", "VBD", "VBG", "VBN","VBP", "VBZ", "WDT", "WP", "WP$","WRB", "#", "$", ".",",",":", "(", ")", "`", "``","'", "''"
176
Tagging Preliminaries
• We want the best set of tags for a sequence of words (a sentence)
• W — a sequence of words• T — a sequence of tags
)|(maxarg^
WTPTT
177
Bayes’ Theorem (1763)
)(
)()|()|(
WP
TPTWPWTP
posteriorposterior
priorpriorlikelihoodlikelihood
marginal likelihoodmarginal likelihood
Reverend Thomas Bayes — Presbyterian minister (1702-1761)Reverend Thomas Bayes — Presbyterian minister (1702-1761)
178
Applying Bayes’ Theorem• How do we approach P(T|W) ?• Use Bayes’ theorem
• So what? Why is it better?• Ignore the denominator (and the question):
)(
)()|(maxarg)|(maxarg
WP
TPTWPWTP
TT
)()|(maxarg)(
)()|(maxarg)|(maxarg TPTWP
WP
TPTWPWTP
TTT
179
Tag Sequence Probability
• Count the number of times a sequence occurs and divide by the number of sequences of that length — not likely!– Use chain rule
How do we get the probability P(T) of a specific tag sequence T?
180
P(T) is a product of the probability of the N-grams that make it up
Make a Markov assumption: the current tag depends on the previous one only:
P(T) is a product of the probability of the N-grams that make it up
Make a Markov assumption: the current tag depends on the previous one only:
Chain Rule
),...,|(...)|()|()(
),...,()(
11213121
1
nn
n
tttPtttPttPtP
ttPTP history
n
iiin ttPtPttP
2111 )|()(),...,(
181
• Use counts from a large hand-tagged corpus.• For bi-grams, count all the ti–1 ti pairs
• Some counts are zero – we’ll use smoothing to address this issue later.
Transition Probabilities
)(
)()|(
1
11
i
iiii tc
ttcttP
182
What about P(W|T) ?
• First it's odd—it is asking the probability of seeing “The white horse” given “Det Adj Noun”!– Collect up all the times you see that tag sequence and see how often “The
white horse” shows up …
• Assume each word in the sequence depends only on its corresponding tag:
n
i
ii twPTWP1
)|()|(
emission probabilitiesemission probabilities
183
Emission Probabilities
• What proportion of times is the word wi associated with the tag ti (as opposed to another word):
)(
),()|(
i
iiii tc
twctwP
184
The “Standard” Model
n
iiiii
T
T
T
T
ttPtwP
TPTWP
WP
TPTWP
WTP
11)|()|(maxarg
)()|(maxarg
)(
)()|(maxarg
)|(maxarg
185
Hidden Markov Models
• Stochastic process: A sequence 1 , 2,… of random variables based on the same sample space .
• Probabilities for the first observation:
• Next step given previous history:
jj xxP outcomeeach for )( 1
), ... ,|(11 11 tt itiit xxxP
186
• A Markov Chain is a stochastic process with the Markov property:
• Outcomes are called states.• Probabilities for next step – weighted finite state
automata.
Markov Chain
)|(), ... ,|(111 111 tttt itititiit xxPxxxP
187
State Transitions w/ Probabilities
STARTEND
COW
DUCK
1.00.2
0.2
0.3 0.3
0.5
0.5
188
Markov Model
Markov chain where each statecan output signals
(like “Moore machines”):
Markov chain where each statecan output signals
(like “Moore machines”):
START END
COW
DUCK
1.00.2
0.2
0.3 0.3
0.5
0.5
moo:0.9
hello:0.1
hello:0.4
quack:0.6
$:1.0^:1.0
189
The Issue Was
• A given output symbol can potentially be emitted by more than one state — omnipresent ambiguity in natural language.
190
Markov ModelMarkov Model
},...,{ 1 mss
},...,{ 1 k
)|( where][P 1 itjtijij ssPpp
)|( where][A itjtijij sPaa
)( where],...,[ 11 jjm sPvvvv
Finite set of states:
Signal alphabet:
Transition matrix:
Emission probabilities:
Initial probability vector:
191
Graphical Model
STATESTATE TAGTAG
OUTPUTOUTPUT wordword
……
192
• A Markov Model for which it is not possible to observe
the sequence of states.
• S: unknown — sequence of states
• O: known — sequence of observations
)|(maxarg OSPS
Hidden Markov Model
*S
*O
wordswordstagstags
193
The State Space
START END
COW
DUCK
1.0
0.0
0.2
0.2
0.3
0.3
0.5
0.5
moo:0.9 hello:0.1
hello:0.4 quack:0.6
moo hello quack
COW
DUCK
0.3
0.3
0.5
0.5
COW
DUCK
More on how the probabilities come about (training) later.More on how the probabilities come about (training) later.
194
Optimal State Sequence:The Viterbi Algorithm
We define the joint probability of the most likely sequence from time 1 to time t ending in state si and the observed sequence O≤t up to time t:
);,,...,( max
);,( max)(
11
11
1
11,...,
1
tititiss
tittS
t
OsssP
OsSPi
t
tii
t
195
Key Observation
The most likely partial derivation leading to state si at position t consists of:
– the most likely partial derivation leading to some state sit-1 at the previous position t-1,
– followed by the transition from sit-1 to si.
196
Note:
We will show that:
)|( and )( where)(111 11 itktikiiiki sPasPvavi
Viterbi (cont)
tjkijti
t apij ])([max)( 1
197
t
t
tt
tt
tt
t
jktiji
tittS
itjti
jtktitjtSi
tittktjtSi
kttjtittSi
tjttS
t
aip
OsSPssP
sPssP
OsSsP
OssSP
OsSPj
)]([max
)];,( max)|(max[
)|()|( maxmax
);,|;( maxmax
),;,,( maxmax
);,( max)(
1
1121
1
112
112
1
2
2
2
2
1
Recurrence Equation
)|(
);,(
);,(
112
112
jtkt
titt
titt
sP
OsSP
OsSP
t
1k 1k
198
• The predecessor of state si in the path corresponding to t(i) :
• Optimal state sequence:
Back Pointers
1, ... ,1for )(
)(argmax
))((argmax)(
**
1
*
11
11
ntss
is
pij
ttt
T
kk
nmi
k
ijtmi
t
199
The Trellis
START
COW
moo hello quack
DUCK
END
0
t=0
1
0
0
t=1 t=2 t=3 t=4
0
0.9
0
0
0 0 0
0 0
$
0.045
0.108
0.00648
0
0.0081
0
0.03240
200
Implementation (Python)observations = ['^','moo','hello','quack','$'] # signal sequencestates = ['start','cow','duck','end']
# Transition probabilities - p[FromState][ToState] = probabilityp = {'start': {'cow':1.0}, 'cow': {'cow' :0.5, 'duck':0.3, 'end' :0.2}, 'duck': {'duck':0.5, 'cow' :0.3, 'end' :0.2}}
# Emission probabilities; special emission symbol '$' for 'end' statea = {'cow' : {'moo' :0.9,'hello':0.1, 'quack':0.0, '$':0.0}, 'duck': {'moo' :0.0,'hello':0.4, 'quack':0.6, '$':0.0}, 'end' : {'moo' :0.0,'hello':0.0, 'quack':0.0, '$':1.0}}
observations = ['^','moo','hello','quack','$'] # signal sequencestates = ['start','cow','duck','end']
# Transition probabilities - p[FromState][ToState] = probabilityp = {'start': {'cow':1.0}, 'cow': {'cow' :0.5, 'duck':0.3, 'end' :0.2}, 'duck': {'duck':0.5, 'cow' :0.3, 'end' :0.2}}
# Emission probabilities; special emission symbol '$' for 'end' statea = {'cow' : {'moo' :0.9,'hello':0.1, 'quack':0.0, '$':0.0}, 'duck': {'moo' :0.0,'hello':0.4, 'quack':0.6, '$':0.0}, 'end' : {'moo' :0.0,'hello':0.0, 'quack':0.0, '$':1.0}}
201
Implementation (Viterbi)n = len(observations)# Initializing viterbi table row by row; v[state][time] = probv = {}; for s in states: v[s] = [0.0] * T
# Initializing back pointersbackPointer = {}; for s in states: backPointer[s] = [""] * T
v['start'][0] = 1.0for t in range(T-1): # t =[0..T-1]; populate column t+1 in v for s in states[:-1]: # 'end' state not considered # only consider 'start' state at time 0 if t == 0 and s != 'start': continue for s1 in p[s].keys(): # s1 is the next state newScore = v[s][t] * p[s][s1] * a[s1][observations[t+1]] if v[s1][t+1] == 0.0 or newScore > v[s1][t+1]: v[s1][t+1] = newScore backPointer[s1][t+1] = s
n = len(observations)# Initializing viterbi table row by row; v[state][time] = probv = {}; for s in states: v[s] = [0.0] * T
# Initializing back pointersbackPointer = {}; for s in states: backPointer[s] = [""] * T
v['start'][0] = 1.0for t in range(T-1): # t =[0..T-1]; populate column t+1 in v for s in states[:-1]: # 'end' state not considered # only consider 'start' state at time 0 if t == 0 and s != 'start': continue for s1 in p[s].keys(): # s1 is the next state newScore = v[s][t] * p[s][s1] * a[s1][observations[t+1]] if v[s1][t+1] == 0.0 or newScore > v[s1][t+1]: v[s1][t+1] = newScore backPointer[s1][t+1] = s
202
Implementation (Best Path)
# Now recover the optimal state sequencestate_sequence = [ 'end' ]
for t in range( n-1, 0, -1 ): state = backPointer[ state ][ t ] state_sequence = [ state ] +
state_sequence print "Observations....: ", observationsprint "Optimal sequence: ", state_sequence
# Now recover the optimal state sequencestate_sequence = [ 'end' ]
for t in range( n-1, 0, -1 ): state = backPointer[ state ][ t ] state_sequence = [ state ] +
state_sequence print "Observations....: ", observationsprint "Optimal sequence: ", state_sequence
203
Complexity of Decoding
• O(m2n) — linear in n (the length of the string)• Initialization: O(mn)• Back tracing: O(n)• Next step: O(m2)
for current_state in s1..sm # at time t+1 for prev_state in s1..sm # at time t compute value
compare with best_so_far
• There are n next steps.
204
Parameter Estimation for HMMs
• Need annotated training data (Brown, PTB).• Signal and state sequences both known.• Calculate observed relative frequencies.• Complications — sparse data problem (need for smoothing).
• One can use only raw data too — Baum-Welch (forward-backward) algorithm.
205
Optimization
• Build vocabulary of possible tags for words• Keep total counts for words• If a word occurs frequently (count > threshold) consider its tag set
exhaustive• For frequent words only consider its tag set (vs. all tags)• For unknown words don’t consider tags corresponding to closed
class words (e.g., DT)
206
Applications Using HMMs
• POS tagging (as we have seen).• Chunking.• Named Entity Recognition (NER).• Speech recognition.
207
Exercises
• Implement the training (parameter estimation).• Use a dictionary of valid tags for known words to constrain
which tags are considered for a word.• Implement a second-order model.• Implement the decoder in Ruby.
208
Some POS Taggers
• Alias-I: http://www.alias-i.com/lingpipe• AUTASYS: http://www.phon.ucl.ac.uk/home/alex/project/tagging/tagging.htm• Brill Tagger: http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z• CLAWS: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html• Connexor: http://www.connexor.com/software/tagger• Edinburgh (LTG): http://www.ltg.ed.ac.uk/software/pos/index.html• FLAT (Flexible Language Acquisition Tool): http://lanaconsult.com• fnTBL: http://nlp.cs.jhu.edu/~rflorian/fntbl/index.html• GATE: http://gate.ac.uk• Infogistics: http://www.infogistics.com/posdemo.htm• Qtag: http://www.english.bham.ac.uk/staff/omason/software/qtag.html• SNoW: http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=POS• Stanford: http://nlp.stanford.edu/software/tagger.shtml• SVMTool: http://www.lsi.upc.edu/~nlp/SVMTool• TNT: http://www.coli.uni-saarland.de/~thorsten/tnt• Yamcha: http://chasen.org/~taku/software/yamcha/
209
References1. Brants, Thorsten. 2000. TnT – A Statistical Part-of-speech Tagger. 6th Applied NLP Conference (ANLP-2000),
224-231, Seattle, U.S.A.2. Charniak, Eugene, Curtis Hendrickson, Neil Jacobson & Mike Perkowitz. 1993. Equations for part-of-speech
tagging. 11th National Conference on Artificial Intelligence, 784-789. Menlo Park: AAAI Press/MIT.3. Krenn, Brigitte & Christer Samuelsson. 1997. Statistical Methods in Computational Linguistics, ESSLLI
Summer school Lecture Notes from, 11-22 August, Aix-en-Provence, France.4. Rabiner, Lawrence R. 1989. A tutorial on Hidden Markov Models and selected applications in speech
recognition, Proceedings of the IEEE, vol. 77, 256-286.5. Samuelsson, Christer. 2000. Extending N-gram tagging to word graphs, Recent Advances in Natural
Language Processing II, ed. by Nicolas Nicolov & Ruslan Mitkov. Current Issues in Linguistic Theory (CILT), vol. 189, pp 3-20. John Benjamins: Amsterdam/Philadelphia.
6. Shin, Jung Ho, Young S. Han & Key-Sun Choi. 1997. A HMM part-of-speech tagger for Korean with wordphrasal relations. Recent Advances in Natural Language Processing, ed. by Nicolas Nicolov & Ruslan Mitkov. Current Issues in Linguistic Theory (CILT) vol 136, pp 439-450. John Benjamins: Amsterdam/Philadelphia.
210
Statistics Refresher• Outcome: Individual atomic results of a (non-deterministic) experiment.• Event: A set of results.• Probability: Limit of target outcome over number of experiments (frequentist view) or
degree of belief (Bayesian view).• Normalization condition: Probabilities for all outcomes sum to 1.• Distribution: Probabilities associated with each outcome.• Random variable: Mapping of the outcomes to real numbers.• Joint distributions: Conducting several (possibly related) experiments and observing the
results. Joint distribution states the probability for a combination of values of several random variables.
• Marginal: Finding the distribution of a random variable from a joint distribution.• Conditional probability (Bayes’ rule): Knowing the value of one variable constrains the
distribution of another.• Probability density functions: Probability that a continuous variable is in a certain range.• Probabilistic reasoning: Introduce evidence (set certain variables) and compute
probabilities of interest (conditioned on this evidence).
211
Definitions
𝜇=𝐸 [ 𝑋 ]=∑𝑖=1
𝑛
𝑥𝑖 ∙𝑝 (𝑥 𝑖 )=∫− ∞
∞
𝑥𝑝 (𝑥 )𝑑𝑥Expectation:
Mode: 𝑥∗=arg max𝑖𝑝 (𝑥 𝑖)
Variance: 𝜎 2=𝑉𝑎𝑟 (𝑋 )=𝐸 [ (𝑋−𝜇)2 ]=𝐸 [ 𝑋 2 ]−𝜇2
𝐸 [ 𝑓 (𝑋 ) ]=∑𝑖=1
𝑛
𝑓 (𝑥 𝑖 ) ∙𝑝 (𝑥𝑖)=∫−∞
∞
𝑓 (𝑥)𝑝 (𝑥 )𝑑𝑥Expectation of a function:
𝐸 [ 𝑋𝑛 ]=∑𝑖=1
𝑛
𝑥𝑖𝑛∙𝑝 (𝑥 𝑖)-th moment: ( is the first moment)
𝐸 [𝑎𝑋 +𝑏 ]=𝑎𝐸 [ 𝑋 ]+𝑏 𝐸 [ 𝑋+𝑌 ]=𝐸 [𝑋 ]+𝐸 [𝑌 ] 𝑉𝑎𝑟 [𝑎𝑋 +𝑏 ]=𝑎2𝑉𝑎𝑟 [𝑋 ]Properties:
212
Intuitions about Scale
Weight in grams if the Earth were to be a black hole.
Age of the universe in seconds.
Number of cells in the human body (100 trillion).
Number of neurons in the human brain.
Standard Blu-ray disc size, XL 4 layer (128GB).
One year in seconds.
Items in the Library of Congress (largest in the world).
Length of the Niles in meters (longest river).
213
Acknowledgements
• Bran Boguraev• Chris Brew• Jinho Choi• William Headden• Jingjing Li• Jason Kessler• Mike Mozer• Shumin Wu• Tong Zhang
• Amir Padovitz• Bruno Bozza• Kent Cedola• Max Galkin• Manuel Reyes Gomez• Matt Hurst• John Langford• Priyank Singh
Top Related