Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

213
Machine Learning with Applications in Categorization, Popularity and Sequence labeling (linear models, decision trees, ensemble methods, evaluation) Dr. Nicolas Nicolov <1st_last@yahoo.com>

description

Machine Learning tutorial series.

Transcript of Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

Page 1: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

Machine Learningwith Applications in Categorization, Popularity and Sequence labeling

(linear models, decision trees, ensemble methods, evaluation)

Dr. Nicolas Nicolov<[email protected]>

Page 2: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

2

Goals

• Introduce important ML concepts• Illustrate ML techniques through examples in:

– Categorization– Popularity– Sequence labeling

(tutorial aims to be self-contained and to explain the notation)

Page 3: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

3

Outline• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models

– Perceptron, Winnow, Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)

– Classification Decision Trees, Regression Trees• Boosting

– AdaBoost• Ranking evaluation

– Kendall tau and Spearman’s coefficient• Sequence labeling

– Hidden Markov Models (HMMs)

Page 4: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

4

EXAMPLES OF MACHINE LEARNINGWhy?– Get a flavor of the diversity of areas where ML is applied.

Page 5: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

5

Sequence Labeling

George W. Bush discussed Iraq

GPEXPER_ _PER_ _PER

<PER>George W. Bush</PER> discussed <GPE>Iraq</GPE>

George W. Bush discussed Iraq

Geo-Political Entity

(like search query analysis)

Page 6: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

6

Spam

www.dietsthatwork.com

www . dietsthatwork . com

www . diets that work . com

SPAM!

further segmentation

classification

Page 7: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

7

TokenizationWhat!?I love the iphone:-)

What !? I love the iphone :-)

How difficult can that be? — 98.2% [Zhang et al. 2003]

NO TRESSPASSING VIOLATORS WILL BE PROSECUTED

Page 8: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

8

NL Parsing

Unlikemy sluggish Chevy the Audi handlesthe winding mountain roads superbly

PREP

POSS

MODDET

SUBJ DETMOD

MOD

MANRDOBJ

CONTR

syntactic structure

Page 9: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

9

State Transitions

λ β

λ β

λ β

λ β

λ β

λ

λ

λ

λ

LEFTARC:

RIGHTARC:

NOARC:

SHIFT:

using ML to make the decisionwhich action to take

Page 10: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

10

Two Ladies in a Men’s Club

Page 11: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

11

We serve men

IOBJ

We serve men

DOBJSUBJ

SUBJ

We serve food to men.We serve our community.serve —IndirectObject men

We serve organic food.We serve coffee to connoiseurs.serve —DirectObject men

Page 12: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

12

Audi is an automaker that makes luxury cars and SUVs. The company was born in Germany . It was established by August Horch in 1910. Horch had previosly founded another company and his models were quite popular. Audi started with four cylinder models. By 1914, Horch 's new cars were racing and winning. August Horch left the Audi company in 1920 to take a position as an industry representative for the German motor vehicle industry federation. Currently Audi is a subsidiary of the Volkswagen group and produces cars of outstanding quality.

Coreference

Page 13: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

13

Parts of Objects (Meronymy)

[…] the interior seems upscale with leatherette upholstery that looks and feels better than the real cow hide found in more expensive vehicles, a dashboard accented by textured soft-touch materials, a woven mesh

headliner, and other materials that give the New Beetle’s interior a sense of quality. […] Finally, and a big plus in my book, both front seats were height adjustable, and the steering column tilted and telescoped for optimum comfort.

Page 14: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

14

Sentiment Analysis

I love pineapple nearly as much as I hate bananas.

POSITIVE sentiment regarding topic pineapple.

Xbox

Xbox

Positive Negative

Neutral

Negative

Negative

Neutral

Positive

Page 15: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

15

Chinese Sentiment

Car aspects Sentiment categories

Sentence

Page 16: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

16

Page 17: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

17

Page 18: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

18

Categorization

• High-level task: – Given a restaurant what is its restaurant sub-category?

• Encoding entities with features• Feature selection• Linear models

non-standard order

“Though this be madness, yet there is method in't.”

Page 19: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

19

Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models

– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)

– Classification Decision Trees, Regression Trees• Boosting

– AdaBoost• Ranking evaluation

– Kendall tau and Spearman’s coefficient• Sequence labeling

– Hidden Markov Models (HMMs)

Page 20: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

20

ENCODING OBJECTS WITH FEATURESWhy?– ML algorithms are “generic”; most of them are cast as solutions around vector encodings of the domain objects. Regardless of the ML algorithm we will need to represent/encode the domain objects as feature vectors. How well we do this (the quality of features) directly impacts system performance.

Page 21: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

21

FlatObject

Encoding

1 0 0 1 1 1 0 1 …37

= feature values (binary in this example)=Target class index (for asian)

Machine learning (training) instance/example/observation.

Default feature: Always on.

Name: has “asian bistr

o”

Description has “china”

Description has “indonesia”

has FB page

Name: has “restaurant”

Name: has “ginger”

Can be a set;object can belong to several classes.

URL has “french”

Number offeatures canbe millions.

Page 22: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

22

Structured Objects to Strings

to Features

a b c d e

Structured object:

f1

f2

f3

f4

f5

f6

“f2:f4>a”“f2:f4>b”“f2:f4>c”…“f2:f4>a_b”“f2:f4>b_c”“f2:f4>c_d”…“f2:f4>a_b_c”“f2:f4>b_c_d”

uni-grams

bi-grams

tri-grams

Feature string Feature index

*DEFAULT* 0

… …

f2:f4>a 100

f2:f4>b 101

f2:f4>c 102

… …

f2:f4>a_b 105

f2:f4>b_c 106

f2:f4>c_d 107

… …

f2:f4>a_b_c 109

Read as field “f2:f4” contains feature “a”.

Table can be quite large.

Page 23: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

23

Sliding Window (bi-grams)

SkyCity at the Space Needle

SkyCity at the Space Needle^ $

add initial “^” and final “$” tokens

SkyCity at the Space Needle^ $

SkyCity at the Space Needle^ $

SkyCity at the Space Needle^ $

SkyCity at the Space Needle^ $

sliding window

Page 24: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

24

Example: Feature Templatespublic static List<string> NGrams( string field ){ var featutes = new List<string>(); string[] tokens = field.Split( spaceCharArr, System.StringSplitOptions.RemoveEmptyEntries );

featutes.Add( string.Join( "", field.Split(SPLIT_CHARS) ) ); // the entire field

string unigram = string.Empty, bigram, previous1 = "^", previous2 = "^", trigram;

for (int i = 0; i < tokens.Length; i++) { unigram = tokens[ i ]; featutes.Add(unigram);

bigram = previous1 + "_" + unigram; featutes.Add( bigram );

if ( i >= 1 ) { trigram = previous2 + "_" + bigram; featutes.Add( trigram ); }

previous2 = previous1; previous1 = unigram; } featutes.Add( unigram + "_$" ); featutes.Add( bigram + "_$" );

return result;}

initial tri-gram is: "^_tokens[0]_tokens[1] "

initial bigram is “^_tokens[0]"

last trigram is “tokens[tokens.Length-2]_tokens[tokens.Length-1]_$"

could add field name as argument and prefix all features

Page 25: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

25

The Art of Feature Engineering:Disjunctive Features

• Useful feature = triggers often and with a particular class.• Rarely occurring (but indicative of a class) features can be

combined in a disjunction. This results in:– Need for less data to achieve good performance.– Final system performance (with all available data) is higher.

• How can we get insights about such features: Error analysis!

Regex ITALIAN_FOOD = new Regex(@"al dente|agnello|alfredo|antipasti|antipasto|arrabbiata|bistecca|bolognese| branzino|caprese|carbonara|carpaccio|cioppino|cozze|fettuccine|filetto|focaccia|frutti di mare|funghi| gnocchi|gorgonzola|insalata|lasagna|linguine|linguini|macaroni|minestrone|mozzarella|ossobuco|panini| panino| parmigiana|pasticcio|pecorino|penne|pepperoncini|pesce|pesto|piatti|piatto|piccata|polpo|pomodori|prosciutto| radicchio|ravioli|ricotta|rigatoni|risotto|saltimbocca|scallopini|scaloppini|spaghetti|tagliatelle|tiramisu| tortellini|vitello|vongole");

if (ITALIAN_FOOD.Match(entity.description).Success) features.Add("Italian_Food_Matched_Description");

Up to us how we call the feature.Triggering of the feature.

Page 26: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

26

instance( class= 7, features=[0,300857,100739,200441,...])instance( class=99, features=[0,201937,196121,345758,13,...])instance( class=42, features=[0,99173,358387,1001,1,...])...

Generic Nature of ML Systems

human sees

computer “sees”

Default feature always triggers.

Number of features that trigger for individual instances are often not the same.

Indices of (binary) features that trigger.

Page 27: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

27

Training Data

𝑋=( 𝑥0( 1) ⋯ 𝑥𝑑

(1 )

⋮ ⋱ ⋮𝑥0

(𝑁 ) ⋯ 𝑥𝑑(𝑁 )) ( 𝑦

(1 )

⋮𝑦 (𝑁 ))

Instance /w outcome.

Page 28: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

28

Feature Selection

• Templates: powerful way to get lots of features.• We get too many features.• Danger of overfitting.• Feature selection:

– CountCutOff.– TFxIDF.– Mutual information.– Information gain.– Chi square.

Doing well on seen data but poorly on unseen data.

e.g., 20M for dependency parsing.

Automatic ways of finding discriminative features.

We will examine in detail the implementation of this.

Page 29: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

29

Mutual Information• Measure of relative entropy between distributions of two random variables.• = expected value of across all classes:

• An alternative is to use:

𝐼 ( 𝑓 ,𝑐 )=log( 𝑃 ( 𝑓 ,𝑐 )𝑃 ( 𝑓 )𝑃 (𝑐 ) )=log(

𝑛 𝑓 ,𝑐

𝑁 𝑡

𝑛𝑓

𝑁𝑡

∙𝑛𝑐

𝑁 𝑡

)𝑀𝐼 ( 𝑓 ,𝐶 )=∑

𝑐∈𝐶

𝑃 (𝑐 ) 𝐼 ( 𝑓 ,𝑐 )=∑𝑐∈𝐶

𝑛𝑐𝑁𝑡

log(𝑛𝑓 ,𝑐

𝑁𝑡

𝑛 𝑓

𝑁 𝑡

∙𝑛𝑐𝑁𝑡

)𝐼𝑚𝑎𝑥 ( 𝑓 ,𝐶 )=max

𝑐∈𝐶𝐼 ( 𝑓 ,𝑐)=max

𝑐∈𝐶log(

𝑛𝑓 , 𝑐

𝑁 𝑡

𝑛𝑓

𝑁𝑡

∙𝑛𝑐

𝑁𝑡

)

Page 30: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

30

Information Gain

Balances effects of feature triggering for an object with the effects of feature being absent for an object.

𝐼𝐺 ( 𝑓 ,𝐶 )=𝐻 (𝐶 ) −𝐻 (𝐶|𝑓 ) −𝐻 (𝐶∨¬ 𝑓 )

¿− ∑𝑐∈𝐶

𝑃 (𝑐 ) log 𝑃 (𝑐 )−(− ∑𝑐∈𝐶 𝑃 ( 𝑓 ,𝑐 ) log 𝑃 (𝑐|𝑓 ))−(−∑𝑐∈𝐶 𝑃 (¬ 𝑓 ,𝑐 ) log 𝑃 (𝑐|¬ 𝑓 ))

¿− ∑𝑐∈𝐶 ( 𝑛𝑐

𝑁𝑡

log ( 𝑛𝑐𝑁𝑡)− 𝑛𝑐𝑁 𝑡

log(𝑛𝑓 ,𝑐

𝑛 𝑓)− 𝑛𝑐−𝑛 𝑓 , 𝑐

𝑁𝑡

log (𝑛𝑐−𝑛𝑓 ,𝑐

𝑁 𝑡−𝑛 𝑓))

Page 31: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

31

Chi Square

Quantifies lack of independence between feature and class :

𝑋 2 ( 𝑓 ,𝑐 )=𝑁𝑡 (𝑃 ( 𝑓 ,𝑐 )𝑃 (¬ 𝑓 , ¬𝑐 )−𝑃 ( 𝑓 ,¬𝑐 ) 𝑃 (¬ 𝑓 ,𝑐))2

𝑃 ( 𝑓 ) 𝑃 (¬ 𝑓 )𝑃 (𝑐 ) 𝑃 (¬𝑐)

¿𝑁𝑡 (𝑛𝑓 ,𝑐 (𝑁 𝑡−𝑛𝑓 −𝑛𝑐+𝑛𝑓 , 𝑐)− (𝑛𝑓 −𝑛𝑓 ,𝑐 ) (𝑛𝑐−𝑛𝑓 ,𝑐 ))

2

𝑛𝑐𝑛𝑓 (𝑁𝑡−𝑛𝑐 ) (𝑁 𝑡−𝑛𝑓 )

float Chi2(int a, int b, int c, int d) { return (a+b+c+d)* ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d)); }

Calling: Chi2( , , , )

Page 32: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

32

Exponent(Log) TrickWhile the final output may not be big intermediate results are. Solution:

float Chi2(int a, int b, int c, int d) { return (a+b+c+d) * ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d)); }

float Chi2_v2(int a, int b, int c, int d){ double total = a + b + c + d; double n = Math.Log(total); double num = 2.0 * Math.Log(Math.Abs((a * d) - (b * c))); double den = Math.Log(a + b) + Math.Log(a + c) + Math.Log(c + d) + Math.Log(b + d); return (float) Math.Exp(n+num-den);}

𝒙=𝒆𝐥𝐧 𝒙 (𝑎+𝑏+𝑐+𝑑 ) (𝑎𝑑−𝑏𝑐 )2

(𝑎+𝑏)(𝑎+𝑐)(𝑐+𝑑)(𝑏+𝑑)=¿

¿𝑒ln

(𝑎+𝑏+𝑐 +𝑑 ) (𝑎𝑑−𝑏𝑐 )2

(𝑎+𝑏)(𝑎+𝑐)(𝑐+𝑑)(𝑏+𝑑)=¿

¿𝑒ln (𝑎+𝑏+𝑐+𝑑 ) (𝑎𝑑−𝑏𝑐 )2 − ln (𝑎+𝑏 ) (𝑎+𝑐 ) (𝑐+𝑑 ) (𝑏+𝑑 )=¿

¿𝑒ln (𝑎+𝑏+𝑐+𝑑 )+2 ln|𝑎𝑑−𝑏𝑐|− ln (𝑎+𝑏 )− ln (𝑎+𝑐 )− ln (𝑐+𝑑 ) − ln (𝑏+𝑑 )

Page 33: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

33

Chi Square: Score per Feature

• We know how to compute .• Two options for an aggregate score across classes:

– Weighted average:

– Highest score among any class:

𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 )= ∑𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠

𝑃 (𝑐 ) 𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 ,𝑐 )

𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 )= max𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠

𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 ,𝑐 )

Page 34: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

34

Chi Square Feature Selectionint[] featureCounts = new int[ numFeatures ]; int numLabels = labelIndex.Count;int[] classTotals = new int[ numLabels ]; // instances with that label.float[] classPriors = new float[ numLabels ]; // class priors: classTotals[label]/numInstances.int[,] counts = new int[ numLabels, numFeatures ]; // (label,feature) co-occurrence counts.int numInstances = instances.Count;

...

float[] weightedChiSquareScore = new float[ numFeatures ];for (int f = 0; f < numFeatures; f++) // f is a feature index{ float score = 0.0f; for (int labelIdx = 0; labelIdx < numLabels; labelIdx++) { int a = counts[ labelIdx, f ]; int b = classTotals[ labelIdx ] - p; int c = featureCounts[ f ] - p; int d = numInstances - ( p + q + r ); if (p >= MIN_SUPPORT && q >= MIN_SUPPORT) { // MIN_SUPPORT = 5 score += classPriors[ labelIdx ] * Chi2( a, b, c, d ); } } weightedChiSquareScore[ f ] = score;}

Do a pass over the data and collect above counts.

Weighted average across all classes.

Page 35: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

35

⇒ Summary: Encoding

• Object representation is crucial.• Humans: good at suggesting features (templates).• Computers: good at filtering (feature selection).

• Feature engineering: Ensuring systems use the “right” features.

The system designer does not have to worry about which feature is more important or useful, and the job is left to the learning algorithm to assign appropriate weights to the corresponding features. The system designer’s job is to define a set of features that is large enough to represent most of the useful information, yet small enough to be manageable for the algorithms and the infrastructure.

Page 36: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

36

Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models

– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)

– Classification Decision Trees, Regression Trees• Boosting

– AdaBoost• Ranking evaluation

– Kendall tau and Spearman’s coefficient• Sequence labeling

– Hidden Markov Models (HMMs)

Page 37: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

37

MACHINE LEARNINGGENERAL FRAMEWORK

Page 38: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

38

Machine Learning: Representation

object encoded with features(think DB attributes/ OO member fields of primitive types) is the feature dimensionality.

classifier

prediction(response/dependent variable).Can be qualitative/quantitative(classification/regression).

𝑶𝒃𝒋𝒆𝒄𝒕→𝑶𝒖𝒕𝒄𝒐𝒎𝒆Entity CategoryEntity PopularityEntity IsChainElement

Complex decision making:

𝑿→𝒀

�⃗�=(𝒙𝟎 , …, 𝒙𝒅)→𝒀

input/independent variable

We may know the relation for certain values of and :

In fact, we may know the relation for many s and s:

(𝒙 , 𝑦 )

{ (𝒙 (𝟏 ) , 𝑦 (𝟏 )) ,… , ( �⃗� ( 𝑵 ) , 𝑦 (𝑵 ) ) } The -th is:

Page 39: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

39

Notation

𝒙(𝑖)=(𝑥0(𝑖) , …, 𝑥 𝑗

(𝑖) , … 𝑥𝑑(𝑖))

-th instance.

is the total number of data items.

is not “to the power of”hence, the parentheses.

is the corresponding component of the feature vector..

We will often have be the default feature with value of 1.

Page 40: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

40

TRAINING

Machine Learning

Input

Online System

object encoded with features

classifier

prediction(response/dependent variable)

FinalOutput

ModelOffline

TrainingSub-system

Training Data

where

𝑓 (𝑋 )=𝑌𝑋→𝑌 Task is very complex . Hard to construct good .We construct an approximation to : Hypothesis space: .

Page 41: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

41

Classes of Learning Problems

• Classification: Assign a category to each item (Chinese | French | Indian | Italian | Japanese restaurant).

• Regression: Predict a real value for each item (stock/currency value, temperature).

• Ranking: Order items according to some criterion (web search results relevant to a user query).

• Clustering: Partition items into homogeneous groups (clustering twitter posts by topic).

• Dimensionality reduction: Transform an initial representation of items into a lower-dimensional representation while preserving some properties (preprocessing of digital images).

Page 42: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

42

ML Terminology• Examples: Items or instances used for learning or evaluation.• Features: Set of attributes represented as a vector associated with an example.• Labels: Values or categories assigned to examples. In classification the labels are categories; in

regression the labels are real numbers.• Target: The correct label for a training example. This is extra data that is needed for supervised learning.• Output: Prediction label from input set of features using a model of the machine learning algorithm.• Training sample: Examples used to train a machine learning algorithm.• Validation sample: Examples used to tune parameters of a learning algorithm.• Model: Information that the machine learning algorithm stores after training. The model is used when

predicting the output labels of new, unseen examples.• Test sample: Examples used to evaluate the performance of a learning algorithm. The test sample is

separate from the training and validation data and is not made available in the learning stage.• Loss function: A function that measures the difference/loss between a predicted label and a true label.

We will design the learning algorithms so that they minimize the error (cumulative loss across all training examples).

• Hypothesis set: A set of functions mapping features (feature vectors) to the set of labels. The learning algorithm chooses one function among those in the hypothesis set to return after training. Usually we pick a class of functions (e.g., linear functions) parameterized by a set of free parameters (e.g., coefficients of the linear function) and pinpoint the final hypothesis by identifying the parameters that minimize the error.

• Model selection: Process for selecting the free parameters of the algorithm (actually of the function in the hypothesis set).

Page 43: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

43

Classification

• Data:

• Binary classification:– Outcomes:

++

+

++

++

+ +

+

+ −−

− −

decision boundary

Yes, this is mysterious at this point.

Page 44: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

44

Multi-Class Classification

• Outcomes: • Common to use binary classification

approaches: One-Versus-All (OVA). One-Versus-One (OVO).

Page 45: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

45

One-Versus-All (OVA)

For each category in turn, create a binary classifier where an instance in the data belonging to the category is considered a positive example, all other examples are considered negative examples.

Given a new object, run all these binary classifiers and see which classifier has the “highest prediction”.

The scores from the different classifiers need to be calibrated!

�̂�=argmax𝑦∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠

𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑆𝑐𝑜𝑟𝑒𝑦 (𝒙 )

Page 46: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

46

One-Versus-One (OVO)For each pair of classes, create binary classifier on data labeled as either of the classes.

How many such classifiers?

Given a new instance run all classifiers and predict class with maximum number of wins.

(𝑘2 )=𝑘(𝑘−1)2

Page 47: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

47

Errors“Nobody is perfect, but then again, who wants to be nobody.”

Binary classifier: :

#misclassified examples (penalty score of 1 for every misclassified example).𝐸𝑟𝑟𝑜𝑟= 1

𝑁∙∑𝑖=1

𝑁

|�̂� (𝑖 ) −𝑦 (𝑖 )|

𝐸𝑟𝑟𝑜𝑟= 1𝑁

∙∑𝑖=1

𝑁

𝐿𝑜𝑠𝑠 ( �̂� ( 𝑖 ) , 𝑦 (𝑖 ) )

Point-wise error (for data point ,The corresponding prediction and true value ).

�̂� (𝑖 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 (𝑖 ) ) Value predicted by the algorithm for input data point .

𝐿𝑜𝑠𝑠 ( �̂� (𝑖 ) , 𝑦 (𝑖 ) )Average error across all instances.Goal: Minimize the Error.Beneficial to have differentiable loss function.

𝐿𝑜𝑠𝑠 ( �̂� , 𝑦 )=|�̂�− 𝑦|

This encoding makes more sense than .

This particular function is called “Zero-One Loss”.For simplicity we are skipping the indices.

Page 48: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

48

Error: Function of the Parameters

�̂� (𝑖 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 (𝑖 ) )=𝑔 ( �⃗�( 𝑖 ) ,𝑝𝑎𝑟𝑎𝑚𝑠 ) Value predicted by the algorithm for input data point .

The cumulative error across all instances is a function of the parameters.

𝐸𝑟𝑟𝑜𝑟 (𝑝𝑎𝑟𝑎𝑚𝑠 )= 1𝑁

∙∑𝑖=1

𝑁

𝐿𝑜𝑠𝑠 ( �̂� ( 𝑖 ) , 𝑦 (𝑖 ) )= 1𝑁

∙∑𝑖=1

𝑁

𝐿𝑜𝑠𝑠 (𝑔 ( �⃗� ( 𝑖 ) ,𝑝𝑎𝑟𝑎𝑚𝑠 ) , 𝑦 (𝑖 ) )

2 When the params are fixed we can compute given (testing).

1 When the s and the s are fixed we can compute (optimize) params (training).

Page 49: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

49

Evaluation

• Motivation:– Benchmark algorithms (which system is better).– Tuning parameters during training.

Page 50: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

50

Evaluation Measures

GeneralizationError: Probability to misclassify an instance selected according to the distribution of the labeled instance space

ClassificationAccuracy GeneralizationError

TrainingError: Percentage of training examples which are correctly classified.

Optimistically biased estimate especially if the inducer over-fits the (training) data.

Empirical estimation of the generalization error:• Heldout method• Re-sampling:

1. Random resampling2. Cross-validation

Page 51: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

51

Precision, Recall and F-measure

Let’s consider binary classification:

Space of all instances

Instances identified as positive by the system.

Positive instances in reality.

System identified these as positive but got them wrong(false positive).

System identified these as positive but got them correct(true positive).

System identified these as negative but got them wrong(false negative).

System identified these as negative and got them correct(true negative).

General Setup

Page 52: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

52

Accuracy, Precision, Recall,and F-measure

Definitions

𝑝=𝑇𝑃

𝑇𝑃+𝐹𝑃

𝑟=𝑇𝑃

𝑇𝑃+𝐹𝑁

𝐹=1

12 ( 1𝑝+

1𝑟 )

=2𝑝𝑟𝑝+𝑟

FP: false positives

TP:true positives

FN: false negatives

TN: true negatives Precision:

Recall:

Accuracy:

𝑎𝑐𝑐=𝑇𝑃+𝑇𝑁

𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁

F-measure: Harmonic mean ofprecision and recall

Page 53: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

53

Accuracy vs. Prec/Rec/F-measAccuracy can be misleading for evaluating a model with an imbalanced distribution of the class. When there are more majority class instances than minority class instances, predicting always the majority class gives good accuracy.

Precision and recall (together) are better indicators.

As a single, aggregate number f-measure favors the lower of the precision or recall.

Page 54: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

54

Extreme Cases for Precision & Recall

TP:true positive

FN: false negatives

TN: true negatives

system actual

If very few (one in the extreme) instance(s) are correctly predicted as belonging to the class precision is 100% () but recall is low ( is high).

all instances

TP: true positives

system

actual

all instances

FP: false positives If all instances are predicted as belonging to the class (some correctly, some not) recall is 100% () but precision is low ( is high).

Precision can be traded for recall and vice versa.

Page 55: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

55

Sensitivity & Specificity

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=𝑇𝑁

𝑇𝑁+𝐹𝑃

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦=𝑇𝑃

𝑇𝑃+𝐹𝑁FP: false positives

TP:true positives

FN: false negatives

TN: true negatives

[same as recall;aka true positive rate]

False positive rate:

𝑀𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑅𝑎𝑡𝑒=1 −𝐴𝑐𝑐=𝐹𝑃+𝐹𝑁

𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁

Definitions

[aka true negative rate]

False negative rate:

𝐹𝑃𝑅=𝐹𝑃

𝐹𝑃+𝑇𝑁𝐹𝑁𝑅=

𝐹𝑁𝐹𝑁+𝑇𝑃

Page 56: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

56

Venn Diagrams

John Venn (1880) “On the Diagrammatic and Mechanical Representation of Propositions and Reasonings”, Philosophical Magazine and Journal of Science, 5:10(59).

These visualization diagrams were introduced by John Venn:

What if there are three classes?

Four classes?

Six classes?

With more classes our visual intuitions are helping less and less.

A subtle point: These are just the actual/real classes without the system classes drawn on top!

Page 57: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

57

Confusion Matrix

Predicted class A Predicted class B Predicted class C

Actual class ANumber of instances in the actual class A AND predicted as belonging to class A.

Number of instances in the actual class A BUT predicted as belonging to class B.

… Total number of actual instances of class A

Actual class B … … … Total number of actual instances of class B

Actual class C … … … Total number of actual instances of class C

Total number of instances predicted as class A

Total number of instances predicted as class B

Total number of instances predicted as class C

Total number of instances

Shows how the predictions of instances of an actual class are distributed across all classes.Here is an example confusion matrix for three classes:

Counts on the diagonal are the true positives for each class. Counts not on the diagonal are errors.Confusion matrices can handle many classes.

Page 58: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

58

Confusion Matrix:Accuracy, Precision and Recall

Predicted class A Predicted class B Predicted class C

Actual class A 50 80 70 200

Actual class B 40 140 120 300

Actual class C 120 220 160 500

210 440 350 1000

Given a confusion matrix, it’s easy to compute accuracy, precision and recall:

Confusion matrices can, themselves, be confusing sometimes

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝟓𝟎+𝟏𝟒𝟎+𝟏𝟔𝟎

𝟏𝟎𝟎𝟎𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝐴=

𝟓𝟎𝟓𝟎+𝟒𝟎+𝟏𝟐𝟎

𝑅𝑒𝑐𝑎𝑙𝑙𝐴=𝟓𝟎

𝟓𝟎+𝟖𝟎+𝟕𝟎

Page 59: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

59

Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models

– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)

– Classification Decision Trees, Regression Trees• Boosting

– AdaBoost• Ranking evaluation

– Kendall tau and Spearman’s coefficient• Sequence labeling

– Hidden Markov Models (HMMs)

Page 60: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

60

LINEAR MODELSWhy?– Linear models are good way to learn about core ML concepts.

Page 61: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

61

Refresher: Vectors

point point

vector

𝑥

𝑦

vector

vector

points are also vectors.

sum of vectors

𝑣2

𝑣1

𝑣1+𝑣2

𝑥 𝑥1

𝑥2𝑦

𝑦=13𝑥

Equation of the line.

3

1

3 𝑦=1 𝑥(−1 ) 𝑥+3 𝑦=0

(−1 ) 𝑥1+3 𝑥2=0

Can be re-written as:

(−1,3 )(𝑥1

𝑥2)=0

(𝑤1 ,𝑤2 ) (𝑥1

𝑥2)=0vector notation

𝒘=(𝑤0

⋮𝑤𝑑

)=(𝑤0 ,… ,𝑤𝑑)𝑇

transpose

Page 62: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

62

Refresher: Vectors (2)

𝑥 𝑥1

𝑥2𝑦

𝑦=13𝑥

Equation of the line.

3

13 𝑦=1 𝑥(−1 ) 𝑥+3 𝑦=0

(−1 ) 𝑥1+3 𝑥2=0

Can be re-written as:

(−1,3 )(𝑥1

𝑥2)=0

(𝑤1 ,𝑤2 ) (𝑥1

𝑥2)=0vector notation

3

−1

Normal vector.

Page 63: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

63

Refresher: Dot Product

𝑥1

𝑥2

float DotProduct(float[] v1, float[] v2) { float sum = 0.0; for(int i=0; i< a.Length; i++) sum+= v1[i] * v2[i]; return sum; }

(𝑤1 ,𝑤2 ) ∙(𝑥1

𝑥2)=𝑤1𝑥1+𝑤2𝑥2

𝒘 ∙ �⃗�=|⃗𝒘||⃗𝒙|cos𝛾

𝛾

𝛾

𝒘

𝒘 ∙ �⃗�>𝟎

𝒘 ∙ �⃗�<𝟎

𝒘 ∙ �⃗�=𝟎

Page 64: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

64

Refresher: Pos/Neg Classes

𝑥 𝑥1

𝑥2𝑦

Normal vector.

+ 𝒘 ∙ �⃗�>𝟎

𝒘 ∙ �⃗�<𝟎

𝒘 ∙ �⃗�=𝟎

Page 65: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

65

sgn Function

𝑥

𝑦

1

−1

𝑠𝑔𝑛 (∎)={+1:∎>00 :∎=0

− 1:∎<0

𝑥

𝑦

1

−1

In mathematics:

We will use:𝑠𝑔𝑛 (∎ )={+1 :∎≥ 0

− 1:∎<0

We are purposefully avoiding using here.We will use for the feature vector.

Informally drawn as:

Page 66: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

66

Two Linear Models

𝑔 (𝒙 )=𝑠𝑖𝑔𝑛 (𝒘𝑇 �⃗� ) 𝑔 (𝒙 )=𝒘𝑇 𝒙

Perceptron Linear regression

The features of an object have associated weights indicating their importance.

Signal: s=𝒘𝑇 �⃗�=∑𝑖=0

𝑑

𝑤𝑖 𝑥 𝑖

When is known the solution function is known; determines the hypothesis space.

Page 67: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

67

Why “Regression”?Why the term for quantitative output prediction is “regression”?

“That same year [1875], [Francis] Galton decided to recruit some of his friends for an experiment with sweet peas. He distributed seeds among seven of them, asking them to plant the seeds and return the offspring. Galton measured the baby seeds and compared their diameters to those of their parents. He noticed a phenomenon that initially seems counter-intuitive: the large seeds tended to produce smaller offspring, and the small seeds tended to produce larger offspring. A decade later he analyzed data from his anthropometric laboratory and recognized the same pattern with human heights. After measuring 205 pairs of parents and their 928 adult children, he saw that exceptionally tall parents had kids who were generally shorter than they were, while exceptionally short parents had children who were generally taller than their parents.

After reflecting upon this, we can understand why it must be the case. If very tall parents always produced even taller children, and if very short parents always produced even shorter ones, we would by now have turned into a race of giants and midgets. Yet this hasn't happened. Human populations may be getting taller as a whole – due to better nutrition and public health – but the distribution of heights within the population is still contained.

Galton called this phenomenon ‘regression towards mediocrity in hereditary stature’. The concept is now more generally known as regression to the mean.”

[A.Bellos pp.375]

Page 68: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

68

On-Line (Sequential) Learning• On-line = process one example at a time.• Attractive for large scale problems.

Objective: Minimize cumulative loss:

return parameters

for iteration (epoch/time).

(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()

�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )

𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔… Compute loss.

∑𝑡=1

𝑇

𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) , �̂� ( 𝑡 ) ,𝐿𝑜𝑠𝑠 ,𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)

Page 69: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

69

On-Line (Sequential) Learning (2)Sometimes written out more explicitly:

return parameters

for # passes over the data.

(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()

�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )

𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔…

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔

𝒙 (𝑖 )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()�̂� (𝑖 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 (𝑖 ) )𝑦 (𝑖 )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝑇𝑟𝑢𝑒𝐿𝑎𝑏𝑒𝑙()

for

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()

for

if

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(…)

return parameters

𝑅𝑎𝑛𝑑𝑜𝑚𝑖𝑧𝑒𝐷𝑎𝑡𝑎 ()

for each data item.

.

𝑈𝑝𝑑𝑎𝑡𝑒 ( �⃗� (𝑡 ) , 𝑦 ( 𝑡 ) , �̂� (𝑡 ) ,𝐿𝑜𝑠𝑠 ,𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 )

Page 70: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

70

Perceptron

• One of the earliest ML algorithms (Rosenblatt 1958).• On-line linear binary classification algorithm.• Determines a hyperplane (line in , plane in ,…) separating the

points for the two classes.

++

+

++

++

+ +

+

+ − −

− −

+

+

+

++

++

+ +

+

+ −

−−

Linearly separable data: Non-linearly separable data:

+

++

−−

Page 71: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

71

First: Perceptron Update Rule𝒘𝑛𝑒𝑤=𝒘𝑜𝑙𝑑+𝑦

(𝑡 )𝒙 (𝑡 )

𝑥 𝑥1

𝑥2𝑦

𝑦=

3𝑥

𝑦=13𝑥

+

−+

+

Example (initially misclassified).

(−1 ) 𝑥+3 𝑦=0

(−3)𝑥+1𝑦=

0

(−1,3 ) (𝑥𝑦)=0

(−3,1) (𝑥 𝑦)=0

(𝑤1

𝑤2)=(−3

1 )+ (+1 )(22)=(− 13 )

(22)

Simplification: Lines pass through origin.

in order to simplify the update rule .

Example is now correctly classified with the new separating boundary. Not always the case that we can achieve this with one update.

Page 72: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

72

On-Line (Sequential) Learning

return

for

(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()

�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )

𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔…

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) , �̂� ( 𝑡 ) ,𝐿𝑜𝑠𝑠 ,𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)

Page 73: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

73

𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔ 12|�̂� (𝑡 ) −𝑦 (𝑡 )|

Perceptron Learning Algorithm

return

for iteration (epoch/time).

(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()

�̂� (𝑡 )≔𝑠𝑖𝑔𝑛 (𝒘 𝑇 ∙𝒙 ( 𝑡 ) )Compute zero-one loss

�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )∈ {−1 ,+1 }

return parameters

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()𝒘=(𝑤0 , …,𝑤𝑑)𝑇=(0 ,…,0 )𝑇∈ℝ𝑑+1

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(⋯)if

𝒘≔𝒘+ 𝑦 ( 𝑡 ) ∙ �⃗� (𝑡 )

Page 74: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

74

Perceptron Learning Algorithm

return

for iteration (epoch/time). sample size.

(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()

�̂� (𝑡 )≔𝑠𝑖𝑔𝑛 (𝒘 𝑇 ∙𝒙 ( 𝑡 ) ) �̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )∈ {−1 ,+1 }

return parameters

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()𝒘=(𝑤0 , …,𝑤𝑑)𝑇=(0 ,…,0 )𝑇∈ℝ𝑑+1

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(⋯)if

𝒘≔𝒘+ 𝑦 ( 𝑡 ) ∙ �⃗� (𝑡 )

represents transpose here

(algorithm makes multiple passes over data.)

Page 75: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

75

Perceptron Learning Algorithm (PLA)

Initialize weights:

Select a mis-classified example:

Update weights:

return

𝒘≔𝒘+ 𝑦 ( 𝑡 ) ∙ �⃗� (𝑡 )

while( mis-classified examples exist ):

𝑦 (𝑡 ) ≠𝑠𝑖𝑔𝑛 (𝒘𝑇 �⃗�( 𝑡 ) )

Misclassified example means:With the current weights

1. A challenge: Algorithm will not terminate for non-linearly separable data (outliers, noise). 2. Unstable: jump from good perceptron to really bad one within one update.3. Attempting to minimize:

min�⃗�

1𝑁∑

𝑡=1

𝑁

⟦𝑦 ( 𝑡 )≠ 𝑠𝑖𝑔𝑛 (𝒘𝑇 𝒙 (𝑡 )) ⟧ NP-hard.

more generally

Page 76: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

76

Perceptron

If a point is classified incorrectly:

⇒𝑠𝑔𝑛 ( 𝑦 ( 𝑡 ) )≠𝑠𝑔𝑛 (𝒘𝑜𝑙𝑑𝑇 ∙ �⃗�( 𝑡 ) )⇒ 𝑦 ( 𝑡 )𝒘 𝑜𝑙𝑑

𝑇 ∙ �⃗� (𝑡 )<0

𝒘𝑛𝑒𝑤=𝒘𝑜𝑙𝑑+𝑦(𝑡 )𝒙 (𝑡 )

Weight update:

𝑦 (𝑡 )𝒘𝑛𝑒𝑤𝑇 ∙ �⃗� (𝑡 )=𝑦 ( 𝑡 ) (𝒘 𝑜𝑙𝑑+𝑦

( 𝑡 )𝒙 ( 𝑡 ) )𝑇 ∙𝒙 (𝑡 )=¿

¿ 𝑦 (𝑡 )𝒘𝑜𝑙𝑑𝑇 ∙ �⃗�( 𝑡 )+‖�⃗� ( 𝑡 )‖2

>𝑦 ( 𝑡 )𝒘 𝑜𝑙𝑑𝑇 ∙ 𝒙 (𝑡 )

¿ 𝑦 (𝑡 )𝒘𝑜𝑙𝑑𝑇 ∙ �⃗�( 𝑡 )+( 𝑦 (𝑡 ) )2‖�⃗� (𝑡 )‖2

=¿

¿0 ¿0

Thus, the perceptron weight update pushes in the “right direction”.

Page 77: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

77

Looks Simple – Does It Work?

Number of updates by the Perceptron Algorithm ≤𝑟 2

𝜌 2

where:

𝒙 (1 )…𝒙 (𝑁 )∈ℝ𝑑+1

𝑟 ≥‖𝒙 ( 𝑖 )‖ (for all )

𝜌 ≤𝑦 (𝑖 ) (𝑣 ∙𝒙 ( 𝑖 ))

‖�⃗�‖(for all )

There exist and such that:

Margin-based upper bound on updates:

The quantity is known as the “normalized margin”.

Remarkable:Does not depend on dimension of feature space!

Fact:

Page 78: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

78

Compact Model Representation

void Save( StreamWriter w, int labelIdx, float[] weights ){ w.Write( labelIdx ); int previousIndex = 0; for (int i = 0; i < weights.Length; i++) { if (weights[ i ] != 0.0f) { w.Write( " " + (i - previousIndex) + " " + weights[ i ] ); previousIndex = i; } }}

Use float instead of double:

Store only non-zero weights (and indices):

Store non-zero weights and diff of indices:

Difference of indices.

Remember last index where the weight was non-zero .

Page 79: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

79

Linear Classification Solutions

A fixed choice of defines the hyperplane and, thus, the solution to our (linear) task.

++

+

++

++

+ +

+

+ − −

− −

Different solutions (infinitely many)

Page 80: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

80

The Pocket AlgorithmA better perceptron algorithm: Keep track of the error and update weights when we lower the error.

Initialize weights:

Run PLA for one iteration and obtain new .

return

𝐸𝑟𝑟 (𝒘 ( 𝑖+1 ) )= 1𝑁∑

𝑛=1

𝑁

⟦𝑠𝑔𝑛 (𝒘 ( 𝑖+1 )𝒙 (𝑛 ) )≠ 𝑦 (𝑛 ) ⟧

for :

𝑏𝑒𝑠𝑡𝐸𝑟𝑟 ≔𝐷𝑜𝑢𝑏𝑙𝑒 .𝑀𝐴𝑋

if :

Compute error. Expensive step!

Only update the best weights if we lower the error!

Access to the entire data needed!

Page 81: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

81

Voted Perceptron• Training as the usual perceptron algorithm (with some extra book-keeping).• Decision rule:

�̂�=sgn((∑𝑡 𝑐𝑡 �⃗�(𝑡 ))∙ �⃗�)

Coefficient proportional to the number of iterations survives(number of iterations between and ).

iterations

𝒘 (𝑡 ) 𝒘 (𝑡+1 )

( �⃗�(𝒊 𝟏) , 𝑦

(𝒊 𝟏) )

( �⃗�(𝒊 𝟐) , 𝑦

(𝒊 𝟐) )

( �⃗�(𝒊 𝒄 𝒕

) , 𝑦(𝒊 𝒄 𝒕

) )

�̂� ( 𝒊𝟏 ) �̂� ( 𝒊𝟐 )�̂� (𝒊𝒄

𝒕)

Page 82: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

82

Dual Perceptron: Intuitions

𝑥1

𝑥2

+

+ separating line.++

+

+

+

++

−−

−−

𝑦 ¿¿

𝑦 − �⃗�−

normal vector

𝑦 −=−1

𝑦 +¿=+1¿

Page 83: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

83

Dual Perceptron

return

for iteration (epoch/time). sample size.

(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()

�̂� (𝑡 )≔sign(∑𝑗=1

𝑁

𝛼 𝑗 𝑦( 𝑗 ) (𝒙 ( 𝒋 )𝑇 ∙ 𝒙 (𝑡 ))) �̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )∈ {−1 ,+1 }

return parameters

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()�⃗�=(𝛼1 ,…,𝛼𝑁 )𝑇= (0 ,…,0 )𝑇 ∈ℝ𝑁

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(⋯)if

𝛼 𝑡≔𝛼𝑡+1

(algorithm makes multiple passes over data.)

�̂�≔sign (∑𝑗=1

𝑁

𝛼 𝑗 𝑦( 𝑗 ) ( �⃗� ( 𝒋 )𝑇 ∙ �⃗� ))Decision rule:

gives a notion of how difficult instance is.

Kernel perceptron uses:

Page 84: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

84

Exclusive OR (XOR) Function

𝑥1

𝑥2

1

1

0

0

Truth table: Inputs in and color-coding of the output:

𝑥1

𝑥2

1

1

0

0

Challenge: The data is not linearly separable (no straight line can be drawn that separates the green from the blue points).

???

Page 85: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

85

Solution for the Exclusive OR (XOR)

𝑥1

𝑥3

1

1

0

0

We introduce another input dimension:

Now the data is linearly separable: 𝑥2

Page 86: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

86

𝑍≔∑𝑖= 0

𝑑

𝑤 𝑖𝑒𝑦 ( 𝑡 ) �⃗� 𝑖

(𝑡 )

Winnow Algorithm

return

for iteration (epoch).

(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()

�̂� (𝑡 )≔𝑠𝑔𝑛 (𝒘 𝑇 ∙𝒙 (𝑡 ) )

𝒘≔( 1𝑑+1

,…)𝑇

;

if

�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(…)for

𝑤𝑖≔𝑤𝑖𝑒

𝑦 (𝑡 ) �⃗� 𝑖(𝑡 )

𝑍

return parameters

Normalizing constant.

Multiplicative update.

Page 87: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

87

Training, Test Error and Complexity

Test error

Training error

Model complexity

Page 88: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

88

Logistic Regression𝑔 (𝒙 )=𝜃 (𝒘𝑇 𝒙 )

𝜃 (𝑠 )= 𝑒𝑠

1+𝑒𝑠

Logistic function:

𝑦∈ {−1 ,+1 }

𝑓 (𝑥 )=𝑃 ( 𝑦=+1∨𝒙 )

1−𝜃 (𝑠 )=𝜃 (−𝑠)

Target:

Data does not give the probability explicitly:

Page 89: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

89

Logistic Regression𝑃 (𝑦∨𝑥 )={ 𝑓 (𝑥 ) when 𝑦=+1

1 − 𝑓 (𝑥 ) when 𝑦=−1

𝑃 (𝑦∨𝒙 )={ 𝑔 (𝒙 )=𝜃 (𝒘 𝑇 𝒙 )=𝜃 ( 𝑦𝒘𝑇 𝒙 )when 𝑦=+1

1−𝑔 (𝒙 )=1 −𝜃 (𝒘 𝑇 𝒙 )=𝜃 (−𝒘𝑇 𝒙 )=𝜃 ( 𝑦𝒘𝑇 𝒙 )       when 𝑦=−1

Data likelihood: 𝐿 ( �⃗� )=∏𝑖=1

𝑁

𝑃 (𝑦 (𝑖)|𝒙(𝑖))

Negative log-likelihood:

−𝑙 (𝒘 )=−1𝑁

ln (∏𝑖=1

𝑁

𝑃 ( 𝑦 (𝑖)|𝒙 (𝑖))  )= 1𝑁∑

𝑖=1

𝑁

ln( 1

𝑃 ( 𝑦(𝑖)|𝒙(𝑖) ) )= 1𝑁∑

𝑖=1

𝑁

ln( 1

𝜃 (𝑦(𝑖)𝒘 𝑇 𝒙(𝑖)) )= 1𝑁∑

𝑖=1

𝑁

ln (1+𝑒−𝑦 (𝑖 )𝒘𝑇 𝒙( 𝑖) )

𝐸 (𝒘 )= 1𝑁∑

𝑖=1

𝑁

ln (1+𝑒−𝑦 ( 𝑖)𝒘𝑇 𝒙(𝑖) )Error:

How likely is it that we get output when we have input :

Which maximizes this? orminimizes this?

Page 90: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

90

RefresherDerivative:

(3 𝑥2 )′=3 ∙2 ∙𝑥2 −1

Partial derivative:𝜕𝜕𝑥

(𝑥2+2 𝑥𝑦+𝑦2 )=2 𝑥+2 𝑦

Partial derivative at a point :𝜕𝜕𝑤0

(𝑤02+2𝑤0𝑤1+𝑤1

2 )¿𝑤0=2 ,𝑤1=3=(2𝑤0+2𝑤1   )¿𝑤0=2,𝑤1=3=2 ∙2+2 ∙ 3

Gradient (derivatives with respect to each component): [ 𝜕 ( ∙ )𝜕𝑤0

,𝜕 ( ∙ )𝜕𝑤1

,…,𝜕 (∙ )𝜕𝑤𝑑 ]

Gradient of the error: 𝛻 𝐸 (𝒘 )=[ 𝜕𝐸𝜕𝑤0

,𝜕𝐸𝜕𝑤1

,…,𝜕𝐸𝜕𝑤𝑑 ]

𝜕𝜕𝑥

𝐹 (𝑥 , 𝑦 )

(ln 𝑥 )′= 1𝑥 (𝑒𝑥) ′=𝑒𝑥 ( 𝑓 (𝑔 ) )′= 𝑓 ′ (𝑔) ∙𝑔 ′

This is a vector and we can compute it at a point.

Chain rule:

𝑓 ′ (𝑥 )= lim∆𝑥→ 0

𝑓 (𝑥+∆ 𝑥 )− 𝑓 (𝑥)∆ 𝑥

𝑓 ′ (𝑥2 )= lim∆ 𝑥→0

(𝑥+∆ 𝑥 )2−𝑥2

∆𝑥= lim

∆ 𝑥→0

𝑥2+2𝑥 ( ∆𝑥 )+( ∆𝑥 )2 −𝑥2

∆𝑥= lim

∆𝑥→ 0(2 𝑥+∆𝑥 )=2𝑥

Page 91: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

91

Hypothesis SpaceThe best to use is the one which minimizes: 𝐸 (𝒘 )= 1

𝑁∑𝑖=1

𝑁

ln (1+𝑒−𝑦 ( 𝑖)𝒘𝑇 �⃗�(𝑖) )

Different give rise to different values for .

−𝛻𝐸 (𝒘 )

is the error surface.

Weight space/hyperplane.

[graph from T.Mitchell]

Page 92: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

92

Math FactThe gradient of the error:

(a vector in weight space) specifies the direction of the argument that leads to the steepest increase for the value of the error.

The negative of the gradient gives the direction of the steepest decrease.

𝛻 𝐸 (𝒘 )=[ 𝜕𝐸𝜕𝑤0

,𝜕𝐸𝜕𝑤1

, …,𝜕𝐸𝜕𝑤𝑑 ]

𝑤0

𝑤1

𝒘 (𝑡 )= (𝑤0 ,𝑤1 )

−𝛻𝐸 (𝒘 (𝑡 ))

𝒘 (𝑡+1 )

Best weights we can findup to iteration .

New best weights atiteration .

Negative gradient (see next slides).

Page 93: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

93

¿ 1𝑁∑

𝑖=1

𝑁1

1+𝑒−𝑦 ( 𝑖)𝒘𝑇 �⃗�(𝑖) ∙𝑒−𝑦 ( 𝑖)𝒘𝑇 𝒙 (𝑖)

∙ (−𝑦 (𝑖)𝒙 (𝑖))=¿

𝛻 𝐸 (𝒘 )=𝛻 1𝑁∑

𝑖=1

𝑁

ln (1+𝑒− 𝑦 (𝑖)𝒘 𝑇 �⃗�( 𝑖) )= 1𝑁∑

𝑖=1

𝑁

𝛻 ln (1+𝑒−𝑦 (𝑖)𝒘 𝑇 𝒙( 𝑖) )=¿¿

Computing the Gradient𝐸 (𝒘 )= 1

𝑁∑𝑖=1

𝑁

ln (1+𝑒−𝑦 ( 𝑖)𝒘𝑇 �⃗�(𝑖) )

¿ 1𝑁∑

𝑖=1

𝑁1

1+ 1

𝑒𝑦 (𝑖)𝒘 𝑇 �⃗�( 𝑖)

∙1

𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖) ∙ (− 𝑦 (𝑖) �⃗�(𝑖) )=¿

¿ 1𝑁∑

𝑖=1

𝑁1

𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖 )

+1

𝑒𝑦(𝑖 )𝒘𝑇 𝒙 (𝑖)

∙1

𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖) ∙ (− 𝑦 (𝑖) �⃗�(𝑖) )=−

1𝑁∑

𝑖=1

𝑁 𝑦 (𝑖) �⃗�(𝑖)

1+𝑒𝑦 (𝑖)𝒘 𝑇 �⃗�( 𝑖)

Because gradient is a linear operator.(ln∎ )′=1

∎∙∎ ′

1∎ ∎′= (1+𝑒𝑧 )′=𝑒𝑧 ∙ 𝑧′

𝑒−𝑧=1

𝑒𝑧

Page 94: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

94

(Batch) Gradient Descent

Initialize weights:

Compute gradient:

Update weights:

return

General technique for minimizing a differentiable function like .

�⃗�𝒓𝒂𝒅 :=−1𝑁∑

𝑖=1

𝑁 𝑦(𝑖)𝒙 (𝑖)

1+𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖 )

𝒘≔𝒘 −𝜂 �⃗�𝒓𝒂𝒅

repeat

until Stop

int

max #iterations; marginal error improvement; andsmall value for the error.

is the learning rate.

If a random training example () is selected and gradient computed on it alone, the algorithm is called SGD(Stochastic Gradient Descent).

Page 95: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

95

Punch Line

With the best weights computed using gradient descent,given a unknown input object encoded as vector of features ,the output probability that the object is in the class is:

𝑃 (𝑦=+1|𝒙 ;𝒘 )= 𝑒�⃗�𝑇 𝒙

1+𝑒𝒘𝑇 �⃗�

𝑃 (𝑦=+1|𝒙 ;𝒘 )>𝜏 classification rule.

The new object is in the class if:

Predict if or equivalently if . The larger , the better; will be larger and so will our degree of confidence that . The prediction that is very confident if . Similarly, logistic regression makes a very confident decision that if .

Page 96: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

96

Newton’s Method• Alternate way to minimize a function (like ).• We need to find the derivative of the error (negative log likelihood) and find for which values

of the parameters the derivative is zero.• Let be a function and we want to find such that .

𝑢𝑖+1≔𝑢𝑖−𝑓 (𝑢𝑖 )𝑓 ′ (𝑢𝑖 )

0 0.5 1 1.5 2 2.5 3

-0.5

0

0.5

1

1.5

2

2.5

3

𝑢𝑖𝑢∗

𝑓 (𝑢𝑖 )

𝑢𝑖+1

𝑓 (𝑢𝑖 )𝑢𝑖−𝑢𝑖+1

=tan𝛾= 𝑓 ′ (𝑢𝑖 )

𝛾

Page 97: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

97

Newton-Raphson• Generalization f Newton’s method to multidimensional case.• The parameters are a vector: (we used the notation ).

𝜃≔𝜃−𝑙 ′ (𝜃 )𝑙 ′ ′ (𝜃 )

𝜃≔𝜃−𝐻−1 ∙𝛻 𝑙(𝜃)

𝐻 𝑖𝑗=𝜕2 𝑙(𝜃)𝜕𝜃 𝑖𝜕 𝜃 𝑗

is the Hessian matrix:

Page 98: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

98

Robust Risk Minimizationinput vector

label

training examples

weight vector

bias

continuous linear model

𝒙𝑦∈ {−1 ,+1 }(𝒙 ( 𝑖 ) , 𝑦 (𝑖 ) )𝒘𝑏𝑝 (𝒙)

Prediction rule:

�̂� (𝒙 )={+1 :𝑝 (𝒙 )≥ 0− 1:𝑝 ( �⃗� )<0

Classification error:

𝑙 (𝑝 (𝒙 ) , 𝑦 )={+1 :𝑝 ( �⃗�) ∙ 𝑦 ≤ 00 :𝑝 ( �⃗� ) ∙ 𝑦>0

Notation:

Page 99: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

99

Robust Classification Loss

Parameter estimation:

Hinge loss:

Robust classification loss:

(�̂� ,�̂� )=argmin�⃗� ,𝑏

1𝑁∑

𝑖=1

𝑁

𝑙𝑜𝑠𝑠 (𝒘𝑇 ∙ �⃗� (𝑖 )+𝑏 , 𝑦 ( 𝑖 ) )

𝑔 (𝑝 (𝒙 ) , 𝑦 )={1−𝑝 (𝒙 ) 𝑦 :𝑝 ( �⃗� ) 𝑦 ≤10 :𝑝 (𝒙 ) 𝑦>1

h (𝑝 ( �⃗� ) , 𝑦 )={ − 2𝑝 ( �⃗� ) 𝑦 :𝑝 ( �⃗� ) 𝑦 ≤ −112

(𝑝 (𝒙 ) 𝑦−1 )2:𝑝 ( �⃗� ) 𝑦∈ [− 1,1 ]

0:𝑝 (𝒙 ) 𝑦>1

Page 100: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

100

Loss Functions: Comparison

Page 101: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

101

Confidence and Regularization

smaller λ corresponds to a larger A.

Confidence 𝑃 (𝑦=1|⃗𝒙 ):

�̂� (𝒙 )=𝑚𝑎𝑥 (0 ,𝑚𝑖𝑛(1 ,�̂� ∙ �⃗�+ �̂�+1

2 ))Regularization:

‖𝒘‖2+𝑏2≤ 𝐴

(�̂� ,�̂� )=argmin�⃗� ,𝑏

1𝑁∑

𝑖=1

𝑁

h (𝒘𝑇 ∙ �⃗� (𝑖 )+𝑏 , 𝑦 ( 𝑖 ) )

Unconstrained optimization (Lagrange multiplier):

(�̂� ,�̂� )=argmin�⃗� ,𝑏 [ 1

𝑁∑𝑖=1

𝑁

h (𝒘𝑇 ∙𝒙 (𝑖 )+𝑏 , 𝑦 (𝑖 ) )+λ2

(‖�⃗�‖2+𝑏2 )  ]

Page 102: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

102

Robust Risk Minimization

Input:

Initialization:

𝑝≔ 𝑦(𝑖) (𝒘 𝑇 ∙𝒙 (𝑖))𝑑𝑖≔𝑚𝑎𝑥 (𝑚𝑖𝑛(2𝑐−𝛼 𝑖 ,𝜂(𝑐−𝛼𝑖

𝑐−𝑝)) ,−𝛼 𝑖)

𝒘≔𝒘+𝑑𝑖 𝑦(𝑖) �⃗�(𝑖)

𝑏≔𝑏+𝑑𝑖 𝑦(𝑖)

𝛼 𝑖≔𝛼𝑖+𝑑𝑖

for

for

return

Number of passes over the data ( is a good default).

; .

𝒙 (𝑖 )∈ℝ𝑑+ 1;𝑦 (𝑖 )∈ {−1 ,+1 }

Go over the training data.

Page 103: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

103

Learning Curve• Plots evaluation metric

against fraction of training data (on the same test set!).

• Highest performance bounded by human inter annotator agreement (ITA).

• Leveling off effect that can guide us how much data is needed.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0

10

20

30

40

50

60

70

80

90

100

Percentage of data used for each experiment.

Experiment with 50% of the training data yields

evaluation number of 70.

Page 104: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

104

Summary

• Examples of ML• Categorization• Object encoding• Linear models:

– Perceptron– Winnow– Logistic Regression– RRM

• Engineering aspects of ML systems

Page 105: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

105PART II: POPULARITY

Page 106: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

106

Goal

• Quantify how popular an entity is.

Motivation:• Used in the new local search relevance metric.

Page 107: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

107

What is popularity?

• Use clicks on entity as proxy for popularity.

• Popularity score [0..1].• Goal: preserve relative

ranking between clicks vs. predicted popularity score.

Page 108: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

108

POPULARITY IN LOCAL SEARCH

Page 109: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

109

Popularity

• Output a popularity score (regression)• Ensemble methods• Tree base procedure (non-linear)• Boosting

Page 110: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

110

When is a Local Entity Popular?

• Definition:Visited by many people in the context of alternative choices.

• Is the popularity of restaurants the same as the popularity of movies, etc.?

• How to operationalize “visit”, “many”, “alternative choices”?– Initially we are using: popular means clicked more.

• Going forward we will use:– “visit” = click given an impression.– “choice” = density of entities in the same primary category.– “many” = fraction of clicks from impressions.

Page 111: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

111

Local Entity Popularity

𝑃𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 (𝑒)=𝐶𝑇𝑅𝑒+ (1−𝐶𝑇𝑅𝑒 )∙𝐷𝑒𝑛𝑠𝑖𝑡𝑦𝑒

𝐷𝑒𝑛𝑠𝑖𝑡𝑦 𝑒=1𝜋2

∙ tan−1 (𝑛𝑢𝑚𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠𝑒𝑠𝑁𝑒𝑎𝑟 (𝑒))

Popularity = Boosted Click Through Rate (CTR) for entity :

where :

The model then will be regression:

0 1𝐶𝑇𝑅=

𝐶𝑙𝑖𝑐𝑘𝑠𝐼𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝑠

1 −𝐶𝑇𝑅

(1−𝐶𝑇𝑅) ∙𝐷𝑒𝑛𝑠𝑖𝑡𝑦

𝐶𝑇𝑅 (𝑒 )= 𝐶𝑙𝑖𝑐𝑘𝑠 (𝑒)𝐼𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝑠(𝑒)

𝑛𝑢𝑚𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠𝑒𝑠𝑁𝑒𝑎𝑟 (𝑒)=¿ Number of entities in the same primary category as within a radius

Page 112: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

112

Not all Clicks are Born the Same

• Click in the context of a named query:– Can even be argued we are not satisfying the user

information needs (and they have to click further to find out what they are looking for).

• Click in the context of a category query:– Much more significant (especially when alternative results

are present).

Page 113: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

113

Local Entity Popularity

• Popularity & 1st page , current ranker.• Entities without URL.• Newly created entities.• Clicks vs. mouseovers.• Scenario: 50 French restaurants; best entity

has 2k clicks. 2 Italian restaurants; best entity has 2k clicks. The French entity is more popular because of higher available choice.

Page 114: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

114

Entity Representation

8000 … 4000 65 4.7 73 … 1 …9000

feature valuesTarget

Machine learning (training) instance

clicks for week -1

clicks for week -9

# ratingsaggregate ratings

# reviews

has FB page

Page 115: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

115

POISSON REGRESSIONWhy?– We will practice the ML machinery on a different problem, re-iterating the concepts. Poisson regression is an example of log-linear models good for modeling counts (e.g., number of visitors to a store in a certain time).

Page 116: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

116

SetupTraining data: where: are counts (rather than for regression problems).

Goal: Come up with a system which given a new observation can correctly predict the corresponding outcome .

response/outcome variable

These counts for our scenario are the clicks on the web page.

A good way to model counts of observations is using the Poisson distribution.

explanatory variables

Page 117: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

117

Poisson Distribution: PreliminariesThe Poisson distribution realistically describes the pattern of requests over time in many client-server situations.

Examples are: incoming customers at a bank, calls into a company’s telephone exchange, requests for storage/retrieval services from a database server, and interrupts to a central processor. It also has higher-dimensional applications, such as the spatial distribution of defects on integrated circuit wafers and the volume distribution of contaminants in well water. In such cases, the “events”, which are request arrivals or defects occurrences, are independent. Customers do not conspire to achieve some special pattern in their access to a bank teller; rather they operate as independent agents. The manufacture of hard disks or integrated circuits introduces unavoidable defects because the process pushes the limits of geometric tolerances. Therefore, a perfectly functional process will still occasionally produce a defect, such as a small area on the disk surface where the magnetic material is not spread uniformly or a shorted transistor on an integrated circuit chip. These errors are independent in the sense that a defect at one point does not influence, for better or worse, the chance of a defect at another point. Moreover, if the time interval or spatial area is small, the probability of an event is correspondingly small. This is a characterizing feature of a Poisson distribution: event probability decreases with the window of opportunity and is linear in the limit. A second characterizing feature, negligible probability of two or more events in a small interval, is also present in the mentioned examples.

Page 118: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

118

Poisson Distribution: FormallyThe Poisson distribution can be used to model situations in which the expected number of events scales with the length of the interval within which the events can occur. If is the expected number of events per unit interval, then the distribution of the number of events within an interval is:

𝑝 (𝑋=𝑘|𝜆 )= 1𝑘!𝑒−𝜆𝑡 (𝜆𝑡 )𝑘

For unit length interval

Mean: Variance:

Page 119: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

119

Poisson Distribution: Mental StepsFirst, we are keeping ’s for the input. So we will write:

𝑝 (𝑌=𝑦|𝜆 )= 1𝑦 !𝑒−𝜆𝑡 ( 𝜆𝑡 )𝑦

The output is determined by a single scalar parameter . We will have be dependent on the input in the following way:

𝜇=𝐸 [ 𝑋 ]=𝜆=𝑒𝒙 𝑇 ∙𝜷 This comes from the theory of Generalized Linear Models (GLM).

ln ( 𝜆 )=¿ �⃗�𝑇 ∙𝜷 ¿

log linear combination of the input features.

Hence, the name log-linear model.

In contrast, a linear model could potentially make negative but which is a count!

We used to write (when discussing logistic regression). Now, we call the parameters and because in the training phase they are unknown we will write them as the second argument in the dot product to emphasize they are the argument.

Page 120: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

120

Poisson Distribution

Data likelihood: 𝐿 ( �⃗� )=∏𝑖=1

𝑁

𝑃 (𝑦 (𝑖)|�⃗�(𝑖))

Log-likelihood:

Which maximizes this?

𝑙 ( �⃗� )= ln(∏𝑖=1

𝑁

𝑃 (𝑦 (𝑖)|�⃗�(𝑖))  )=∑𝑖=1

𝑁

ln (𝑃 ( 𝑦(𝑖)|⃗𝒙(𝑖)))=∑𝑖=1

𝑁

ln(𝑒−𝑒 �⃗�( 𝑖)𝜷 (𝑒𝒙( 𝑖)𝜷 )𝑦(𝑖)

𝑦 (𝑖) ! )=¿

¿∑𝑖=1

𝑁 [ ln (𝑒−𝑒 �⃗�( 𝑖)𝜷 )+ ln (𝑒𝒙 (𝑖) �⃗� )𝑦(𝑖 )

− ln (𝑦(𝑖) ! )]=∑𝑖=1

𝑁

[−𝑒𝒙 (𝑖) 𝜷+𝑦 (𝑖 ) ln (𝑒 �⃗�(𝑖 )𝜷 )− ln ( 𝑦(𝑖)! )]=¿

¿∑𝑖=1

𝑁

[−𝑒 �⃗�( 𝑖) 𝜷+𝑦 (𝑖 )𝒙 (𝑖) �⃗�− ln (𝑦 (𝑖 )! ) ]

𝑝 (𝑌=𝑦|𝜆 )= 1𝑦 !𝑒−𝜆𝑡 ( 𝜆𝑡 )𝑦 𝜆=𝑒 �⃗�𝑇 ∙ �⃗�

Page 121: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

121

Maximizing the Log-Likelihood

𝑙 ( �⃗� )=∑𝑖=1

𝑁

[−𝑒 �⃗�( 𝑖) 𝜷+𝑦 (𝑖 )𝒙 (𝑖) �⃗�− ln (𝑦 (𝑖 )! ) ] Which maximizes this?

𝛻 𝑙 (𝜷 )=0

𝛻 𝑙 (𝜷 )=∑𝑖=1

𝑁

[− �⃗� (𝑖 )𝑒 �⃗� ( 𝑖) 𝜷+𝑦 (𝑖 )𝒙 (𝑖)]=∑𝑖=1

𝑁

(𝑦 (𝑖 )−𝑒𝒙 (𝑖 )𝜷 )𝒙 (𝑖)=0

Non-linear in ; does not have analytical solution.

Page 122: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

122

Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models

– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)

– Classification Decision Trees, Regression Trees• Boosting

– AdaBoost• Ranking evaluation

– Kendall tau and Spearman’s coefficient• Sequence labeling

– Hidden Markov Models (HMMs)

Page 123: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

123

DECISION TREESWhy?– DTs are an influential development in ML. Combined in ensemble they provide very competitive performance. We will see ensemble techniques in the next part.

Page 124: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

124

Decision Trees

𝑥𝑖<𝑠1

𝑥 𝑗<𝑠2

Binary partitioning of the data during training(navigating to leaf node during testing).

Selecting dimension and split value .

predictionTraining instances are more homogeneousin terms of the output variable (more pure) compared to ancestor nodes.

Stopping when instances are homogeneous or

small number of instances.

Training instances. Color reflects output variable(classification example).

𝑥 𝑗≥𝑠2

𝑥𝑖≥ 𝑠1

Page 125: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

125

Decision Tree: Example

Parents Visiting

Weather

Money

Cinema

CinemaShopping

Stay in

PoorRich

RainyWindySunny

NoYes

Play tennis

Attribute/feature/predicate

Value of the attribute

Predicted classes.

(classification example with categorical features)

Branching factor depends on the number of possible values for the attribute (as seen in the training set).

Page 126: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

126

Entropy (needed for describing how an attribute is selected.)

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 )=− ∑𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠

𝑝𝑐 ∙ log 2𝑝𝑐

Entropy values for two classes varying the probability of one classes (the probability of the other class is):

𝐸𝑛𝑡𝑟𝑜𝑝𝑦=−𝑝1∙ log 2𝑝1−𝑝2 ∙ log2𝑝2=−𝑝 ∙ log 2𝑝−(1 −𝑝 )∙ log 2 (1 −𝑝 )

Example

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 127: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

127

Selecting an Attribute: Information Gain

Measure of expected reduction in entropy.

𝐺𝑎𝑖𝑛 (𝑆 ,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 )− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)

|𝑆𝑣||𝑆|

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣 )

instances attribute

Choose attribute with the highest information gain ( that minimizes this).

See Mitchell’97, p.59 for an example.

instances with value for attribute

Page 128: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

128

Splitting ‘Hairs’

?

If there are only a small number of instances do not split the node further (statistics are unreliable).

If there are no instances in the current node, inherit statistics (majority class) from parent node.

𝑎𝑡𝑡𝑟=𝑣𝑎𝑙1 𝑎𝑡𝑡𝑟=𝑣𝑎𝑙2 𝑎𝑡𝑡𝑟=𝑣𝑎𝑙3

If there is more training data, the tree can be “grown” bigger.

Page 129: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

129

ID3 AlgorithmID3: { new node

if then ; return

if then ; return

if then ; return

best attribute

foreach : possible value of attribute :

if then

else

return }

𝒙 (𝑖 )∈ℝ𝑑+ 1;𝑦 (𝑖 )∈ {−1 ,+1 }

Examples that have value for attribute .

Attributes without .

most common class among

Page 130: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

130

Alternative Attribute Selection:Gain Ratio

𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )= 𝐺𝑎𝑖𝑛(𝑆 ,𝑎)𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑆 ,𝑎)

instances attribute

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )=− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)

|𝑆𝑣||𝑆|

log 2(|𝑆𝑣||𝑆| )

instances with value for attribute

[Quinlan 1986]

Examples:

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )=− ∑𝑣∈ { 1. .𝑛 }

1𝑛

log2( 1𝑛 )=−𝑛 1

𝑛log2 (𝑛

−1 )=log 2𝑛

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )=− ∑𝑣∈ { 0,1}

𝑛2𝑛

log2( 𝑛2𝑛 )=−212

log2 (2− 1 )=1all different values.

0

1

Page 131: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

131

Alternative Attribute Selection:GINI Index

𝐺𝑖𝑛𝑖 (𝑆 , 𝑦 )=1− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 (𝑦 )

(|𝑆𝑣||𝑆| )

2

𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛 (𝑆 ,𝑎)=𝐺𝑖𝑛𝑖 (𝑆 , 𝑦 ) − ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)

|𝑆𝑣||𝑆|

𝐺𝑖𝑛𝑖 (𝑆𝑣 ,𝑎 )

target is just like another attribute.

�̂�= argmax𝑎∈ 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠

𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛 (𝑆 ,𝑎 ) The selected attribute is the one that maximizes the .

[Corrado Gini: Italian statistician]

Page 132: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

132

Space of Possible Decision Trees

𝒏−𝟏𝒏−𝟏

𝒏

𝒏−𝟐 𝒏−𝟐

Assume:• Binary classifier;• binary attributes;• height.

22h

∙[∑𝑖=0

h

2𝑖 (𝑛−𝑖 )]

𝒏−𝟐 𝒏−𝟐

10101010

𝑖nodes attributes

h

Number of possible trees:

Page 133: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

133

Decision Trees and Rule SystemsPath from each leaf node to the root represents a conjunctive rule:

Cinema

CinemaShopping

Stay in

PoorRich

RainyWindySunny

NoYes

Play tennis

if (ParentsVisiting==No) & (Weather==Windy) & (Money==Poor) then Cinema.

Parents Visiting

Weather

Money

Page 134: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

134

Decision Trees

• Different training sample -> different resulting tree (different structure).

• Learning does (conditional) feature selection.

Page 135: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

135

Regression TreesLike classification trees but the prediction is a number (as suggested by “regression”).

1. How do we split?2. When to stop?

𝑥𝑖<𝑠1

𝑥 𝑗<𝑠2

predictions(constants)

𝑐1∈𝑅

𝑐2 𝑐3

Page 136: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

136

Regression Trees: How to Split

Finding:• Dimension • Split value .

⟨ 𝑗 ,𝑠 ⟩=min𝑗 , 𝑠 (min

𝑐1( ∑𝑋 (𝑖 ) [ 𝑗 ]<𝑠

(𝑌 (𝑖 ) −𝑐1 )2)+min

𝑐2( ∑𝑋 (𝑖 ) [ 𝑗 ]≥ 𝑠

(𝑌 ( 𝑖 )−𝑐2 )2))

𝑋 [ 𝑗 ](𝑖 )

𝑌 (𝑖 )

𝑠

𝑐1

𝑐2

𝑋 (1 )=(… 𝑋[ 𝑗 ](1 ) … )

𝑋 (2 )=(… 𝑋 [ 𝑗 ](2 ) …)

𝑋 (𝑁 )=(… 𝑋 [ 𝑗 ](𝑁 ) …)

Page 137: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

137

Regression Trees: PruningTree operation where a pre-terminal gets its two leaves collapsed:

𝑥𝑖<𝑠10

𝑥 𝑗<𝑠20

𝑐20 𝑐30

𝑥𝑖<𝑠10

𝑐 ′

Page 138: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

138

Regression Trees: How to Stop1. Don’t stop.2. Build big tree.3. Prune.4. Evaluate sub-trees.

Page 139: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

139

Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models

– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)

– Classification Decision Trees, Regression Trees• Boosting

– AdaBoost• Ranking evaluation

– Kendall tau and Spearman’s coefficient• Sequence labeling

– Hidden Markov Models (HMMs)

Page 140: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

140

BOOSTING

Page 141: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

141

ENSEMBLE

Ensemble Methods

INPUT

System System System System

Output Output Output Output

object encoded with featuresclassifiers

predictions(response/dependent variable)

FinalOutput

majority voting/averaging

Page 142: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

142

Where the Systems Come from

Sequential ensemble scheme:

System

System

System

Data

Data

Data

System Data

Inducing a classifier.

Identifying difficult examples (through weighting the examples).

Page 143: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

143

Contrast with Bagging

Non-sequential ensemble scheme:

System

System

System

Data

Data

Data

System Data

Inducing a classifier.

Sampling with replacement.

DATA

Datai are independent of each other (likewise for Sytemi).

Page 144: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

144

Base Procedure:Decision Tree

SystemData

𝑥𝑖<𝑠1

𝑥 𝑗<𝑠2

Binary partitioning of the data during training(navigating to leaf node during testing).

Selecting dimension and split value .

predictionTraining instances are more homogeneousin terms of the output variable (more pure) compared to ancestor nodes.

Stopping when instances are homogeneous or

small number of instances.

Training instances. Color reflects output variable(classification example).

𝑥 𝑗≥𝑠2

𝑥𝑖≥ 𝑠1

Page 145: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

145

TRAINING DATA

Ensemble Schemebase procedure

{ (𝑿 (𝟏 ) ,𝒀 (𝟏 )) ,… , (𝑿 (𝑵 ) ,𝒀 (𝑵 ) ) } 𝑮(𝑿)

base procedure 𝑮𝟏(𝑿 )Original data

base procedure 𝑮𝟐(𝑿 )Weighted data

base procedure 𝑮𝑴 (𝑿)Weighted data

⋮ ⋮ ⋮

𝑔 (𝑋 )=∑𝑚=1

𝑀

𝛼𝑚∙𝐺𝑚 (𝑋 )Final prediction (regression)

Small systems.Don’t need to be perfect.

Weights depend only on previous

iteration (memory-less).N.B.: Data weights

feature weights inlinear models.

Page 146: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

146

Ada Boost (classification)𝑮𝟏(𝑿 )Original data

𝑮𝟐(𝑿 )Weighted data

𝑮𝒎(𝑿)Weighted data

⋮ ⋮ ⋮

𝑔 (𝑋 )=𝑠𝑖𝑔𝑛(∑𝑚=1

𝑀

𝛼𝑚 ∙𝐺𝑚 (𝑋 ))

𝑮𝑴 (𝑿)Weighted data

⋮ ⋮ ⋮𝑒𝑟𝑟𝑚=

∑𝑖

𝑤 𝑖(𝑚 )

∑𝑗=1

𝑁

𝑤 𝑗(𝑚 )

𝑤𝑖( 1)=

1𝑁

𝛼𝑚=log( 1−𝑒𝑟𝑟𝑚𝑒𝑟𝑟𝑚 )

𝑤𝑖(𝑚+1)=

~𝑤𝑖

∑𝑗=1

𝑁~𝑤 𝑗

~𝑤𝑖=𝑤𝑖(𝑚−1 ) ∙𝑒𝛼𝑚

weight associated with -th training example.

normalizing factor.

for miss-classified example .

final prediction.

for miss-classified example .

Goodness ofpredictor .

Page 147: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

147

AdaBoost

𝑒𝑟𝑟𝑚=∑𝑖=1

𝑁

𝑤𝑖(𝑚 ) ∙ ⟦𝐺𝑚 (𝑋 ( 𝑖 ) )≠𝑌 ( 𝑖 )⟧

∑𝑗=1

𝑁

𝑤 𝑗(𝑚 )

𝛼𝑚=log( 1−𝑒𝑟𝑟𝑚𝑒𝑟𝑟𝑚 )

𝑤𝑖(𝑚+1)=

~𝑤𝑖

∑𝑗=1

𝑁~𝑤 𝑗

~𝑤𝑖=𝑤𝑖(𝑚 ) ∙𝑒𝛼𝑚 ⟦𝐺𝑚 (𝑋 ( 𝑖) )≠𝑌 ( 𝑖) ⟧

Initializing weights.

normalizing factor.

for

for :

𝑔 (𝑋 )=𝑠𝑖𝑔𝑛(∑𝑚=1

𝑀

𝛼𝑚 ∙𝐺𝑚 (𝑋 ))=𝑎𝑟𝑔𝑚𝑎𝑥∑𝑚=1

𝑀

𝛼𝑚∙ ⟦𝐺𝑚 (𝑋 )=𝑌 ⟧ final prediction.

weight update.

Create using .

𝑌

Page 148: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

148

Binary Classifier

• Constraint:– Must not have all zero clicks for current week, previous week and week before last

[shopping team uses stronger constraint: only instances with non-zero clicks for current week].

• Training: – 1.5M instances.– 0.5M instances (validation).

• Feature extraction:– 4.82mins (Cosmos job).

• Training time:– 2hrs 20mins.

• Testing:– 10k instances: 1sec.

Page 149: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

149

Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models

– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)

– Classification Decision Trees, Regression Trees• Boosting

– AdaBoost• Ranking evaluation

– Kendall tau and Spearman’s coefficient• Sequence labeling

– Hidden Markov Models (HMMs)

Page 150: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

150

POPULARITY EVALUATION

How do we know we have a good popularity?

Page 151: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

151

Rank Correlation Metrics

• Input: two rankings: and • Requirements:

−1≤𝐶 (𝑅1,𝑅2)≤1

𝐶 (𝑅1 ,𝑅2 )=1

𝐶 (𝑅1 ,𝑅2 )=−1

The two rankings are the same.

The two rankings are reverse of each other.

• •• •

• •

Actual input is a set of objects with two rank scores (ties are possible).

Page 152: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

152

Kendall’s Tau Coefficient

Considers concordant/discordant pairs in two rankings (each ranking w.r.t. the other):

Complexity:

Page 153: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

153

What is a concordant pair?

a a

b c

c b𝑅1 (𝑎 )−𝑅1 (𝑐 )

𝑅2 (𝑎 )−𝑅2 (𝑐 )

Need to have the same sign

Page 154: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

154

Kendall Tau: ExampleA

B

C

D

C

D

A

B

𝑅1 𝑅2

Pairs:(discordant pairs in red):

Observation: Total number of discordant pairs = 2x the discordant pairs in one ranking w.r.t. the other.

𝜏=1−2 ∙𝐷𝑖𝑠𝑐𝑜𝑟𝑑𝑎𝑛𝑡𝑃𝑎𝑖𝑟𝑠 (𝑅1 ,𝑅2 )

𝑛 (𝑛− 1 )=1−

2 ∙ 84 ∙ (4 − 1 )

=−13

Page 155: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

155

Spearman’s Coefficient

Considers ranking differences for the same object:

a a

b c

c b

𝑆 (𝑅1 ,𝑅2 )=1− 6 ∙(𝑅1 (𝑎) −𝑅2 (𝑎) )2+(𝑅1 (𝑏 )−𝑅2 (𝑏) )2+(𝑅1 (𝑐 )−𝑅2 (𝑐 ) )2

3 (32− 1 )=1− 6 ∙

(1 −1 )2+(2 −3 )2+(3 −2 )2

3 ∙ 8=

12

Complexity:

0≤∑𝑗=1

𝑛

(𝑅1 (𝑜 𝑗 )−𝑅2 (𝑜 𝑗 ) )2≤𝑛 (𝑛2−1 )

3

Example:

Page 156: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

156

Rank Intuitions: Setup

𝑅1 𝑅2

The sequence <3,1,4,10,5,9,2,6,8,7> is sufficient to encode the two rankings.

123456789

10

123456789

10

Objects ordered by rank scores. Viewing as if scrambling the order of .

Page 157: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

157

Rank Intuitions: Pairs

Rankings in complete agreement.

Rankings in complete dis-agreement.

−1 0 1

𝑅1 𝑅2

𝑅1 𝑅2

Page 158: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

158

Rank Intuitions: Spearman

−1 0 1

Segment lengths represent R1 rank scores.

0.5− 0.50

𝑝=1𝑝=2𝑝=3𝑝=4

𝑝=5𝑝=6𝑝=10𝑝=15𝑝=𝑛

− 0.78− 0.88− 0.92

489

Page 159: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

159

Rank Intuitions: Kendall

−1

01

Segment lengths represent R1 rank scores.

0.5− 0.53

𝑝=1𝑝=2𝑝=3𝑝=4

𝑝=5𝑝=6𝑝=10𝑝=15𝑝=𝑛

01− 0.36− 0.639− 0.830

Page 160: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

160

What about ties?

The position of an object within set of objects with the same scores in the rankings affects the rank correlation.

𝑅1 𝑅2

𝑜 𝑗

Objects have the same ranking scores.

𝑜 𝑗

𝑜 𝑗 𝑜 𝑗Objects have the same ranking scores.

For example, red positioning of oj leads to lower Spearman’s coefficient; green – higher.

Page 161: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

161

Ties

• Kendall: Strict discordance:

• Spearman:– Can use per entity upper and lower bounds.– Do as in the Olympics:

𝑅1 𝑅2

𝑜 𝑗

𝑜 𝑗

Objects with thesame score have

the same rank.

𝑠𝑔𝑛(𝑠𝑐𝑜𝑟𝑒1 (𝑎 )−𝑠𝑐𝑜𝑟𝑒1 (𝑏))≠𝑠𝑔𝑛(𝑠𝑐𝑜𝑟𝑒2 (𝑎 )−𝑠𝑐𝑜𝑟𝑒2 (𝑏))

Page 162: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

162

Ties: Kendall TauB

http://en.wikipedia.org/wiki/Kendall_tau#Tau-b

𝜏 𝐵=𝑛𝑐−𝑛𝑑

√(𝑛 (𝑛−1)2

−𝑛1)(𝑛(𝑛−1)2

−𝑛2)where:

𝑛𝑐𝑛𝑑

𝑛

is the number of concordant pairs.

is the number of discordant pairs.

is the number of objects in the two rankings.

𝑛1=∑𝑖

𝑡𝑖(𝑡 𝑖−1)2

𝑛2=∑𝑗

𝑢 𝑗 (𝑢 𝑗−1)2

number of pairs among elements with ties in ranking .

number of pairs among elements with ties in ranking .

Page 163: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

163

Uses of popularityPopularity can be used to augment gain in NDCG by linearly scaling it:

1 3 7 15-1

𝑙𝑎𝑏𝑒𝑙

1 2 3 4

31

5

perfectexcellentgoodfairpoor

𝐺𝑎𝑖𝑛+ (𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 ) ∙𝐺𝑎𝑖𝑛

Page 164: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

164

Next Steps

• How to determine popularity of new entities– Challenge: No historical data.– Usually there is an initial period of high popularity

(e.g., a new restaurant is featured in local paper, promotions, etc.).

• Good abandonment (no user clicks but good entity in terms of satisfying the user information needs, e.g., phone number).– Use number impressions for named queries.

Page 165: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

165

References1. Yaser S. Abu-Mostafa, Malik Magdon-Ismail & Hsuan-Tien Lin (2012) Learning From Data. AMLBook. [link]2. Ethem Alpaydin (2009) Introduction to Machine Learning. 2nd edition. Adaptive Computation and Machine Learning series. MIT Press. [link

]3. David Barber (2012) Bayesian Reasoning and Machine Learning. Cambridge University Press. [link]4. Ricardo Baeza-Yates & Berthier Ribeiro-Neto (2011) Modern Information Retrieval: The Concepts and Technology behind Search. 2nd

Edition. ACM Press Books. [link]5. Alex Bellos (2010) Alex's Adventures in Numberland. Bloomsbury: New York. [link]6. Ron Bekkerman, Mikhail Bilenko & John Langford (2011) Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge

University Press. [link]7. Christopher M. Bishop (2007) Pattern Recognition and Machine Learning. Information Science and Statistics. Springer. [link]8. George Casella & Roger L. Berger (2001) Statistical Inference. 2nd edition. Duxbury Press. [link]9. Anirban DasGupta (2011) Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. Springer Texts in Statistics.

Springer. [link]10. Luc Devroye, László Györfi & Gábor Lugosi (1996) A Probabilistic Theory of Pattern Recognition. Springer. [link]11. Richard O. Duda, Peter E. Hart & David G. Stork (2000) Pattern Classification. 2nd Edition. Wiley-Interscience. [link]12. Trevor Hastie, Robert Tibshirani & Jerome Friedman (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd

Edition. Springer Series in Statistics. Springer. [link]13. James L. Johnson (2008) Probability and Statistics for Computer Science. Wiley-Interscience. [link]14. Daphne Koller & Nir Friedman (2009) Probabilistic Graphical Models: Principles and Techniques. Adaptive Computation and Machine

Learning series. MIT Press. [link]15. David J. C. MacKay (2003) Information Theory, Inference and Learning Algorithms. Cambridge University Press. [link]16. Zbigniew Michalewicz & David B. Fogel (2004) How to Solve It: Modern Heuristics. 2nd edition. Springer. [link]17. Tom M. Mitchell (1997) Machine Learning. McGraw-Hill (Science/Engineering/Math). [link]18. Mehryar Mohri, Afshin Rostamizadeh & Ameet Talwalkar (2012) Foundations of Machine Learning. Adaptive Computation and Machine

Learning series. MIT Press. [link]19. Lior Rokach (2010) Pattern Classification Using Ensemble Methods. World Scientific. [link]20. Gilbert Strang (1991) Calculus. Wellesley-Cambridge Press. [link]21. Larry Wasserman (2010) All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer. [link]22. Sholom M. Weiss, Nitin Indurkhya & Tong Zhang (2010) Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer. [link]

Page 166: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

166

Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models

– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)

– Classification Decision Trees, Regression Trees• Boosting

– AdaBoost• Ranking evaluation

– Kendall tau and Spearman’s coefficient• Sequence labeling

– Hidden Markov Models (HMMs)

Page 167: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

167

SEQUENCE LABELING:HIDDEN MARKOV MODELS (HMMs)

Page 168: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

168

Outline

• The guessing game• Tagging preliminaries• Hidden Markov Models• Trellis and the Viterbi algorithm• Implementation (Python)• Complexity of decoding• Parameter estimation and smoothing• Second order models

Page 169: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

169

The Guessing Game

• A cow and duck write an email message together.• Goal – figure out which word is written by which animal.

The cow/duck illustration of HMMs is due to Ralph Grishman (NYU).

Page 170: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

170

What’s the Big Deal ?

• The vocabularies of the cow and the duck can overlap and it is not clear a priori who wrote a certain word!

Page 171: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

171

The Game (cont)

? ?

moo hello

?

quack

COW ?

moo hello

DUCK

quack

Page 172: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

172

The Game (cont)

COW COW

moo hello

DUCK

quack

DUCK

Page 173: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

173

What about the Rest of the Animals?

ZEBRA ZEBRA

word1 word2

ZEBRA

word3

PIG

ZEBRA

word4

ZEBRA

word5

PIG

DUCK

COW

ANT

DUCK

COW

ANT

PIG

DUCK

COW

ANT

PIG

DUCK

COW

ANT

PIG

DUCK

COW

ANT

Page 174: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

174

A Game for Adults

• Instead of guessing which animal is associated with each word guess the corresponding POS tag of a word.

Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.

Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.

Page 175: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

175

POS Tags"CC", "CD", "DT", "EX", "FW","IN", "JJ", "JJR", "JJS", "LS","MD","NN", "NNS","NNP", "NNPS","PDT", "POS", "PRP", "PRP$", "RB","RBR", "RBS", "RP", "SYM", "TO","UH", "VB", "VBD", "VBG", "VBN","VBP", "VBZ", "WDT", "WP", "WP$","WRB", "#", "$", ".",",",":", "(", ")", "`", "``","'", "''"

"CC", "CD", "DT", "EX", "FW","IN", "JJ", "JJR", "JJS", "LS","MD","NN", "NNS","NNP", "NNPS","PDT", "POS", "PRP", "PRP$", "RB","RBR", "RBS", "RP", "SYM", "TO","UH", "VB", "VBD", "VBG", "VBN","VBP", "VBZ", "WDT", "WP", "WP$","WRB", "#", "$", ".",",",":", "(", ")", "`", "``","'", "''"

Page 176: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

176

Tagging Preliminaries

• We want the best set of tags for a sequence of words (a sentence)

• W — a sequence of words• T — a sequence of tags

)|(maxarg^

WTPTT

Page 177: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

177

Bayes’ Theorem (1763)

)(

)()|()|(

WP

TPTWPWTP

posteriorposterior

priorpriorlikelihoodlikelihood

marginal likelihoodmarginal likelihood

Reverend Thomas Bayes — Presbyterian minister (1702-1761)Reverend Thomas Bayes — Presbyterian minister (1702-1761)

Page 178: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

178

Applying Bayes’ Theorem• How do we approach P(T|W) ?• Use Bayes’ theorem

• So what? Why is it better?• Ignore the denominator (and the question):

)(

)()|(maxarg)|(maxarg

WP

TPTWPWTP

TT

)()|(maxarg)(

)()|(maxarg)|(maxarg TPTWP

WP

TPTWPWTP

TTT

Page 179: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

179

Tag Sequence Probability

• Count the number of times a sequence occurs and divide by the number of sequences of that length — not likely!– Use chain rule

How do we get the probability P(T) of a specific tag sequence T?

Page 180: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

180

P(T) is a product of the probability of the N-grams that make it up

Make a Markov assumption: the current tag depends on the previous one only:

P(T) is a product of the probability of the N-grams that make it up

Make a Markov assumption: the current tag depends on the previous one only:

Chain Rule

),...,|(...)|()|()(

),...,()(

11213121

1

nn

n

tttPtttPttPtP

ttPTP history

n

iiin ttPtPttP

2111 )|()(),...,(

Page 181: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

181

• Use counts from a large hand-tagged corpus.• For bi-grams, count all the ti–1 ti pairs

• Some counts are zero – we’ll use smoothing to address this issue later.

Transition Probabilities

)(

)()|(

1

11

i

iiii tc

ttcttP

Page 182: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

182

What about P(W|T) ?

• First it's odd—it is asking the probability of seeing “The white horse” given “Det Adj Noun”!– Collect up all the times you see that tag sequence and see how often “The

white horse” shows up …

• Assume each word in the sequence depends only on its corresponding tag:

n

i

ii twPTWP1

)|()|(

emission probabilitiesemission probabilities

Page 183: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

183

Emission Probabilities

• What proportion of times is the word wi associated with the tag ti (as opposed to another word):

)(

),()|(

i

iiii tc

twctwP

Page 184: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

184

The “Standard” Model

n

iiiii

T

T

T

T

ttPtwP

TPTWP

WP

TPTWP

WTP

11)|()|(maxarg

)()|(maxarg

)(

)()|(maxarg

)|(maxarg

Page 185: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

185

Hidden Markov Models

• Stochastic process: A sequence 1 , 2,… of random variables based on the same sample space .

• Probabilities for the first observation:

• Next step given previous history:

jj xxP outcomeeach for )( 1

), ... ,|(11 11 tt itiit xxxP

Page 186: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

186

• A Markov Chain is a stochastic process with the Markov property:

• Outcomes are called states.• Probabilities for next step – weighted finite state

automata.

Markov Chain

)|(), ... ,|(111 111 tttt itititiit xxPxxxP

Page 187: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

187

State Transitions w/ Probabilities

STARTEND

COW

DUCK

1.00.2

0.2

0.3 0.3

0.5

0.5

Page 188: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

188

Markov Model

Markov chain where each statecan output signals

(like “Moore machines”):

Markov chain where each statecan output signals

(like “Moore machines”):

START END

COW

DUCK

1.00.2

0.2

0.3 0.3

0.5

0.5

moo:0.9

hello:0.1

hello:0.4

quack:0.6

$:1.0^:1.0

Page 189: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

189

The Issue Was

• A given output symbol can potentially be emitted by more than one state — omnipresent ambiguity in natural language.

Page 190: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

190

Markov ModelMarkov Model

},...,{ 1 mss

},...,{ 1 k

)|( where][P 1 itjtijij ssPpp

)|( where][A itjtijij sPaa

)( where],...,[ 11 jjm sPvvvv

Finite set of states:

Signal alphabet:

Transition matrix:

Emission probabilities:

Initial probability vector:

Page 191: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

191

Graphical Model

STATESTATE TAGTAG

OUTPUTOUTPUT wordword

……

Page 192: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

192

• A Markov Model for which it is not possible to observe

the sequence of states.

• S: unknown — sequence of states

• O: known — sequence of observations

)|(maxarg OSPS

Hidden Markov Model

*S

*O

wordswordstagstags

Page 193: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

193

The State Space

START END

COW

DUCK

1.0

0.0

0.2

0.2

0.3

0.3

0.5

0.5

moo:0.9 hello:0.1

hello:0.4 quack:0.6

moo hello quack

COW

DUCK

0.3

0.3

0.5

0.5

COW

DUCK

More on how the probabilities come about (training) later.More on how the probabilities come about (training) later.

Page 194: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

194

Optimal State Sequence:The Viterbi Algorithm

We define the joint probability of the most likely sequence from time 1 to time t ending in state si and the observed sequence O≤t up to time t:

);,,...,( max

);,( max)(

11

11

1

11,...,

1

tititiss

tittS

t

OsssP

OsSPi

t

tii

t

Page 195: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

195

Key Observation

The most likely partial derivation leading to state si at position t consists of:

– the most likely partial derivation leading to some state sit-1 at the previous position t-1,

– followed by the transition from sit-1 to si.

Page 196: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

196

Note:

We will show that:

)|( and )( where)(111 11 itktikiiiki sPasPvavi

Viterbi (cont)

tjkijti

t apij ])([max)( 1

Page 197: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

197

t

t

tt

tt

tt

t

jktiji

tittS

itjti

jtktitjtSi

tittktjtSi

kttjtittSi

tjttS

t

aip

OsSPssP

sPssP

OsSsP

OssSP

OsSPj

)]([max

)];,( max)|(max[

)|()|( maxmax

);,|;( maxmax

),;,,( maxmax

);,( max)(

1

1121

1

112

112

1

2

2

2

2

1

Recurrence Equation

)|(

);,(

);,(

112

112

jtkt

titt

titt

sP

OsSP

OsSP

t

1k 1k

Page 198: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

198

• The predecessor of state si in the path corresponding to t(i) :

• Optimal state sequence:

Back Pointers

1, ... ,1for )(

)(argmax

))((argmax)(

**

1

*

11

11

ntss

is

pij

ttt

T

kk

nmi

k

ijtmi

t

Page 199: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

199

The Trellis

START

COW

moo hello quack

DUCK

END

0

t=0

1

0

0

t=1 t=2 t=3 t=4

0

0.9

0

0

0 0 0

0 0

$

0.045

0.108

0.00648

0

0.0081

0

0.03240

Page 200: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

200

Implementation (Python)observations = ['^','moo','hello','quack','$'] # signal sequencestates = ['start','cow','duck','end']

# Transition probabilities - p[FromState][ToState] = probabilityp = {'start': {'cow':1.0}, 'cow': {'cow' :0.5, 'duck':0.3, 'end' :0.2}, 'duck': {'duck':0.5, 'cow' :0.3, 'end' :0.2}}

# Emission probabilities; special emission symbol '$' for 'end' statea = {'cow' : {'moo' :0.9,'hello':0.1, 'quack':0.0, '$':0.0}, 'duck': {'moo' :0.0,'hello':0.4, 'quack':0.6, '$':0.0}, 'end' : {'moo' :0.0,'hello':0.0, 'quack':0.0, '$':1.0}}

observations = ['^','moo','hello','quack','$'] # signal sequencestates = ['start','cow','duck','end']

# Transition probabilities - p[FromState][ToState] = probabilityp = {'start': {'cow':1.0}, 'cow': {'cow' :0.5, 'duck':0.3, 'end' :0.2}, 'duck': {'duck':0.5, 'cow' :0.3, 'end' :0.2}}

# Emission probabilities; special emission symbol '$' for 'end' statea = {'cow' : {'moo' :0.9,'hello':0.1, 'quack':0.0, '$':0.0}, 'duck': {'moo' :0.0,'hello':0.4, 'quack':0.6, '$':0.0}, 'end' : {'moo' :0.0,'hello':0.0, 'quack':0.0, '$':1.0}}

Page 201: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

201

Implementation (Viterbi)n = len(observations)# Initializing viterbi table row by row; v[state][time] = probv = {}; for s in states: v[s] = [0.0] * T

# Initializing back pointersbackPointer = {}; for s in states: backPointer[s] = [""] * T

v['start'][0] = 1.0for t in range(T-1): # t =[0..T-1]; populate column t+1 in v for s in states[:-1]: # 'end' state not considered # only consider 'start' state at time 0 if t == 0 and s != 'start': continue for s1 in p[s].keys(): # s1 is the next state newScore = v[s][t] * p[s][s1] * a[s1][observations[t+1]] if v[s1][t+1] == 0.0 or newScore > v[s1][t+1]: v[s1][t+1] = newScore backPointer[s1][t+1] = s

n = len(observations)# Initializing viterbi table row by row; v[state][time] = probv = {}; for s in states: v[s] = [0.0] * T

# Initializing back pointersbackPointer = {}; for s in states: backPointer[s] = [""] * T

v['start'][0] = 1.0for t in range(T-1): # t =[0..T-1]; populate column t+1 in v for s in states[:-1]: # 'end' state not considered # only consider 'start' state at time 0 if t == 0 and s != 'start': continue for s1 in p[s].keys(): # s1 is the next state newScore = v[s][t] * p[s][s1] * a[s1][observations[t+1]] if v[s1][t+1] == 0.0 or newScore > v[s1][t+1]: v[s1][t+1] = newScore backPointer[s1][t+1] = s

Page 202: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

202

Implementation (Best Path)

# Now recover the optimal state sequencestate_sequence = [ 'end' ]

for t in range( n-1, 0, -1 ): state = backPointer[ state ][ t ] state_sequence = [ state ] +

state_sequence print "Observations....: ", observationsprint "Optimal sequence: ", state_sequence

# Now recover the optimal state sequencestate_sequence = [ 'end' ]

for t in range( n-1, 0, -1 ): state = backPointer[ state ][ t ] state_sequence = [ state ] +

state_sequence print "Observations....: ", observationsprint "Optimal sequence: ", state_sequence

Page 203: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

203

Complexity of Decoding

• O(m2n) — linear in n (the length of the string)• Initialization: O(mn)• Back tracing: O(n)• Next step: O(m2)

for current_state in s1..sm # at time t+1 for prev_state in s1..sm # at time t compute value

compare with best_so_far

• There are n next steps.

Page 204: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

204

Parameter Estimation for HMMs

• Need annotated training data (Brown, PTB).• Signal and state sequences both known.• Calculate observed relative frequencies.• Complications — sparse data problem (need for smoothing).

• One can use only raw data too — Baum-Welch (forward-backward) algorithm.

Page 205: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

205

Optimization

• Build vocabulary of possible tags for words• Keep total counts for words• If a word occurs frequently (count > threshold) consider its tag set

exhaustive• For frequent words only consider its tag set (vs. all tags)• For unknown words don’t consider tags corresponding to closed

class words (e.g., DT)

Page 206: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

206

Applications Using HMMs

• POS tagging (as we have seen).• Chunking.• Named Entity Recognition (NER).• Speech recognition.

Page 207: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

207

Exercises

• Implement the training (parameter estimation).• Use a dictionary of valid tags for known words to constrain

which tags are considered for a word.• Implement a second-order model.• Implement the decoder in Ruby.

Page 208: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

208

Some POS Taggers

• Alias-I: http://www.alias-i.com/lingpipe• AUTASYS: http://www.phon.ucl.ac.uk/home/alex/project/tagging/tagging.htm• Brill Tagger: http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z• CLAWS: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html• Connexor: http://www.connexor.com/software/tagger• Edinburgh (LTG): http://www.ltg.ed.ac.uk/software/pos/index.html• FLAT (Flexible Language Acquisition Tool): http://lanaconsult.com• fnTBL: http://nlp.cs.jhu.edu/~rflorian/fntbl/index.html• GATE: http://gate.ac.uk• Infogistics: http://www.infogistics.com/posdemo.htm• Qtag: http://www.english.bham.ac.uk/staff/omason/software/qtag.html• SNoW: http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=POS• Stanford: http://nlp.stanford.edu/software/tagger.shtml• SVMTool: http://www.lsi.upc.edu/~nlp/SVMTool• TNT: http://www.coli.uni-saarland.de/~thorsten/tnt• Yamcha: http://chasen.org/~taku/software/yamcha/

Page 209: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

209

References1. Brants, Thorsten. 2000. TnT – A Statistical Part-of-speech Tagger. 6th Applied NLP Conference (ANLP-2000),

224-231, Seattle, U.S.A.2. Charniak, Eugene, Curtis Hendrickson, Neil Jacobson & Mike Perkowitz. 1993. Equations for part-of-speech

tagging. 11th National Conference on Artificial Intelligence, 784-789. Menlo Park: AAAI Press/MIT.3. Krenn, Brigitte & Christer Samuelsson. 1997. Statistical Methods in Computational Linguistics, ESSLLI

Summer school Lecture Notes from, 11-22 August, Aix-en-Provence, France.4. Rabiner, Lawrence R. 1989. A tutorial on Hidden Markov Models and selected applications in speech

recognition, Proceedings of the IEEE, vol. 77, 256-286.5. Samuelsson, Christer. 2000. Extending N-gram tagging to word graphs, Recent Advances in Natural

Language Processing II, ed. by Nicolas Nicolov & Ruslan Mitkov. Current Issues in Linguistic Theory (CILT), vol. 189, pp 3-20. John Benjamins: Amsterdam/Philadelphia.

6. Shin, Jung Ho, Young S. Han & Key-Sun Choi. 1997. A HMM part-of-speech tagger for Korean with wordphrasal relations. Recent Advances in Natural Language Processing, ed. by Nicolas Nicolov & Ruslan Mitkov. Current Issues in Linguistic Theory (CILT) vol 136, pp 439-450. John Benjamins: Amsterdam/Philadelphia.

Page 210: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

210

Statistics Refresher• Outcome: Individual atomic results of a (non-deterministic) experiment.• Event: A set of results.• Probability: Limit of target outcome over number of experiments (frequentist view) or

degree of belief (Bayesian view).• Normalization condition: Probabilities for all outcomes sum to 1.• Distribution: Probabilities associated with each outcome.• Random variable: Mapping of the outcomes to real numbers.• Joint distributions: Conducting several (possibly related) experiments and observing the

results. Joint distribution states the probability for a combination of values of several random variables.

• Marginal: Finding the distribution of a random variable from a joint distribution.• Conditional probability (Bayes’ rule): Knowing the value of one variable constrains the

distribution of another.• Probability density functions: Probability that a continuous variable is in a certain range.• Probabilistic reasoning: Introduce evidence (set certain variables) and compute

probabilities of interest (conditioned on this evidence).

Page 211: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

211

Definitions

𝜇=𝐸 [ 𝑋 ]=∑𝑖=1

𝑛

𝑥𝑖 ∙𝑝 (𝑥 𝑖 )=∫− ∞

𝑥𝑝 (𝑥 )𝑑𝑥Expectation:

Mode: 𝑥∗=arg max𝑖𝑝 (𝑥 𝑖)

Variance: 𝜎 2=𝑉𝑎𝑟 (𝑋 )=𝐸 [ (𝑋−𝜇)2 ]=𝐸 [ 𝑋 2 ]−𝜇2

𝐸 [ 𝑓 (𝑋 ) ]=∑𝑖=1

𝑛

𝑓 (𝑥 𝑖 ) ∙𝑝 (𝑥𝑖)=∫−∞

𝑓 (𝑥)𝑝 (𝑥 )𝑑𝑥Expectation of a function:

𝐸 [ 𝑋𝑛 ]=∑𝑖=1

𝑛

𝑥𝑖𝑛∙𝑝 (𝑥 𝑖)-th moment: ( is the first moment)

𝐸 [𝑎𝑋 +𝑏 ]=𝑎𝐸 [ 𝑋 ]+𝑏 𝐸 [ 𝑋+𝑌 ]=𝐸 [𝑋 ]+𝐸 [𝑌 ] 𝑉𝑎𝑟 [𝑎𝑋 +𝑏 ]=𝑎2𝑉𝑎𝑟 [𝑋 ]Properties:

Page 212: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

212

Intuitions about Scale

Weight in grams if the Earth were to be a black hole.

Age of the universe in seconds.

Number of cells in the human body (100 trillion).

Number of neurons in the human brain.

Standard Blu-ray disc size, XL 4 layer (128GB).

One year in seconds.

Items in the Library of Congress (largest in the world).

Length of the Niles in meters (longest river).

Page 213: Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

213

Acknowledgements

• Bran Boguraev• Chris Brew• Jinho Choi• William Headden• Jingjing Li• Jason Kessler• Mike Mozer• Shumin Wu• Tong Zhang

• Amir Padovitz• Bruno Bozza• Kent Cedola• Max Galkin• Manuel Reyes Gomez• Matt Hurst• John Langford• Priyank Singh