Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

Machine Learningwith Applications in Categorization, Popularity and Sequence labeling

(linear models, decision trees, ensemble methods, evaluation)

Dr. Nicolas Nicolov<[email protected]>

2

Goals

• Introduce important ML concepts• Illustrate ML techniques through examples in:

– Categorization– Popularity– Sequence labeling

(tutorial aims to be self-contained and to explain the notation)

3

Outline• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models

– Perceptron, Winnow, Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)

– Classification Decision Trees, Regression Trees• Boosting

– AdaBoost• Ranking evaluation

– Kendall tau and Spearman’s coefficient• Sequence labeling

– Hidden Markov Models (HMMs)

4

EXAMPLES OF MACHINE LEARNINGWhy?– Get a flavor of the diversity of areas where ML is applied.

5

Sequence Labeling

George W. Bush discussed Iraq

GPEXPER_ _PER_ _PER

<PER>George W. Bush</PER> discussed <GPE>Iraq</GPE>

George W. Bush discussed Iraq

Geo-Political Entity

(like search query analysis)

6

Spam

www.dietsthatwork.com

www . dietsthatwork . com

www . diets that work . com

SPAM!

further segmentation

classification

7

TokenizationWhat!?I love the iphone:-)

What !? I love the iphone :-)

How difficult can that be? — 98.2% [Zhang et al. 2003]

NO TRESSPASSING VIOLATORS WILL BE PROSECUTED

8

NL Parsing

Unlikemy sluggish Chevy the Audi handlesthe winding mountain roads superbly

PREP

POSS

MODDET

SUBJ DETMOD

MOD

MANRDOBJ

CONTR

syntactic structure

9

State Transitions

λ β

λ β

λ β

λ β

λ β

λ

λ

λ

λ

LEFTARC:

RIGHTARC:

NOARC:

SHIFT:

using ML to make the decisionwhich action to take

10

Two Ladies in a Men’s Club

11

We serve men

IOBJ

We serve men

DOBJSUBJ

SUBJ

We serve food to men.We serve our community.serve —IndirectObject men

We serve organic food.We serve coffee to connoiseurs.serve —DirectObject men

12

Audi is an automaker that makes luxury cars and SUVs. The company was born in Germany . It was established by August Horch in 1910. Horch had previosly founded another company and his models were quite popular. Audi started with four cylinder models. By 1914, Horch 's new cars were racing and winning. August Horch left the Audi company in 1920 to take a position as an industry representative for the German motor vehicle industry federation. Currently Audi is a subsidiary of the Volkswagen group and produces cars of outstanding quality.

Coreference

13

Parts of Objects (Meronymy)

[…] the interior seems upscale with leatherette upholstery that looks and feels better than the real cow hide found in more expensive vehicles, a dashboard accented by textured soft-touch materials, a woven mesh

headliner, and other materials that give the New Beetle’s interior a sense of quality. […] Finally, and a big plus in my book, both front seats were height adjustable, and the steering column tilted and telescoped for optimum comfort.

14

Sentiment Analysis

I love pineapple nearly as much as I hate bananas.

POSITIVE sentiment regarding topic pineapple.

Xbox

Xbox

Positive Negative

Neutral

Negative

Negative

Neutral

Positive

15

Chinese Sentiment

Car aspects Sentiment categories

Sentence

18

Categorization

• High-level task: – Given a restaurant what is its restaurant sub-category?

• Encoding entities with features• Feature selection• Linear models

non-standard order

“Though this be madness, yet there is method in't.”

19

Roadmap• Examples of applications of Machine Learning• Encoding objects with features• The Machine Learning framework• Linear models

– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)• Tree models (Decision Trees DTs)





20

ENCODING OBJECTS WITH FEATURESWhy?– ML algorithms are “generic”; most of them are cast as solutions around vector encodings of the domain objects. Regardless of the ML algorithm we will need to represent/encode the domain objects as feature vectors. How well we do this (the quality of features) directly impacts system performance.

21

FlatObject

Encoding

1 0 0 1 1 1 0 1 …37

= feature values (binary in this example)=Target class index (for asian)

Machine learning (training) instance/example/observation.

Default feature: Always on.

Name: has “asian bistr

o”

Description has “china”

Description has “indonesia”

has FB page

Name: has “restaurant”

Name: has “ginger”

Can be a set;object can belong to several classes.

URL has “french”

Number offeatures canbe millions.

22

Structured Objects to Strings

to Features

a b c d e

Structured object:

f1

f2

f3

f4

f5

f6

“f2:f4>a”“f2:f4>b”“f2:f4>c”…“f2:f4>a_b”“f2:f4>b_c”“f2:f4>c_d”…“f2:f4>a_b_c”“f2:f4>b_c_d”

uni-grams

bi-grams

tri-grams

Feature string Feature index

*DEFAULT* 0

… …

f2:f4>a 100

f2:f4>b 101

f2:f4>c 102

… …

f2:f4>a_b 105

f2:f4>b_c 106

f2:f4>c_d 107

… …

f2:f4>a_b_c 109

Read as field “f2:f4” contains feature “a”.

Table can be quite large.

23

Sliding Window (bi-grams)

SkyCity at the Space Needle

SkyCity at the Space Needle^ $

add initial “^” and final “$” tokens





sliding window

24

Example: Feature Templatespublic static List<string> NGrams( string field ){ var featutes = new List<string>(); string[] tokens = field.Split( spaceCharArr, System.StringSplitOptions.RemoveEmptyEntries );

featutes.Add( string.Join( "", field.Split(SPLIT_CHARS) ) ); // the entire field

string unigram = string.Empty, bigram, previous1 = "^", previous2 = "^", trigram;

for (int i = 0; i < tokens.Length; i++) { unigram = tokens[ i ]; featutes.Add(unigram);

bigram = previous1 + "_" + unigram; featutes.Add( bigram );

if ( i >= 1 ) { trigram = previous2 + "_" + bigram; featutes.Add( trigram ); }

previous2 = previous1; previous1 = unigram; } featutes.Add( unigram + "_$" ); featutes.Add( bigram + "_$" );

return result;}

initial tri-gram is: "^_tokens[0]_tokens[1] "

initial bigram is “^_tokens[0]"

last trigram is “tokens[tokens.Length-2]_tokens[tokens.Length-1]_$"

could add field name as argument and prefix all features

26

instance( class= 7, features=[0,300857,100739,200441,...])instance( class=99, features=[0,201937,196121,345758,13,...])instance( class=42, features=[0,99173,358387,1001,1,...])...

Generic Nature of ML Systems

human sees

computer “sees”

Default feature always triggers.

Number of features that trigger for individual instances are often not the same.

Indices of (binary) features that trigger.

27

Training Data

𝑋=( 𝑥0( 1) ⋯ 𝑥𝑑

(1 )

⋮ ⋱ ⋮𝑥0

(𝑁 ) ⋯ 𝑥𝑑(𝑁 )) ( 𝑦

(1 )

⋮𝑦 (𝑁 ))

Instance /w outcome.

28

Feature Selection

• Templates: powerful way to get lots of features.• We get too many features.• Danger of overfitting.• Feature selection:

– CountCutOff.– TFxIDF.– Mutual information.– Information gain.– Chi square.

Doing well on seen data but poorly on unseen data.

e.g., 20M for dependency parsing.

Automatic ways of finding discriminative features.

We will examine in detail the implementation of this.

29

Mutual Information• Measure of relative entropy between distributions of two random variables.• = expected value of across all classes:

• An alternative is to use:

𝐼 ( 𝑓 ,𝑐 )=log( 𝑃 ( 𝑓 ,𝑐 )𝑃 ( 𝑓 )𝑃 (𝑐 ) )=log(

𝑛 𝑓 ,𝑐

𝑁 𝑡

𝑛𝑓

𝑁𝑡

∙𝑛𝑐

𝑁 𝑡

)𝑀𝐼 ( 𝑓 ,𝐶 )=∑

𝑐∈𝐶

𝑃 (𝑐 ) 𝐼 ( 𝑓 ,𝑐 )=∑𝑐∈𝐶

𝑛𝑐𝑁𝑡

log(𝑛𝑓 ,𝑐

𝑁𝑡

𝑛 𝑓

𝑁 𝑡

∙𝑛𝑐𝑁𝑡

)𝐼𝑚𝑎𝑥 ( 𝑓 ,𝐶 )=max

𝑐∈𝐶𝐼 ( 𝑓 ,𝑐)=max

𝑐∈𝐶log(

𝑛𝑓 , 𝑐

𝑁 𝑡

𝑛𝑓

𝑁𝑡

∙𝑛𝑐

𝑁𝑡

)

30

Information Gain

Balances effects of feature triggering for an object with the effects of feature being absent for an object.

𝐼𝐺 ( 𝑓 ,𝐶 )=𝐻 (𝐶 ) −𝐻 (𝐶|𝑓 ) −𝐻 (𝐶∨¬ 𝑓 )

¿− ∑𝑐∈𝐶

𝑃 (𝑐 ) log 𝑃 (𝑐 )−(− ∑𝑐∈𝐶 𝑃 ( 𝑓 ,𝑐 ) log 𝑃 (𝑐|𝑓 ))−(−∑𝑐∈𝐶 𝑃 (¬ 𝑓 ,𝑐 ) log 𝑃 (𝑐|¬ 𝑓 ))

¿− ∑𝑐∈𝐶 ( 𝑛𝑐

𝑁𝑡

log ( 𝑛𝑐𝑁𝑡)− 𝑛𝑐𝑁 𝑡

log(𝑛𝑓 ,𝑐

𝑛 𝑓)− 𝑛𝑐−𝑛 𝑓 , 𝑐

𝑁𝑡

log (𝑛𝑐−𝑛𝑓 ,𝑐

𝑁 𝑡−𝑛 𝑓))

31

Chi Square

Quantifies lack of independence between feature and class :

𝑋 2 ( 𝑓 ,𝑐 )=𝑁𝑡 (𝑃 ( 𝑓 ,𝑐 )𝑃 (¬ 𝑓 , ¬𝑐 )−𝑃 ( 𝑓 ,¬𝑐 ) 𝑃 (¬ 𝑓 ,𝑐))2

𝑃 ( 𝑓 ) 𝑃 (¬ 𝑓 )𝑃 (𝑐 ) 𝑃 (¬𝑐)

¿𝑁𝑡 (𝑛𝑓 ,𝑐 (𝑁 𝑡−𝑛𝑓 −𝑛𝑐+𝑛𝑓 , 𝑐)− (𝑛𝑓 −𝑛𝑓 ,𝑐 ) (𝑛𝑐−𝑛𝑓 ,𝑐 ))

2

𝑛𝑐𝑛𝑓 (𝑁𝑡−𝑛𝑐 ) (𝑁 𝑡−𝑛𝑓 )

float Chi2(int a, int b, int c, int d) { return (a+b+c+d)* ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d)); }

Calling: Chi2( , , , )

32

Exponent(Log) TrickWhile the final output may not be big intermediate results are. Solution:

float Chi2(int a, int b, int c, int d) { return (a+b+c+d) * ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d)); }

float Chi2_v2(int a, int b, int c, int d){ double total = a + b + c + d; double n = Math.Log(total); double num = 2.0 * Math.Log(Math.Abs((a * d) - (b * c))); double den = Math.Log(a + b) + Math.Log(a + c) + Math.Log(c + d) + Math.Log(b + d); return (float) Math.Exp(n+num-den);}

𝒙=𝒆𝐥𝐧 𝒙 (𝑎+𝑏+𝑐+𝑑 ) (𝑎𝑑−𝑏𝑐 )2

(𝑎+𝑏)(𝑎+𝑐)(𝑐+𝑑)(𝑏+𝑑)=¿

¿𝑒ln

(𝑎+𝑏+𝑐 +𝑑 ) (𝑎𝑑−𝑏𝑐 )2

(𝑎+𝑏)(𝑎+𝑐)(𝑐+𝑑)(𝑏+𝑑)=¿

¿𝑒ln (𝑎+𝑏+𝑐+𝑑 ) (𝑎𝑑−𝑏𝑐 )2 − ln (𝑎+𝑏 ) (𝑎+𝑐 ) (𝑐+𝑑 ) (𝑏+𝑑 )=¿

¿𝑒ln (𝑎+𝑏+𝑐+𝑑 )+2 ln|𝑎𝑑−𝑏𝑐|− ln (𝑎+𝑏 )− ln (𝑎+𝑐 )− ln (𝑐+𝑑 ) − ln (𝑏+𝑑 )

33

Chi Square: Score per Feature

• We know how to compute .• Two options for an aggregate score across classes:

– Weighted average:

– Highest score among any class:

𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 )= ∑𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠

𝑃 (𝑐 ) 𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 ,𝑐 )

𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 )= max𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠

𝑋 2 ( 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 ,𝑐 )

34

Chi Square Feature Selectionint[] featureCounts = new int[ numFeatures ]; int numLabels = labelIndex.Count;int[] classTotals = new int[ numLabels ]; // instances with that label.float[] classPriors = new float[ numLabels ]; // class priors: classTotals[label]/numInstances.int[,] counts = new int[ numLabels, numFeatures ]; // (label,feature) co-occurrence counts.int numInstances = instances.Count;

...

float[] weightedChiSquareScore = new float[ numFeatures ];for (int f = 0; f < numFeatures; f++) // f is a feature index{ float score = 0.0f; for (int labelIdx = 0; labelIdx < numLabels; labelIdx++) { int a = counts[ labelIdx, f ]; int b = classTotals[ labelIdx ] - p; int c = featureCounts[ f ] - p; int d = numInstances - ( p + q + r ); if (p >= MIN_SUPPORT && q >= MIN_SUPPORT) { // MIN_SUPPORT = 5 score += classPriors[ labelIdx ] * Chi2( a, b, c, d ); } } weightedChiSquareScore[ f ] = score;}

Do a pass over the data and collect above counts.

Weighted average across all classes.

35

⇒ Summary: Encoding

• Object representation is crucial.• Humans: good at suggesting features (templates).• Computers: good at filtering (feature selection).

• Feature engineering: Ensuring systems use the “right” features.

The system designer does not have to worry about which feature is more important or useful, and the job is left to the learning algorithm to assign appropriate weights to the corresponding features. The system designer’s job is to define a set of features that is large enough to represent most of the useful information, yet small enough to be manageable for the algorithms and the infrastructure.

36







37

MACHINE LEARNINGGENERAL FRAMEWORK

38

Machine Learning: Representation

object encoded with features(think DB attributes/ OO member fields of primitive types) is the feature dimensionality.

classifier

prediction(response/dependent variable).Can be qualitative/quantitative(classification/regression).

𝑶𝒃𝒋𝒆𝒄𝒕→𝑶𝒖𝒕𝒄𝒐𝒎𝒆Entity CategoryEntity PopularityEntity IsChainElement

Complex decision making:

𝑿→𝒀

�⃗�=(𝒙𝟎 , …, 𝒙𝒅)→𝒀

input/independent variable

We may know the relation for certain values of and :

In fact, we may know the relation for many s and s:

(𝒙 , 𝑦 )

{ (𝒙 (𝟏 ) , 𝑦 (𝟏 )) ,… , ( �⃗� ( 𝑵 ) , 𝑦 (𝑵 ) ) } The -th is:

39

Notation

𝒙(𝑖)=(𝑥0(𝑖) , …, 𝑥 𝑗

(𝑖) , … 𝑥𝑑(𝑖))

-th instance.

is the total number of data items.

is not “to the power of”hence, the parentheses.

is the corresponding component of the feature vector..

We will often have be the default feature with value of 1.

40

TRAINING

Machine Learning

Input

Online System

object encoded with features

classifier

prediction(response/dependent variable)

FinalOutput

ModelOffline

TrainingSub-system

Training Data

where

𝑓 (𝑋 )=𝑌𝑋→𝑌 Task is very complex . Hard to construct good .We construct an approximation to : Hypothesis space: .

41

Classes of Learning Problems

• Classification: Assign a category to each item (Chinese | French | Indian | Italian | Japanese restaurant).

• Regression: Predict a real value for each item (stock/currency value, temperature).

• Ranking: Order items according to some criterion (web search results relevant to a user query).

• Clustering: Partition items into homogeneous groups (clustering twitter posts by topic).

• Dimensionality reduction: Transform an initial representation of items into a lower-dimensional representation while preserving some properties (preprocessing of digital images).

42

ML Terminology• Examples: Items or instances used for learning or evaluation.• Features: Set of attributes represented as a vector associated with an example.• Labels: Values or categories assigned to examples. In classification the labels are categories; in

regression the labels are real numbers.• Target: The correct label for a training example. This is extra data that is needed for supervised learning.• Output: Prediction label from input set of features using a model of the machine learning algorithm.• Training sample: Examples used to train a machine learning algorithm.• Validation sample: Examples used to tune parameters of a learning algorithm.• Model: Information that the machine learning algorithm stores after training. The model is used when

predicting the output labels of new, unseen examples.• Test sample: Examples used to evaluate the performance of a learning algorithm. The test sample is

separate from the training and validation data and is not made available in the learning stage.• Loss function: A function that measures the difference/loss between a predicted label and a true label.

We will design the learning algorithms so that they minimize the error (cumulative loss across all training examples).

• Hypothesis set: A set of functions mapping features (feature vectors) to the set of labels. The learning algorithm chooses one function among those in the hypothesis set to return after training. Usually we pick a class of functions (e.g., linear functions) parameterized by a set of free parameters (e.g., coefficients of the linear function) and pinpoint the final hypothesis by identifying the parameters that minimize the error.

• Model selection: Process for selecting the free parameters of the algorithm (actually of the function in the hypothesis set).

43

Classification

• Data:

• Binary classification:– Outcomes:

−

++

+

++

++

+ +

+

+ −−

−

−

−

−

−

−

− −

−

−

decision boundary

Yes, this is mysterious at this point.

44

Multi-Class Classification

• Outcomes: • Common to use binary classification

approaches: One-Versus-All (OVA). One-Versus-One (OVO).

45

One-Versus-All (OVA)

For each category in turn, create a binary classifier where an instance in the data belonging to the category is considered a positive example, all other examples are considered negative examples.

Given a new object, run all these binary classifiers and see which classifier has the “highest prediction”.

The scores from the different classifiers need to be calibrated!

�̂�=argmax𝑦∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠

𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑆𝑐𝑜𝑟𝑒𝑦 (𝒙 )

46

One-Versus-One (OVO)For each pair of classes, create binary classifier on data labeled as either of the classes.

How many such classifiers?

Given a new instance run all classifiers and predict class with maximum number of wins.

(𝑘2 )=𝑘(𝑘−1)2

47

Errors“Nobody is perfect, but then again, who wants to be nobody.”

Binary classifier: :

#misclassified examples (penalty score of 1 for every misclassified example).𝐸𝑟𝑟𝑜𝑟= 1

𝑁∙∑𝑖=1

𝑁

|�̂� (𝑖 ) −𝑦 (𝑖 )|

𝐸𝑟𝑟𝑜𝑟= 1𝑁

∙∑𝑖=1

𝑁

𝐿𝑜𝑠𝑠 ( �̂� ( 𝑖 ) , 𝑦 (𝑖 ) )

Point-wise error (for data point ,The corresponding prediction and true value ).

�̂� (𝑖 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 (𝑖 ) ) Value predicted by the algorithm for input data point .

𝐿𝑜𝑠𝑠 ( �̂� (𝑖 ) , 𝑦 (𝑖 ) )Average error across all instances.Goal: Minimize the Error.Beneficial to have differentiable loss function.

𝐿𝑜𝑠𝑠 ( �̂� , 𝑦 )=|�̂�− 𝑦|

This encoding makes more sense than .

This particular function is called “Zero-One Loss”.For simplicity we are skipping the indices.

48

Error: Function of the Parameters

�̂� (𝑖 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 (𝑖 ) )=𝑔 ( �⃗�( 𝑖 ) ,𝑝𝑎𝑟𝑎𝑚𝑠 ) Value predicted by the algorithm for input data point .

The cumulative error across all instances is a function of the parameters.

𝐸𝑟𝑟𝑜𝑟 (𝑝𝑎𝑟𝑎𝑚𝑠 )= 1𝑁

∙∑𝑖=1

𝑁

𝐿𝑜𝑠𝑠 ( �̂� ( 𝑖 ) , 𝑦 (𝑖 ) )= 1𝑁

∙∑𝑖=1

𝑁

𝐿𝑜𝑠𝑠 (𝑔 ( �⃗� ( 𝑖 ) ,𝑝𝑎𝑟𝑎𝑚𝑠 ) , 𝑦 (𝑖 ) )

2 When the params are fixed we can compute given (testing).

1 When the s and the s are fixed we can compute (optimize) params (training).

49

Evaluation

• Motivation:– Benchmark algorithms (which system is better).– Tuning parameters during training.

50

Evaluation Measures

GeneralizationError: Probability to misclassify an instance selected according to the distribution of the labeled instance space

ClassificationAccuracy GeneralizationError

TrainingError: Percentage of training examples which are correctly classified.

Optimistically biased estimate especially if the inducer over-fits the (training) data.

Empirical estimation of the generalization error:• Heldout method• Re-sampling:

1. Random resampling2. Cross-validation

51

Precision, Recall and F-measure

Let’s consider binary classification:

Space of all instances

Instances identified as positive by the system.

Positive instances in reality.

System identified these as positive but got them wrong(false positive).

System identified these as positive but got them correct(true positive).

System identified these as negative but got them wrong(false negative).

System identified these as negative and got them correct(true negative).

General Setup

52

Accuracy, Precision, Recall,and F-measure

Definitions

𝑝=𝑇𝑃

𝑇𝑃+𝐹𝑃

𝑟=𝑇𝑃

𝑇𝑃+𝐹𝑁

𝐹=1

12 ( 1𝑝+

1𝑟 )

=2𝑝𝑟𝑝+𝑟

FP: false positives

TP:true positives

FN: false negatives

TN: true negatives Precision:

Recall:

Accuracy:

𝑎𝑐𝑐=𝑇𝑃+𝑇𝑁

𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁

F-measure: Harmonic mean ofprecision and recall

53

Accuracy vs. Prec/Rec/F-measAccuracy can be misleading for evaluating a model with an imbalanced distribution of the class. When there are more majority class instances than minority class instances, predicting always the majority class gives good accuracy.

Precision and recall (together) are better indicators.

As a single, aggregate number f-measure favors the lower of the precision or recall.

54

Extreme Cases for Precision & Recall

TP:true positive

FN: false negatives

TN: true negatives

system actual

If very few (one in the extreme) instance(s) are correctly predicted as belonging to the class precision is 100% () but recall is low ( is high).

all instances

TP: true positives

system

actual

all instances

FP: false positives If all instances are predicted as belonging to the class (some correctly, some not) recall is 100% () but precision is low ( is high).

Precision can be traded for recall and vice versa.

55

Sensitivity & Specificity

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦=𝑇𝑁

𝑇𝑁+𝐹𝑃

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦=𝑇𝑃

𝑇𝑃+𝐹𝑁FP: false positives

TP:true positives

FN: false negatives

TN: true negatives

[same as recall;aka true positive rate]

False positive rate:

𝑀𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑅𝑎𝑡𝑒=1 −𝐴𝑐𝑐=𝐹𝑃+𝐹𝑁

𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁

Definitions

[aka true negative rate]

False negative rate:

𝐹𝑃𝑅=𝐹𝑃

𝐹𝑃+𝑇𝑁𝐹𝑁𝑅=

𝐹𝑁𝐹𝑁+𝑇𝑃

56

Venn Diagrams

John Venn (1880) “On the Diagrammatic and Mechanical Representation of Propositions and Reasonings”, Philosophical Magazine and Journal of Science, 5:10(59).

These visualization diagrams were introduced by John Venn:

What if there are three classes?

Four classes?

Six classes?

With more classes our visual intuitions are helping less and less.

A subtle point: These are just the actual/real classes without the system classes drawn on top!

57

Confusion Matrix

Predicted class A Predicted class B Predicted class C

Actual class ANumber of instances in the actual class A AND predicted as belonging to class A.

Number of instances in the actual class A BUT predicted as belonging to class B.

… Total number of actual instances of class A

Actual class B … … … Total number of actual instances of class B

Actual class C … … … Total number of actual instances of class C

Total number of instances predicted as class A

Total number of instances predicted as class B

Total number of instances predicted as class C

Total number of instances

Shows how the predictions of instances of an actual class are distributed across all classes.Here is an example confusion matrix for three classes:

Counts on the diagonal are the true positives for each class. Counts not on the diagonal are errors.Confusion matrices can handle many classes.

58

Confusion Matrix:Accuracy, Precision and Recall

Predicted class A Predicted class B Predicted class C

Actual class A 50 80 70 200

Actual class B 40 140 120 300

Actual class C 120 220 160 500

210 440 350 1000

Given a confusion matrix, it’s easy to compute accuracy, precision and recall:

Confusion matrices can, themselves, be confusing sometimes

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝟓𝟎+𝟏𝟒𝟎+𝟏𝟔𝟎

𝟏𝟎𝟎𝟎𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝐴=

𝟓𝟎𝟓𝟎+𝟒𝟎+𝟏𝟐𝟎

𝑅𝑒𝑐𝑎𝑙𝑙𝐴=𝟓𝟎

𝟓𝟎+𝟖𝟎+𝟕𝟎

59







60

LINEAR MODELSWhy?– Linear models are good way to learn about core ML concepts.

61

Refresher: Vectors

point point

vector

𝑥

𝑦

vector

vector

points are also vectors.

sum of vectors

𝑣2

𝑣1

𝑣1+𝑣2

𝑥 𝑥1

𝑥2𝑦

𝑦=13𝑥

Equation of the line.

3

1

3 𝑦=1 𝑥(−1 ) 𝑥+3 𝑦=0

(−1 ) 𝑥1+3 𝑥2=0

Can be re-written as:

(−1,3 )(𝑥1

𝑥2)=0

(𝑤1 ,𝑤2 ) (𝑥1

𝑥2)=0vector notation

𝒘=(𝑤0

⋮𝑤𝑑

)=(𝑤0 ,… ,𝑤𝑑)𝑇

transpose

62

Refresher: Vectors (2)

𝑥 𝑥1

𝑥2𝑦

𝑦=13𝑥

Equation of the line.

3

13 𝑦=1 𝑥(−1 ) 𝑥+3 𝑦=0

(−1 ) 𝑥1+3 𝑥2=0

Can be re-written as:

(−1,3 )(𝑥1

𝑥2)=0

(𝑤1 ,𝑤2 ) (𝑥1

𝑥2)=0vector notation

3

−1

Normal vector.

63

Refresher: Dot Product

𝑥1

𝑥2

float DotProduct(float[] v1, float[] v2) { float sum = 0.0; for(int i=0; i< a.Length; i++) sum+= v1[i] * v2[i]; return sum; }

(𝑤1 ,𝑤2 ) ∙(𝑥1

𝑥2)=𝑤1𝑥1+𝑤2𝑥2

𝒘 ∙ �⃗�=|⃗𝒘||⃗𝒙|cos𝛾

𝛾

𝛾

𝒘

𝒘 ∙ �⃗�>𝟎

𝒘 ∙ �⃗�<𝟎

𝒘 ∙ �⃗�=𝟎

64

Refresher: Pos/Neg Classes

𝑥 𝑥1

𝑥2𝑦

Normal vector.

−

+ 𝒘 ∙ �⃗�>𝟎

𝒘 ∙ �⃗�<𝟎

𝒘 ∙ �⃗�=𝟎

65

sgn Function

𝑥

𝑦

1

−1

𝑠𝑔𝑛 (∎)={+1:∎>00 :∎=0

− 1:∎<0

𝑥

𝑦

1

−1

In mathematics:

We will use:𝑠𝑔𝑛 (∎ )={+1 :∎≥ 0

− 1:∎<0

We are purposefully avoiding using here.We will use for the feature vector.

Informally drawn as:

66

Two Linear Models

𝑔 (𝒙 )=𝑠𝑖𝑔𝑛 (𝒘𝑇 �⃗� ) 𝑔 (𝒙 )=𝒘𝑇 𝒙

Perceptron Linear regression

The features of an object have associated weights indicating their importance.

Signal: s=𝒘𝑇 �⃗�=∑𝑖=0

𝑑

𝑤𝑖 𝑥 𝑖

When is known the solution function is known; determines the hypothesis space.

67

Why “Regression”?Why the term for quantitative output prediction is “regression”?

“That same year [1875], [Francis] Galton decided to recruit some of his friends for an experiment with sweet peas. He distributed seeds among seven of them, asking them to plant the seeds and return the offspring. Galton measured the baby seeds and compared their diameters to those of their parents. He noticed a phenomenon that initially seems counter-intuitive: the large seeds tended to produce smaller offspring, and the small seeds tended to produce larger offspring. A decade later he analyzed data from his anthropometric laboratory and recognized the same pattern with human heights. After measuring 205 pairs of parents and their 928 adult children, he saw that exceptionally tall parents had kids who were generally shorter than they were, while exceptionally short parents had children who were generally taller than their parents.

After reflecting upon this, we can understand why it must be the case. If very tall parents always produced even taller children, and if very short parents always produced even shorter ones, we would by now have turned into a race of giants and midgets. Yet this hasn't happened. Human populations may be getting taller as a whole – due to better nutrition and public health – but the distribution of heights within the population is still contained.

Galton called this phenomenon ‘regression towards mediocrity in hereditary stature’. The concept is now more generally known as regression to the mean.”

[A.Bellos pp.375]

68

On-Line (Sequential) Learning• On-line = process one example at a time.• Attractive for large scale problems.

Objective: Minimize cumulative loss:

return parameters

for iteration (epoch/time).

(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()

�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )

𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔… Compute loss.

∑𝑡=1

𝑇

𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) , �̂� ( 𝑡 ) ,𝐿𝑜𝑠𝑠 ,𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)

69

On-Line (Sequential) Learning (2)Sometimes written out more explicitly:

return parameters

for # passes over the data.


�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )

𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔…


𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔

𝒙 (𝑖 )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒()�̂� (𝑖 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 (𝑖 ) )𝑦 (𝑖 )≔𝑅𝑒𝑐𝑒𝑖𝑣𝑒𝑇𝑟𝑢𝑒𝐿𝑎𝑏𝑒𝑙()

for


for

if

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(…)

return parameters

𝑅𝑎𝑛𝑑𝑜𝑚𝑖𝑧𝑒𝐷𝑎𝑡𝑎 ()

for each data item.

.

𝑈𝑝𝑑𝑎𝑡𝑒 ( �⃗� (𝑡 ) , 𝑦 ( 𝑡 ) , �̂� (𝑡 ) ,𝐿𝑜𝑠𝑠 ,𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 )

70

Perceptron

• One of the earliest ML algorithms (Rosenblatt 1958).• On-line linear binary classification algorithm.• Determines a hyperplane (line in , plane in ,…) separating the

points for the two classes.

−

++

+

++

++

+ +

+

+ − −

−

−

−

−

−

−

− −

−

−

−

+

+

+

++

++

+ +

+

+ −

−

−−

−

−

−

−

−

−

−

−

Linearly separable data: Non-linearly separable data:

+

++

−−

−

71

First: Perceptron Update Rule𝒘𝑛𝑒𝑤=𝒘𝑜𝑙𝑑+𝑦

(𝑡 )𝒙 (𝑡 )

𝑥 𝑥1

𝑥2𝑦

𝑦=

3𝑥

𝑦=13𝑥

−

+

−+

+

Example (initially misclassified).

(−1 ) 𝑥+3 𝑦=0

(−3)𝑥+1𝑦=

0

(−1,3 ) (𝑥𝑦)=0

(−3,1) (𝑥 𝑦)=0

(𝑤1

𝑤2)=(−3

1 )+ (+1 )(22)=(− 13 )

(22)

Simplification: Lines pass through origin.

in order to simplify the update rule .

Example is now correctly classified with the new separating boundary. Not always the case that we can achieve this with one update.

72

On-Line (Sequential) Learning

return

for


�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )

𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔…


𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(𝒙 ( 𝑡 ) , 𝑦 (𝑡 ) , �̂� ( 𝑡 ) ,𝐿𝑜𝑠𝑠 ,𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)

73

𝐿𝑜𝑠𝑠 ( �̂� (𝑡 ) , 𝑦 ( 𝑡 ) )≔ 12|�̂� (𝑡 ) −𝑦 (𝑡 )|

Perceptron Learning Algorithm

return

for iteration (epoch/time).


�̂� (𝑡 )≔𝑠𝑖𝑔𝑛 (𝒘 𝑇 ∙𝒙 ( 𝑡 ) )Compute zero-one loss

�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )∈ {−1 ,+1 }

return parameters

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()𝒘=(𝑤0 , …,𝑤𝑑)𝑇=(0 ,…,0 )𝑇∈ℝ𝑑+1

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(⋯)if

𝒘≔𝒘+ 𝑦 ( 𝑡 ) ∙ �⃗� (𝑡 )

74

Perceptron Learning Algorithm

return

for iteration (epoch/time). sample size.


�̂� (𝑡 )≔𝑠𝑖𝑔𝑛 (𝒘 𝑇 ∙𝒙 ( 𝑡 ) ) �̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )∈ {−1 ,+1 }

return parameters

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()𝒘=(𝑤0 , …,𝑤𝑑)𝑇=(0 ,…,0 )𝑇∈ℝ𝑑+1


𝒘≔𝒘+ 𝑦 ( 𝑡 ) ∙ �⃗� (𝑡 )

represents transpose here

(algorithm makes multiple passes over data.)

75

Perceptron Learning Algorithm (PLA)

Initialize weights:

Select a mis-classified example:

Update weights:

return

𝒘≔𝒘+ 𝑦 ( 𝑡 ) ∙ �⃗� (𝑡 )

while( mis-classified examples exist ):

𝑦 (𝑡 ) ≠𝑠𝑖𝑔𝑛 (𝒘𝑇 �⃗�( 𝑡 ) )

Misclassified example means:With the current weights

1. A challenge: Algorithm will not terminate for non-linearly separable data (outliers, noise). 2. Unstable: jump from good perceptron to really bad one within one update.3. Attempting to minimize:

min�⃗�

1𝑁∑

𝑡=1

𝑁

⟦𝑦 ( 𝑡 )≠ 𝑠𝑖𝑔𝑛 (𝒘𝑇 𝒙 (𝑡 )) ⟧ NP-hard.

more generally

76

Perceptron

If a point is classified incorrectly:

⇒𝑠𝑔𝑛 ( 𝑦 ( 𝑡 ) )≠𝑠𝑔𝑛 (𝒘𝑜𝑙𝑑𝑇 ∙ �⃗�( 𝑡 ) )⇒ 𝑦 ( 𝑡 )𝒘 𝑜𝑙𝑑

𝑇 ∙ �⃗� (𝑡 )<0

𝒘𝑛𝑒𝑤=𝒘𝑜𝑙𝑑+𝑦(𝑡 )𝒙 (𝑡 )

Weight update:

𝑦 (𝑡 )𝒘𝑛𝑒𝑤𝑇 ∙ �⃗� (𝑡 )=𝑦 ( 𝑡 ) (𝒘 𝑜𝑙𝑑+𝑦

( 𝑡 )𝒙 ( 𝑡 ) )𝑇 ∙𝒙 (𝑡 )=¿

¿ 𝑦 (𝑡 )𝒘𝑜𝑙𝑑𝑇 ∙ �⃗�( 𝑡 )+‖�⃗� ( 𝑡 )‖2

>𝑦 ( 𝑡 )𝒘 𝑜𝑙𝑑𝑇 ∙ 𝒙 (𝑡 )

¿ 𝑦 (𝑡 )𝒘𝑜𝑙𝑑𝑇 ∙ �⃗�( 𝑡 )+( 𝑦 (𝑡 ) )2‖�⃗� (𝑡 )‖2

=¿

¿0 ¿0

Thus, the perceptron weight update pushes in the “right direction”.

77

Looks Simple – Does It Work?

Number of updates by the Perceptron Algorithm ≤𝑟 2

𝜌 2

where:

𝒙 (1 )…𝒙 (𝑁 )∈ℝ𝑑+1

𝑟 ≥‖𝒙 ( 𝑖 )‖ (for all )

𝜌 ≤𝑦 (𝑖 ) (𝑣 ∙𝒙 ( 𝑖 ))

‖�⃗�‖(for all )

There exist and such that:

Margin-based upper bound on updates:

The quantity is known as the “normalized margin”.

Remarkable:Does not depend on dimension of feature space!

Fact:

78

Compact Model Representation

void Save( StreamWriter w, int labelIdx, float[] weights ){ w.Write( labelIdx ); int previousIndex = 0; for (int i = 0; i < weights.Length; i++) { if (weights[ i ] != 0.0f) { w.Write( " " + (i - previousIndex) + " " + weights[ i ] ); previousIndex = i; } }}

Use float instead of double:

Store only non-zero weights (and indices):

Store non-zero weights and diff of indices:

Difference of indices.

Remember last index where the weight was non-zero .

79

Linear Classification Solutions

A fixed choice of defines the hyperplane and, thus, the solution to our (linear) task.

−

++

+

++

++

+ +

+

+ − −

−

−

−

−

−

−

− −

−

−

Different solutions (infinitely many)

80

The Pocket AlgorithmA better perceptron algorithm: Keep track of the error and update weights when we lower the error.

Initialize weights:

Run PLA for one iteration and obtain new .

return

𝐸𝑟𝑟 (𝒘 ( 𝑖+1 ) )= 1𝑁∑

𝑛=1

𝑁

⟦𝑠𝑔𝑛 (𝒘 ( 𝑖+1 )𝒙 (𝑛 ) )≠ 𝑦 (𝑛 ) ⟧

for :

𝑏𝑒𝑠𝑡𝐸𝑟𝑟 ≔𝐷𝑜𝑢𝑏𝑙𝑒 .𝑀𝐴𝑋

if :

Compute error. Expensive step!

Only update the best weights if we lower the error!

Access to the entire data needed!

81

Voted Perceptron• Training as the usual perceptron algorithm (with some extra book-keeping).• Decision rule:

�̂�=sgn((∑𝑡 𝑐𝑡 �⃗�(𝑡 ))∙ �⃗�)

Coefficient proportional to the number of iterations survives(number of iterations between and ).

iterations

𝒘 (𝑡 ) 𝒘 (𝑡+1 )

( �⃗�(𝒊 𝟏) , 𝑦

(𝒊 𝟏) )

( �⃗�(𝒊 𝟐) , 𝑦

(𝒊 𝟐) )

( �⃗�(𝒊 𝒄 𝒕

) , 𝑦(𝒊 𝒄 𝒕

) )

�̂� ( 𝒊𝟏 ) �̂� ( 𝒊𝟐 )�̂� (𝒊𝒄

𝒕)

82

Dual Perceptron: Intuitions

𝑥1

𝑥2

−

+

−

+ separating line.++

+

+

+

++

−−

−

−−

−

𝑦 ¿¿

𝑦 − �⃗�−

normal vector

𝑦 −=−1

𝑦 +¿=+1¿

83

Dual Perceptron

return

for iteration (epoch/time). sample size.


�̂� (𝑡 )≔sign(∑𝑗=1

𝑁

𝛼 𝑗 𝑦( 𝑗 ) (𝒙 ( 𝒋 )𝑇 ∙ 𝒙 (𝑡 ))) �̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )∈ {−1 ,+1 }

return parameters

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒()�⃗�=(𝛼1 ,…,𝛼𝑁 )𝑇= (0 ,…,0 )𝑇 ∈ℝ𝑁


𝛼 𝑡≔𝛼𝑡+1

(algorithm makes multiple passes over data.)

�̂�≔sign (∑𝑗=1

𝑁

𝛼 𝑗 𝑦( 𝑗 ) ( �⃗� ( 𝒋 )𝑇 ∙ �⃗� ))Decision rule:

gives a notion of how difficult instance is.

Kernel perceptron uses:

84

Exclusive OR (XOR) Function

𝑥1

𝑥2

1

1

0

0

Truth table: Inputs in and color-coding of the output:

𝑥1

𝑥2

1

1

0

0

Challenge: The data is not linearly separable (no straight line can be drawn that separates the green from the blue points).

???

85

Solution for the Exclusive OR (XOR)

𝑥1

𝑥3

1

1

0

0

We introduce another input dimension:

Now the data is linearly separable: 𝑥2

86

𝑍≔∑𝑖= 0

𝑑

𝑤 𝑖𝑒𝑦 ( 𝑡 ) �⃗� 𝑖

(𝑡 )

Winnow Algorithm

return

for iteration (epoch).


�̂� (𝑡 )≔𝑠𝑔𝑛 (𝒘 𝑇 ∙𝒙 (𝑡 ) )

𝒘≔( 1𝑑+1

,…)𝑇

;

if

�̂� (𝑡 )≔𝑃𝑟𝑒𝑑𝑖𝑐𝑡 (𝒙 ( 𝑡 ) )

𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠≔𝑈𝑝𝑑𝑎𝑡𝑒(…)for

𝑤𝑖≔𝑤𝑖𝑒

𝑦 (𝑡 ) �⃗� 𝑖(𝑡 )

𝑍

return parameters

Normalizing constant.

Multiplicative update.

87

Training, Test Error and Complexity

Test error

Training error

Model complexity

88

Logistic Regression𝑔 (𝒙 )=𝜃 (𝒘𝑇 𝒙 )

𝜃 (𝑠 )= 𝑒𝑠

1+𝑒𝑠

Logistic function:

𝑦∈ {−1 ,+1 }

𝑓 (𝑥 )=𝑃 ( 𝑦=+1∨𝒙 )

1−𝜃 (𝑠 )=𝜃 (−𝑠)

Target:

Data does not give the probability explicitly:

89

Logistic Regression𝑃 (𝑦∨𝑥 )={ 𝑓 (𝑥 ) when 𝑦=+1

1 − 𝑓 (𝑥 ) when 𝑦=−1

𝑃 (𝑦∨𝒙 )={ 𝑔 (𝒙 )=𝜃 (𝒘 𝑇 𝒙 )=𝜃 ( 𝑦𝒘𝑇 𝒙 )when 𝑦=+1

1−𝑔 (𝒙 )=1 −𝜃 (𝒘 𝑇 𝒙 )=𝜃 (−𝒘𝑇 𝒙 )=𝜃 ( 𝑦𝒘𝑇 𝒙 ) when 𝑦=−1

Data likelihood: 𝐿 ( �⃗� )=∏𝑖=1

𝑁

𝑃 (𝑦 (𝑖)|𝒙(𝑖))

Negative log-likelihood:

−𝑙 (𝒘 )=−1𝑁

ln (∏𝑖=1

𝑁

𝑃 ( 𝑦 (𝑖)|𝒙 (𝑖)) )= 1𝑁∑

𝑖=1

𝑁

ln( 1

𝑃 ( 𝑦(𝑖)|𝒙(𝑖) ) )= 1𝑁∑

𝑖=1

𝑁

ln( 1

𝜃 (𝑦(𝑖)𝒘 𝑇 𝒙(𝑖)) )= 1𝑁∑

𝑖=1

𝑁

ln (1+𝑒−𝑦 (𝑖 )𝒘𝑇 𝒙( 𝑖) )

𝐸 (𝒘 )= 1𝑁∑

𝑖=1

𝑁

ln (1+𝑒−𝑦 ( 𝑖)𝒘𝑇 𝒙(𝑖) )Error:

How likely is it that we get output when we have input :

Which maximizes this? orminimizes this?

90

RefresherDerivative:

(3 𝑥2 )′=3 ∙2 ∙𝑥2 −1

Partial derivative:𝜕𝜕𝑥

(𝑥2+2 𝑥𝑦+𝑦2 )=2 𝑥+2 𝑦

Partial derivative at a point :𝜕𝜕𝑤0

(𝑤02+2𝑤0𝑤1+𝑤1

2 )¿𝑤0=2 ,𝑤1=3=(2𝑤0+2𝑤1 )¿𝑤0=2,𝑤1=3=2 ∙2+2 ∙ 3

Gradient (derivatives with respect to each component): [ 𝜕 ( ∙ )𝜕𝑤0

,𝜕 ( ∙ )𝜕𝑤1

,…,𝜕 (∙ )𝜕𝑤𝑑 ]

Gradient of the error: 𝛻 𝐸 (𝒘 )=[ 𝜕𝐸𝜕𝑤0

,𝜕𝐸𝜕𝑤1

,…,𝜕𝐸𝜕𝑤𝑑 ]

𝜕𝜕𝑥

𝐹 (𝑥 , 𝑦 )

(ln 𝑥 )′= 1𝑥 (𝑒𝑥) ′=𝑒𝑥 ( 𝑓 (𝑔 ) )′= 𝑓 ′ (𝑔) ∙𝑔 ′

This is a vector and we can compute it at a point.

Chain rule:

𝑓 ′ (𝑥 )= lim∆𝑥→ 0

𝑓 (𝑥+∆ 𝑥 )− 𝑓 (𝑥)∆ 𝑥

𝑓 ′ (𝑥2 )= lim∆ 𝑥→0

(𝑥+∆ 𝑥 )2−𝑥2

∆𝑥= lim

∆ 𝑥→0

𝑥2+2𝑥 ( ∆𝑥 )+( ∆𝑥 )2 −𝑥2

∆𝑥= lim

∆𝑥→ 0(2 𝑥+∆𝑥 )=2𝑥

91

Hypothesis SpaceThe best to use is the one which minimizes: 𝐸 (𝒘 )= 1

𝑁∑𝑖=1

𝑁

ln (1+𝑒−𝑦 ( 𝑖)𝒘𝑇 �⃗�(𝑖) )

Different give rise to different values for .

−𝛻𝐸 (𝒘 )

is the error surface.

Weight space/hyperplane.

[graph from T.Mitchell]

92

Math FactThe gradient of the error:

(a vector in weight space) specifies the direction of the argument that leads to the steepest increase for the value of the error.

The negative of the gradient gives the direction of the steepest decrease.

𝛻 𝐸 (𝒘 )=[ 𝜕𝐸𝜕𝑤0

,𝜕𝐸𝜕𝑤1

, …,𝜕𝐸𝜕𝑤𝑑 ]

𝑤0

𝑤1

𝒘 (𝑡 )= (𝑤0 ,𝑤1 )

−𝛻𝐸 (𝒘 (𝑡 ))

𝒘 (𝑡+1 )

Best weights we can findup to iteration .

New best weights atiteration .

Negative gradient (see next slides).

93

¿ 1𝑁∑

𝑖=1

𝑁1

1+𝑒−𝑦 ( 𝑖)𝒘𝑇 �⃗�(𝑖) ∙𝑒−𝑦 ( 𝑖)𝒘𝑇 𝒙 (𝑖)

∙ (−𝑦 (𝑖)𝒙 (𝑖))=¿

𝛻 𝐸 (𝒘 )=𝛻 1𝑁∑

𝑖=1

𝑁

ln (1+𝑒− 𝑦 (𝑖)𝒘 𝑇 �⃗�( 𝑖) )= 1𝑁∑

𝑖=1

𝑁

𝛻 ln (1+𝑒−𝑦 (𝑖)𝒘 𝑇 𝒙( 𝑖) )=¿¿

Computing the Gradient𝐸 (𝒘 )= 1

𝑁∑𝑖=1

𝑁

ln (1+𝑒−𝑦 ( 𝑖)𝒘𝑇 �⃗�(𝑖) )

¿ 1𝑁∑

𝑖=1

𝑁1

1+ 1

𝑒𝑦 (𝑖)𝒘 𝑇 �⃗�( 𝑖)

∙1

𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖) ∙ (− 𝑦 (𝑖) �⃗�(𝑖) )=¿

¿ 1𝑁∑

𝑖=1

𝑁1

𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖 )

+1

𝑒𝑦(𝑖 )𝒘𝑇 𝒙 (𝑖)

∙1

𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖) ∙ (− 𝑦 (𝑖) �⃗�(𝑖) )=−

1𝑁∑

𝑖=1

𝑁 𝑦 (𝑖) �⃗�(𝑖)

1+𝑒𝑦 (𝑖)𝒘 𝑇 �⃗�( 𝑖)

Because gradient is a linear operator.(ln∎ )′=1

∎∙∎ ′

1∎ ∎′= (1+𝑒𝑧 )′=𝑒𝑧 ∙ 𝑧′

𝑒−𝑧=1

𝑒𝑧

94

(Batch) Gradient Descent

Initialize weights:

Compute gradient:

Update weights:

return

General technique for minimizing a differentiable function like .

�⃗�𝒓𝒂𝒅 :=−1𝑁∑

𝑖=1

𝑁 𝑦(𝑖)𝒙 (𝑖)

1+𝑒𝑦( 𝑖)𝒘𝑇 �⃗�(𝑖 )

𝒘≔𝒘 −𝜂 �⃗�𝒓𝒂𝒅

repeat

until Stop

int

max #iterations; marginal error improvement; andsmall value for the error.

is the learning rate.

If a random training example () is selected and gradient computed on it alone, the algorithm is called SGD(Stochastic Gradient Descent).

95

Punch Line

With the best weights computed using gradient descent,given a unknown input object encoded as vector of features ,the output probability that the object is in the class is:

𝑃 (𝑦=+1|𝒙 ;𝒘 )= 𝑒�⃗�𝑇 𝒙

1+𝑒𝒘𝑇 �⃗�

𝑃 (𝑦=+1|𝒙 ;𝒘 )>𝜏 classification rule.

The new object is in the class if:

Predict if or equivalently if . The larger , the better; will be larger and so will our degree of confidence that . The prediction that is very confident if . Similarly, logistic regression makes a very confident decision that if .

96

Newton’s Method• Alternate way to minimize a function (like ).• We need to find the derivative of the error (negative log likelihood) and find for which values

of the parameters the derivative is zero.• Let be a function and we want to find such that .

𝑢𝑖+1≔𝑢𝑖−𝑓 (𝑢𝑖 )𝑓 ′ (𝑢𝑖 )

0 0.5 1 1.5 2 2.5 3

-0.5

0

0.5

1

1.5

2

2.5

3

𝑢𝑖𝑢∗

𝑓 (𝑢𝑖 )

𝑢𝑖+1

𝑓 (𝑢𝑖 )𝑢𝑖−𝑢𝑖+1

=tan𝛾= 𝑓 ′ (𝑢𝑖 )

𝛾

97

Newton-Raphson• Generalization f Newton’s method to multidimensional case.• The parameters are a vector: (we used the notation ).

𝜃≔𝜃−𝑙 ′ (𝜃 )𝑙 ′ ′ (𝜃 )

𝜃≔𝜃−𝐻−1 ∙𝛻 𝑙(𝜃)

𝐻 𝑖𝑗=𝜕2 𝑙(𝜃)𝜕𝜃 𝑖𝜕 𝜃 𝑗

is the Hessian matrix:

98

Robust Risk Minimizationinput vector

label

training examples

weight vector

bias

continuous linear model

𝒙𝑦∈ {−1 ,+1 }(𝒙 ( 𝑖 ) , 𝑦 (𝑖 ) )𝒘𝑏𝑝 (𝒙)

Prediction rule:

�̂� (𝒙 )={+1 :𝑝 (𝒙 )≥ 0− 1:𝑝 ( �⃗� )<0

Classification error:

𝑙 (𝑝 (𝒙 ) , 𝑦 )={+1 :𝑝 ( �⃗�) ∙ 𝑦 ≤ 00 :𝑝 ( �⃗� ) ∙ 𝑦>0

Notation:

99

Robust Classification Loss

Parameter estimation:

Hinge loss:

Robust classification loss:

(�̂� ,�̂� )=argmin�⃗� ,𝑏

1𝑁∑

𝑖=1

𝑁

𝑙𝑜𝑠𝑠 (𝒘𝑇 ∙ �⃗� (𝑖 )+𝑏 , 𝑦 ( 𝑖 ) )

𝑔 (𝑝 (𝒙 ) , 𝑦 )={1−𝑝 (𝒙 ) 𝑦 :𝑝 ( �⃗� ) 𝑦 ≤10 :𝑝 (𝒙 ) 𝑦>1

h (𝑝 ( �⃗� ) , 𝑦 )={ − 2𝑝 ( �⃗� ) 𝑦 :𝑝 ( �⃗� ) 𝑦 ≤ −112

(𝑝 (𝒙 ) 𝑦−1 )2:𝑝 ( �⃗� ) 𝑦∈ [− 1,1 ]

0:𝑝 (𝒙 ) 𝑦>1

100

Loss Functions: Comparison

101

Confidence and Regularization

smaller λ corresponds to a larger A.

Confidence 𝑃 (𝑦=1|⃗𝒙 ):

�̂� (𝒙 )=𝑚𝑎𝑥 (0 ,𝑚𝑖𝑛(1 ,�̂� ∙ �⃗�+ �̂�+1

2 ))Regularization:

‖𝒘‖2+𝑏2≤ 𝐴

(�̂� ,�̂� )=argmin�⃗� ,𝑏

1𝑁∑

𝑖=1

𝑁

h (𝒘𝑇 ∙ �⃗� (𝑖 )+𝑏 , 𝑦 ( 𝑖 ) )

Unconstrained optimization (Lagrange multiplier):

(�̂� ,�̂� )=argmin�⃗� ,𝑏 [ 1

𝑁∑𝑖=1

𝑁

h (𝒘𝑇 ∙𝒙 (𝑖 )+𝑏 , 𝑦 (𝑖 ) )+λ2

(‖�⃗�‖2+𝑏2 ) ]

102

Robust Risk Minimization

Input:

Initialization:

𝑝≔ 𝑦(𝑖) (𝒘 𝑇 ∙𝒙 (𝑖))𝑑𝑖≔𝑚𝑎𝑥 (𝑚𝑖𝑛(2𝑐−𝛼 𝑖 ,𝜂(𝑐−𝛼𝑖

𝑐−𝑝)) ,−𝛼 𝑖)

𝒘≔𝒘+𝑑𝑖 𝑦(𝑖) �⃗�(𝑖)

𝑏≔𝑏+𝑑𝑖 𝑦(𝑖)

𝛼 𝑖≔𝛼𝑖+𝑑𝑖

for

for

return

Number of passes over the data ( is a good default).

; .

𝒙 (𝑖 )∈ℝ𝑑+ 1;𝑦 (𝑖 )∈ {−1 ,+1 }

Go over the training data.

103

Learning Curve• Plots evaluation metric

against fraction of training data (on the same test set!).

• Highest performance bounded by human inter annotator agreement (ITA).

• Leveling off effect that can guide us how much data is needed.

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%0

10

20

30

40

50

60

70

80

90

100

Percentage of data used for each experiment.

Experiment with 50% of the training data yields

evaluation number of 70.

104

Summary

• Examples of ML• Categorization• Object encoding• Linear models:

– Perceptron– Winnow– Logistic Regression– RRM

• Engineering aspects of ML systems

105PART II: POPULARITY

106

Goal

• Quantify how popular an entity is.

Motivation:• Used in the new local search relevance metric.

107

What is popularity?

• Use clicks on entity as proxy for popularity.

• Popularity score [0..1].• Goal: preserve relative

ranking between clicks vs. predicted popularity score.

108

POPULARITY IN LOCAL SEARCH

109

Popularity

• Output a popularity score (regression)• Ensemble methods• Tree base procedure (non-linear)• Boosting

110

When is a Local Entity Popular?

• Definition:Visited by many people in the context of alternative choices.

• Is the popularity of restaurants the same as the popularity of movies, etc.?

• How to operationalize “visit”, “many”, “alternative choices”?– Initially we are using: popular means clicked more.

• Going forward we will use:– “visit” = click given an impression.– “choice” = density of entities in the same primary category.– “many” = fraction of clicks from impressions.

111

Local Entity Popularity

𝑃𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 (𝑒)=𝐶𝑇𝑅𝑒+ (1−𝐶𝑇𝑅𝑒 )∙𝐷𝑒𝑛𝑠𝑖𝑡𝑦𝑒

𝐷𝑒𝑛𝑠𝑖𝑡𝑦 𝑒=1𝜋2

∙ tan−1 (𝑛𝑢𝑚𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠𝑒𝑠𝑁𝑒𝑎𝑟 (𝑒))

Popularity = Boosted Click Through Rate (CTR) for entity :

where :

The model then will be regression:

0 1𝐶𝑇𝑅=

𝐶𝑙𝑖𝑐𝑘𝑠𝐼𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝑠

1 −𝐶𝑇𝑅

(1−𝐶𝑇𝑅) ∙𝐷𝑒𝑛𝑠𝑖𝑡𝑦

𝐶𝑇𝑅 (𝑒 )= 𝐶𝑙𝑖𝑐𝑘𝑠 (𝑒)𝐼𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝑠(𝑒)

𝑛𝑢𝑚𝐵𝑢𝑠𝑖𝑛𝑒𝑠𝑠𝑒𝑠𝑁𝑒𝑎𝑟 (𝑒)=¿ Number of entities in the same primary category as within a radius

112

Not all Clicks are Born the Same

• Click in the context of a named query:– Can even be argued we are not satisfying the user

information needs (and they have to click further to find out what they are looking for).

• Click in the context of a category query:– Much more significant (especially when alternative results

are present).

113

Local Entity Popularity

• Popularity & 1st page , current ranker.• Entities without URL.• Newly created entities.• Clicks vs. mouseovers.• Scenario: 50 French restaurants; best entity

has 2k clicks. 2 Italian restaurants; best entity has 2k clicks. The French entity is more popular because of higher available choice.

114

Entity Representation

8000 … 4000 65 4.7 73 … 1 …9000

feature valuesTarget

Machine learning (training) instance

clicks for week -1

clicks for week -9

# ratingsaggregate ratings

# reviews

has FB page

115

POISSON REGRESSIONWhy?– We will practice the ML machinery on a different problem, re-iterating the concepts. Poisson regression is an example of log-linear models good for modeling counts (e.g., number of visitors to a store in a certain time).

116

SetupTraining data: where: are counts (rather than for regression problems).

Goal: Come up with a system which given a new observation can correctly predict the corresponding outcome .

response/outcome variable

These counts for our scenario are the clicks on the web page.

A good way to model counts of observations is using the Poisson distribution.

explanatory variables

117

Poisson Distribution: PreliminariesThe Poisson distribution realistically describes the pattern of requests over time in many client-server situations.

Examples are: incoming customers at a bank, calls into a company’s telephone exchange, requests for storage/retrieval services from a database server, and interrupts to a central processor. It also has higher-dimensional applications, such as the spatial distribution of defects on integrated circuit wafers and the volume distribution of contaminants in well water. In such cases, the “events”, which are request arrivals or defects occurrences, are independent. Customers do not conspire to achieve some special pattern in their access to a bank teller; rather they operate as independent agents. The manufacture of hard disks or integrated circuits introduces unavoidable defects because the process pushes the limits of geometric tolerances. Therefore, a perfectly functional process will still occasionally produce a defect, such as a small area on the disk surface where the magnetic material is not spread uniformly or a shorted transistor on an integrated circuit chip. These errors are independent in the sense that a defect at one point does not influence, for better or worse, the chance of a defect at another point. Moreover, if the time interval or spatial area is small, the probability of an event is correspondingly small. This is a characterizing feature of a Poisson distribution: event probability decreases with the window of opportunity and is linear in the limit. A second characterizing feature, negligible probability of two or more events in a small interval, is also present in the mentioned examples.

118

Poisson Distribution: FormallyThe Poisson distribution can be used to model situations in which the expected number of events scales with the length of the interval within which the events can occur. If is the expected number of events per unit interval, then the distribution of the number of events within an interval is:

𝑝 (𝑋=𝑘|𝜆 )= 1𝑘!𝑒−𝜆𝑡 (𝜆𝑡 )𝑘

For unit length interval

Mean: Variance:

119

Poisson Distribution: Mental StepsFirst, we are keeping ’s for the input. So we will write:

𝑝 (𝑌=𝑦|𝜆 )= 1𝑦 !𝑒−𝜆𝑡 ( 𝜆𝑡 )𝑦

The output is determined by a single scalar parameter . We will have be dependent on the input in the following way:

𝜇=𝐸 [ 𝑋 ]=𝜆=𝑒𝒙 𝑇 ∙𝜷 This comes from the theory of Generalized Linear Models (GLM).

ln ( 𝜆 )=¿ �⃗�𝑇 ∙𝜷 ¿

log linear combination of the input features.

Hence, the name log-linear model.

In contrast, a linear model could potentially make negative but which is a count!

We used to write (when discussing logistic regression). Now, we call the parameters and because in the training phase they are unknown we will write them as the second argument in the dot product to emphasize they are the argument.

120

Poisson Distribution

Data likelihood: 𝐿 ( �⃗� )=∏𝑖=1

𝑁

𝑃 (𝑦 (𝑖)|�⃗�(𝑖))

Log-likelihood:

Which maximizes this?

𝑙 ( �⃗� )= ln(∏𝑖=1

𝑁

𝑃 (𝑦 (𝑖)|�⃗�(𝑖)) )=∑𝑖=1

𝑁

ln (𝑃 ( 𝑦(𝑖)|⃗𝒙(𝑖)))=∑𝑖=1

𝑁

ln(𝑒−𝑒 �⃗�( 𝑖)𝜷 (𝑒𝒙( 𝑖)𝜷 )𝑦(𝑖)

𝑦 (𝑖) ! )=¿

¿∑𝑖=1

𝑁 [ ln (𝑒−𝑒 �⃗�( 𝑖)𝜷 )+ ln (𝑒𝒙 (𝑖) �⃗� )𝑦(𝑖 )

− ln (𝑦(𝑖) ! )]=∑𝑖=1

𝑁

[−𝑒𝒙 (𝑖) 𝜷+𝑦 (𝑖 ) ln (𝑒 �⃗�(𝑖 )𝜷 )− ln ( 𝑦(𝑖)! )]=¿

¿∑𝑖=1

𝑁

[−𝑒 �⃗�( 𝑖) 𝜷+𝑦 (𝑖 )𝒙 (𝑖) �⃗�− ln (𝑦 (𝑖 )! ) ]

𝑝 (𝑌=𝑦|𝜆 )= 1𝑦 !𝑒−𝜆𝑡 ( 𝜆𝑡 )𝑦 𝜆=𝑒 �⃗�𝑇 ∙ �⃗�

121

Maximizing the Log-Likelihood

𝑙 ( �⃗� )=∑𝑖=1

𝑁

[−𝑒 �⃗�( 𝑖) 𝜷+𝑦 (𝑖 )𝒙 (𝑖) �⃗�− ln (𝑦 (𝑖 )! ) ] Which maximizes this?

𝛻 𝑙 (𝜷 )=0

𝛻 𝑙 (𝜷 )=∑𝑖=1

𝑁

[− �⃗� (𝑖 )𝑒 �⃗� ( 𝑖) 𝜷+𝑦 (𝑖 )𝒙 (𝑖)]=∑𝑖=1

𝑁

(𝑦 (𝑖 )−𝑒𝒙 (𝑖 )𝜷 )𝒙 (𝑖)=0

Non-linear in ; does not have analytical solution.

122







123

DECISION TREESWhy?– DTs are an influential development in ML. Combined in ensemble they provide very competitive performance. We will see ensemble techniques in the next part.

124

Decision Trees

𝑥𝑖<𝑠1

𝑥 𝑗<𝑠2

Binary partitioning of the data during training(navigating to leaf node during testing).

Selecting dimension and split value .

predictionTraining instances are more homogeneousin terms of the output variable (more pure) compared to ancestor nodes.

Stopping when instances are homogeneous or

small number of instances.

Training instances. Color reflects output variable(classification example).

𝑥 𝑗≥𝑠2

𝑥𝑖≥ 𝑠1

125

Decision Tree: Example

Parents Visiting

Weather

Money

Cinema

CinemaShopping

Stay in

PoorRich

RainyWindySunny

NoYes

Play tennis

Attribute/feature/predicate

Value of the attribute

Predicted classes.

(classification example with categorical features)

Branching factor depends on the number of possible values for the attribute (as seen in the training set).

126

Entropy (needed for describing how an attribute is selected.)

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 )=− ∑𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠

𝑝𝑐 ∙ log 2𝑝𝑐

Entropy values for two classes varying the probability of one classes (the probability of the other class is):

𝐸𝑛𝑡𝑟𝑜𝑝𝑦=−𝑝1∙ log 2𝑝1−𝑝2 ∙ log2𝑝2=−𝑝 ∙ log 2𝑝−(1 −𝑝 )∙ log 2 (1 −𝑝 )

Example

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

127

Selecting an Attribute: Information Gain

Measure of expected reduction in entropy.

𝐺𝑎𝑖𝑛 (𝑆 ,𝑎 )=𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆 )− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)

|𝑆𝑣||𝑆|

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣 )

instances attribute

Choose attribute with the highest information gain ( that minimizes this).

See Mitchell’97, p.59 for an example.

instances with value for attribute

128

Splitting ‘Hairs’

?

If there are only a small number of instances do not split the node further (statistics are unreliable).

If there are no instances in the current node, inherit statistics (majority class) from parent node.

𝑎𝑡𝑡𝑟=𝑣𝑎𝑙1 𝑎𝑡𝑡𝑟=𝑣𝑎𝑙2 𝑎𝑡𝑡𝑟=𝑣𝑎𝑙3

If there is more training data, the tree can be “grown” bigger.

129

ID3 AlgorithmID3: { new node

if then ; return

if then ; return

if then ; return

best attribute

foreach : possible value of attribute :

if then

else

return }

𝒙 (𝑖 )∈ℝ𝑑+ 1;𝑦 (𝑖 )∈ {−1 ,+1 }

Examples that have value for attribute .

Attributes without .

most common class among

130

Alternative Attribute Selection:Gain Ratio

𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )= 𝐺𝑎𝑖𝑛(𝑆 ,𝑎)𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛(𝑆 ,𝑎)

instances attribute

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )=− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)

|𝑆𝑣||𝑆|

log 2(|𝑆𝑣||𝑆| )

instances with value for attribute

[Quinlan 1986]

Examples:

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )=− ∑𝑣∈ { 1. .𝑛 }

1𝑛

log2( 1𝑛 )=−𝑛 1

𝑛log2 (𝑛

−1 )=log 2𝑛

𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆 ,𝑎 )=− ∑𝑣∈ { 0,1}

𝑛2𝑛

log2( 𝑛2𝑛 )=−212

log2 (2− 1 )=1all different values.

0

1

131

Alternative Attribute Selection:GINI Index

𝐺𝑖𝑛𝑖 (𝑆 , 𝑦 )=1− ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 (𝑦 )

(|𝑆𝑣||𝑆| )

2

𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛 (𝑆 ,𝑎)=𝐺𝑖𝑛𝑖 (𝑆 , 𝑦 ) − ∑𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠(𝑎)

|𝑆𝑣||𝑆|

𝐺𝑖𝑛𝑖 (𝑆𝑣 ,𝑎 )

target is just like another attribute.

�̂�= argmax𝑎∈ 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠

𝐺𝑖𝑛𝑖𝐺𝑎𝑖𝑛 (𝑆 ,𝑎 ) The selected attribute is the one that maximizes the .

[Corrado Gini: Italian statistician]

132

Space of Possible Decision Trees

𝒏−𝟏𝒏−𝟏

𝒏

𝒏−𝟐 𝒏−𝟐

Assume:• Binary classifier;• binary attributes;• height.

22h

∙[∑𝑖=0

h

2𝑖 (𝑛−𝑖 )]

𝒏−𝟐 𝒏−𝟐

10101010

𝑖nodes attributes

h

Number of possible trees:

133

Decision Trees and Rule SystemsPath from each leaf node to the root represents a conjunctive rule:

Cinema

CinemaShopping

Stay in

PoorRich

RainyWindySunny

NoYes

Play tennis

if (ParentsVisiting==No) & (Weather==Windy) & (Money==Poor) then Cinema.

Parents Visiting

Weather

Money

134

Decision Trees

• Different training sample -> different resulting tree (different structure).

• Learning does (conditional) feature selection.

135

Regression TreesLike classification trees but the prediction is a number (as suggested by “regression”).

1. How do we split?2. When to stop?

𝑥𝑖<𝑠1

𝑥 𝑗<𝑠2

predictions(constants)

𝑐1∈𝑅

𝑐2 𝑐3

136

Regression Trees: How to Split

Finding:• Dimension • Split value .

⟨ 𝑗 ,𝑠 ⟩=min𝑗 , 𝑠 (min

𝑐1( ∑𝑋 (𝑖 ) [ 𝑗 ]<𝑠

(𝑌 (𝑖 ) −𝑐1 )2)+min

𝑐2( ∑𝑋 (𝑖 ) [ 𝑗 ]≥ 𝑠

(𝑌 ( 𝑖 )−𝑐2 )2))

𝑋 [ 𝑗 ](𝑖 )

𝑌 (𝑖 )

𝑠

𝑐1

𝑐2

𝑋 (1 )=(… 𝑋[ 𝑗 ](1 ) … )

𝑋 (2 )=(… 𝑋 [ 𝑗 ](2 ) …)

𝑋 (𝑁 )=(… 𝑋 [ 𝑗 ](𝑁 ) …)

137

Regression Trees: PruningTree operation where a pre-terminal gets its two leaves collapsed:

𝑥𝑖<𝑠10

𝑥 𝑗<𝑠20

𝑐20 𝑐30

𝑥𝑖<𝑠10

𝑐 ′

138

Regression Trees: How to Stop1. Don’t stop.2. Build big tree.3. Prune.4. Evaluate sub-trees.

139







140

BOOSTING

141

ENSEMBLE

Ensemble Methods

INPUT

System System System System

Output Output Output Output

object encoded with featuresclassifiers

predictions(response/dependent variable)

…

…

FinalOutput

majority voting/averaging

142

Where the Systems Come from

Sequential ensemble scheme:

System

System

System

…

Data

Data

Data

System Data

Inducing a classifier.

Identifying difficult examples (through weighting the examples).

143

Contrast with Bagging

Non-sequential ensemble scheme:

System

System

System

Data

Data

Data

System Data

Inducing a classifier.

Sampling with replacement.

⋮

DATA

Datai are independent of each other (likewise for Sytemi).

144

Base Procedure:Decision Tree

SystemData

𝑥𝑖<𝑠1

𝑥 𝑗<𝑠2

Binary partitioning of the data during training(navigating to leaf node during testing).

Selecting dimension and split value .

predictionTraining instances are more homogeneousin terms of the output variable (more pure) compared to ancestor nodes.

Stopping when instances are homogeneous or

small number of instances.

Training instances. Color reflects output variable(classification example).

𝑥 𝑗≥𝑠2

𝑥𝑖≥ 𝑠1

145

TRAINING DATA

Ensemble Schemebase procedure

{ (𝑿 (𝟏 ) ,𝒀 (𝟏 )) ,… , (𝑿 (𝑵 ) ,𝒀 (𝑵 ) ) } 𝑮(𝑿)

base procedure 𝑮𝟏(𝑿 )Original data

base procedure 𝑮𝟐(𝑿 )Weighted data

base procedure 𝑮𝑴 (𝑿)Weighted data

⋮ ⋮ ⋮

𝑔 (𝑋 )=∑𝑚=1

𝑀

𝛼𝑚∙𝐺𝑚 (𝑋 )Final prediction (regression)

Small systems.Don’t need to be perfect.

Weights depend only on previous

iteration (memory-less).N.B.: Data weights

feature weights inlinear models.

146

Ada Boost (classification)𝑮𝟏(𝑿 )Original data

𝑮𝟐(𝑿 )Weighted data

𝑮𝒎(𝑿)Weighted data

⋮ ⋮ ⋮

𝑔 (𝑋 )=𝑠𝑖𝑔𝑛(∑𝑚=1

𝑀

𝛼𝑚 ∙𝐺𝑚 (𝑋 ))

𝑮𝑴 (𝑿)Weighted data

⋮ ⋮ ⋮𝑒𝑟𝑟𝑚=

∑𝑖

𝑤 𝑖(𝑚 )

∑𝑗=1

𝑁

𝑤 𝑗(𝑚 )

𝑤𝑖( 1)=

1𝑁

𝛼𝑚=log( 1−𝑒𝑟𝑟𝑚𝑒𝑟𝑟𝑚 )

𝑤𝑖(𝑚+1)=

~𝑤𝑖

∑𝑗=1

𝑁~𝑤 𝑗

~𝑤𝑖=𝑤𝑖(𝑚−1 ) ∙𝑒𝛼𝑚

weight associated with -th training example.

normalizing factor.

for miss-classified example .

final prediction.

for miss-classified example .

Goodness ofpredictor .

147

AdaBoost

𝑒𝑟𝑟𝑚=∑𝑖=1

𝑁

𝑤𝑖(𝑚 ) ∙ ⟦𝐺𝑚 (𝑋 ( 𝑖 ) )≠𝑌 ( 𝑖 )⟧

∑𝑗=1

𝑁

𝑤 𝑗(𝑚 )

𝛼𝑚=log( 1−𝑒𝑟𝑟𝑚𝑒𝑟𝑟𝑚 )

𝑤𝑖(𝑚+1)=

~𝑤𝑖

∑𝑗=1

𝑁~𝑤 𝑗

~𝑤𝑖=𝑤𝑖(𝑚 ) ∙𝑒𝛼𝑚 ⟦𝐺𝑚 (𝑋 ( 𝑖) )≠𝑌 ( 𝑖) ⟧

Initializing weights.

normalizing factor.

for

for :

𝑔 (𝑋 )=𝑠𝑖𝑔𝑛(∑𝑚=1

𝑀

𝛼𝑚 ∙𝐺𝑚 (𝑋 ))=𝑎𝑟𝑔𝑚𝑎𝑥∑𝑚=1

𝑀

𝛼𝑚∙ ⟦𝐺𝑚 (𝑋 )=𝑌 ⟧ final prediction.

weight update.

Create using .

𝑌

148

Binary Classifier

• Constraint:– Must not have all zero clicks for current week, previous week and week before last

[shopping team uses stronger constraint: only instances with non-zero clicks for current week].

• Training: – 1.5M instances.– 0.5M instances (validation).

• Feature extraction:– 4.82mins (Cosmos job).

• Training time:– 2hrs 20mins.

• Testing:– 10k instances: 1sec.

149







150

POPULARITY EVALUATION

How do we know we have a good popularity?

151

Rank Correlation Metrics

• Input: two rankings: and • Requirements:

−1≤𝐶 (𝑅1,𝑅2)≤1

𝐶 (𝑅1 ,𝑅2 )=1

𝐶 (𝑅1 ,𝑅2 )=−1

The two rankings are the same.

The two rankings are reverse of each other.

• •• •

• •

Actual input is a set of objects with two rank scores (ties are possible).

152

Kendall’s Tau Coefficient

Considers concordant/discordant pairs in two rankings (each ranking w.r.t. the other):

Complexity:

153

What is a concordant pair?

a a

b c

c b𝑅1 (𝑎 )−𝑅1 (𝑐 )

𝑅2 (𝑎 )−𝑅2 (𝑐 )

Need to have the same sign

154

Kendall Tau: ExampleA

B

C

D

C

D

A

B

𝑅1 𝑅2

Pairs:(discordant pairs in red):

Observation: Total number of discordant pairs = 2x the discordant pairs in one ranking w.r.t. the other.

𝜏=1−2 ∙𝐷𝑖𝑠𝑐𝑜𝑟𝑑𝑎𝑛𝑡𝑃𝑎𝑖𝑟𝑠 (𝑅1 ,𝑅2 )

𝑛 (𝑛− 1 )=1−

2 ∙ 84 ∙ (4 − 1 )

=−13

155

Spearman’s Coefficient

Considers ranking differences for the same object:

a a

b c

c b

𝑆 (𝑅1 ,𝑅2 )=1− 6 ∙(𝑅1 (𝑎) −𝑅2 (𝑎) )2+(𝑅1 (𝑏 )−𝑅2 (𝑏) )2+(𝑅1 (𝑐 )−𝑅2 (𝑐 ) )2

3 (32− 1 )=1− 6 ∙

(1 −1 )2+(2 −3 )2+(3 −2 )2

3 ∙ 8=

12

Complexity:

0≤∑𝑗=1

𝑛

(𝑅1 (𝑜 𝑗 )−𝑅2 (𝑜 𝑗 ) )2≤𝑛 (𝑛2−1 )

3

Example:

156

Rank Intuitions: Setup

𝑅1 𝑅2

The sequence <3,1,4,10,5,9,2,6,8,7> is sufficient to encode the two rankings.

123456789

10

123456789

10

Objects ordered by rank scores. Viewing as if scrambling the order of .

157

Rank Intuitions: Pairs

Rankings in complete agreement.

Rankings in complete dis-agreement.

−1 0 1

𝑅1 𝑅2

𝑅1 𝑅2

158

Rank Intuitions: Spearman

−1 0 1

Segment lengths represent R1 rank scores.

0.5− 0.50

𝑝=1𝑝=2𝑝=3𝑝=4

𝑝=5𝑝=6𝑝=10𝑝=15𝑝=𝑛

− 0.78− 0.88− 0.92

489

159

Rank Intuitions: Kendall

−1

01

Segment lengths represent R1 rank scores.

0.5− 0.53

𝑝=1𝑝=2𝑝=3𝑝=4

𝑝=5𝑝=6𝑝=10𝑝=15𝑝=𝑛

01− 0.36− 0.639− 0.830

160

What about ties?

The position of an object within set of objects with the same scores in the rankings affects the rank correlation.

𝑅1 𝑅2

𝑜 𝑗

Objects have the same ranking scores.

𝑜 𝑗

𝑜 𝑗 𝑜 𝑗Objects have the same ranking scores.

For example, red positioning of oj leads to lower Spearman’s coefficient; green – higher.

161

Ties

• Kendall: Strict discordance:

• Spearman:– Can use per entity upper and lower bounds.– Do as in the Olympics:

𝑅1 𝑅2

𝑜 𝑗

𝑜 𝑗

Objects with thesame score have

the same rank.

𝑠𝑔𝑛(𝑠𝑐𝑜𝑟𝑒1 (𝑎 )−𝑠𝑐𝑜𝑟𝑒1 (𝑏))≠𝑠𝑔𝑛(𝑠𝑐𝑜𝑟𝑒2 (𝑎 )−𝑠𝑐𝑜𝑟𝑒2 (𝑏))

162

Ties: Kendall TauB

http://en.wikipedia.org/wiki/Kendall_tau#Tau-b

𝜏 𝐵=𝑛𝑐−𝑛𝑑

√(𝑛 (𝑛−1)2

−𝑛1)(𝑛(𝑛−1)2

−𝑛2)where:

𝑛𝑐𝑛𝑑

𝑛

is the number of concordant pairs.

is the number of discordant pairs.

is the number of objects in the two rankings.

𝑛1=∑𝑖

𝑡𝑖(𝑡 𝑖−1)2

𝑛2=∑𝑗

𝑢 𝑗 (𝑢 𝑗−1)2

number of pairs among elements with ties in ranking .

number of pairs among elements with ties in ranking .

http://en.wikipedia.org/wiki/Kendall_tau

http://en.wikipedia.org/wiki/Kendall_tau

163

Uses of popularityPopularity can be used to augment gain in NDCG by linearly scaling it:

1 3 7 15-1

𝑙𝑎𝑏𝑒𝑙

1 2 3 4

31

5

perfectexcellentgoodfairpoor

𝐺𝑎𝑖𝑛+ (𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦 ) ∙𝐺𝑎𝑖𝑛

164

Next Steps

• How to determine popularity of new entities– Challenge: No historical data.– Usually there is an initial period of high popularity

(e.g., a new restaurant is featured in local paper, promotions, etc.).

• Good abandonment (no user clicks but good entity in terms of satisfying the user information needs, e.g., phone number).– Use number impressions for named queries.

165

References1. Yaser S. Abu-Mostafa, Malik Magdon-Ismail & Hsuan-Tien Lin (2012) Learning From Data. AMLBook. [link]2. Ethem Alpaydin (2009) Introduction to Machine Learning. 2nd edition. Adaptive Computation and Machine Learning series. MIT Press. [link

]3. David Barber (2012) Bayesian Reasoning and Machine Learning. Cambridge University Press. [link]4. Ricardo Baeza-Yates & Berthier Ribeiro-Neto (2011) Modern Information Retrieval: The Concepts and Technology behind Search. 2nd

Edition. ACM Press Books. [link]5. Alex Bellos (2010) Alex's Adventures in Numberland. Bloomsbury: New York. [link]6. Ron Bekkerman, Mikhail Bilenko & John Langford (2011) Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge

University Press. [link]7. Christopher M. Bishop (2007) Pattern Recognition and Machine Learning. Information Science and Statistics. Springer. [link]8. George Casella & Roger L. Berger (2001) Statistical Inference. 2nd edition. Duxbury Press. [link]9. Anirban DasGupta (2011) Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. Springer Texts in Statistics.

Springer. [link]10. Luc Devroye, László Györfi & Gábor Lugosi (1996) A Probabilistic Theory of Pattern Recognition. Springer. [link]11. Richard O. Duda, Peter E. Hart & David G. Stork (2000) Pattern Classification. 2nd Edition. Wiley-Interscience. [link]12. Trevor Hastie, Robert Tibshirani & Jerome Friedman (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd

Edition. Springer Series in Statistics. Springer. [link]13. James L. Johnson (2008) Probability and Statistics for Computer Science. Wiley-Interscience. [link]14. Daphne Koller & Nir Friedman (2009) Probabilistic Graphical Models: Principles and Techniques. Adaptive Computation and Machine

Learning series. MIT Press. [link]15. David J. C. MacKay (2003) Information Theory, Inference and Learning Algorithms. Cambridge University Press. [link]16. Zbigniew Michalewicz & David B. Fogel (2004) How to Solve It: Modern Heuristics. 2nd edition. Springer. [link]17. Tom M. Mitchell (1997) Machine Learning. McGraw-Hill (Science/Engineering/Math). [link]18. Mehryar Mohri, Afshin Rostamizadeh & Ameet Talwalkar (2012) Foundations of Machine Learning. Adaptive Computation and Machine

Learning series. MIT Press. [link]19. Lior Rokach (2010) Pattern Classification Using Ensemble Methods. World Scientific. [link]20. Gilbert Strang (1991) Calculus. Wellesley-Cambridge Press. [link]21. Larry Wasserman (2010) All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer. [link]22. Sholom M. Weiss, Nitin Indurkhya & Tong Zhang (2010) Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer. [link]

http://www.amazon.com/Learning-From-Data-Yaser-Abu-Mostafa/dp/1600490069/ref=pd_sim_b_2

http://www.amazon.com/Introduction-Machine-Learning-Adaptive-Computation/dp/026201243X/ref=pd_sim_b_3

http://www.amazon.com/Bayesian-Reasoning-Machine-Learning-Barber/dp/0521518148/ref=pd_sim_b_3

http://www.amazon.com/Modern-Information-Retrieval-Concepts-Technology/dp/0321416910/ref=sr_1_1?s=books&ie=UTF8&qid=1348505184&sr=1-1&keywords=modern+information+retrieval

http://www.amazon.com/Alexs-Adventures-Numberland-Alex-Bellos/dp/1408809591/ref=sr_1_1?ie=UTF8&qid=1349072170&sr=8-1&keywords=Adventures+in+Numberland

http://www.amazon.com/Scaling-Machine-Learning-Distributed-Approaches/dp/0521192242/ref=sr_1_1?s=books&ie=UTF8&qid=1348985009&sr=1-1&keywords=scaling+up+machine+learning

http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738/ref=sr_1_1?s=books&ie=UTF8&qid=1348587636&sr=1-1&keywords=Pattern+Recognition+and+Machine+Learning

http://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126/ref=sr_1_1?s=books&ie=UTF8&qid=1348504540&sr=1-1&keywords=castella+inference

http://www.amazon.com/Probability-Statistics-Machine-Learning-Fundamentals/dp/1441996338/ref=sr_1_1?s=books&ie=UTF8&qid=1348504667&sr=1-1&keywords=das+gupta+statistics

http://www.amazon.com/Probabilistic-Recognition-Stochastic-Modelling-Probability/dp/0387946187/ref=sr_1_1?s=books&ie=UTF8&qid=1348759400&sr=1-1&keywords=A+Probabilistic+Theory+of+Pattern+Recognition

http://www.amazon.com/Pattern-Classification-Edition-Richard-Duda/dp/0471056693/ref=pd_sim_b_5

http://www.amazon.com/The-Elements-Statistical-Learning-Prediction/dp/0387848576/ref=pd_sim_b_1

http://www.amazon.com/Probability-Statistics-Computer-Science-Johnson/dp/0470383429/ref=sr_1_1?s=books&ie=UTF8&qid=1348505670&sr=1-1&keywords=probability+and+statistics+for+computer+science+james+l.+johnson

http://www.amazon.com/Probabilistic-Graphical-Models-Principles-Computation/dp/0262013193/ref=pd_sim_b_4

http://www.amazon.com/Information-Theory-Inference-Learning-Algorithms/dp/0521642981/ref=sr_1_1?s=books&ie=UTF8&qid=1348505050&sr=1-1&keywords=mackay+information+theory

http://www.amazon.com/How-Solve-It-Modern-Heuristics/dp/3642061346/ref=sr_1_2?ie=UTF8&qid=1349458231&sr=8-2&keywords=how+to+solve+it

http://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/0070428077/ref=pd_sim_b_2

http://www.amazon.com/Foundations-Machine-Learning-Adaptive-Computation/dp/026201825X/ref=sr_1_1?s=books&ie=UTF8&qid=1348504786&sr=1-1&keywords=mohri+machine+learning

http://www.amazon.com/Classification-Ensemble-Perception-Artifical-Intelligence/dp/9814271063/ref=sr_1_1?ie=UTF8&qid=1350053951&sr=8-1&keywords=Pattern+Classification+Using+Ensemble+Methods

http://ocw.mit.edu/resources/res-18-001-calculus-online-textbook-spring-2005/textbook/

http://www.amazon.com/All-Statistics-Statistical-Inference-Springer/dp/1441923225/ref=sr_1_1?s=books&ie=UTF8&qid=1348505844&sr=1-1&keywords=all+of+statistics

http://www.amazon.com/Fundamentals-Predictive-Mining-Computer-Science/dp/1849962251/ref=sr_1_1?s=books&ie=UTF8&qid=1348505312&sr=1-1&keywords=Tong+Zhang

166







167

SEQUENCE LABELING:HIDDEN MARKOV MODELS (HMMs)

168

Outline

• The guessing game• Tagging preliminaries• Hidden Markov Models• Trellis and the Viterbi algorithm• Implementation (Python)• Complexity of decoding• Parameter estimation and smoothing• Second order models

169

The Guessing Game

• A cow and duck write an email message together.• Goal – figure out which word is written by which animal.

The cow/duck illustration of HMMs is due to Ralph Grishman (NYU).

170

What’s the Big Deal ?

• The vocabularies of the cow and the duck can overlap and it is not clear a priori who wrote a certain word!

171

The Game (cont)

? ?

moo hello

?

quack

COW ?

moo hello

DUCK

quack

172

The Game (cont)

COW COW

moo hello

DUCK

quack

DUCK

173

What about the Rest of the Animals?

ZEBRA ZEBRA

word1 word2

ZEBRA

word3

PIG

ZEBRA

word4

ZEBRA

word5

PIG

DUCK

COW

ANT

DUCK

COW

ANT

PIG

DUCK

COW

ANT

PIG

DUCK

COW

ANT

PIG

DUCK

COW

ANT

174

A Game for Adults

• Instead of guessing which animal is associated with each word guess the corresponding POS tag of a word.

Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.

Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.

175

POS Tags"CC", "CD", "DT", "EX", "FW","IN", "JJ", "JJR", "JJS", "LS","MD","NN", "NNS","NNP", "NNPS","PDT", "POS", "PRP", "PRP$", "RB","RBR", "RBS", "RP", "SYM", "TO","UH", "VB", "VBD", "VBG", "VBN","VBP", "VBZ", "WDT", "WP", "WP$","WRB", "#", "$", ".",",",":", "(", ")", "`", "``","'", "''"

"CC", "CD", "DT", "EX", "FW","IN", "JJ", "JJR", "JJS", "LS","MD","NN", "NNS","NNP", "NNPS","PDT", "POS", "PRP", "PRP$", "RB","RBR", "RBS", "RP", "SYM", "TO","UH", "VB", "VBD", "VBG", "VBN","VBP", "VBZ", "WDT", "WP", "WP$","WRB", "#", "$", ".",",",":", "(", ")", "`", "``","'", "''"

176

Tagging Preliminaries

• We want the best set of tags for a sequence of words (a sentence)

• W — a sequence of words• T — a sequence of tags

)|(maxarg^

WTPTT

177

Bayes’ Theorem (1763)

)(

)()|()|(

WP

TPTWPWTP

posteriorposterior

priorpriorlikelihoodlikelihood

marginal likelihoodmarginal likelihood

Reverend Thomas Bayes — Presbyterian minister (1702-1761)Reverend Thomas Bayes — Presbyterian minister (1702-1761)

179

Tag Sequence Probability

• Count the number of times a sequence occurs and divide by the number of sequences of that length — not likely!– Use chain rule

How do we get the probability P(T) of a specific tag sequence T?

180

P(T) is a product of the probability of the N-grams that make it up

Make a Markov assumption: the current tag depends on the previous one only:

P(T) is a product of the probability of the N-grams that make it up

Make a Markov assumption: the current tag depends on the previous one only:

Chain Rule

),...,|(...)|()|()(

),...,()(

11213121

1

nn

n

tttPtttPttPtP

ttPTP history

n

iiin ttPtPttP

2111 )|()(),...,(

181

• Use counts from a large hand-tagged corpus.• For bi-grams, count all the ti–1 ti pairs

• Some counts are zero – we’ll use smoothing to address this issue later.

Transition Probabilities

)(

)()|(

1

11

i

iiii tc

ttcttP

182

What about P(W|T) ?

• First it's odd—it is asking the probability of seeing “The white horse” given “Det Adj Noun”!– Collect up all the times you see that tag sequence and see how often “The

white horse” shows up …

• Assume each word in the sequence depends only on its corresponding tag:

n

i

ii twPTWP1

)|()|(

emission probabilitiesemission probabilities

183

Emission Probabilities

• What proportion of times is the word wi associated with the tag ti (as opposed to another word):

)(

),()|(

i

iiii tc

twctwP

185

Hidden Markov Models

• Stochastic process: A sequence 1 , 2,… of random variables based on the same sample space .

• Probabilities for the first observation:

• Next step given previous history:

jj xxP outcomeeach for )( 1

), ... ,|(11 11 tt itiit xxxP

186

• A Markov Chain is a stochastic process with the Markov property:

• Outcomes are called states.• Probabilities for next step – weighted finite state

automata.

Markov Chain

)|(), ... ,|(111 111 tttt itititiit xxPxxxP

187

State Transitions w/ Probabilities

STARTEND

COW

DUCK

1.00.2

0.2

0.3 0.3

0.5

0.5

188

Markov Model

Markov chain where each statecan output signals

(like “Moore machines”):

Markov chain where each statecan output signals

(like “Moore machines”):

START END

COW

DUCK

1.00.2

0.2

0.3 0.3

0.5

0.5

moo:0.9

hello:0.1

hello:0.4

quack:0.6

$:1.0^:1.0

189

The Issue Was

• A given output symbol can potentially be emitted by more than one state — omnipresent ambiguity in natural language.

190

Markov ModelMarkov Model

},...,{ 1 mss

},...,{ 1 k

)|( where][P 1 itjtijij ssPpp

)|( where][A itjtijij sPaa

)( where],...,[ 11 jjm sPvvvv

Finite set of states:

Signal alphabet:

Transition matrix:

Emission probabilities:

Initial probability vector:

191

Graphical Model

STATESTATE TAGTAG

OUTPUTOUTPUT wordword

……

192

• A Markov Model for which it is not possible to observe

the sequence of states.

• S: unknown — sequence of states

• O: known — sequence of observations

)|(maxarg OSPS

Hidden Markov Model

*S

*O

wordswordstagstags

193

The State Space

START END

COW

DUCK

1.0

0.0

0.2

0.2

0.3

0.3

0.5

0.5

moo:0.9 hello:0.1

hello:0.4 quack:0.6

moo hello quack

COW

DUCK

0.3

0.3

0.5

0.5

COW

DUCK

More on how the probabilities come about (training) later.More on how the probabilities come about (training) later.

194

Optimal State Sequence:The Viterbi Algorithm

We define the joint probability of the most likely sequence from time 1 to time t ending in state si and the observed sequence O≤t up to time t:

);,,...,( max

);,( max)(

11

11

1

11,...,

1

tititiss

tittS

t

OsssP

OsSPi

t

tii

t

195

Key Observation

The most likely partial derivation leading to state si at position t consists of:

– the most likely partial derivation leading to some state sit-1 at the previous position t-1,

– followed by the transition from sit-1 to si.

196

Note:

We will show that:

)|( and )( where)(111 11 itktikiiiki sPasPvavi

Viterbi (cont)

tjkijti

t apij ])([max)( 1

197

t

t

tt

tt

tt

t

jktiji

tittS

itjti

jtktitjtSi

tittktjtSi

kttjtittSi

tjttS

t

aip

OsSPssP

sPssP

OsSsP

OssSP

OsSPj

)]([max

)];,( max)|(max[

)|()|( maxmax

);,|;( maxmax

),;,,( maxmax

);,( max)(

1

1121

1

112

112

1

2

2

2

2

1

Recurrence Equation

)|(

);,(

);,(

112

112

jtkt

titt

titt

sP

OsSP

OsSP

t

1k 1k

198

• The predecessor of state si in the path corresponding to t(i) :

• Optimal state sequence:

Back Pointers

1, ... ,1for )(

)(argmax

))((argmax)(

**

1

*

11

11

ntss

is

pij

ttt

T

kk

nmi

k

ijtmi

t

199

The Trellis

START

COW

moo hello quack

DUCK

END

0

t=0

1

0

0

t=1 t=2 t=3 t=4

0

0.9

0

0

0 0 0

0 0

$

0.045

0.108

0.00648

0

0.0081

0

0.03240

200

Implementation (Python)observations = ['^','moo','hello','quack','$'] # signal sequencestates = ['start','cow','duck','end']

# Transition probabilities - p[FromState][ToState] = probabilityp = {'start': {'cow':1.0}, 'cow': {'cow' :0.5, 'duck':0.3, 'end' :0.2}, 'duck': {'duck':0.5, 'cow' :0.3, 'end' :0.2}}

# Emission probabilities; special emission symbol '$' for 'end' statea = {'cow' : {'moo' :0.9,'hello':0.1, 'quack':0.0, '$':0.0}, 'duck': {'moo' :0.0,'hello':0.4, 'quack':0.6, '$':0.0}, 'end' : {'moo' :0.0,'hello':0.0, 'quack':0.0, '$':1.0}}

observations = ['^','moo','hello','quack','$'] # signal sequencestates = ['start','cow','duck','end']

# Transition probabilities - p[FromState][ToState] = probabilityp = {'start': {'cow':1.0}, 'cow': {'cow' :0.5, 'duck':0.3, 'end' :0.2}, 'duck': {'duck':0.5, 'cow' :0.3, 'end' :0.2}}

# Emission probabilities; special emission symbol '$' for 'end' statea = {'cow' : {'moo' :0.9,'hello':0.1, 'quack':0.0, '$':0.0}, 'duck': {'moo' :0.0,'hello':0.4, 'quack':0.6, '$':0.0}, 'end' : {'moo' :0.0,'hello':0.0, 'quack':0.0, '$':1.0}}

201

Implementation (Viterbi)n = len(observations)# Initializing viterbi table row by row; v[state][time] = probv = {}; for s in states: v[s] = [0.0] * T

# Initializing back pointersbackPointer = {}; for s in states: backPointer[s] = [""] * T

v['start'][0] = 1.0for t in range(T-1): # t =[0..T-1]; populate column t+1 in v for s in states[:-1]: # 'end' state not considered # only consider 'start' state at time 0 if t == 0 and s != 'start': continue for s1 in p[s].keys(): # s1 is the next state newScore = v[s][t] * p[s][s1] * a[s1][observations[t+1]] if v[s1][t+1] == 0.0 or newScore > v[s1][t+1]: v[s1][t+1] = newScore backPointer[s1][t+1] = s

n = len(observations)# Initializing viterbi table row by row; v[state][time] = probv = {}; for s in states: v[s] = [0.0] * T

# Initializing back pointersbackPointer = {}; for s in states: backPointer[s] = [""] * T

v['start'][0] = 1.0for t in range(T-1): # t =[0..T-1]; populate column t+1 in v for s in states[:-1]: # 'end' state not considered # only consider 'start' state at time 0 if t == 0 and s != 'start': continue for s1 in p[s].keys(): # s1 is the next state newScore = v[s][t] * p[s][s1] * a[s1][observations[t+1]] if v[s1][t+1] == 0.0 or newScore > v[s1][t+1]: v[s1][t+1] = newScore backPointer[s1][t+1] = s

202

Implementation (Best Path)

# Now recover the optimal state sequencestate_sequence = [ 'end' ]

for t in range( n-1, 0, -1 ): state = backPointer[ state ][ t ] state_sequence = [ state ] +

state_sequence print "Observations....: ", observationsprint "Optimal sequence: ", state_sequence

# Now recover the optimal state sequencestate_sequence = [ 'end' ]

for t in range( n-1, 0, -1 ): state = backPointer[ state ][ t ] state_sequence = [ state ] +

state_sequence print "Observations....: ", observationsprint "Optimal sequence: ", state_sequence

203

Complexity of Decoding

• O(m2n) — linear in n (the length of the string)• Initialization: O(mn)• Back tracing: O(n)• Next step: O(m2)

for current_state in s1..sm # at time t+1 for prev_state in s1..sm # at time t compute value

compare with best_so_far

• There are n next steps.

204

Parameter Estimation for HMMs

• Need annotated training data (Brown, PTB).• Signal and state sequences both known.• Calculate observed relative frequencies.• Complications — sparse data problem (need for smoothing).

• One can use only raw data too — Baum-Welch (forward-backward) algorithm.

205

Optimization

• Build vocabulary of possible tags for words• Keep total counts for words• If a word occurs frequently (count > threshold) consider its tag set

exhaustive• For frequent words only consider its tag set (vs. all tags)• For unknown words don’t consider tags corresponding to closed

class words (e.g., DT)

206

Applications Using HMMs

• POS tagging (as we have seen).• Chunking.• Named Entity Recognition (NER).• Speech recognition.

207

Exercises

• Implement the training (parameter estimation).• Use a dictionary of valid tags for known words to constrain

which tags are considered for a word.• Implement a second-order model.• Implement the decoder in Ruby.

208

Some POS Taggers

• Alias-I: http://www.alias-i.com/lingpipe• AUTASYS: http://www.phon.ucl.ac.uk/home/alex/project/tagging/tagging.htm• Brill Tagger: http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z• CLAWS: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html• Connexor: http://www.connexor.com/software/tagger• Edinburgh (LTG): http://www.ltg.ed.ac.uk/software/pos/index.html• FLAT (Flexible Language Acquisition Tool): http://lanaconsult.com• fnTBL: http://nlp.cs.jhu.edu/~rflorian/fntbl/index.html• GATE: http://gate.ac.uk• Infogistics: http://www.infogistics.com/posdemo.htm• Qtag: http://www.english.bham.ac.uk/staff/omason/software/qtag.html• SNoW: http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=POS• Stanford: http://nlp.stanford.edu/software/tagger.shtml• SVMTool: http://www.lsi.upc.edu/~nlp/SVMTool• TNT: http://www.coli.uni-saarland.de/~thorsten/tnt• Yamcha: http://chasen.org/~taku/software/yamcha/

209

References1. Brants, Thorsten. 2000. TnT – A Statistical Part-of-speech Tagger. 6th Applied NLP Conference (ANLP-2000),

224-231, Seattle, U.S.A.2. Charniak, Eugene, Curtis Hendrickson, Neil Jacobson & Mike Perkowitz. 1993. Equations for part-of-speech

tagging. 11th National Conference on Artificial Intelligence, 784-789. Menlo Park: AAAI Press/MIT.3. Krenn, Brigitte & Christer Samuelsson. 1997. Statistical Methods in Computational Linguistics, ESSLLI

Summer school Lecture Notes from, 11-22 August, Aix-en-Provence, France.4. Rabiner, Lawrence R. 1989. A tutorial on Hidden Markov Models and selected applications in speech

recognition, Proceedings of the IEEE, vol. 77, 256-286.5. Samuelsson, Christer. 2000. Extending N-gram tagging to word graphs, Recent Advances in Natural

Language Processing II, ed. by Nicolas Nicolov & Ruslan Mitkov. Current Issues in Linguistic Theory (CILT), vol. 189, pp 3-20. John Benjamins: Amsterdam/Philadelphia.

6. Shin, Jung Ho, Young S. Han & Key-Sun Choi. 1997. A HMM part-of-speech tagger for Korean with wordphrasal relations. Recent Advances in Natural Language Processing, ed. by Nicolas Nicolov & Ruslan Mitkov. Current Issues in Linguistic Theory (CILT) vol 136, pp 439-450. John Benjamins: Amsterdam/Philadelphia.

210

Statistics Refresher• Outcome: Individual atomic results of a (non-deterministic) experiment.• Event: A set of results.• Probability: Limit of target outcome over number of experiments (frequentist view) or

degree of belief (Bayesian view).• Normalization condition: Probabilities for all outcomes sum to 1.• Distribution: Probabilities associated with each outcome.• Random variable: Mapping of the outcomes to real numbers.• Joint distributions: Conducting several (possibly related) experiments and observing the

results. Joint distribution states the probability for a combination of values of several random variables.

• Marginal: Finding the distribution of a random variable from a joint distribution.• Conditional probability (Bayes’ rule): Knowing the value of one variable constrains the

distribution of another.• Probability density functions: Probability that a continuous variable is in a certain range.• Probabilistic reasoning: Introduce evidence (set certain variables) and compute

probabilities of interest (conditioned on this evidence).

211

Definitions

𝜇=𝐸 [ 𝑋 ]=∑𝑖=1

𝑛

𝑥𝑖 ∙𝑝 (𝑥 𝑖 )=∫− ∞

∞

𝑥𝑝 (𝑥 )𝑑𝑥Expectation:

Mode: 𝑥∗=arg max𝑖𝑝 (𝑥 𝑖)

Variance: 𝜎 2=𝑉𝑎𝑟 (𝑋 )=𝐸 [ (𝑋−𝜇)2 ]=𝐸 [ 𝑋 2 ]−𝜇2

𝐸 [ 𝑓 (𝑋 ) ]=∑𝑖=1

𝑛

𝑓 (𝑥 𝑖 ) ∙𝑝 (𝑥𝑖)=∫−∞

∞

𝑓 (𝑥)𝑝 (𝑥 )𝑑𝑥Expectation of a function:

𝐸 [ 𝑋𝑛 ]=∑𝑖=1

𝑛

𝑥𝑖𝑛∙𝑝 (𝑥 𝑖)-th moment: ( is the first moment)

𝐸 [𝑎𝑋 +𝑏 ]=𝑎𝐸 [ 𝑋 ]+𝑏 𝐸 [ 𝑋+𝑌 ]=𝐸 [𝑋 ]+𝐸 [𝑌 ] 𝑉𝑎𝑟 [𝑎𝑋 +𝑏 ]=𝑎2𝑉𝑎𝑟 [𝑋 ]Properties:

212

Intuitions about Scale

Weight in grams if the Earth were to be a black hole.

Age of the universe in seconds.

Number of cells in the human body (100 trillion).

Number of neurons in the human brain.

Standard Blu-ray disc size, XL 4 layer (128GB).

One year in seconds.

Items in the Library of Congress (largest in the world).

Length of the Niles in meters (longest river).

213

Acknowledgements

• Bran Boguraev• Chris Brew• Jinho Choi• William Headden• Jingjing Li• Jason Kessler• Mike Mozer• Shumin Wu• Tong Zhang

• Amir Padovitz• Bruno Bozza• Kent Cedola• Max Galkin• Manuel Reyes Gomez• Matt Hurst• John Langford• Priyank Singh

Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

Technology

Transcript of Machine Learning with Applications in Categorization, Popularity and Sequence Labeling