Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for...

89
Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    224
  • download

    0

Transcript of Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for...

Page 1: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Seminar: Statistical NLPSeminar: Statistical NLP

Girona, June 2003Girona, June 2003

Machine Learning for Natural Language

Processing

Machine Learning for Natural Language

Processing Lluís Màrquez

TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

Lluís MàrquezTALP Research Center

Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

Page 2: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

OutlineOutline

• Machine Learning for NLP• Machine Learning for NLP

• The Classification Problem

• Three ML Algorithms

• Applications to NLP

• The Classification Problem

• Three ML Algorithms

• Applications to NLP

Page 3: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• Machine Learning for NLP• Machine Learning for NLP

OutlineOutline

• The Classification Problem

• Three ML Algorithms

• Applications to NLP

• The Classification Problem

• Three ML Algorithms

• Applications to NLP

Page 4: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

ML4NLPML4NLP

• There are many general-purpose definitions of Machine Learning (or artificial learning):

• There are many general-purpose definitions of Machine Learning (or artificial learning):

Making a computer automatically acquire some kind of knowledge from a concrete

data domain

Making a computer automatically acquire some kind of knowledge from a concrete

data domain

• Learners are computers: we study learning algorithms

• Resources are scarce: time, memory, data, etc.

• It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc.

• Biological plausibility is welcome but not the main goal

• Learners are computers: we study learning algorithms

• Resources are scarce: time, memory, data, etc.

• It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc.

• Biological plausibility is welcome but not the main goal

Machine LearningMachine Learning

Page 5: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Machine LearningMachine Learning

• Learning... but what for?– To perform some particular task

– To react to environmental inputs

– Concept learning from data:• modelling concepts underlying data

• predicting unseen observations

• compacting the knowledge representation

• knowledge discovery for expert systems

• Learning... but what for?– To perform some particular task

– To react to environmental inputs

– Concept learning from data:• modelling concepts underlying data

• predicting unseen observations

• compacting the knowledge representation

• knowledge discovery for expert systems

• We will concentrate on:– Supervised inductive learning for classification

= discriminative learning

• We will concentrate on:– Supervised inductive learning for classification

= discriminative learning

ML4NLPML4NLP

Page 6: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Obtaining a description of the concept in

some representation language that

explains observations and helps

predicting new instances of the same

distribution

Obtaining a description of the concept in

some representation language that

explains observations and helps

predicting new instances of the same

distribution

A more precise definition:A more precise definition:

Machine LearningMachine LearningML4NLPML4NLP

• What to read?– Machine Learning (Mitchell, 1997)

• What to read?– Machine Learning (Mitchell, 1997)

Page 7: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• Lexical and structural ambiguity problems

– Word selection (SR, MT)– Part-of-speech tagging– Semantic ambiguity (polysemy)– Prepositional phrase attachment– Reference ambiguity (anaphora)– etc.

• Lexical and structural ambiguity problems

– Word selection (SR, MT)– Part-of-speech tagging– Semantic ambiguity (polysemy)– Prepositional phrase attachment– Reference ambiguity (anaphora)– etc.

Clasification problems

Clasification problems

9090’s’s: Application of Machine Learning techniques (ML) to NLP problems

9090’s’s: Application of Machine Learning techniques (ML) to NLP problems

Empirical NLP Empirical NLP ML4NLPML4NLP

• What to read? Foundations of Statistical Language Processing (Manning & Schütze,

1999)

• What to read? Foundations of Statistical Language Processing (Manning & Schütze,

1999)

Page 8: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

He was shot in the hand as he chased

the robbers in the back street

He was shot in the hand as he chased

the robbers in the back street

• Ambiguity is a crucial problem for natural language understanding/processing. Ambiguity Resolution = Classification

• Ambiguity is a crucial problem for natural language understanding/processing. Ambiguity Resolution = Classification

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

NLP “classification” problems

NLP “classification” problems

ML4NLPML4NLP

Page 9: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

He was shot in the hand as he chased

the robbers in the back street

He was shot in the hand as he chased

the robbers in the back street

• Morpho-syntactic ambiguity• Morpho-syntactic ambiguity

NNVBNNVB

JJVBJJ

VBNNVBNNVB

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

NLP “classification” problems

NLP “classification” problems

ML4NLPML4NLP

Page 10: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

He was shot in the hand as he chased

the robbers in the back street

He was shot in the hand as he chased

the robbers in the back streetNNVBNNVB

JJVBJJ

VBNNVBNNVB

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

• Morpho-syntactic ambiguity: Part of Speech Tagging

• Morpho-syntactic ambiguity: Part of Speech Tagging

NLP “classification” problems

NLP “classification” problems

ML4NLPML4NLP

Page 11: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

He was shot in the hand as he chased

the robbers in the back street

He was shot in the hand as he chased

the robbers in the back streetbody-partclock-partbody-partclock-part

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

• Semantic (lexical) ambiguity• Semantic (lexical) ambiguity

NLP “classification” problems

NLP “classification” problems

ML4NLPML4NLP

Page 12: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

He was shot in the hand as he chased

the robbers in the back street

He was shot in the hand as he chased

the robbers in the back streetbody-partclock-partbody-partclock-part

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

• Semantic (lexical) ambiguity: Word Sense Disambiguation

• Semantic (lexical) ambiguity: Word Sense Disambiguation

NLP “classification” problems

NLP “classification” problems

ML4NLPML4NLP

Page 13: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

He was shot in the hand as he chased

the robbers in the back street

He was shot in the hand as he chased

the robbers in the back street

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

• Structural (syntactic) ambiguity• Structural (syntactic) ambiguity

NLP “classification” problems

NLP “classification” problems

ML4NLPML4NLP

Page 14: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

He was shot in the hand as he chased

the robbers in the back street

He was shot in the hand as he chased

the robbers in the back street

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

• Structural (syntactic) ambiguity• Structural (syntactic) ambiguity

NLP “classification” problems

NLP “classification” problems

ML4NLPML4NLP

Page 15: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

He was shot in the hand as he (chased

(the robbers)NP (in the back street)PP)

He was shot in the hand as he (chased

(the robbers)NP (in the back street)PP)

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

• Structural (syntactic) ambiguity: PP-attachment disambiguation

• Structural (syntactic) ambiguity: PP-attachment disambiguation

NLP “classification” problems

NLP “classification” problems

ML4NLPML4NLP

Page 16: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• The Classification Problem

• Three ML Algorithms in detail

• Applications to NLP

• The Classification Problem

• Three ML Algorithms in detail

• Applications to NLP

OutlineOutline

• Machine Learning for NLP• Machine Learning for NLP

Page 17: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

IA perspective

IA perspective

Feature Vector ClassificationFeature Vector ClassificationClassificationClassification

• An instance is a vector: x = <x1,…, xn> whose components,

called features (or attributes), are discrete or real-valued.

• Let X be the space of all possible instances.

• Let Y={y1,…, ym} be the set of categories (or classes).

• The goal is to learn an unknown target function, f : X Y

• A training example is an instance x belonging to X,

labelled with the correct value for f(x), i.e., a pair <x, f(x)>

• Let D be the set of all training examples.

• An instance is a vector: x = <x1,…, xn> whose components,

called features (or attributes), are discrete or real-valued.

• Let X be the space of all possible instances.

• Let Y={y1,…, ym} be the set of categories (or classes).

• The goal is to learn an unknown target function, f : X Y

• A training example is an instance x belonging to X,

labelled with the correct value for f(x), i.e., a pair <x, f(x)>

• Let D be the set of all training examples.

Page 18: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Feature Vector ClassificationFeature Vector Classification

The goal is to find a function h belonging to

H such that for all pair <x, f (x)> belonging

to D, h(x) = f (x)

The goal is to find a function h belonging to

H such that for all pair <x, f (x)> belonging

to D, h(x) = f (x)

• The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions

• The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions

ClassificationClassification

Page 19: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

An ExampleAn Example

Example SIZE COLOR SHAPE CLASS1 small red circle positive

2 big red circle positive

3 small red triangle negative

4 big blue circle negative

(COLOR=red) (SHAPE=circle) positive

RulesRules

red blue

SHAPE negative

positive

circle triangle

negative

COLOR

Decision TreeDecision Tree

otherwise negative

ClassificationClassification

Page 20: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

An ExampleAn Example

Example SIZE COLOR SHAPE CLASS1 small red circle positive

2 big red circle positive

3 small red triangle negative

4 big blue circle negative

Example SIZE COLOR SHAPE CLASS1 small red circle positive

2 big red circle positive

3 small red triangle negative

4 big blue circle negative

RulesRules

(SIZE=small) (SHAPE=circle) positive

otherwise negative

(SIZE=big) (COLOR=red) positivesmall big

SHAPE

pos

circle red

SIZE

Decision TreeDecision Tree

COLOR

triang blue

neg pos neg

ClassificationClassification

Page 21: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Some important conceptsSome important concepts

• Inductive Bias

“Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99)

– Language / Search bias

• Inductive Bias

“Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99)

– Language / Search bias

ClassificationClassification

red blue

SHAPE negative

positive

circle triangle

negative

COLOR

Decision TreeDecision Tree

Page 22: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• Generalization ability and overfitting

• Batch Learning vs. on-line Leaning

• Symbolic vs. statistical Learning

• Propositional vs. first-order learning

• Generalization ability and overfitting

• Batch Learning vs. on-line Leaning

• Symbolic vs. statistical Learning

• Propositional vs. first-order learning

• Inductive Bias

• Training error and generalization error

• Inductive Bias

• Training error and generalization error

Some important conceptsSome important conceptsClassificationClassification

Page 23: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Propositional vs. Relational Learning

Propositional vs. Relational Learning

ClassificationClassification

color(red) shape(circle) classA

course(X) person(Y) link_to(Y,X) instructor_of(X,Y)

research_project(X) person(Z) link_to(L1,X,Y) link_to(L2,Y,Z) neighbour_word_people(L1)

member_proj(X,Z)

• Relational learning = ILP (induction of logic programs)• Relational learning = ILP (induction of logic programs)

• Propositional learning• Propositional learning

Page 24: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

The Classification SettingClass, Point, Example, Data Set, ...

The Classification SettingClass, Point, Example, Data Set, ...

• Input Space: X Rn

• (binary) Output Space: Y = {+1,-1}

• A point, pattern or instance: x X, x = (x1, x2, …, xn)

• Example: (x, y) with x X, y Y

• Training Set: a set of m examples generated i.i.d. according to an unknown distribution P(x,y) S = {(x1, y1), …, (xm, ym)} (X Y)m

• Input Space: X Rn

• (binary) Output Space: Y = {+1,-1}

• A point, pattern or instance: x X, x = (x1, x2, …, xn)

• Example: (x, y) with x X, y Y

• Training Set: a set of m examples generated i.i.d. according to an unknown distribution P(x,y) S = {(x1, y1), …, (xm, ym)} (X Y)m

ClassificationClassification

CoLT/SLT perspectiveCoLT/SLT

perspective

Page 25: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

The Classification SettingLearning, Error, ...

The Classification SettingLearning, Error, ...

• The hypotheses space, H, is the set of functions h: XY that the learner can consider as possible definitions. In SVM are of the form:

• The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P(x,y), is minimal (Risk Minimization, RM)

• The hypotheses space, H, is the set of functions h: XY that the learner can consider as possible definitions. In SVM are of the form:

• The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P(x,y), is minimal (Risk Minimization, RM)

n

iii bwh

1

)()( xx

ClassificationClassification

Page 26: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

The Classification SettingLearning, Error, ...

The Classification SettingLearning, Error, ...

• Expected error (risk)

• Problem: P itself is unknown. Known are training examples an induction principle is needed

• Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal

• Expected error (risk)

• Problem: P itself is unknown. Known are training examples an induction principle is needed

• Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal

),(),(loss ydPyhhR xx

m

i iiemp yhmhR1

),(loss1 x

ClassificationClassification

Page 27: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

The Classification SettingError, Over(under)fitting,...

The Classification SettingError, Over(under)fitting,...

• Low training error low true error?

• The overfitting dilemma:

• Low training error low true error?

• The overfitting dilemma:

• Trade-off between training error and complexity

• Different learning biases can be used

• Trade-off between training error and complexity

• Different learning biases can be used

(Mül le

r et

al.,

200

1)

(Mül le

r et

al.,

200

1)

ClassificationClassification

Underfitting Overfitting

Page 28: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• The Classification Problem

• Three ML Algorithms

• Applications to NLP

• The Classification Problem

• Three ML Algorithms

• Applications to NLP

OutlineOutline

• Machine Learning for NLP• Machine Learning for NLP

Page 29: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• The Classification Problem

• Three ML Algorithms−Decision Trees−AdaBoost−Support Vector Machines

• Applications to NLP

• The Classification Problem

• Three ML Algorithms−Decision Trees−AdaBoost−Support Vector Machines

• Applications to NLP

OutlineOutline

• Machine Learning for NLP• Machine Learning for NLP

Page 30: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• Statistical learning:– HMM, Bayesian Networks, ME, CRF, etc.

• Traditional methods from Artificial Intelligence (ML, AI)

– Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc.

• Methods from Computational Learning Theory (CoLT/SLT)

– Winnow, AdaBoost, SVM’s, etc.

• Statistical learning:– HMM, Bayesian Networks, ME, CRF, etc.

• Traditional methods from Artificial Intelligence (ML, AI)

– Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc.

• Methods from Computational Learning Theory (CoLT/SLT)

– Winnow, AdaBoost, SVM’s, etc.

Learning ParadigmsLearning ParadigmsAlgorithmsAlgorithms

Page 31: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• Classifier combination:– Bagging, Boosting, Randomization,

ECOC, Stacking, etc.

• Semi-supervised learning: learning from labelled and unlabelled examples– Bootstrapping, EM, Transductive learning

(SVM’s, AdaBoost), Co-Training, etc.

• etc.

• Classifier combination:– Bagging, Boosting, Randomization,

ECOC, Stacking, etc.

• Semi-supervised learning: learning from labelled and unlabelled examples– Bootstrapping, EM, Transductive learning

(SVM’s, AdaBoost), Co-Training, etc.

• etc.

AlgorithmsAlgorithms

Learning ParadigmsLearning Paradigms

Page 32: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Decision TreesDecision Trees

• Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data.

• They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization.

• From a machine-learning perspective: Decision Trees are n -ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes

• Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data.

• They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization.

• From a machine-learning perspective: Decision Trees are n -ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes

AlgorithmsAlgorithms

Page 33: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Decision TreesDecision Trees

• Acquisition: Top-Down Induction of Decision Trees (TDIDT)

• Systems: CART (Breiman et al. 84),ID3, C4.5, C5.0 (Quinlan 86,93,98),

ASSISTANT, ASSISTANT-R (Cestnik et al. 87) (Kononenko et al. 95)

etc.

• Acquisition: Top-Down Induction of Decision Trees (TDIDT)

• Systems: CART (Breiman et al. 84),ID3, C4.5, C5.0 (Quinlan 86,93,98),

ASSISTANT, ASSISTANT-R (Cestnik et al. 87) (Kononenko et al. 95)

etc.

AlgorithmsAlgorithms

Page 34: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

An ExampleAn Example

A1

A2 A3

C1

A5 A2

A2

A5 C3

C2C1

...

..

....

...

v1

v2

v3

v5v4

v6

v7

small big

SHAPE

pos

circle red

SIZE

Decision TreeDecision Tree

COLOR

triang blue

neg pos neg

AlgorithmsAlgorithms

Page 35: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Learning Decision TreesLearning Decision TreesTrainingTraining

Training Set

TDIDTTDIDT+DT

=

TestTest

=DT

Example + ClassClass

AlgorithmsAlgorithms

Page 36: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Gen

era

l In

du

cti

on

A

lgori

thm

Gen

era

l In

du

cti

on

A

lgori

thm

function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree;

X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X)

else amax := feature_selection (X,A);

tree1 := create_tree (X, amax);

for-all val in values (amax) do

X’ := select_examples (X,amax,val);

A’ := A - {amax};

tree2 := TDIDT (X’,A’);

tree1 := add_branch (tree1,tree2,val)

end-for end-if return (tree1)

end-function

AlgorithmsAlgorithms

Page 37: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Gen

era

l In

du

cti

on

A

lgori

thm

Gen

era

l In

du

cti

on

A

lgori

thm

function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree;

X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X)

else amax := feature_selection (X,A);

tree1 := create_tree (X, amax);

for-all val in values (amax) do

X’ := select_examples (X,amax,val);

A’ := A - {amax};

tree2 := TDIDT (X’,A’);

tree1 := add_branch (tree1,tree2,val)

end-for end-if return (tree1)

end-function

AlgorithmsAlgorithms

Page 38: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Feature Selection CriteriaFeature Selection Criteria

• Functions derived from Information Theory:– Information Gain, Gain Ratio (Quinlan 86)

• Functions derived from Distance Measures– Gini Diversity Index (Breiman et al. 84)

– RLM (López de Mántaras 91)

• Statistically-based– Chi-square test (Sestito & Dillon 94)

– Symmetrical Tau (Zhou & Dillon 91)

• RELIEFF-IG: variant of RELIEFF (Kononenko 94)

• Functions derived from Information Theory:– Information Gain, Gain Ratio (Quinlan 86)

• Functions derived from Distance Measures– Gini Diversity Index (Breiman et al. 84)

– RLM (López de Mántaras 91)

• Statistically-based– Chi-square test (Sestito & Dillon 94)

– Symmetrical Tau (Zhou & Dillon 91)

• RELIEFF-IG: variant of RELIEFF (Kononenko 94)

AlgorithmsAlgorithms

Page 39: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Extensions of DTsExtensions of DTs

• Pruning (pre/post)

• Minimize the effect of the greedy approach: lookahead

• Non-lineal splits

• Combination of multiple models

• Incremental learning (on-line)

• etc.

• Pruning (pre/post)

• Minimize the effect of the greedy approach: lookahead

• Non-lineal splits

• Combination of multiple models

• Incremental learning (on-line)

• etc.

(Murthy 95)(Murthy 95)

AlgorithmsAlgorithms

Page 40: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Decision Trees and NLPDecision Trees and NLP

• Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)

• POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00)

• Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96)

• Parsing (Magerman 95,96; Haruno et al. 98,99)

• Text categorization (Lewis & Ringuette 94; Weiss et al. 99)

• Text summarization (Mani & Bloedorn 98)

• Dialogue act tagging (Samuel et al. 98)

• Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)

• POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00)

• Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96)

• Parsing (Magerman 95,96; Haruno et al. 98,99)

• Text categorization (Lewis & Ringuette 94; Weiss et al. 99)

• Text summarization (Mani & Bloedorn 98)

• Dialogue act tagging (Samuel et al. 98)

AlgorithmsAlgorithms

Page 41: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95)

• Discourse analysis in information extraction (Soderland & Lehnert 94)

• Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94)

• Verb classification in Machine Translation (Tanaka 96; Siegel 97)

• Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95)

• Discourse analysis in information extraction (Soderland & Lehnert 94)

• Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94)

• Verb classification in Machine Translation (Tanaka 96; Siegel 97)

Decision Trees and NLPDecision Trees and NLPAlgorithmsAlgorithms

Page 42: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Decision Trees: pros&consDecision Trees: pros&cons

• Advantages

– Acquires symbolic knowledge in a understandable way

– Very well studied ML algorithms and variants

– Can be easily translated into rules

– Existence of available software: C4.5, C5.0, etc.

– Can be easily integrated into an ensemble

• Advantages

– Acquires symbolic knowledge in a understandable way

– Very well studied ML algorithms and variants

– Can be easily translated into rules

– Existence of available software: C4.5, C5.0, etc.

– Can be easily integrated into an ensemble

AlgorithmsAlgorithms

Page 43: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• Drawbacks

– Computationally expensive when scaling to large natural language domains: training examples, features, etc.

– Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation

– DTs is a model with high variance (unstable)

– Tendency to overfit training data: pruning is necessary

– Requires quite a big effort in tuning the model

• Drawbacks

– Computationally expensive when scaling to large natural language domains: training examples, features, etc.

– Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation

– DTs is a model with high variance (unstable)

– Tendency to overfit training data: pruning is necessary

– Requires quite a big effort in tuning the model

Decision Trees: pros&consDecision Trees: pros&cons

AlgorithmsAlgorithms

Page 44: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Boosting algorithmsBoosting algorithms

• Idea “to combine many simple and moderately accurate

hypotheses (weak classifiers) into a single and highly accurate classifier”

• AdaBoost (Freund & Schapire 95) has been theoretically and empirically studied extensively

• Many other variants extensions (1997-2003)

http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html

• Idea “to combine many simple and moderately accurate

hypotheses (weak classifiers) into a single and highly accurate classifier”

• AdaBoost (Freund & Schapire 95) has been theoretically and empirically studied extensively

• Many other variants extensions (1997-2003)

http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html

AlgorithmsAlgorithms

Page 45: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

AdaBoost: general schemeAdaBoost: general scheme

TS2

D2

TS1D1

Weak

Learner

h1

Weak

Learner

h2

TST. . .

Probability distribution

updating

DT

Weak

Learner

hT. . .

TR

AIN

ING

TR

AIN

ING

Linear combination

F(h1,h2,...,hT)TES

TTES

T

2

AlgorithmsAlgorithms

Page 46: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

AdaBoost: algorithmAdaBoost: algorithmAlgorithmsAlgorithms

(Freund & Schapire 97)(Freund & Schapire 97)

Page 47: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

AdaBoost: exampleAdaBoost: example

Weak hypotheses = vertical/horizontal hyperplanesWeak hypotheses = vertical/horizontal hyperplanes

AlgorithmsAlgorithms

Page 48: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

AdaBoost: round 1AdaBoost: round 1AlgorithmsAlgorithms

Page 49: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

AdaBoost: round 2AdaBoost: round 2AlgorithmsAlgorithms

Page 50: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

AdaBoost: round 3AdaBoost: round 3AlgorithmsAlgorithms

Page 51: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Combined HypothesisCombined Hypothesis

www.research.att.com/~yoav/adaboostwww.research.att.com/~yoav/adaboost

AlgorithmsAlgorithms

Page 52: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

AdaBoost and NLPAdaBoost and NLP• POS Tagging (Abney et al. 99; Màrquez 99)

• Text and Speech Categorization (Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99)

• PP-attachment Disambiguation (Abney et al. 99)

• Parsing (Haruno et al. 99)

• Word Sense Disambiguation (Escudero et al. 00, 01)

• Shallow parsing (Carreras & Màrquez, 01a; 02)

• Email spam filtering (Carreras & Màrquez, 01b)

• Term Extraction (Vivaldi, et al. 01)

• POS Tagging (Abney et al. 99; Màrquez 99)

• Text and Speech Categorization (Schapire & Singer 98; Schapire et al. 98; Weiss et al. 99)

• PP-attachment Disambiguation (Abney et al. 99)

• Parsing (Haruno et al. 99)

• Word Sense Disambiguation (Escudero et al. 00, 01)

• Shallow parsing (Carreras & Màrquez, 01a; 02)

• Email spam filtering (Carreras & Màrquez, 01b)

• Term Extraction (Vivaldi, et al. 01)

AlgorithmsAlgorithms

Page 53: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

AlgorithmsAlgorithms

AdaBoost: pros&consAdaBoost: pros&cons

+Easy to implement and few parameters to set

+Time and space grow linearly with number of examples. Ability to manage very large learning problems

+Does not constrain explicitly the complexity of the learner

+Naturally combines feature selection with learning

+Has been succesfully applied to many practical problems

+Easy to implement and few parameters to set

+Time and space grow linearly with number of examples. Ability to manage very large learning problems

+Does not constrain explicitly the complexity of the learner

+Naturally combines feature selection with learning

+Has been succesfully applied to many practical problems

Page 54: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

±Seems to be rather robust to overfitting (number of rounds) but sensitive to noise

±Performance is very good when there are relatively few relevant terms (features)

– Can perform poorly when there is insufficient training data relative to the complexity of the base classifiers, the training errors of the base classifiers become too large too quickly

±Seems to be rather robust to overfitting (number of rounds) but sensitive to noise

±Performance is very good when there are relatively few relevant terms (features)

– Can perform poorly when there is insufficient training data relative to the complexity of the base classifiers, the training errors of the base classifiers become too large too quickly

AlgorithmsAlgorithms

AdaBoost: pros&consAdaBoost: pros&cons

Page 55: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor,

2000)

• “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor,

2000)

AlgorithmsAlgorithms

SVM: A General DefinitionSVM: A General Definition

Page 56: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000)

• “Support Vector Machines (SVM) are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimisation theory that implements a learning bias derived from statistical learning theory”. (Cristianini & Shawe-Taylor, 2000)

Key ConceptsKey Concepts

SVM: A General DefinitionSVM: A General Definition

AlgorithmsAlgorithms

Page 57: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Linear ClassifiersLinear Classifiers

otherwise

if

1

0b xw1b xw sign )h(

N

1i

i iN

1i

i ix

• Hyperplanes in RN.

• Defined by a weight vector (w) and a threshold (b).

• They induce a classification rule:

• Hyperplanes in RN.

• Defined by a weight vector (w) and a threshold (b).

• They induce a classification rule:

w

++ +

+

++

_

_ _ _

__

__

_wb

AlgorithmsAlgorithms

Page 58: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Optimal Hyperplane: Geometric Intuition

Optimal Hyperplane: Geometric Intuition

AlgorithmsAlgorithms

Page 59: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Optimal Hyperplane: Geometric Intuition

Optimal Hyperplane: Geometric Intuition

Maximal Margin

Hyperplane

Maximal Margin

Hyperplane

These are theSupport Vectors

These are theSupport Vectors

AlgorithmsAlgorithms

Page 60: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Linearly separable dataLinearly separable data

Seminari SVMs 22/05/2001Seminari SVMs 22/05/2001

liby ii ,,1 allfor 1)( xw

2/2 margin geometric w

:sconstraint subject to minimize toequivalent ismargin themaximizing2

w

Quadratic Programming

AlgorithmsAlgorithms

Page 61: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Non-separable case (soft margin)

Non-separable case (soft margin)

Seminari SVMs 22/05/2001Seminari SVMs 22/05/2001

lii ,,1 allfor 0

:sconstraint subject to Minimize1

2

n

iiC w

liby iii ,,1 allfor 1)( xw

costs gintroducinfor ablesslack vari positive ,,1 l

AlgorithmsAlgorithms

Page 62: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

n

iii bwf

1

)()( xx

FX : Non-linear mapping

Set of hypotheses

Non-linear SVMsNon-linear SVMs

• Implicit mapping into feature space via kernel functions• Implicit mapping into feature space via kernel functions

Seminari SVMs 22/05/2001Seminari SVMs 22/05/2001

byfl

iiii

1

)()()( xxx Dual formulation

)()(),( zxzx K Kernel function

bKyfl

iiii

1

),()( xxx Evaluation

AlgorithmsAlgorithms

Page 63: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Non-linear SVMsNon-linear SVMs

• Kernel functions

– Must be efficiently computable

– Characterization via Mercer’s theorem

– One of the curious facts about using a kernel is that we do not need to know the underlying feature map in order to be able to learn in the feature space! (Cristianini & Shawe-Taylor, 2000)

– Examples: polynomials, Gaussian radial basis functions, two-layer sigmoidal neural networks, etc.

• Kernel functions

– Must be efficiently computable

– Characterization via Mercer’s theorem

– One of the curious facts about using a kernel is that we do not need to know the underlying feature map in order to be able to learn in the feature space! (Cristianini & Shawe-Taylor, 2000)

– Examples: polynomials, Gaussian radial basis functions, two-layer sigmoidal neural networks, etc.

Seminari SVMs 22/05/2001Seminari SVMs 22/05/2001

AlgorithmsAlgorithms

Page 64: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Non linear SVMsNon linear SVMs

Seminari SVMs 22/05/2001Seminari SVMs 22/05/2001

Degree 3 polynomial kernel

lin. separable lin. non-separable

AlgorithmsAlgorithms

Page 65: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Toy ExamplesToy Examples• All examples have been run with the 2D

graphic interface of SVMLIB (Chang and Lin, National University of Taiwan)

“LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. The basic algorithm is a simplification of both SMO by Platt and SVMLight by Joachims. It is also a simplification of the modification 2 of SMO by Keerthy et al. Our goal is to help users from other fields to easily use SVM as a tool. LIBSVM provides a simple interface where users can easily link it with their own programs…”

• Available from: www.csie.ntu.edu.tw/~cjlin/libsvm (it icludes a Web integrated demo tool)

• All examples have been run with the 2D graphic interface of SVMLIB (Chang and Lin, National University of Taiwan)

“LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, un-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. The basic algorithm is a simplification of both SMO by Platt and SVMLight by Joachims. It is also a simplification of the modification 2 of SMO by Keerthy et al. Our goal is to help users from other fields to easily use SVM as a tool. LIBSVM provides a simple interface where users can easily link it with their own programs…”

• Available from: www.csie.ntu.edu.tw/~cjlin/libsvm (it icludes a Web integrated demo tool)

AlgorithmsAlgorithms

Page 66: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Toy Examples (I)Toy Examples (I)

.What happens if we adda blue training example

here?

What happens if we adda blue training example

here?

Linearly separable data setLinear SVMMaximal margin Hyperplane

AlgorithmsAlgorithms

Page 67: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Toy Examples (I)Toy Examples (I)

(still) Linearly separable data setLinear SVM

High value of C parameterMaximal margin Hyperplane

The example is correctly classified

The example is correctly classified

AlgorithmsAlgorithms

Page 68: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

(still) Linearly separable data setLinear SVM

Low value of C parameterTrade-off between: margin and training error

Toy Examples (I)Toy Examples (I)

The example is now a bounded SV

The example is now a bounded SV

AlgorithmsAlgorithms

Page 69: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Toy Examples (II)Toy Examples (II)AlgorithmsAlgorithms

Page 70: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Toy Examples (II)Toy Examples (II)AlgorithmsAlgorithms

Page 71: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Toy Examples (II)Toy Examples (II)AlgorithmsAlgorithms

Page 72: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Toy Examples (III)Toy Examples (III)AlgorithmsAlgorithms

Page 73: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

SVM: SummarySVM: Summary

• SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik, 1992). Great developement since then

• Kernel-induced feature spaces: SVMs work efficiently in very high dimensional feature spaces (+)

• Learning bias: maximal margin optimisation. Reduces the danger of overfitting. Generalization bounds for SVMs (+)

• Compact representation of the induced hypothesis. The solution is sparse in terms of SVs (+)

• SVMs introduced in COLT’92 (Boser, Guyon, & Vapnik, 1992). Great developement since then

• Kernel-induced feature spaces: SVMs work efficiently in very high dimensional feature spaces (+)

• Learning bias: maximal margin optimisation. Reduces the danger of overfitting. Generalization bounds for SVMs (+)

• Compact representation of the induced hypothesis. The solution is sparse in terms of SVs (+)

AlgorithmsAlgorithms

Page 74: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

SVM: SummarySVM: Summary• Due to Mercer’s conditions on the kernels the

optimi-sation problems are convex. No local minima (+)

• Optimisation theory guides the implementation. Efficient learning (+)

• Mainly for classification but also for regression, density estimation, clustering, etc.

• Success in many real-world applications: OCR, vision, bioinformatics, speech recognition, NLP: TextCat, POS tagging, chunking, parsing, etc. (+)

• Parameter tuning (–). Implications in convergence times, sparsity of the solution, etc.

• Due to Mercer’s conditions on the kernels the optimi-sation problems are convex. No local minima (+)

• Optimisation theory guides the implementation. Efficient learning (+)

• Mainly for classification but also for regression, density estimation, clustering, etc.

• Success in many real-world applications: OCR, vision, bioinformatics, speech recognition, NLP: TextCat, POS tagging, chunking, parsing, etc. (+)

• Parameter tuning (–). Implications in convergence times, sparsity of the solution, etc.

AlgorithmsAlgorithms

Page 75: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• The Classification Problem

• Three ML Algorithms

• Applications to NLP

• The Classification Problem

• Three ML Algorithms

• Applications to NLP

OutlineOutline

• Machine Learning for NLP• Machine Learning for NLP

Page 76: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

NLP problemsNLP problemsApplicationsApplications

• Warning! We will not focus on final NLP applications, but on intermediate tasks...

• We will classify the NLP tasks according to their (structural) complexity

• Warning! We will not focus on final NLP applications, but on intermediate tasks...

• We will classify the NLP tasks according to their (structural) complexity

Page 77: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

NLP problems: structural complexity

NLP problems: structural complexity

ApplicationsApplications

• Decisional problems−Text Categorization, Document filtering,

Word Sense Disambiguation, etc.

• Sequence tagging and detection of sequential structures−POS tagging, Named Entity extraction,

syntactic chunking, etc.

• Hierarchical structures−Clause detection, full parsing, IE of

complex concepts, composite Named Entities, etc.

• Decisional problems−Text Categorization, Document filtering,

Word Sense Disambiguation, etc.

• Sequence tagging and detection of sequential structures−POS tagging, Named Entity extraction,

syntactic chunking, etc.

• Hierarchical structures−Clause detection, full parsing, IE of

complex concepts, composite Named Entities, etc.

Page 78: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

He was shot in the hand as he chased

the robbers in the back street

He was shot in the hand as he chased

the robbers in the back streetNNVBNNVB

JJVBJJ

VBNNVBNNVB

(The Wall Street Journal Corpus)(The Wall Street Journal Corpus)

• Morpho-syntactic ambiguity: Part of Speech Tagging

• Morpho-syntactic ambiguity: Part of Speech Tagging

POS taggingPOS taggingApplicationsApplications

Page 79: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

POS taggingPOS taggingApplicationsApplications

root

P(IN)=0.81P(RB)=0.19Word Form

leaf

P(IN)=0.83P(RB)=0.17tag(+1)

P(IN)=0.13P(RB)=0.87tag(+2)

P(IN)=0.013P(RB)=0.987

“As”,“as”

RB

IN

others

others

...

...

“preposition-adverb” tree“preposition-adverb” tree

^

Probabilistic interpretation:Probabilistic interpretation:

P( RB | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.987

P( IN | word=“A/as” tag(+1)=RB tag(+2)=IN) = 0.013

^

Page 80: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

“as_RB much_RB as_IN”“as_RB much_RB as_IN”

Collocations:Collocations:

“as_RB well_RB as_IN”“as_RB well_RB as_IN”

“as_RB soon_RB as_IN”“as_RB soon_RB as_IN”

POS taggingPOS taggingApplicationsApplications

root

P(IN)=0.81P(RB)=0.19Word Form

leaf

P(IN)=0.83P(RB)=0.17tag(+1)

P(IN)=0.13P(RB)=0.87tag(+2)

P(IN)=0.013P(RB)=0.987

“As”,“as”

RB

IN

others

others

...

...

“preposition-adverb” tree“preposition-adverb” tree

Page 81: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

POS taggingPOS tagging

Rawtext

Morphologicalanalysis

Taggedtext

Classify Update Filter

Language Model

Disambiguation

stop?

RTT (Màrquez & Rodríguez 97)RTT (Màrquez & Rodríguez 97)

yesno

A Sequential Model for Multi-class Classification: NLP/POS Tagging (Even-Zohar & Roth, 01)

A Sequential Model for Multi-class Classification: NLP/POS Tagging (Even-Zohar & Roth, 01)

ApplicationsApplications

Page 82: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

POS taggingPOS tagging

STT (Màrquez & Rodríguez 97)STT (Màrquez & Rodríguez 97)

Taggedtext

Rawtext

Morphologicalanalysis

Viterbialgorithm

Language Model

Disambiguation

Lexicalprobs. +

Contextual probs. The Use of Classifiers in sequential inference:

Chunking (Punyakanok & Roth, 00)

The Use of Classifiers in sequential inference: Chunking (Punyakanok & Roth, 00)

ApplicationsApplications

Page 83: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• Named Entity recognition

• Clause detection

• Named Entity recognition

• Clause detection

Detection of sequential and hierarchical

structures

Detection of sequential and hierarchical

structures

ApplicationsApplications

Page 84: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• We have briefly outlined:

−The ML setting: “supervised learning for classification”

−Three concrete machine learning algorithms

−How to apply them to solve itermediate NLP tasks

• We have briefly outlined:

−The ML setting: “supervised learning for classification”

−Three concrete machine learning algorithms

−How to apply them to solve itermediate NLP tasks

Summary/conclusionsSummary/conclusionsConclusionsConclusions

Page 85: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• Any ML algorithm for NLP should be:

– Robust to noise and outliers

– Efficient in large feature/example spaces

– Adaptive to new/changing domains: portability, tuning, etc.

– Able to take advantage of unlabelled examples: semi-supervised learning

• Any ML algorithm for NLP should be:

– Robust to noise and outliers

– Efficient in large feature/example spaces

– Adaptive to new/changing domains: portability, tuning, etc.

– Able to take advantage of unlabelled examples: semi-supervised learning

ConclusionsConclusions

Summary/conclusionsSummary/conclusions

Page 86: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

• Statistical and ML-based Natural Language Processing is a very active and multidisciplinary area of research

• Statistical and ML-based Natural Language Processing is a very active and multidisciplinary area of research

Summary/conclusionsSummary/conclusionsConclusionsConclusions

Page 87: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Some current research lines

Some current research lines

• Appropriate learning paradigm for all kind of NLP problems: TiMBL (DBZ99), TBEDL (Brill95), ME (Ratnaparkhi98), SNoW (Roth98), CRF (Pereira & Singer02).

• Definition of an adequate (and task-specific) feature space: mapping from the input space to a high dimensional feature space, kernels, etc.

• Resolution of complex NLP problems: inference with classifiers + constraint satisfaction

• etc.

• Appropriate learning paradigm for all kind of NLP problems: TiMBL (DBZ99), TBEDL (Brill95), ME (Ratnaparkhi98), SNoW (Roth98), CRF (Pereira & Singer02).

• Definition of an adequate (and task-specific) feature space: mapping from the input space to a high dimensional feature space, kernels, etc.

• Resolution of complex NLP problems: inference with classifiers + constraint satisfaction

• etc.

ConclusionsConclusions

Page 88: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

BibliografiaBibliografia

• You may found additional information at:

http://www.lsi.upc.es/~lluism/

tesi.htmlpublicacions/pubs.htmlcursos/talks.htmlcursos/MLandNL.htmlcursos/emnlp1.html

• This talk at:

http://www.lsi.upc.es/~lluism/udg03.ppt.gz

• You may found additional information at:

http://www.lsi.upc.es/~lluism/

tesi.htmlpublicacions/pubs.htmlcursos/talks.htmlcursos/MLandNL.htmlcursos/emnlp1.html

• This talk at:

http://www.lsi.upc.es/~lluism/udg03.ppt.gz

ConclusionsConclusions

Page 89: Machine Learning for NLP 30/06/2003 Seminar: Statistical NLP Girona, June 2003 Machine Learning for Natural Language Processing Lluís Màrquez TALP Research.

Machine Learning for NLP 30/06/2003 Machine Learning for NLP 30/06/2003

Seminar: Statistical NLPSeminar: Statistical NLP

Girona, June 2003Girona, June 2003

Machine Learning for Natural Language

Processing

Machine Learning for Natural Language

Processing Lluís Màrquez

TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya

Lluís MàrquezTALP Research Center

Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya