ppt

46
1 Overview of Machine Learning for NLP Tasks: part I (based partly on slides by Kevin Small and Scott Yih)

description

 

Transcript of ppt

Page 1: ppt

1

Overview of Machine Learning for NLP Tasks: part I

(based partly on slides by Kevin Small and Scott Yih)

Page 2: ppt

Page 2

Goals of Introduction

Frame specific natural language processing (NLP) tasks as machine learning problems

Provide an overview of a general machine learning system architecture

Introduce a common terminology Identify typical needs of ML system

Describe some specific aspects of our tool suite in regards to the general architecture

Build some intuition for using the tools Focus here is on Supervised learning

Page 3: ppt

Page 3

Overview

1. Some Sample NLP Problems2. Solving Problems with Supervised Learning3. Framing NLP Problems as Supervised

Learning Tasks4. Preprocessing: cleaning up and enriching

text5. Machine Learning System Architecture6. Feature Extraction using FEX

Page 4: ppt

Page 4

Context Sensitive Spelling[2]

A word level tagging task:

I would like a peace of cake for desert.

I would like a piece of cake for dessert.

In principal, we can use the solution to the

duel problem.

In principle, we can use the solution to the dual problem.

Page 5: ppt

Page 5

Part of Speech (POS) Tagging

Another word-level task:

Allen Iverson is an inconsistent player. While he can shoot very well, some nights he will score only a few points.

(NNP Allen) (NNP Iverson) (VBZ is) (DT an) (JJ inconsistent) (NN player) (. .) (IN While) (PRP he) (MD can) (VB shoot) (RB very) (RB well) (, ,) (DT some) (NNS nights) (PRP he) (MD will) (VB score) (RB only)

(DT a) (JJ few) (NNS points) (. .)

Page 6: ppt

Page 6

Phrase Tagging

Named Entity Recognition – a phrase-level task:

After receiving his M.B.A. from Harvard Business School, Richard F. America accepted a faculty position at the McDonough School of Business (Georgetown University) in Washington.

After receiving his [MISC M.B.A.] from [ORG Harvard Business School], [PER Richard F. America] accepted a faculty position at the [ORG McDonough School of Business] ([ORG Georgetown University]) in [LOC Washington].

Page 7: ppt

Page 7

Some Other Tasks

Text Categorization Word Sense Disambiguation Shallow Parsing Semantic Role Labeling Preposition Identification Question Classification Spam Filtering

::

Page 8: ppt

Page 8

Supervised Learning/SNoW

Page 9: ppt

Page 9

Learning Mapping Functions

Binary Classification

Multi-class Classification

Ranking

Regression

{Feature, Instance, Input}

Space – space used to describe each instance; often

Output Space – space of possible output labels; very dependent on problem

Hypothesis Space – space of functions that can be selected by the machine learning algorithm; algorithm dependent (obviously)

1,0d

kd Y,,2,1,0

Yd

d

;dX ;1,0 dX ;dNX

Page 10: ppt

Page 10

Multi-class Classification[3,4]

One Versus All (OvA) Constraint Classification

xfy yYy

maxarg

Page 11: ppt

Page 11

Online Learning[5]

SNoW algorithms include Winnow, Perceptron

Learning algorithms are mistake driven

Search for linear discriminant along function gradient (unconstrained optimization)

Provides best hypothesis using data presented up to to the present example

Learning rate determines convergence

Too small and it will take forever Too large and it will not

converge

Page 12: ppt

Page 12

Framing NLP Problems as Supervised Learning Tasks

Page 13: ppt

Page 13

Defining Learning Problems[6]

ML algorithms are mathematical formalisms and problems must be modeled accordingly

Feature Space – space used to describe each instance; often Rd, {0,1}d, Nd

Output Space – space of possible output labels, e.g.

Set of Part-of-Speech tags Correctly spelled word (possibly from confusion set)

Hypothesis Space – space of functions that can be selected by the machine learning algorithm, e.g.

Boolean functions (e.g. decision trees) Linear separators in Rd

Page 14: ppt

Page 14

Context Sensitive Spelling

Did anybody (else) want too sleep for to more hours this morning?

Output Space Could use the entire vocabulary;

Y={a,aback,...,zucchini} Could also use a confusion set; Y={to, too, two}

Model as (single label) multi-class classification

Hypothesis space is provided by SNoW Need to define the feature space

twotootod ,,

Page 15: ppt

Page 15

What are ‘feature’, ‘feature type’, anyway?

A feature type is any characteristic (relation) you can define over the input representation.

Example: feature TYPE = word bigrams

Sentence:

The man in the moon eats green cheese.

Features:

[The_man], [man_in], [in_the], [the_moon]….

In Natural Language Text, sparseness is often a problem

How many times are we likely to see “the_moon”? How often will it provide useful information? How can we avoid this problem?

Page 16: ppt

Page 16

Preprocessing: cleaning up and enriching text

Assuming we start with plain text:

The quick brown fox jumped over the lazy dog. It landed on

Mr. Tibbles, the slow blue cat.

Problems: Often, want to work at the level of sentences,

words Where are sentence boundaries – ‘Mr.’ vs. ‘Cat.’? Where are word boundaries -- ‘dog.’ Vs. ‘dog’?

Enriching the text: e.g. POS-tagging:

(DT The) (JJ quick) (NN brown) (NN fox) (VBD jumped) (IN over) (DT the) (JJ lazy) (NN dog) (. .)

Page 17: ppt

Page 17

Download Some Tools

http::/l2r.cs.uiuc.edu/~cogcomp/ Software::tools, Software::packages

Sentence segmenter Word segmenter POS-tagger FEX NB: RIGHT-CLICK on “download” link

select “save link as...”

Page 18: ppt

Page 18

Preprocessing scripts

http://l2r.cs.uiuc.edu/~cogcomp/ sentence-boundary.pl

./sentence-splitter.pl –d HONORIFICS –i nyttext.txt -o nytsentence.txt

word-splitter.pl./word-splitter.pl nytsentence.txt > nytword.txt

Invoking the tagger:./tagger –i nytword.txt –o nytpos.txt

Check output

Page 19: ppt

Page 19

Problems running .pl scripts?

Check the first line:#!/usr/bin/perl

Find perl library on own machine E.g. might need...

#!/local/bin/perl

Check file permissions...> ls –l sentence-boundary.pl> chmod 744 sentence-boundary.pl

Page 20: ppt

Page 20

Minor Problems with install

Possible (system-dependent) compilation errors:

doesn’t recognize ‘optarg’ POS-tagger: change Makefile in subdirectory snow/

where indicated sentence-boundary.pl: try ‘perl sentence-

boundary.pl’

Link error (POS tagger): linker can’t find –lxnet

remove ‘-lxnet’ entry from Makefile generally, check README, makefile for hints

Page 21: ppt

Page 21

The System View

Page 22: ppt

Page 22

A Machine Learning System

PreprocessingFeature

Extraction

MachineLearner

Classifier(s) Inference

RawText

FormattedText

TestingExamples

FunctionParameters

Labels

FeatureVectors

TrainingExamples

Labels

Page 23: ppt

Page 23

Preprocessing Text

Sentence splitting, Word Splitting, etc.

Put data in a form usable for feature extraction

They recently recovered a small piece of a live Elvis concert recording.He was singing gospel songs, including “Peace in the Valley.”

0 0 0 They0 0 1 recently0 0 2 recovered0 0 3 a0 0 4 smallpiece 0 5 piece0 0 6 of:0 1 6 including0 1 7 QUOTEpeace 1 8 Peace0 1 9 in0 1 10 the0 1 11 Valley0 1 12 .0 1 13 QUOTE

Page 24: ppt

Page 24

A Machine Learning System

PreprocessingFeature

Extraction

RawText

FormattedText

FeatureVectors

Page 25: ppt

Page 25

Feature Extraction with FEX

Page 26: ppt

Page 26

Feature Extraction with FEX

FEX (Feature Extraction tool) generates abstract representations of text input

Has a number of specialized modes suited to different types of problem

Can generate very expressive features Works best when text enriched with other knowledge sources

– i.e., need to preprocess text

S = I would like a piece of cake too!

FEX converts input text into a list of active features…

1: 1003, 1005, 1101, 1330…

Where each numerical feature corresponds to a specific textual feature:

1: label[piece]1003: word[like] BEFORE word[a]

Page 27: ppt

Page 27

Feature Extraction

Converts formatted text into feature vectors

Lexicon file contains feature descriptions

0 0 0 They0 0 1 recently0 0 2 recovered0 0 3 a0 0 4 smallpiece 0 5 piece0 0 6 of:0 1 6 including0 1 7 QUOTEpeace 1 8 Peace0 1 9 in0 1 10 the0 1 11 Valley0 1 12 .0 1 13 QUOTE

0, 1001, 1013, 1134, 1175, 1206

1, 1021, 1055, 1085, 1182, 1252

LexiconFile

Page 28: ppt

Page 28

Role of FEX

Why won't you accept the facts?

No one saw her except the postman.

1, 1001, 1003, 1004, 1006:

2, 1002, 1003, 1005, 1006:

Feature ExtractionFEX

lab[accept], w[you], w[the], w[you*], w[*the]

lab[except], w[her], w[the], w[her*], w[*the]

Page 29: ppt

Page 29

Four Important Files

FEX

Script

Corpus Example

Lexicon

A new representation of the

raw text data

1. Control FEX’s behavior2. Define the “types” of features

Feature vectors for SNoW

Mapping of feature and feature id

Page 30: ppt

Page 30

Corpus – General Linear Format

The corpus file contains the preprocessed input with a single sentence per line.

When generating examples, Fex never crosses line boundaries.

The input can be any combination of: 1st form: words separated by white spaces 2nd form: tag/word pairs in parentheses There is a more complicated 3rd form, but

deprecated in view of alternative, more general format (later)

Page 31: ppt

Page 31

Corpus – Context Sensitive Spelling

Why won't you accept the facts?

(WRB Why) (VBD wo) (NN n't) (PRP you)(VBP accept) (DT the) (NNS facts) (. ?)

No one saw her except the postman.

(DT No) (CD one) (VBD saw) (PRP her) (IN except) (DT the) (NN postman) (. .)

Page 32: ppt

Page 32

Script – Means of Feature Engineering

Fex does not decide or find good features. Instead, Fex provides you an easy method to

define the feature types and extracts the corresponding features from data.

Feature Engineering is in fact very important in practical learning tasks.

Page 33: ppt

Page 33

Script – Description of Feature Types

What can be good features? Let’s try some combinations of words and

tags. Feature types in mind

Words around the target word (accept, except) POS tags around the target Conjunctions of words and POS tags? Bigrams or trigrams? Include relative locations?

Page 34: ppt

Page 34

Graphical Representation

0 1 2 3 4 5 6 7

WRB

Why

VBD

won

NN

't

PRP

you

VBP

accept

DT

the

NNS

facts

.

?

Target-2 -1 1 2

0

-3-4 3

Window [-2,2]

Page 35: ppt

Page 35

Script – Syntax

Syntax:targ [inc] [loc]: RGF [[left-offset, right-offset]]

targ – target index If targ is ‘-1’…

target file entries are used to identify the targets

If no target file is specified, then EVERY word is treated as a target

inc – use the actual target instead of the generic place-holder (‘*’)

loc – include the location of feature relative to the target

RGF – define “types” of features like words, tags, conjunctions, bigrams, trigrams, …, etc

left-offset and right-offset: specify the window range

Page 36: ppt

Page 36

Basic RGF’s – Sensors (1/2)

Type Mnemonic

Interpretation Example

Word w the word (spelling) w[you]

Tag t part-of-speech tag t[NNP]

Vowel v active if the word starts with a vowel

v[eager]

Length len length of the word len[5]

Sensor is the fundamental method of defining “feature types.” It is applied on the element, and generates active features.

Page 37: ppt

Page 37

Basic RGF’s – Sensors (2/2)

Type Mnemonic

Interpretation Example

City List isCity active is the phrase is the name of a city

isCity[Chicago]

Verb Class vCls return Levin’s verb class

vCls[51.2]

More sensors can be found by looking at FEX source (Sensors.h)

lab: a special RGF that generates labels lab(w), lab(t), …

Sensors are also an elegant way to incorporate our background knowledge.

Page 38: ppt

Page 38

Complex RGF’s

Existential Usage len(x=3), v(X)

Conjunction and Disjunction w&t; w|t

Collocation and Sparse Collocation coloc(w,w); coloc(w,t,w); coloc(w|t,w|t) scoloc(t,t); scoloc(t,w,t); scoloc(w|t,w|t)

Page 39: ppt

Page 39

(Sparse) Collocation

0 1 2 3 4 5 6 7

WRB

Why

VBD

won

NN

't

PRP

you

VBP

accept

DT

the

NNS

facts

.

?

Target-2 -1 1 2

0

-3-4 3

-1 inc: coloc(w,t)[-2,2]

w[‘t]-t[PRP], w[you]-t[VBP]w[accept]-t[DT], w[the]-t[NNS]

-1 inc: scoloc(w,t)[-2,2]

w[‘t]-t[PRP], w[‘t]-t[VBP], w[‘t]-t[DT], w[‘t]-t[NNS],w[you]-t[VBP], w[you]-t[DT], w[you]-t[NNS], w[accept]-t[DT], w[accept]-t[NNS], w[the]-t[NNS]

Page 40: ppt

Page 40

Examples – 2 Scripts

Download examples from tutorial page:

‘context sensitive spelling materials’ link

accept-except-simple.scr-1: lab(w)-1: w[-1,1]

accept-except.scr-1: lab(w)-1: w|t [-2,2]-1 loc: coloc(w|t,w|t) [-3,-3]

Page 41: ppt

Page 41

Lexicon & Example (1/3)

Corpus:… (NNS prices) (CC or) (VB accept) (JJR slimmer) (NNS

profits) …

Script: ae-simple.scr-1 lab(w); -1: w[-1,1]

Lexicon:1 label[w[except]]2 label[w[accept]]1001 w[or]1002 w[slimmer]

Example:2, 1001, 1002;

Generated by lab(w)

Generated by w[-1,1]

Feature indices of lab start from 1.

Feature indices of regular features start from 1001.

Page 42: ppt

Page 42

Lexicon & Example (2/3)

Target file: fex -t ae.targ …acceptexcept

Lexicon file If the file does not exist, fex will create it. If the file already exists, fex will first read

it, and then append the new entries to this file.

This is important because we don’t want two different feature indices representing the same feature.

We treat only these two words as targets.

Page 43: ppt

Page 43

Lexicon & Example (3/3)

Example file If the file does not exist, fex will create it. If the file already exists, fex will append

new examples to it. Only active features and their

corresponding lexicon items are generated.

If the read-only lexicon option is set, only those features from the lexicon that are present (active) in the current instance are listed.

Page 44: ppt

Page 44

Now practice – change script, run FEX, look at the resulting

lexicon/examples

> ./fex –t ae.targ ae-simple.scr ae-simple.lex short-ae.pos short-ae.ex

Page 45: ppt

Page 45

Citations

1) F. Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002.

2) A. R. Golding and D. Roth. A Winnow-Based Approach to Spelling Correction. Machine Learning, 34:107-130, 1999.

3) E. Allewin, R. Schapire, and Y. Singer. Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers. Journal of Machine Learning Research, 1:113-141, 2000.

4) S. Har-Peled, D. Roth, and D. Zimak. Constraint Classification: A New Approach to Multiclass Classification. In Proc. 13th Annual Intl. Conf. of Algorithmic Learning Theory, pp. 365-379, 2002.

5) A. Blum. On-Line Algorithms in Machine Learning. 1996.

Page 46: ppt

Page 46

Citations

6) T. Mitchell. Machine Learning, McGraw Hill, 1997.7) A. Blum. Learning Boolean Functions in an Infinite

Attribute Space. Machine Learning, 9(4):373-386, 1992.

8) J. Kivinen and M. Warmuth. The Perceptron Algorithm vs. Winnow: Linear vs. Logarithmic Mistake Bounds when few Input Variables are Relevant. UCSC-CRL-95-44, 1995.

9) T. Dietterich. Approximate Statistical Tests for Comparing Supervised Classfication Learning Algorithms. Neural Computation, 10(7):1895-1923, 1998