Machine Learning in Natural Language Processing

Post on 11-Jun-2015

164 views 1 download

description

Panel talk given to the ATL Data Science meet-up. http://www.meetup.com/Data-Science-ATL/events/205956952/

Transcript of Machine Learning in Natural Language Processing

Jinho D. Choi jinho.choi@emory.edu

Machine Learning in Natural Language Processing

Data Science ATL Meetup October 9th, 2014

Natural Language Processing

2

NLP is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages.

According to Wikipedia:

What areNLP tasks?

Natural Language Processing

3

John bought two books from me that he wantedNNP VBD CD NNS IN NNP WDT PRP VBZ

wanted

bought

two

John books from

me

that he

agenttheme

source

theme agent

nsubjdobj

prep

num

rcmod

pobj

nsubjdobj

end possession

start possession

Part-of-speech Tagging

Dependency Parsing

Semantic Role Labeling

Semantic Understanding

Coreference Resolution

How?

Rule-based Approach

4

if wi.form == ‘John’: wi.pos = ‘noun’

if wi.form == ‘majors’: wi.pos = ‘noun’

if wi.form == ‘majors’ and wi-1.form == ‘two’ wi.pos = ‘noun’

if wi.form == ‘studies’ and wi-1.pos == ‘num’ wi.pos = ‘noun’

Really?

Too specific!

Keep doing this?

Find the part-of-speech tag of each word.

Good.

John has two majors John majors in Mathnoun verb num noun noun verb num noun

Machine Learning Approach

5

John has two majors John majors in Mathnoun verb num noun noun verb num noun

Extract features for each word.

wi-1.formwi.form wi+1.form wi-1.f + wi.f wi.f + wi+1.fLabel

John ∅ has ∅ John_hasnoun

noun majors two ∅ two_majors ∅

verb majors John in John_majors majors_in

Convert string features into vector.

0 0 1 0 0

John has two

majors in Math

0 1 0 0 0 0

John has two

majors in Math

0 0 0 0 1 0

John has two

majors in Math

0

Space?

Issues with NLP Features

6

NLP tasks often deal with 1 ~ 10 million features.

These feature vectors are very sparse.

The values in these vectors are often binary.

Many features are redundant in some way.

Feature selection takes a long time.

Is machine learning easier or harder for NLP?