A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based...

28
A Survey of NLP Toolkits Jing Jiang Mar 8, 2007

Transcript of A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based...

Page 1: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

A Survey of NLP Toolkits

Jing Jiang

Mar 8, 2007

Page 2: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 2

Outline

• WordNet

• Statistics-based phrases

• POS taggers

• Parsers

• Chunkers (syntax-based phrases)

• NER

• SNoW, OpenNLP and LingPipe

Page 3: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 3

Outline (cont.)

• What does the tool provide?

• Is the tool easy to use as a stand-alone program?

• Is the tool easy to modify or integrate with my program?

Page 4: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 4

WordNet

• Background:– Princeton, George Miller, 1985– “WordNet: An Electronic Lexical Database”– Current version: WordNet 3.0

• What does it provide?– A database of words and their relations

• Nouns, verbs, adjectives and adverbs• Lexical relations: morphology• Semantic relations: synonyms,

hypernyms/hyponyms, holonyms/meronyms, etc.

Page 5: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 5

WordNet

• To use as a stand-alone program?– A command line program– Web interface

• To modify or integrate with my program?– API in C– Online manual not very clear (http://

wordnet.princeton.edu/doc)– Interfaces in other languages (http://

wordnet.princeton.edu/links#local)• Java• Perl• Many others

Page 6: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 6

WordNet::Similarity

• Background– Ted Pedersen et al.

• What does it provide:– Semantic similarity between two words

measured in various ways using WordNet– Need to understand the measures to make

the best use

• Demo:– http://marimba.d.umn.edu/cgi-bin/similarity.cgi

Page 7: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 7

WordNet::Similarity

• To use as a stand-alone program?– A Perl script to call from command line– Web interface

• To modify or integrate with my program?– A Perl module– Online API with details and examples

Page 8: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 8

Ngram Statistics Package

• What does it provide:– N-grams from a corpus ranked by a user-

selected statistical measure of association (e.g. mutual information, chi-squared test)

Page 9: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 9

Ngram Statistics Package

• To use as a stand-alone program?– count.pl, statistic.pl– Input can be flat text– Regular expressions to define tokens can be specified

by the user

• To modify or integrate with my program?– Perl module– Online API with details and examples– User can define new statistical measures of

association

Page 10: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 10

LingPipe: Significant Phrases

• What does it provide:– Collocations (similar to NSP)– Relatively new terms

• Foreground vs. background• Web application: Amazon “SIPs”, Yahoo “Buzz

Index”, Google “in the news”• http://www.alias-i.com/lingpipe/demos/tutorial/inter

estingPhrases/read-me.html

Page 11: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 11

POS Taggers

• What do they provide?– POS tags

• How many POS tags are there?– Penn Treebank Tag Set

http://www.cis.upenn.edu/~treebank/– Which tags are useful to your task?

Page 12: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 12

Brill Tagger

• Background– Eric Brill, PhD thesis, U Penn, 1993– Transformation-based error-driven learning

• Accuracy and speed– ~96%– ~5000 sentences ~4 seconds

Page 13: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 13

Brill Tagger

• To use as a stand-alone program?– Call from command line– Input must be one sentence per line,

tokenized• E.g. We ’re going today , are you ?

• To modify or integrate with my program?– No API

Page 14: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 14

Charniak Parser

• Background– Eugene Charniak, Brown University– State-of-the-art

• What does it provide?– Syntactic parse tree

Page 15: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 15

Charniak Parser

• To use as a stand-alone program?– Call from command line– Input must be one sentence per line

• To modify or integrate with my program?– No API

Page 16: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 16

Collins Parser

• Background– Michael Collins, PhD thesis, U Penn, 1999– Head-driven statistical models

• What does it provide?– Syntactic parse trees– Head word for each production (dependency

relations, but no relation labels)

Page 17: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 17

Collins Parser

• To use as a stand-alone program?– Call from command line– Input must be one sentence per line,

tokenized, POS tagged

• To modify or integrate with my program?– No API

Page 18: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 18

MiniPar

• Background– Dekang Lin, U Alberta

• What does it provide?– Dependency parse trees– Dependency relation labels

• Accuracy and speed– ~88% precision, ~80% recall for dependency

relations– 300 words / second (Pentium II 300, 128MB)

Page 19: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 19

Examples of Dependency Relations

• The Fulton County Grand Jury said Friday an investigation of Atlanta 's recent primary election produced…

• say V:s:N Fulton County Grand Jury• Fulton County Grand Jury N:det:Det the• Fulton County Grand Jury N:lex-mod:U Fulton• Fulton County Grand Jury N:lex-mod:U County• Fulton County Grand Jury N:lex-mod:U Grand• say V:subj:N Fulton County Grand Jury• say V:guest:N Friday• produce V:s:N investigation• investigation N:det:Det an• investigation N:mod:Prep of

Page 20: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 20

MiniPar

• To use as a stand-alone program?– A command line program– Input must be one sentence per line

• To modify or integrate with my program?– API in C– Parse tree and dependency relations are

stored in some data structure for easy access

Page 21: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 21

Comparison of Parsers

• Accuracy:– Charniak > Collins > MiniPar

• Dependency relations:– Collins, MiniPar

• Dependency relation labels:– MiniPar

• Speed– MiniPar

Page 22: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 22

Chunkers (Shallow Parsers)

• What do they provide?– Phrase structure of a sentence– E.g. [NP He] [VP reckons] [NP the current

account deficit] [VP will narrow] [PP to] [NP only 1.8 billion] [PP in] [NP September]

• Compare with collocations

Page 23: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 23

Named Entity Recognizers

• What do they provide?– Named entities of various pre-defined types

(e.g. Person, Location, Organization, Number, etc.)

Page 24: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 24

SNoW-based Tools

• Use SNoW as the underlying learner

• In C++

• API available for many components

Page 25: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 25

SNoW-based Tools

• Sentence splitter

• Tokenizer

• POS tagger

• Dependency parser

• Chunker

• NE tagger

• SRL

Page 26: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 26

OpenNLP

• Java-based, open source project

• Maximum entropy models

• Pipeline structure– Sentence detector tokenizer POS tagger

Chunker

• Java API

Page 27: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 27

OpenNLP

• Sentence boundary detector

• Tokenizer

• POS tagger

• Chunker

• Parser

• Name Finder

• Coreference

Page 28: A Survey of NLP Toolkits Jing Jiang Mar 8, 2007. 03/08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)

03/08/2007 28

LingPipe

• Java-based libraries for various linguistic analysis

• http://www.alias-i.com/lingpipe/index.html