December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information...

58
December 2004 CSA4050: Information Extr action II 1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition

Transcript of December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information...

Page 1: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 1

CSA405: Advanced Topicsin NLP

Information Extraction II

Named Entity Recognition

Page 2: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 2

Sources

– D. Appelt and D. Israel, Introduction= to IE Technology, tutorial given at IJCAI99

– Mikheev et al EACL 1999: Named Entity Recognition without Gazetteers

– Daniel M. Bikel, Richard Schwartz and Ralph M. Weischedel. 1999. An Algorithm that Learns What’s in a Name

Page 3: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 3

Outline

• NER – what is involved

• The MUC6/7 task definition

• Two approaches:– Mikheev 1999 (Rule Based)– Bikel 1999 (NER Based on HMMs)

Page 4: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 4

The Named Entity Recognition

• Named Entity task introduced as part of MUC-6 (1995), and continued at MUC-7 (1998)

• Different kinds of named entity:– temporal expressions– numeric expressions– name expressions

Page 5: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 5

Temporal Expressions(TIMEX tag)

• DATE: complete or partial date expression

• TIME: complete or partial expression of time of day

• Absolute temporal expressions only, i.e.– Monday,“– "10th of October“– but not "first day of the month".

Page 6: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 6

More TIMEX Examples

• "twelve o'clock noon" <TIMEX TYPE="TIME">twelve o'clock noon</TIMEX>

• "January 1990" <TIMEX TYPE="DATE">January 1990</TIMEX>

• "third quarter of 1991" <TIMEX TYPE="DATE">third quarter of 1991</TIMEX>

• "the fourth quarter ended Sept. 30" <TIMEX TYPE="DATE">the fourth quarter ended Sept. 30</TIMEX>

Page 7: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 7

Time Expressions - Difficulties

• Problems interpreting some task specs:“Relative time expressions are not to be tagged, but any absolute times expressed as part of the entire expression are to be tagged”– this <TIMEX TYPE="DATE">June</TIMEX>

– thirty days before the end of the year (no markup)

– the end of <TIMEX TYPE="DATE">1991</TIMEX>

Page 8: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 8

Temporal Expressions

• DATE/TIME distinction relatively straightforward to handle

• Can typically be captured by Regular Expressions

• Need to handle missing elements properlye.g. Jan 21st Jan 21st 2002

Page 9: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 9

Number Expressions(NUMEX)

• Monetary expressions

• Percentages.

• Numbers may be expressed in either numeric or alphabetic form.

• Categorized as “MONEY” or “PERCENT” via the TYPE attribute.

Page 10: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 10

NUMEX Tag

• The entire string is to be tagged. <NUMEX TYPE="MONEY">20 million New Pesos</NUMEX>

• Modifying words are to be excluded from the NUMEX tag. over <NUMEX TYPE="MONEY">$90,000</NUMEX>

• Nested tags allowed <NUMEX TYPE="MONEY"><ENAMEX TYPE="LOCATION">US</ENAMEX>$43.6 million</NUMEX>

• Numeric expressions that do not use currency/percentage terms are not to be tagged.12 points (no markup)

Page 11: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 11

NUMEX Examples

• "about 5%" about <NUMEX TYPE="PERCENT">5%</NUMEX>

• "over $90,000" over <NUMEX TYPE="MONEY">$90,000</NUMEX>

• "several million dollars" <NUMEX TYPE="MONEY" ALT="million dollars">several million dollars</NUMEX>

• "US$43.6 million" <NUMEX TYPE="MONEY"><ENAMEX TYPE="LOCATION">US</ENAMEX>$43.6 million</NUMEX>

Page 12: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 12

Name Expressions

• Two related subtasks:– Identification – which piece of text– Classification – what kind of name

Page 13: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 13

Name RecognitionIdentification and Classification

• The delegation, which included the commander of the U.N. troops in Bosnia, Lt.Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic .– Locations

– Persons

– Organizations

Page 14: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 14

Annotator Guidelines

TYPE DESCRIPTION

Organisation named corporate, governmental, or other organizational entity

Person Named person or family

Location name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains, etc.)

Page 15: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 15

MUC-6 Output Format

•Output in terms of SGML markup

<ENAMEX TYPE="ORGANIZATION">Taga Co.</ENAMEX> type attributetag

Page 16: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 16

Name ExpressionsProblems

• Recognition– Sentence initial uppercase is unreliable

• Delimitation– Conjunctions: to bind or not to bind

Victoria and Albert (Museum)

• Type Ambiguity– Persons versus Organisations versus Locations, e.g.

J. Arthur RankWashington

Page 17: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 17

Example 2

1. MATSUSHITA ELECTRIC INDUSTRIAL CO . HAS REACHED AGREEMENT …

2. IF ALL GOES WELL, MATSUSHITA AND ROBERT BOSCH WILL …

3. VICTOR CO. OF JAPAN ( JVC ) AND SONY CORP.

4. IN A FACTORY OF BLAUPUNKT WERKE , A ROBERT BOSCH SUBSIDIARY , …

5. TOUCH PANEL SYSTEMS , CAPITALIZED AT 50 MILLION YEN, IS OWNED …

6. MATSUSHITA EILL DECIDE ON THE PRODUCTION SCALE. …

Page 18: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 18

Example 2

1. EASY – keyword present

2. EASY – shortened form is computable

3. EASY – acronym is computable

4. HARD – difficult to tell ROBERT BOSCH is an organisation name

5. HARD – cf. 4.

6. HARD – spelling error difficult to spot.

Page 19: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 19

Name Expressions:Sources of Information

• Occurrence specific– capitalisation; presence of immediately

surrounding clue words (e.g . Mr.)

• Document specific– Previous mention of a name (cf. symbol tables)– same document; same collection

• External– Gazetteers: e.g. person names; place names; zip

codes.

Page 20: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 20

Gazetteers

• System that recognises only entities stored in its lists (gazetteers).

• Advantages - Simple, fast, language independent, easy to retarget (just create lists)

• Disadvantages – impossible to enumerate all names, cannot deal with name variants, cannot resolve ambiguity.

Page 21: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 21

Gazetteers

• Limited availability

• Maintenance (organisations change)

• Criteria for building effective gazetteers unclear, e.g. size, but

• Better to use small gazetteers with of well-known names than large ones of low-frequency names (Mikheev et al. 1999).

Page 22: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 22

Sources for Creation of Gazetteers

• Yellow pages for person and organisation names.

• US GEOnet Names Server (GNS) data – 3.9 million locations with 5.37 million nameshttp://earth-info.nga.mil/gns/html/

• UN site: http://unstats.un.org/unsd/citydata• Automatic collection from annotated

training data

Page 23: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 23

Recognising Names

• Two main approaches

• Rule Based System– Usually based on FS methods

• Automatically trained system– Usually based on HMMs

• Rule based systems tend to have a performance advantage

Page 24: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 24

Mikheev et al 1999

• How important are gazetteers?

• Is it important that they are big?

• If gazetteers are important but their size isn't,

• What are the criteria for building gazetteers?

Page 25: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 25

Mikheev – Experiment

• Learned List– Training data (200 articles from MUC7)– 1228 persons, 809 Organisations, 770

Locations

• Common Lists– CIA World Fact book– 33K Organisations, 27K persons, 5K Locations

• Combined

Page 26: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 26

Mikheev – Results of Experiment

Page 27: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 27

Mikheev’s System

• Hybrid approach – c. 100 rules• Rules make heavy use of capitalisation• Rules based on internal structure which reveals

the type e.g.Word Word plcProf. Word Word

• Modest but well-chosen gazetteer - 5000 Company Names, 1000 Human Names, 20,000 Locations, 2-3 weeks effort

Page 28: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 28

Mikheev et-al (1999): Architecture

1. Sure-fire Rules

2. Partial Match

Rule Relaxation

Partial Match 2

Title Assignment

Page 29: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 29

Sure-Fire Rules• Fire when a possible candidate expression is surrounded by a suggestive context

Page 30: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 30

Partial Match 1

• Collect all named entitities already identified – eg: Adam Kluver Ltd.

• Generate all subsequences: Adam, Adam Kluver; Kluver, Kluver Ltd, Ltd.

• Check for occurrences of subsequences and mark as possible items of the same class as the orginal named entity

• Check against pre-trained maximum entropy model.

Page 31: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 31

Maximum Entropy Model

• This model takes into account contextual information for named entities– sentence position – whether they exist in lowercase in general– used in lowercase elsewhere in the same document, etc.

• These features are passed to the model as attributes of the partially matched words.

• If the model provides a positive answer for a partial match, the system makes a definite assignment.

Page 32: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 32

Rule Relaxation

• More relaxed contextual constraints

• Make use of information from existing markup and from previous stages to – Resolve conjunctions within named entitites

e.g. China Import and Export Co.– Resolve ambiguity of e.g.

Murdoch’s News Corp

Page 33: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 33

Partial Match 2

• Handle single word names not covered by partial match 1 (eg Hughes – Hughes Communication Ltd)

• U7ited States and Russia: If evidence for 2 items and one item has already been tagged “Location”, then likely that XXX and YYY are of same type. Hence conclude that U7ited States is of type Location

Page 34: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 34

Title Assignment

• Newswire titles are uppercase

• Mark up entities in title by matching or partially matching entities found in text

Page 35: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 35

Mikheev: System Results

Page 36: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 36

Use of Gazetteers

Page 37: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 37

Mikheev - Conclusions

• Locations suffer without gazetteers, but addition of small numbers of certain entries (e.g.country names) make a big difference.

• Main point: relatively small gazetteers are sufficient to give good precision and recall.

• Experiments on the basis of a particuar type (journalistic English with mixed case)

Page 38: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 38

Bikel 99 - Trainable SystemsHidden Markov Models

• HMM is a probabilistic model based on a sequence of events – in this case words..

• Whether a word is part of a name is an event with an estimable probability that can be determined from a training corpus.

• With HMM we assume that there is an underlying probabilistic FSM that changes state with each input event.

• Probability that a word is part of a name is conditional also on the state of the machine.

Page 39: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 39

Creating HMMs

• Constructing an HMM depends upon• Having a good hidden state model• Having enough training data to estimate the

probabilities of the state transitions given sequences of words.

• When the recogniser is run, it computes the maximum likelihood path through the hidden state model, given the input word sequence.

• Viterbi Algorithm finds the path.

Page 40: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 40

The HMM for NER (Bikel)

start-of-sentence end-of-sentenceorganisation

person

not-a-name

(other name classes)

Page 41: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 41

Name Class Categories

• Eight Name Classes + not-a-name (NAN).• Within each category, use a bigram

language model (number of states in each category is V).

• Aim, for a given sentence, is to find the most likely sequence of name-classes (NC) given a sequence of words (W):

• NC = argmax(P(NC|W))

Page 42: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 42

Model of Word Production

• Select a name class NC, conditioning on the previous name-class (NC-1) and previous word w-1.

• Generate the first word inside NC, conditioning on the NC and NC-1..

• Generate all subsequent words inside NC, where each subsequent word is conditioned on its immediate predecessor (using standard bigram language model).

Page 43: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 43

Example

• Sentence: Mr. Jones eats• According to MUC-6 rules, correct

labelling isMr. <ENAMEX TYPE=PERSON>Jones</ENAMEX>eats.NAN PERSON NAN

• According to model, the likelihood of this word/name-class sequence is given by the following expression (which should turn out to be most likely, given sufficient training)..

Page 44: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 44

Likelihood Under the Model

Pr(NOT-A-NAME | START-OF-SENTENCE, “+end+”) *Pr(“Mr.” | NOT-A-NAME, START-OF-SENTENCE) *Pr(+end+ | “Mr.”, NOT-A-NAME) *Pr(PERSON | NOT-A-NAME, “Mr.”) *Pr(“Jones” | PERSON, NOT-A-NAME) *Pr(+end+ | “Jones”, PERSON) *Pr(NOT-A-NAME | PERSON, “Jones”) *Pr(“eats” | NOT-A-NAME, PERSON) *Pr(“.” | “eats”, NOT-A-NAME) *Pr(+end+ | “.”, NOT-A-NAME) *Pr(END-OF-SENTENCE | NOT-A-NAME, “.”)

Page 45: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 45

Words and Word Features

• Word features are a language dependent part of the model

twoDigitNum 90 Two digit yearfourDigitNum 1990 Four digit yearcontainsDigitAndAlpha A8956-67 Product codecontainsDigitAndDash 09-96 DatecontainsDigitAndSlash 11/9/89 DatecontainsDigitAndComma 23,000.00 Monetary amountcontainsDigitAndPeriod 1.00 Monetary amountallCaps BBN OrganizationcapPeriod M. Person name initialinitCap Sally Capitalized wordother , Punctuation all other

words

Page 46: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 46

Three Sub Models

• Model to generate a name class

• Model to generate first word

• Model to generate subsequent words

Page 47: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 47

How the Model Works

Model to generate a name class

Model to generate first word

Model to generate subsequent words

Page 48: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 48

Generate First Word in NC

• Likelihood =P(transition from NC-1 to NC )*P(generate word w).=P(NC | NC-1,w-1)*P(<w,f> | NC, NC-1)

• N.B. Underlying Intuitions– Transition to NC strongly influenced by previous word

and previous word class– First word of a name class strongly influenced by

preceding word class.

Page 49: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 49

Generate Subsequent Wordsin Name Class

• Here there are two cases:– Normal – likelihood of w following w-1 within

a particular NC. P(<w,f> | <w,f>-1,NC )

– Final word – likelihood of w in NC being the final word of the class. This uses a distinguished “+end+” word with features “other” P(<+end+,other> | <w,f>final,NC)

Page 50: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 50

Estimating Probabilities

• P(NC|NC-1,w-1) = c(NC,NC-1,w-1) / c(NC-1,w-1)

• P(<w,f>first|NC,NC-1) = c(<w,f>first,NC,NC-1)/c(NC,NC-1)

• P(<w,f>|<w,f>-1,NC) = c(<w,f>,<w,f>-1,NC)/c(<w,f>-1,NC)

Page 51: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 51

Backoff Models and Smoothing

• System knows about all words/bigrams encountered during training.

• However, in real applications, unknown words are also encountered, and mapped to _UNK_

• System must therefore handle bigram probabilities involving _UNK_:

• as first word, as second word, as both.

Page 52: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 52

Constructing Unknown Word Model

• Based on "held out" data.• Divide data into 2 halves.• Use first half to create vocabulary, and train

on second half.• When performing name recognition, the

unknown word model is used whenever either or both words of a bigram is unknown.

Page 53: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 53

Backoff Strategy

• However, even with UWM, it is possible to be faced with a bigram that has never been encountered. In this case a backoff strategy is used.

• Underlying such a strategy is a series of fallback models.

• Data for successive members of the series are easier to obtain, but of lower quality.

Page 54: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 54

Backoff Models for Names Class Bigrams

P(NC | NC-1,w-1)

|

P(NC | NC-1)

|

P(NC)

|

1/NC

Page 55: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 55

Backoff Weighting

• The weight for each backoff model is computed on the fly

• If computing P(X|Y), assign weight λ to the direct estimate and a weight (1- λ) to the backoff model, where λ =

1 – (old c(Y)/c(y)) / 1+ (unique outcomes of Y/c(Y))

Page 56: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 56

Results of EvaluationLanguage Best Rules Identifinder

Mixed Case En (WSJ) 96.4 94.9

Upper Case En (WSJ) 89 93.6

Speech Form En (WSJ) 74 90.7

Mixed Case Sp 93 90

Page 57: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 57

How Much Data is Needed?

• Performance increase of 1.5 F-points for each doubling in the quantity of training data.

• 1.2 million words of training data = 200 hours of broadcast news or 1777 Wall Street Journal articles. = 20 person weeks

Page 58: December 2004CSA4050: Information Extraction II1 CSA405: Advanced Topics in NLP Information Extraction II Named Entity Recognition.

December 2004 CSA4050: Information Extraction II 58

Bikel - Conclusion

• Old fashioned techniques

• Simple probabilistic

• Near human performance

• Higher F-measure than any other system when case information is missing.