CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora...

13
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science Nothman et al. 2009, EACL

Transcript of CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora...

Page 1: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

CS 6998 NLP for the WebColumbia University 04/22/2010

Analyzing Wikipedia and Gold-Standard Corpora for

NER Training

William Y. WangComputer Science

Nothman et al. 2009, EACL

Page 2: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Outline

1. Motivation

2. NER and Gold-Standard Corpora

3. The Problem: Cross-corpora Performance

4. Wikipedia for NER

5. Results

6. Conclusion and My Observation

Page 3: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Motivation

1. Manual Annotation is “expensive”. (1) expensive (2) time (3) extra problems

Can we use linguistic resources to create NER corpus automatically?

2. What’s the cross-corpora NER performance?

3. How can we utilize Web resource (e.g. Wikipedia) to improve NER?

Page 4: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

NER Gold Corpora

1. MUC-7: Locations(LOC), organizations(ORG), personal names(PER)

2. CoNLL-03: LOC, ORG, PER, Miscellaneous(MISC)3. BBN: 54 tags in Penn Treebank

Corpus Tags Train Tokens

DevTokens

TestTokens

MUC-7 3 83601 18655 60436

CoNLL-03 4 203621 51362 46435

BBN 54 901894 142218 129654

Page 5: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Problem: Poor Cross-corpus Performance

Train With MISCCoNLL BBN

Without MISCMUC CoNLL

BBN

MUC — — — — 73.5 55.5 67.5

CoNLL 81.2 62.3 65.9 82.1 62.4

BBN 54.7 86.7 77.9 53.9 88.4

Page 6: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Corpus and Error Analysis

• N-gram tag variation:Check tags of all n-grams appear multiple times to see if the NE tags are consistent

• Entity type frequency:(1) POS tag with its NE tag

(e.g. nationalities are often with JJ or NNPS)(2) Wordtypes(3) Wordtypes with Functions

(e.g. Bank of New England -> Aaa of Aaa Aaa)

• Tag sequence confusion:Looking into the detail of confusion matrix

Page 7: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Using Wikipedia to Build NER Corpus

1. Classify all articles into entity classes

2. Split Wikipedia articles into sentences

3. Label NEs according to link targets

4. Select sentences for inclusion in a corpus

Page 8: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Improve Wikipedia NER

• Baseline: 58.9% and 62.3% on CoNLL and BBN

1. Inferring extra links using Wikipedia Disambiguation Pages

2. Personal titles: not all preceding titles indicate PER(e.g. Prime Minister of Australia)

3. Previously missed JJ entities (e.g. American / MISC)

4. Miscellaneous changes

Page 9: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

ResultsTrain With MISC

CoNLL BBNWithout MISC

MUC CoNLL BBN

MUC — — — — 82.3 54.9 69.3

CoNLL 85.9 61.9 69.9 86.9 60.2

BBN 59.4 86.5 80.2 59.0 88.0

WP0 62.8 69.7 69.7 64.7 70.0

WP1 67.2 73.4 75.3 67.7 73.6

WP2 69.0 74.0 76.6 69.4 75.1

WP3 68.9 73.5 77.2 69.5 73.7

WP4 66.2 72.3 75.6 67.3 73.3

DEV set results (higher but similar to test set results)

Page 10: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Conclusion

• The impact of NER training corpora on its corresponding test set is huge

• Annotation-free Wikipedia NER corpora created

• Wikipedia data performs better in the cross-corpora NER task

• Still much room for improvement

Page 11: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Comments

What I like about this paper:

• The scope of this paper is unique (analogy: cross-cultural studies)

• Utilizing novel linguistic resources to solve basic NLP problems

• Good results• Relatively clear and easy to understand

What I don’t like about this paper:

• The overall method to improve Wikipedia NER training is not a principal approach

Page 12: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Overall Assessment:

8/10

Page 13: CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Thank you!