Ling 570 Day 17: Named Entity Recognition Chunking.
-
Upload
ashlynn-johnston -
Category
Documents
-
view
219 -
download
0
Transcript of Ling 570 Day 17: Named Entity Recognition Chunking.
![Page 1: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/1.jpg)
Ling 570 Day 17:Named Entity Recognition
Chunking
![Page 2: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/2.jpg)
Sequence Labeling
• Goal: Find most probable labeling of a sequence
• Many sequence labeling tasks– POS tagging– Word segmentation– Named entity tagging– Story/spoken sentence segmentation– Pitch accent detection– Dialog act tagging
![Page 3: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/3.jpg)
NER AS SEQUENCE LABELING
![Page 4: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/4.jpg)
NER as Classification Task
• Instance:
![Page 5: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/5.jpg)
NER as Classification Task
• Instance: token• Labels:
![Page 6: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/6.jpg)
NER as Classification Task
• Instance: token• Labels:– Position: B(eginning), I(nside), Outside
![Page 7: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/7.jpg)
NER as Classification Task
• Instance: token• Labels:– Position: B(eginning), I(nside), Outside– NER types: PER, ORG, LOC, NUM
![Page 8: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/8.jpg)
NER as Classification Task
• Instance: token• Labels:– Position: B(eginning), I(nside), Outside– NER types: PER, ORG, LOC, NUM– Label: Type-Position, e.g. PER-B, PER-I, O, …– How many tags?
![Page 9: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/9.jpg)
NER as Classification Task
• Instance: token• Labels:– Position: B(eginning), I(nside), Outside– NER types: PER, ORG, LOC, NUM– Label: Type-Position, e.g. PER-B, PER-I, O, …– How many tags?• (|NER Types|x 2) + 1
![Page 10: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/10.jpg)
NER as Classification: Features
• What information can we use for NER?
![Page 11: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/11.jpg)
NER as Classification: Features
• What information can we use for NER?
![Page 12: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/12.jpg)
NER as Classification: Features
• What information can we use for NER?
– Predictive tokens: e.g. MD, Rev, Inc,..• How general are these features?
![Page 13: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/13.jpg)
NER as Classification: Features
• What information can we use for NER?
– Predictive tokens: e.g. MD, Rev, Inc,..• How general are these features? – Language? Genre? Domain?
![Page 14: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/14.jpg)
NER as Classification: Shape Features
• Shape types:
![Page 15: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/15.jpg)
NER as Classification: Shape Features
• Shape types:– lower: e.g. e. e. cummings• All lower case
![Page 16: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/16.jpg)
NER as Classification: Shape Features
• Shape types:– lower: e.g. e. e. cummings• All lower case
– capitalized: e.g. Washington• First letter uppercase
![Page 17: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/17.jpg)
NER as Classification: Shape Features
• Shape types:– lower: e.g. e. e. cummings• All lower case
– capitalized: e.g. Washington• First letter uppercase
– all caps: e.g. WHO• all letters capitalized
![Page 18: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/18.jpg)
NER as Classification: Shape Features
• Shape types:– lower: e.g. e. e. cummings• All lower case
– capitalized: e.g. Washington• First letter uppercase
– all caps: e.g. WHO• all letters capitalized
– mixed case: eBay• Mixed upper and lower case
![Page 19: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/19.jpg)
NER as Classification: Shape Features
• Shape types:– lower: e.g. e. e. cummings
• All lower case
– capitalized: e.g. Washington• First letter uppercase
– all caps: e.g. WHO• all letters capitalized
– mixed case: eBay• Mixed upper and lower case
– Capitalized with period: H.
![Page 20: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/20.jpg)
NER as Classification: Shape Features
• Shape types:– lower: e.g. e. e. cummings
• All lower case
– capitalized: e.g. Washington• First letter uppercase
– all caps: e.g. WHO• all letters capitalized
– mixed case: eBay• Mixed upper and lower case
– Capitalized with period: H.– Ends with digit: A9
![Page 21: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/21.jpg)
NER as Classification: Shape Features
• Shape types:– lower: e.g. e. e. cummings
• All lower case
– capitalized: e.g. Washington• First letter uppercase
– all caps: e.g. WHO• all letters capitalized
– mixed case: eBay• Mixed upper and lower case
– Capitalized with period: H.– Ends with digit: A9– Contains hyphen: H-P
![Page 22: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/22.jpg)
Example Instance Representation
• Example
![Page 23: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/23.jpg)
Sequence Labeling
• Example
![Page 24: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/24.jpg)
Evaluation
• System: output of automatic tagging• Gold Standard: true tags
![Page 25: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/25.jpg)
Evaluation
• System: output of automatic tagging• Gold Standard: true tags
• Precision: # correct chunks/# system chunks• Recall: # correct chunks/# gold chunks• F-measure:
![Page 26: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/26.jpg)
Evaluation
• System: output of automatic tagging• Gold Standard: true tags
• Precision: # correct chunks/# system chunks• Recall: # correct chunks/# gold chunks• F-measure:
• F1 balances precision & recall
![Page 27: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/27.jpg)
Evaluation
• Standard measures:– Precision, Recall, F-measure– Computed on entity types (Co-NLL evaluation)
![Page 28: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/28.jpg)
Evaluation
• Standard measures:– Precision, Recall, F-measure– Computed on entity types (Co-NLL evaluation)
• Classifiers vs evaluation measures– Classifiers optimize tag accuracy
![Page 29: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/29.jpg)
Evaluation
• Standard measures:– Precision, Recall, F-measure– Computed on entity types (Co-NLL evaluation)
• Classifiers vs evaluation measures– Classifiers optimize tag accuracy• Most common tag?
![Page 30: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/30.jpg)
Evaluation
• Standard measures:– Precision, Recall, F-measure– Computed on entity types (Co-NLL evaluation)
• Classifiers vs evaluation measures– Classifiers optimize tag accuracy• Most common tag?
– O – most tokens aren’t NEs
– Evaluation measures focuses on NE
![Page 31: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/31.jpg)
Evaluation
• Standard measures:– Precision, Recall, F-measure– Computed on entity types (Co-NLL evaluation)
• Classifiers vs evaluation measures– Classifiers optimize tag accuracy
• Most common tag? – O – most tokens aren’t NEs
– Evaluation measures focuses on NE• State-of-the-art:– Standard tasks: PER, LOC: 0.92; ORG: 0.84
![Page 32: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/32.jpg)
Hybrid Approaches
• Practical sytems– Exploit lists, rules, learning…
![Page 33: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/33.jpg)
Hybrid Approaches
• Practical sytems– Exploit lists, rules, learning…– Multi-pass:• Early passes: high precision, low recall• Later passes: noisier sequence learning
![Page 34: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/34.jpg)
Hybrid Approaches
• Practical sytems– Exploit lists, rules, learning…– Multi-pass:• Early passes: high precision, low recall• Later passes: noisier sequence learning
• Hybrid system:– High precision rules tag unambiguous mentions• Use string matching to capture substring matches
![Page 35: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/35.jpg)
Hybrid Approaches
• Practical sytems– Exploit lists, rules, learning…– Multi-pass:
• Early passes: high precision, low recall• Later passes: noisier sequence learning
• Hybrid system:– High precision rules tag unambiguous mentions
• Use string matching to capture substring matches
– Tag items from domain-specific name lists– Apply sequence labeler
![Page 36: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/36.jpg)
CHUNKING
![Page 37: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/37.jpg)
What is Chunking?
• Form of partial (shallow) parsing– Extracts major syntactic units, but not full parse trees
• Task: identify and classify – Flat, non-overlapping segments of a sentence– Basic non-recursive phrases– Correspond to major POS
• May ignore some categories; i.e. base NP chunking
– Create simple bracketing• [NPThe morning flight][PPfrom][NPDenver][Vphas arrived]
• [NPThe morning flight] from [NPDenver] has arrived
![Page 38: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/38.jpg)
ExampleS
NP
NNP
Breaking
NNP
Dawn
VP
VBZ
has
VP
VBN
broken
PP
IN
into
NP
DT
the
NN
box
NN
office
NN
top
NN
ten
![Page 39: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/39.jpg)
NPPPVPNP
ExampleS
NP
NNP
Breaking
NNP
Dawn
VP
VBZ
has
VP
VBN
broken
PP
IN
into
NP
DT
the
NN
box
NN
office
NN
top
NN
ten
![Page 40: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/40.jpg)
Why Chunking?
• Used when full parse unnecessary– Or infeasible or impossible (when?)
• Extraction of subcategorization frames– Identify verb arguments
• e.g. VP NP• VP NP NP• VP NP to NP
• Information extraction: who did what to whom• Summarization: Base information, remove mods• Information retrieval: Restrict indexing to base NPs
![Page 41: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/41.jpg)
Processing Example
• Tokenization: The morning flight from Denver has arrived
• POS tagging: DT JJ N PREP NNP AUX V
• Chunking: NP PP NP VP
• Extraction: NP NP VP
• etc
![Page 42: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/42.jpg)
Approaches
• Finite-state Approaches– Grammatical rules in FSTs– Cascade to produce more complex structure
• Machine Learning– Similar to POS tagging
![Page 43: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/43.jpg)
Finite-State Rule-Based Chunking
• Hand-crafted rules model phrases– Typically application-specific
• Left-to-right longest match (Abney 1996)– Start at beginning of sentence– Find longest matching rule– Greedy approach, not guaranteed optimal
![Page 44: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/44.jpg)
Finite-State Rule-Based Chunking
• Chunk rules:– Cannot contain recursion• NP -> Det Nominal: Okay• Nominal -> Nominal PP: Not okay
• Examples:– NP (Det) Noun* Noun– NP Proper-Noun– VP Verb– VP Aux Verb
![Page 45: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/45.jpg)
Finite-State Rule-Based Chunking
• Chunk rules:– Cannot contain recursion
• NP -> Det Nominal: Okay• Nominal -> Nominal PP: Not okay
• Examples:– NP (Det) Noun* Noun– NP Proper-Noun– VP Verb– VP Aux Verb
• Consider: Time flies like an arrow• Is this what we want?
![Page 46: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/46.jpg)
Cascading FSTs
• Richer partial parsing– Pass output of FST to next FST
• Approach:– First stage: Base phrase chunking– Next stage: Larger constituents (e.g. PPs, VPs)– Highest stage: Sentences
![Page 47: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/47.jpg)
Example
![Page 48: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/48.jpg)
Chunking by Classification
• Model chunking as task similar to POS tagging• Instance:
![Page 49: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/49.jpg)
Chunking by Classification
• Model chunking as task similar to POS tagging• Instance: tokens • Labels: – Simultaneously encode segmentation &
identification
![Page 50: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/50.jpg)
Chunking by Classification
• Model chunking as task similar to POS tagging• Instance: tokens • Labels: – Simultaneously encode segmentation &
identification– IOB (or BIO tagging) (also BIOE or BIOSE)• Segment: B(eginning), I (nternal), O(utside)
![Page 51: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/51.jpg)
Chunking by Classification
• Model chunking as task similar to POS tagging• Instance: tokens • Labels: – Simultaneously encode segmentation &
identification– IOB (or BIO tagging) (also BIOE or BIOSE)• Segment: B(eginning), I (nternal), O(utside)• Identity: Phrase category: NP, VP, PP, etc.
![Page 52: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/52.jpg)
Chunking by Classification
• Model chunking as task similar to POS tagging• Instance: tokens • Labels: – Simultaneously encode segmentation &
identification– IOB (or BIO tagging) (also BIOE or BIOSE)
• Segment: B(eginning), I (nternal), O(utside)• Identity: Phrase category: NP, VP, PP, etc.• The morning flight from Denver has arrived• NP-B NP-I NP-I PP-B NP-B VP-B VP-I
![Page 53: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/53.jpg)
Chunking by Classification
• Model chunking as task similar to POS tagging• Instance: tokens • Labels: – Simultaneously encode segmentation & identification– IOB (or BIO tagging) (also BIOE, etc.)
• Segment: B(eginning), I (nternal), O(utside)• Identity: Phrase category: NP, VP, PP, etc.• The morning flight from Denver has arrived• NP-B NP-I NP-I PP-B NP-B VP-B VP-I• NP-B NP-I NP-I NP-B
![Page 54: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/54.jpg)
Features for Chunking
• What are good features?
![Page 55: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/55.jpg)
Features for Chunking
• What are good features?– Preceding tags
• for 2 preceding words
– Words• for 2 preceding, current, 2 following
– Parts of speech• for 2 preceding, current, 2 following
• Vector includes those features + true label
![Page 56: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/56.jpg)
Chunking as Classification
• Example
![Page 57: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/57.jpg)
Evaluation
• System: output of automatic tagging• Gold Standard: true tags – Typically extracted from parsed treebank
• Precision: # correct chunks/# system chunks• Recall: # correct chunks/# gold chunks• F-measure:
• F1 balances precision & recall
![Page 58: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/58.jpg)
State-of-the-Art
• Base NP chunking: 0.96
![Page 59: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/59.jpg)
State-of-the-Art
• Base NP chunking: 0.96• Complex phrases: Learning: 0.92-0.94
• Most learners achieve similar results
– Rule-based: 0.85-0.92
![Page 60: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/60.jpg)
State-of-the-Art
• Base NP chunking: 0.96• Complex phrases: Learning: 0.92-0.94
• Most learners achieve similar results
– Rule-based: 0.85-0.92• Limiting factors:
![Page 61: Ling 570 Day 17: Named Entity Recognition Chunking.](https://reader036.fdocuments.us/reader036/viewer/2022062322/56649ea95503460f94bad6b1/html5/thumbnails/61.jpg)
State-of-the-Art
• Base NP chunking: 0.96• Complex phrases: Learning: 0.92-0.94
• Most learners achieve similar results
– Rule-based: 0.85-0.92• Limiting factors:– POS tagging accuracy– Inconsistent labeling (parse tree extraction)– Conjunctions
• Late departures and arrivals are common in winter• Late departures and cancellations are common in winter