1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.
-
date post
22-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.
![Page 1: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/1.jpg)
1
I256: Applied Natural Language Processing
Marti HearstNov 13, 2006
![Page 2: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/2.jpg)
2
Today
Automating Lexicon Construction
![Page 3: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/3.jpg)
3
PMI (Turney 2001)
Pointwise Mutual InformationPosed as an alternative to LSA
score(choicei) = log2(p(problem & choicei) / (p(problem)p(choicei)))
With various assumptions, this simplifies to:
score(choicei) = p(problem & choicei) / p(choicei)
Conducts experiments with 4 ways to compute thisscore1(choicei) = hits(problem AND choicei) / hits(choicei)
![Page 4: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/4.jpg)
4
Dependency Parser (Lin 98)
Syntactic parser that emphasizes dependancy relationships between lexical items.
Alice is the author of the book.
The book is written by Alice
mod
pred
dets
det
s
be by-subjpcomp-n
pcomp-n
det
Illustration by Bengi Mizrahi
![Page 5: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/5.jpg)
5
Automating Lexicon Construction
![Page 6: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/6.jpg)
6Slide adapted from Manning & Raghavan
What is a Lexicon?
A database of the vocabulary of a particular domain (or a language)More than a list of words/phrasesUsually some linguistic information
Morphology (manag- e/es/ing/ed → manage)
Syntactic patterns (transitivity etc)
Often some semantic informationIs-a hierarchy
Synonymy
Numbers convert to normal form: Four → 4
Date convert to normal form
Alternative names convert to explicit form– Mr. Carr, Tyler, Presenter → Tyler Carr
![Page 7: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/7.jpg)
7Slide adapted from Manning & Raghavan
Lexica in Text Mining
Many text mining tasks require named entity recognition.Named entity recognition requires a lexicon in most cases.Example 1: Question answering
Where is Mount Everest?A list of geographic locations increases accuracy
Example 2: Information extractionConsider scraping book data from amazon.comTemplate contains field “publisher”A list of publishers increases accuracy
Manual construction is expensive: 1000s of person hours!Sometimes an unstructured inventory is sufficientOften you need more structure, e.g., hierarchy
![Page 8: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/8.jpg)
8
Semantic Relation Detection
Goal: automatically augment a lexical databaseMany potential relation types:
ISA (hypernymy/hyponymy)Part-Of (meronymy)
Idea: find unambiguous contexts which (nearly) always indicate the relation of interest
![Page 9: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/9.jpg)
9
Lexico-Syntactic Patterns (Hearst 92)
![Page 10: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/10.jpg)
10
Lexico-Syntactic Patterns (Hearst 92)
![Page 11: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/11.jpg)
11
Adding a New Relation
![Page 12: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/12.jpg)
12
Automating Semantic Relation Detection
Lexico-syntactic Patterns:Should occur frequently in textShould (nearly) always suggest the relation of interestShould be recognizable with little pre-encoded knowledge.
These patterns have been used extensively by other researchers.
![Page 13: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/13.jpg)
13Slide adapted from Manning & Raghavan
Lexicon Construction (Riloff 93)
Attempt 1: Iterative expansion of phrase listStart with:
Large text corpusList of seed words
Identify “good” seed word contextsCollect close nouns in contextsCompute confidence scores for nounsIteratively add high-confidence nouns to seed word list. Go to 2.Output: Ranked list of candidates
![Page 14: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/14.jpg)
14Slide adapted from Manning & Raghavan
Lexicon Construction: Example
Category: weaponSeed words: bomb, dynamite, explosivesContext: <new-phrase> and <seed-phrase>Iterate:
Context: They use TNT and other explosives.Add word: TNT
Other words added by algorithm: rockets, bombs, missile, arms, bullets
![Page 15: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/15.jpg)
15Slide adapted from Manning & Raghavan
Lexicon Construction: Attempt 2
Multilevel bootstrapping (Riloff and Jones 1999)Generate two data structures in parallel
The lexiconA list of extraction patterns
Input as beforeCorpus (not annotated)List of seed words
![Page 16: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/16.jpg)
16Slide adapted from Manning & Raghavan
Multilevel Bootstrapping
Initial lexicon: seed wordsLevel 1: Mutual bootstrapping
Extraction patterns are learned from lexicon entries.New lexicon entries are learned from extraction patternsIterate
Level 2: Filter lexiconRetain only most reliable lexicon entriesGo back to level 1
2-level performs better than just level 1.
![Page 17: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/17.jpg)
17Slide adapted from Manning & Raghavan
Scoring of Patterns
Example Concept: companyPattern: owned by <x>
Patterns are scored as followsscore(pattern) = F/N log(F)F = number of unique lexicon entries produced by the patternN = total number of unique phrases produced by the patternSelects for patterns that are– Selective (F/N part)– Have a high yield (log(F) part)
![Page 18: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/18.jpg)
18Slide adapted from Manning & Raghavan
Scoring of Noun Phrases
Noun phrases are scored as followsscore(NP) = sum_k (1 + 0.01 * score(pattern_k))where we sum over all patterns that fire for NPMain criterion is number of independent patterns that fire for this NP.Give higher score for NPs found by high-confidence patterns.
Example:New candidate phrase: boeingOccurs in: owned by <x>, sold to <x>, offices of <x>
![Page 19: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/19.jpg)
19Slide adapted from Manning & Raghavan
Shallow Parsing
Shallow parsing neededFor identifying noun phrases and their headsFor generating extraction patterns
For scoring, when are two noun phrases the same?
Head phrase matchingX matches Y if X is the rightmost substring of Y“New Zealand” matches “Eastern New Zealand”“New Zealand cheese” does not match “New Zealand”
![Page 20: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/20.jpg)
20Slide adapted from Manning & Raghavan
Seed Words
![Page 21: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/21.jpg)
21Slide adapted from Manning & Raghavan
Mutual Bootstrapping
![Page 22: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/22.jpg)
22Slide adapted from Manning & Raghavan
Extraction Patterns
![Page 23: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/23.jpg)
23Slide adapted from Manning & Raghavan
Level 1: Mutual BootstrappingDrift can occur.It only takes one bad apple to spoil the barrel.Example: headIntroduce level 2 bootstrapping to prevent drift.
![Page 24: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/24.jpg)
24Slide adapted from Manning & Raghavan
Level 2: Meta-Bootstrapping
![Page 25: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/25.jpg)
25Slide adapted from Manning & Raghavan
Evaluation
![Page 26: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/26.jpg)
26Slide adapted from Manning & Raghavan
CoTraining (Collins&Singer 99)
Similar back and forth betweenan extraction algorithm anda lexicon
New: They use word-internal featuresIs the word all caps? (IBM)Is the word all caps with at least one period? (N.Y.)Non-alphabetic character? (AT&T)The constituent words of the phrase (“Bill” is a feature of the phrase “Bill Clinton”)
Classification formalism: Decision Lists
![Page 27: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/27.jpg)
27Slide adapted from Manning & Raghavan
Collins&Singer: Seed Words
Note that categories are more generic than in the case of Riloff/Jones.
![Page 28: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/28.jpg)
28Slide adapted from Manning & Raghavan
Collins&Singer: Algorithm
Train decision rules on current lexicon (initially: seed words).
Result: new set of decision rules.
Apply decision rules to training setResult: new lexicon
Repeat
![Page 29: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/29.jpg)
29Slide adapted from Manning & Raghavan
Collins&Singer: Results
Per-token evaluation?
![Page 30: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/30.jpg)
30
More Recent Work
Knowitall system at U WashingtonWebFountain project at IBM
![Page 31: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/31.jpg)
31Slide adapted from Manning & Raghavan
Lexica: Limitations
Named entity recognition is more than lookup in a list.Linguistic variation
Manage, manages, managed, managing
Non-linguistic variationHuman gene MYH6 in lexicon, MYH7 in text
AmbiguityWhat if a phrase has two different semantic classes?Bioinformatics example: gene/protein metonymy
![Page 32: 1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d795503460f94a5d203/html5/thumbnails/32.jpg)
32Slide adapted from Manning & Raghavan
Discussion
Partial resources often available.E.g., you have a gazetteer, you want to extend it to a new geographic area.
Some manual post-editing necessary for high-quality.Semi-automated approaches offer good coverage with much reduced human effort.Drift not a problem in practice if there is a human in the loop anyway.Approach that can deal with diverse evidence preferable.Hand-crafted features (period for “N.Y.”) help a lot.