INFORMATION EXTRACTIONNaïve Bayes: Mixture of Multinomials Model 1. Pick the class: P(class) 2. For...
Transcript of INFORMATION EXTRACTIONNaïve Bayes: Mixture of Multinomials Model 1. Pick the class: P(class) 2. For...
1of 63
INFORMATION EXTRACTION
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
2of 63
Text Classification by Example
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
3of 63
Text Classification by Example
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
4of 63
Text Classification by Example
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
5of 63
Text Classification by Example
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
6of 63
Text Classification by Example
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
7of 63
How could you build a text classifier?
• Take some ideas from machine learning– Supervised learning setting– Examples of each class (a few or thousands)
• Take some ideas from machine translation– Generative models– Language models
• Simplify each and stir thoroughly
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
8of 63
Basic Approach of Generative Modeling
1. Pick representation for data
2. Write down probabilistic generative model
3. Estimate model parameters with training data
4. Turn model around to calculate unknown values for new data
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
9 of 63
Naïve Bayes: Bag of Words Representation
Corn prices rose today while corn futures dropped in surprising trading activity. Corn ...
activity 1 cable 0 corn 3 damp 0
drawer 0 dropped 1 elbow 0
earning 0 . . . . . .
All words in dictionary
Occurrence counts
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
10of 63
Naïve Bayes: Mixture of Multinomials Model1. Pick the class: P(class)
2. For every word, pick from the class urn: P(word|class)
while
polo
socceractivity
droppedsoccer
the
ball
COMPUTERS SPORTS
thein
web
windows
the
thein
java
windows
again
modem
Word independence assumption!CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
11of 63
Naïve Bayes: Estimating Parameters• Just like estimating biased coin flip probabilities
• Estimate MAP word probabilities:
• Estimate MAP class priors:
classdoc
classdoc
docNVocab
docwordclassword
)(||
),N(1)|P(
)N()N(),N(1)P(
docclassclassdocclass
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
12of 63
Naïve Bayes: Performing Classification
• Word independence assumption
• Take the class with the highest probability
docword
classwordclass )|P()P(
)(P)(P)|(P)|(P
docclassclassdocdocclass
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
13of 63
Classification Tricks of the Trade• Stemming
– run, runs, running, ran run– table, tables, tabled table– computer, compute, computing compute
• Stopwords– Very frequent function words generally uninformative– if, in, the, like, …
• Information gain feature selection– Keep just most indicative words in the vocabulary
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
14of 63
Naïve Bayes Rules of Thumb
• Need hundreds of labeled examples per class for good performance (~85% accuracy)
• Stemming and stopwords may or may not help
• Feature selection may or may not help
• Predicted probabilities will be very extreme
• Use sum of logs instead of multiplying probabilities for underflow prevention
• Coding this up is trivial, either as a mapreduce or not
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
15of 63
Information Extraction with Generative Models
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
16of 63
Example: A Problem
Genomics job
Mt. Baker, the school district
Baker Hostetler, the company
Baker, a job opening
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
17of 63
Example: A Solution
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
18of 63
Job Openings:Category = Food ServicesKeyword = BakerLocation = Continental U.S.
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
19 of 63
Extracting Job Openings from the Web
Title: Ice Cream Guru
Description: If you dream of cold creamy…
Contact:[email protected]
Category:Travel/Hospitality
Function:Food Services
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
20of 63
Potential Enabler of Faceted Search
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
21of 63
Lots of Structured Information in Text
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
22of 63
IE from Research Papers
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
23of 63
What is Information Extraction?
• Recovering structured data from formatted text
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
24of 63
What is Information Extraction?
• Recovering structured data from formatted text– Identifying fields (e.g. named entity recognition)
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
25of 63
What is Information Extraction?
• Recovering structured data from formatted text– Identifying fields (e.g. named entity recognition)
– Understanding relations between fields (e.g. record association)
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
26of 63
What is Information Extraction?
• Recovering structured data from formatted text– Identifying fields (e.g. named entity recognition)
– Understanding relations between fields (e.g. record association)
– Normalization and deduplication
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
27of 63
What is Information Extraction?
• Recovering structured data from formatted text– Identifying fields (e.g. named entity recognition)
– Understanding relations between fields (e.g. record association)
– Normalization and deduplication
• Today, focus on field identification
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
28of 63
IE HistoryPre-Web• Mostly news articles
– De Jong’s FRUMP [1982]• Hand-built system to fill Schank-style “scripts” from news wire
– Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-’96]
• Most early work dominated by hand-built models– E.g. SRI’s FASTUS, hand-built FSMs.– But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then HMMs: Elkan
[Leek ’97], BBN [Bikel et al ’98]
Web• AAAI ’94 Spring Symposium on “Software Agents”
– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.
• Tom Mitchell’s WebKB, ‘96– Build KB’s from the Web.
• Wrapper Induction– Initially hand-build, then ML: [Soderland ’96], [Kushmeric ’97],…
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
29 of 63
IE Posed as a Machine Learning Task
• Training data: documents marked up with ground truth
• In contrast to text classification, local features crucial. Features of:– Contents
– Text just before item
– Text just after item
– Begin/end boundaries
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun
prefix contents suffix
… …
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
30of 63
Good Features for Information Extraction
Example word features:– identity of word
– is in all caps
– ends in “-ski”
– is part of a noun phrase
– is in a list of city names
– is under node X in WordNet or Cyc
– is in bold font
– is in hyperlink anchor
– features of past & future
– last person name was female
– next two words are “and Associates”
begins-with-number
begins-with-ordinal
begins-with-punctuation
begins-with-question-word
begins-with-subject
blank
contains-alphanum
contains-bracketed-number
contains-http
contains-non-space
contains-number
contains-pipe
contains-question-mark
contains-question-word
ends-with-question-mark
first-alpha-is-capitalized
indented
indented-1-to-4
indented-5-to-10
more-than-one-third-space
only-punctuation
prev-is-blank
prev-begins-with-ordinal
shorter-than-30
Creativity and Domain Knowledge Required!
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
31of 63
Is Capitalized
Is Mixed Caps
Is All Caps
Initial Cap
Contains Digit
All lowercase
Is Initial
Punctuation
Period
Comma
Apostrophe
Dash
Preceded by HTML tag
Character n-gram classifier says string is a person name (80% accurate)
In stopword list(the, of, their, etc)
In honorific list(Mr, Mrs, Dr, Sen, etc)
In person suffix list(Jr, Sr, PhD, etc)
In name particle list (de, la, van, der, etc)
In Census lastname list;segmented by P(name)
In Census firstname list;segmented by P(name)
In locations lists(states, cities, countries)
In company name list(“J. C. Penny”)
In list of company suffixes(Inc, & Associates, Foundation)
Word Features– lists of job titles, – Lists of prefixes– Lists of suffixes– 350 informative phrases
HTML/Formatting Features– {begin, end, in} x
{<b>, <i>, <a>, <hN>} x{lengths 1, 2, 3, 4, or longer}
– {begin, end} of line
Creativity and Domain Knowledge Required!Good Features for Information Extraction
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
32of 63
Landscape of ML Techniques for IE:
Any of these models can be used to capture words, formatting or both.
Classify Candidates
Abraham Lincolnwas born in Kentucky.
Classifier
which class?
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Boundary Models
Abraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Finite State Machines
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
Wrapper Induction
<b><i>Abraham Lincoln</i></b> was born in Kentucky.
Learn and apply pattern for a website
<b>
<i>
PersonName
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
33of 63
Sliding Windows & Boundary Detection
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
34of 63
Information Extraction by Sliding WindowsGRAND CHALLENGES FOR MACHINE LEARNING
Jaime CarbonellSchool of Computer ScienceCarnegie Mellon University
3:30 pm7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
35of 63
Information Extraction by Sliding WindowsGRAND CHALLENGES FOR MACHINE LEARNING
Jaime CarbonellSchool of Computer ScienceCarnegie Mellon University
3:30 pm7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
36of 63
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime CarbonellSchool of Computer ScienceCarnegie Mellon University
3:30 pm7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
37of 63
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime CarbonellSchool of Computer ScienceCarnegie Mellon University
3:30 pm7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
38of 63
Information Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime CarbonellSchool of Computer ScienceCarnegie Mellon University
3:30 pm7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
39 of 63
Information Extraction with Sliding Windows[Freitag 97, 98; Soderland 97; Califf 98]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m
prefix contents suffix
… …
• Standard supervised learning setting– Positive instances: Windows with real label
– Negative instances: All other windows
– Features based on candidate, prefix and suffix
• Special-purpose rule learning systems work wellcourseNumber(X) :-
tokenLength(X,=,2), every(X, inTitle, false), some(X, A, <previousToken>, inTitle, true),some(X, B, <>. tripleton, true)CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
40of 63
IE by Boundary DetectionGRAND CHALLENGES FOR MACHINE LEARNING
Jaime CarbonellSchool of Computer ScienceCarnegie Mellon University
3:30 pm7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
41of 63
IE by Boundary DetectionGRAND CHALLENGES FOR MACHINE LEARNING
Jaime CarbonellSchool of Computer ScienceCarnegie Mellon University
3:30 pm7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
42of 63
IE by Boundary DetectionGRAND CHALLENGES FOR MACHINE LEARNING
Jaime CarbonellSchool of Computer ScienceCarnegie Mellon University
3:30 pm7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
43of 63
IE by Boundary DetectionGRAND CHALLENGES FOR MACHINE LEARNING
Jaime CarbonellSchool of Computer ScienceCarnegie Mellon University
3:30 pm7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
44of 63
IE by Boundary DetectionGRAND CHALLENGES FOR MACHINE LEARNING
Jaime CarbonellSchool of Computer ScienceCarnegie Mellon University
3:30 pm7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
45of 63
BWI: Learning to detect boundaries
• Another formulation: learn three probabilistic classifiers:– START(i) = Prob( position i starts a field)– END(j) = Prob( position j ends a field)– LEN(k) = Prob( an extracted field has length k)
• Then score a possible extraction (i,j) bySTART(i) * END(j) * LEN(j-i)
• LEN(k) is estimated from a histogram
• START(i) and END(j) learned by boosting over simple boundary patterns and features
[Freitag & Kushmerick, AAAI 2000]
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
46of 63
Problems with Sliding Windows and Boundary Finders
• Decisions in neighboring parts of the input are made independently from each other.
– Sliding Window may predict a “seminar end time” before the “seminar start time”.
– It is possible for two overlapping windows to both be above threshold.
– In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
47of 63
Hidden Markov Models
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
48of 63
Citation Parsing
• Fahlman, Scott & Lebiere, Christian(1989). The cascade-correlation learning architecture.Advances in Neural Information Processing Systems,pp. 524-532.
• Fahlman, S.E. and Lebiere, C., “The Cascade Correlation Learning Architecture,”Neural Information Processing Systems,pp. 524-532, 1990.
• Fahlman, S. E.(1991)The recurrent cascade-correlation learning architecture. NIPS 3, 190-205.
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
49 of 63
Can we do this with probabilistic generative models?
• Could have classes for {author, title, journal, year, pages}
• Could classify every word or sequence?– Which sequences?
• Something interesting in the sequence of fields that we’d like to capture– Authors come first– Title comes before journal– Page numbers come near the end
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
50of 63
Hidden Markov Models: The Representation
• A document is a sequence of words
• Each word is tagged by its class
• fahlman s e and lebiere c the cascade correlation learning architectureneural information processing systemspp 524 532 1990
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
51of 63
HMM: Generative Model (1)
Author Title Journal
Year Pages
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
52of 63
HMM: Generative Model (2)
Author Title
Year Pages
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
53of 63
HMM: Generative Model (3)
• States: xi
• State transitions: P(xi|xj) = a[xi|xj] • Output probabilities: P(oi|xj) = b[oi|xj]
• Markov independence assumptionCS8691-AI-AI-UNITV-INFORMATION EXTRACTION
54of 63
HMMs: Estimating Parameters• With fully-labeled data, just like naïve Bayes
• Estimate MAP output probabilities:
• Estimate MAP state transitions:
j
j
xdataword
xdatawordi
ji Vocab
wordoxo
@
@
1||
),N(1]|[b
datax
dataxij
ji
j
j
x
xxxxa
1||
),N(1]|[
1
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
55of 63
HMMs: Performing Extraction
• Given output words:– fahlman s e 1991 the recurrent cascade correlation learning architecture nips 3 190 205
• Find state sequence that maximizes:
• Lots of possible state sequences to test (514)
Hmm…
i
iiii xobxxa ]|[]|[ 1
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
56of 63
Representation for Paths: Trellis
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
57of 63
Representation for Paths: Trellis
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
58of 63CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
59 of 63CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
60of 63CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
61of 63
HMM Example: NymbleTask: Named Entity Extraction
Train on 450k words of news wire text.
Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%
[Bikel, et al 97]
Person
Org
Other
(Five other name classes)
start-of-sentence
end-of-sentence
Results:
• Bigram within classes
• Backoff to unigram
• Special capitalization and number features…
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
62of 63
Nymble word features
CS8691-AI-AI-UNITV-INFORMATION EXTRACTION
63of 63
HMMs: A Plethora of Applications
• Information extraction• Part of speech tagging• Word segmentation
• Gene finding• Protein structure prediction
• Speech recognition
• Economics, Climatology, Robotics, …CS8691-AI-AI-UNITV-INFORMATION EXTRACTION