Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns...
Transcript of Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns...
![Page 1: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/1.jpg)
Information Extraction
Pedro Szekely
Information Sciences Institute,
USC Viterbi School of Engineering
1
![Page 2: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/2.jpg)
Agenda
Information extraction classification
Text extraction techniques
Storing extractions in knowledge graphs
myDIG demo
Summary
![Page 3: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/3.jpg)
Document FeaturesText
paragraphs
without
formatting
Grammatical
sentences
plus some
formatting &
links
Non-grammatical snippets,
rich formatting & linksTables
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University, where
he was inducted as a national Hertz fellow. His M.S.
in symbolic and heuristic computation and B.S. in
computer science are from Stanford University. His
work in science, literature and business has appeared
in international media from the New York Times to
CNN to NPR.
Charts
3
![Page 4: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/4.jpg)
Kejriwal, Szekely
Scope
Web site specific
Genre specific
(e.g., forums) Wide, non-specific
4
![Page 5: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/5.jpg)
Pattern ComplexityE.g., word patterns
Closed set
He was born in Alabama…
Regular set
Phone: (413) 545-1323
Complex pattern
University of Arkansas
P.O. Box 140
Hope, AR 71802…was among the six houses
sold by Hope Feldman that
year.
Ambiguous patterns,
needing context and
many sources of evidence
The CALD main office can be
reached at 412-268-1299The big Wyoming sky…
U.S. states U.S. phone numbers
U.S. postal addresses
Person names
Headquarters:
1128 Main Street, 4th Floor
Cincinnati, Ohio 45210
Pawel Opalinski, Software
Engineer at WhizBang Labs.
Courtesy of Andrew McCallum
“YOU don't wanna miss out on
ME :) Perfect lil booty Green
eyes Long curly black hair Im a
Irish, Armenian and Filipino
mixed princess :) ❤ Kim ❤
7○7~7two7~7four77 ❤ HH 80
roses ❤ Hour 120 roses ❤ 15
mins 60 roses”
5
![Page 6: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/6.jpg)
small amount of relevant contentirrelevant content very similar to relevant content
6
![Page 7: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/7.jpg)
Practical Considerations
How good (precision/recall) is necessary?High precision when showing extractions to users
High recall when used for ranking results
How long does it take to construct?Minutes, hours, days, months
What expertise do I need?None (domain expertise), patience (annotation), simple scripting, machine learning guru
What tools can I use?Many …
7
![Page 8: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/8.jpg)
Information Extraction Process
8
Segmentation Data Extraction
![Page 9: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/9.jpg)
Information Extraction Process
9
Segmentation Data Extraction
![Page 10: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/10.jpg)
Information Extraction Process
10
Segmentation Data Extraction
Name:
Legacy Ventures Intl, Inc.
Stock:
LGYV
Date:
2017-07-14
Market Cap:
391,030
![Page 11: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/11.jpg)
Segmentation
Semi-structured extraction
Table extraction
Main content identification
Custom regular expressions
11
![Page 12: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/12.jpg)
Segmentation
Semi-structured extraction
Table extraction
Main content identification
Custom regular expressions
12
Text
segments
![Page 13: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/13.jpg)
Text Extraction Techniques
Glossary
Regular expressions
Natural language rules
Named entity recognition
Sequence labeling (Conditional Random Fields)
13
![Page 14: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/14.jpg)
Glossary Extraction
![Page 15: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/15.jpg)
Glossary Extraction
Simplelist of words or phrases to extract
ChallengesAmbiguity: Charlotte is a name of a person and a city
Colloquial expressions: “Asia Broadband, Inc.” vs “Asia Broadband”
ResearchImproving precision of glossary extractions using context
Creating/extending glossaries automatically
15
![Page 16: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/16.jpg)
Regex Extraction
![Page 17: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/17.jpg)
Extraction Using Regular
Expressions Too difficult for non-programmers
regex for North American phone numbers:
^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-
9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-
9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-
9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$
Brittle and difficult to adapt to unusual domains
unusual nomenclature and short-hands
obfuscation
17
![Page 18: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/18.jpg)
NLP Rule-Based Extraction
![Page 19: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/19.jpg)
NLP Rule-Based Extraction
19
TokenizationPattern
Matching
![Page 20: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/20.jpg)
Tokenization
20
My name is Pedro My name is Pedro
310-822-1511 310-822-1511
310 - 822 1511-
Candy is here Candy is here
Candy is here
![Page 21: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/21.jpg)
Token Properties
Surface propertiesLiteral, type, shape, capitalization, length, prefix, suffix, minimum, maximum
Language propertiesPart of speech tag, lemma, dependency
21
![Page 22: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/22.jpg)
![Page 23: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/23.jpg)
![Page 24: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/24.jpg)
![Page 25: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/25.jpg)
![Page 26: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/26.jpg)
Token Types
![Page 27: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/27.jpg)
Patterns
27
Pattern := Token-Spec
[Token-Spec]
Token-Spec +
Token-Spec Pattern
Optional
One or more
![Page 28: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/28.jpg)
Positive/Negative Patterns
PositiveGenerate candidates
NegativeRemove candidates
Output overlaps positive candidates
28
![Page 29: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/29.jpg)
Positive/Negative Patterns
PositiveGenerate candidates
NegativeRemove candidates
Output overlaps positive candidates
29
General
Specific
![Page 30: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/30.jpg)
Kejriwal, Szekely
DIG Demo
30
![Page 31: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/31.jpg)
Kejriwal, Szekely 31
https://spacy.io/docs/usage/rule-based-matching
![Page 32: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/32.jpg)
Advantages/Disadvantages
AdvantagesEasy to define
High precision
Recall increases with number of rules
DisadvantagesText must follow strict patterns
32
![Page 33: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/33.jpg)
Kejriwal, Szekely
NLP Rule-Based Extraction
Tokenization for unusual domainstokenize on white-space, punctuation and emojis
Token propertiesliteral, part of speech tag, lemma, in/out of dictionarydependency parsing relationships (advanced)type (alphanumeric, alphabetic, numeric)shape (pattern of digits and characters), capitalization, prefix and suffixnumber of characters, range (numbers)
PatternSequence of required/optional tokenspositive and negative patterns
33
![Page 34: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/34.jpg)
Named-Entity Recognizers
![Page 35: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/35.jpg)
Kejriwal, Szekely
Named Entity Recognizers
Machine learning modelspeople, places, organizations and a few others
SpaCycomplete NLP toolkit, Python (Cython), MIT license
code: https://github.com/explosion/spaCy
demo: http://textanalysisonline.com/spacy-named-entity-recognition-ner
Stanford NERpart of Stanford’s NLP software library, Java, GNU license
code: https://nlp.stanford.edu/software/CRF-NER.shtml
demo: http://nlp.stanford.edu:8080/ner/process
35
![Page 36: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/36.jpg)
Kejriwal, Szekely
https://spacy.io/docs/usage/entity-recognition
36
![Page 37: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/37.jpg)
Kejriwal, Szekely
https://demos.explosion.ai/displacy-ent
37
![Page 38: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/38.jpg)
Advantages/Disadvantages
AdvantagesEasy to use
Tolerant of some noise
Easy to train
DisadvantagesPerformance degrades rapidly for new genres, language models
Requires hundreds to thousands of training examples
38
![Page 39: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/39.jpg)
Conditional Random Fields
![Page 40: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/40.jpg)
Discriminative Vs. Generative ● Generative Model: A model that generate observed
data randomly
● Naïve Bayes: once the class label is known, all the
features are independent
● Discriminative: Directly estimate the posterior
probability; Aim at modeling the “discrimination”
between different outputs
● MaxEnt classifier: linear combination of feature
function in the exponent,
Both generative models and discriminative models describe distributions over (y , x), but they work in
different directions.
slide by Daniel Khashabi
![Page 41: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/41.jpg)
Discriminative Vs. Generative
=unobservable=observable
slide by Daniel Khashabi
![Page 42: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/42.jpg)
Chain CRFs ● Each potential function will operate on pairs of adjacent label variables
● Parameters to be estimated,
Feature functions
=unobservable
=observable
slide by Daniel Khashabi
![Page 43: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/43.jpg)
Chain CRF● We can change it so that each state depends on more observations
● Or inputs at previous steps
● Or all inputs
=unobservable
=observable
slide by Daniel Khashabi
![Page 44: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/44.jpg)
Modeling Problems With CRF
44
iX1
(word)
X2
(capitalized)
X3
(POS Tag)
Y
(entity)
1 My 1 Possessive Pron Other
2 name 0 Noun Other
3 is 0 Verb Other
4 Pedro 1 Proper Noun Person-Name
5 Szekely 1 Proper Noun Person-Name
![Page 45: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/45.jpg)
Modeling Problems With CRF
45
iX1
(word)
X2
(capitalized)
X3
(POS Tag)
Y
(entity)
1 My 1 Possessive Pron Other
2 name 0 Noun Other
3 is 0 Verb Other
4 Pedro 1 Proper Noun Person-Name
5 Szekely 1 Proper Noun Person-Name
Other common features:
lemma, prefix, suffix, length
![Page 46: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/46.jpg)
Modeling Problems With CRF
46
iX1
(word)
X2
(capitalized)
X3
(POS Tag)
Y
(entity)
1 My 1 Possessive Pron Other
2 name 0 Noun Other
3 is 0 Verb Other
4 Pedro 1 Proper Noun Person-Name
5 Szekely 1 Proper Noun Person-Name
fj(x, yi-1, yi, i)feature functions
![Page 47: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/47.jpg)
Advantages/Disadvantages
AdvantagesExpressive
Tolerant of noise
Stood test of time
Software packages available
DisadvantagesRequires feature engineering
Requires thousands of training examples
47
![Page 48: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/48.jpg)
Open Information Extraction
![Page 49: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/49.jpg)
Kejriwal, Szekely
http://openie.allenai.org/
49
![Page 50: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/50.jpg)
Kejriwal, Szekely
Practical IE Technologies
Glossary Regex NLP RulesSemi-
StructuredCRF NER Table
Effortassemble
glossaryhours hours minutes
O(1000)
annotati
ons
zero
O(10)
annotati
ons
Expertise minimal
high,
program
mer
low minimallow-
mediumzero minimal
Precision
medium
(ambiguit
y)
high high highmedium-
high
medium-
highhigh
Recall
medium
(formatti
ng)
low
f(#
regex)
medium
f(# rules)high medium medium high
50
![Page 51: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/51.jpg)
how to represent KGs?
51
![Page 52: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/52.jpg)
KG Definition
a directed, labeled multi-relational graph
representing facts/assertions as triples
(h, r, t) head entity, relation, tail entity
(s, p, o) subject, predicate, object
![Page 53: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/53.jpg)
Simplest Knowledge Graph
LGY
V
Legacy Ventures
International Inc
Damn Good
Penny Stocks
Entities
mentions
Easiest to build
![Page 54: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/54.jpg)
Simple, But Useful KG
LGY
V
Legacy Ventures
International Inc
Damn Good
Penny Stocks
Entities + properties
company
“Easy” to build54
![Page 55: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/55.jpg)
Semantic Web KG (RDF/OWL)
LGY
V
Legacy Ventures
International Inc
Damn Good Penny Stocks
stock-ticker
Entities + properties + classes
promoter
Compan
y
is-a
is-a
Very hard to buildKejriwal, Szekely
![Page 56: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/56.jpg)
“Ideal” KG
LGY
V
Legacy Ventures
International Inc
Damn Good
Penny Stocks
stock-ticker
Entities + properties + classes + qualifiers
promoter
Compan
y
is-a
is-a
start-date June 2017
sourcestockreads.co
m
Very very hard to build
![Page 57: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/57.jpg)
”More Ideal” KGEntities + properties + provenance + confidence + qualifiers
“Not so hard” to build
![Page 58: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/58.jpg)
Where to Store KGs?
![Page 59: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/59.jpg)
Serializing Knowledge Graphs
Resource Description Framework (RDF) Database (triple store): AllegroGraph, Virtuoso,
Query: SPARQL (SQL-like)
Key-Value, Document StoresData model: Node-centric
Databases: Hbase, MongoDB, Elastic Search, …
Query: filters, keywords, aggregation (no joins)
Graph DatabasesData model: graph
Databases: Neo4J, Cayley, MarkLogic, GraphDB, Titan, OrientDB, Oracle, …
Query: GraphQL, Gremlin, Cypher59
![Page 60: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/60.jpg)
Popularity Ranking Of Graph
Databases
https://db-engines.com/en/ranking_trend/graph+dbms
![Page 61: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/61.jpg)
ElasticSearch, MongoDB & Neo4J Have Wide
Adoption
Triple Stores
https://db-engines.com/en/ranking_trend/graph+dbms
![Page 62: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/62.jpg)
myDIG: A KG Construction ToolkitPython, MIT license, https://github.com/usc-isi-i2/dig-etl-engine
Enable end-users to construct domain-specific KGsend users from 5 government orgs constructed KGs in less than one day
Suite of extraction techniquessemi-structured HTML pages, glossaries, NLP rules, NER, tables (coming soon)
KG includes provenance and confidencesenable research to improve extractions and KG quality
Scalableruns on laptop (~100K docs), cluster (> 100M docs)
RobustDeployed to many law enforcement agencies
Easy to installDocker deployment with single “docker compose up” installation
62
![Page 63: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/63.jpg)
myDIG Demo
![Page 64: Information Extraction - GitHub Pages · 2020-06-17 · Pattern Complexity E.g., word patterns Closed set He was born in Alabama… Regular set Phone: (413) 545-1323 Complex pattern](https://reader034.fdocuments.us/reader034/viewer/2022042410/5f281471473266042d46ea58/html5/thumbnails/64.jpg)
Summary
Partition pages into segments
Select technology based on segment features
Do knowledge graph completion (next topic)
Choose representation based on application
demands