Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends...
-
Upload
marcus-young -
Category
Documents
-
view
217 -
download
0
Transcript of Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends...
![Page 1: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/1.jpg)
Information Extraction
Sources:• Sarawagi, S. (2008). Information extraction.
Foundations and Trends in Databases, 1(3), 261–377. • Hobbs, J. R., & Riloff, E. (2010). Information extraction.
Handbook of Natural Language Processing, 2.
![Page 2: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/2.jpg)
CONTEXT
![Page 3: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/3.jpg)
History
• Genesis = recognition of named entities (organization & people names)
• Online access = pushes towards – personal desktops -> structured databases, – scientific publications -> structured records, – Internet -> structured fact finding queries.
![Page 4: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/4.jpg)
Driving workshops / conferences
– 1987-97: MUC (Message Understanding Conference)Filling slots, named entities & coreference (95-)
– 1999-08: ACE (Automatic Content Extraction) « supporting various classification, filtering, and selection applications by extracting and representing language content »
– 2008-now: TAC (Text Automated Comprehension)• Knowledge Base Population (09-11)• Others: Textual entailment, Summarization, QA (until
2009)
![Page 5: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/5.jpg)
Example: MUC0. MESSAGE: ID TST1-MUC3-00011. MESSAGE: TEMPLATE 12. INCIDENT: DATE 02 FEB 903. INCIDENT: LOCATION GUATEMALA: SANTO TOMAS (FARM)4. INCIDENT: TYPE ATTACK5. INCIDENT: STAGE OF EXECUTION ACCOMPLISHED6. INCIDENT: INSTRUMENT ID -7. INCIDENT: INSTRUMENT TYPE -8. PERP: INCIDENT CATEGORY TERRORIST ACT9. PERP: INDIVIDUAL ID "GUERRILLA COLUMN" / "GUERRILLAS"10. PERP: ORGANIZATION ID "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG"11. PERP: ORGANIZATION CONFIDENCE REPORTED AS FACT / CLAIMED OR ADMITTED: "GUATEMALAN NATIONAL REVOLUTIONARY UNITY" / "URNG"12. PHYS TGT: ID "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"13. PHYS TGT: TYPE GOVERNMENT OFFICE OR RESIDENCE: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"14. PHYS TGT: NUMBER 1: "\"SANTO TOMAS\" PRESIDENTIAL FARM" / "PRESIDENTIAL FARM"15. PHYS TGT: FOREIGN NATION -16. PHYS TGT: EFFECT OF INCIDENT -17. PHYS TGT: TOTAL NUMBER -18. HUM TGT: NAME "CEREZO"19. HUM TGT: DESCRIPTION "PRESIDENT": "CEREZO" "CIVILIAN"20. HUM TGT: TYPE GOVERNMENT OFFICIAL: "CEREZO" CIVILIAN: "CIVILIAN"21. HUM TGT: NUMBER 1: "CEREZO" 1: "CIVILIAN"22. HUM TGT: FOREIGN NATION -23. HUM TGT: EFFECT OF INCIDENT NO INJURY: "CEREZO" DEATH: "CIVILIAN"24. HUM TGT: TOTAL NUMBER -
![Page 6: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/6.jpg)
Application• Enterprise Applications
– News Tracking (terrorists, disease)– Customer care (linking mails to products, etc.)– Data Cleaning– Classified Ads
• Personal Information Management (PIM)• Scientific Applications (e.g. bio-informatics)• Web Oriented
– Citation databases– Opinion databases– Community websites (DBLife, Rexa - UMASS)– Comparison Shopping– Ad Placement on Webpages – Structured Web Searches
![Page 7: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/7.jpg)
IE - Taxonomy
• Types of structures extracted– Entities, Records, Relationships– Open/Closed IE
• Sources– Granularity of extraction– Heterogenity: machine generated, (semi)structured, open
• Input resources– Structured DB– Labelled Unstructured Text– Preprocessing (tokenizer, chunker, parser<)
![Page 8: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/8.jpg)
Process (I)
• Annotated documents• Rules hand-crafted by humans (1500 hours!)
![Page 9: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/9.jpg)
Process (I)
• Annotated documents• Rules hand-crafted by humans (1500 hours!)• Rules generated by a system• Rules evaluated by humans
![Page 10: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/10.jpg)
Process (II)
• Annotated documents• Rules hand-crafted by humans (1500 hours!)
• Rules generated by a system• Rules learnt
![Page 11: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/11.jpg)
Process (III)
• Annotated documents• Rules hand-crafted by humans (1500 hours!)
• Rules generated by a system
• Rules learnt• Models– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF
• Decomposition into a series of subproblems– Complex words, basic phrases, complex phrases, events and
merging
![Page 12: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/12.jpg)
Process (IV)
• Annotated documents• Relevant & non relevant documents• Rules hand-crafted by humans (1500 hours!)
• Rules generated by a system
• Rules learnt• Models
– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF
![Page 13: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/13.jpg)
Process (V)
• Annotated documents• Relevant & non relevant documents
• Seeds -> boostrapping• Rules hand-crafted by humans (1500 hours!)
• Rules generated by a system
• Rules learnt• Models
– Logic: First Order Logic– Sequence: e.g. HMM– Classifiers: e.g. MEM, CRF
![Page 14: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/14.jpg)
RECOGNIZING ENTITIES / FILLING SLOTS
![Page 15: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/15.jpg)
Rule based systems
• Rules to mark an entity (or more)– Before the start of the entity– Tokens of the entity– After the end of the entity
• Rules to mark the boundaries• Conflicts between rules– Larger span– Merge (if same action)– Order the rules
![Page 16: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/16.jpg)
Entity Extraction – rule based
![Page 17: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/17.jpg)
Learning rules
• Algorithms are based on– Coverage [how many cases are covered by the
rule]– Precision
• Two approaches– Top-down (e.g. FOIL): start with coverage = 100%– Bottom-up: start with precision = 100%
![Page 18: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/18.jpg)
Rules – Autoslog
• Rule Learning– Look at sentences containing targets– Heuristic: looking for a linguistic pattern
Riloff, E. (1993). Automatically constructing a dictionary for information extraction tasks, 811–811.
![Page 19: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/19.jpg)
Rules – LIEPHuffman, S. B. (2005). Learning information extraction patterns from examples.
Learn (sets of meta-heuristics) by using syntactic paths that relate two role-filling constituents, e.g. [subject(Bob,named),object(named,CE0)].Followed by generalization (matching + disjonction)
![Page 20: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/20.jpg)
Statistical models
• How to label– IOB sequences (Inside, Outside, Beginning)– Sequences– Segmentation
Alleged/B guerrilla/I urban/I commandos/I launched/O two/B highpower/I bombs/I against/O a/B car/I dealership/I in/O down- town/O San/B Salvador/I this/B morning/I.
– Grammar based (longer dependencies)• Many ML models:– HMM– ME, CRF– SVM
![Page 21: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/21.jpg)
Statistical models (cont’d)
• Features– Word– Orthographic– Dictionary– …
• First order– Position:– Segment:
![Page 22: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/22.jpg)
Examples of features
![Page 23: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/23.jpg)
Statistical models (cont’d)
• Learning:– Likelihood
– Max-Margin
![Page 24: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/24.jpg)
PREDICTING RELATIONSHIPS
![Page 25: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/25.jpg)
Overall
• Goal: classify (E1,E2,x)• Features– Surface tokens (words, entities)
[Entity label of E1 = Person, Entity label of E2 = Location]
– Parse tree (syntaxic, dependency graph)[(POS = (noun,verb,noun), flag = “(1,none,2)”, type = “dependency”]
![Page 26: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/26.jpg)
Models
• Standard classifier (e.g. SVM)• Kernel-based methods– e.g. measure of common properties between two
paths in the dependency tree– Convolution based kernels
• Rule-based methods
![Page 27: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/27.jpg)
Extracting entities for a set of relationships
• Three steps– Learn extraction patterns for the seeds• Find documents where entities appear close to each
other• Filtering
– Generate candidate triplets• Pattern or keyword-based
– Validation• # of occurrences
![Page 28: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/28.jpg)
MANAGEMENT
![Page 29: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/29.jpg)
Summary
• Performance– Document selection: subset, crawling– Queries to DB: related entities (top-k retrieval)
• Handling changes– Detecting when a page has changed
• Integration– Detecting duplicates entities– Redundant extractions (open IE)
![Page 30: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/30.jpg)
EVALUATION
![Page 31: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/31.jpg)
Metrics
• Metrics– Precision-Recall– F-measure (-> harmonic mean)
![Page 32: Information Extraction Sources: Sarawagi, S. (2008). Information extraction. Foundations and Trends in Databases, 1(3), 261–377. Hobbs, J. R., & Riloff,](https://reader036.fdocuments.us/reader036/viewer/2022062407/56649e105503460f94afb7ce/html5/thumbnails/32.jpg)
The 60% barrier