Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da...
-
Upload
emma-mosley -
Category
Documents
-
view
217 -
download
0
Transcript of Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da...
![Page 1: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/1.jpg)
Unsupervised Strategies for Unsupervised Strategies for Information Extraction by Information Extraction by
Text SegmentationText Segmentation
Eli Cortez, Altigran da SilvaFederal University of Amazonas - BRAZIL
![Page 2: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/2.jpg)
OutlineOutline
Information Extraction by Text
Segmentation (IETS)
◦ Scenario and Problem
◦ Challenges and Motivation
◦ Related Work
ONDUX
◦ Preliminary Experiments
Next Steps
![Page 3: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/3.jpg)
IInformation nformation EExtraction by xtraction by TText ext
SSegmentationegmentationText documents containing
implicit semi-structured data records
Addresses Bibliographic References Classified Ads Product Descriptions
![Page 4: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/4.jpg)
Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms;
2 Bathrooms. 412-638-7273
Classified Ad
Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore, MD 21214
Address
Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based
similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221,
January 2006
Bibliographic Reference
IInformation nformation EExtraction by xtraction by TText ext
SSegmentationegmentation
Neighborhood, Price, Number, Street,..., Phone
![Page 5: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/5.jpg)
Why extracting information? Database Storage, Query… Data Mining Record Linkage.
Regent Square
$228,900 1028 Mifflin
Ave.; 6 Bedrooms; 2
Bathrooms. 412-638-
7273
Classified Ad
<Neighboorhood> :
Regent Square
<Price> :
$228,900
<No.> : 1028
<Street> :
Mifflin Ave,
<Bed.> : 6 Bedrooms
<Bath..> : 2
Bathrooms
<Phone> : 412-
638-7273
IInformation nformation EExtraction by xtraction by TText ext
SSegmentationegmentation
![Page 6: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/6.jpg)
Given an input string I representing an implicit textual record (e.g. classified ad), the IETS task consists in:
1.Segmenting
2.Assigning to each segment a label corresponding to an attribute a
I
IInformation nformation EExtraction by xtraction by TText ext
SSegmentationegmentation
![Page 7: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/7.jpg)
IETS – Challenges(I)IETS – Challenges(I)Information Extraction by Text
Segmentation (IETS)
◦ Borkar@SIGMOD'01, McCallum@ICML'01,
Agichtein@SIGKDD'04, Mansuri@ICDE'06,
Zhao@SICDM'08, Cortez@JASIST'09
Diversity of templates and styles Attribute Ordering Capitalization Abbreviations.
Different applications share similar domains Ex.: Address and Ads
Records from both domains contain address information
![Page 8: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/8.jpg)
IETS – Challenges(II)IETS – Challenges(II)Diversity of templates and styles
Attribute Ordering; Capitalization; Abbreviations.
HomePage
DBLP
ACM
Link-based similarity measures for the classication of Web documents. Pável Calado. Journal of the American Society for the Information Science and Technology – 57(2) 2006
Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST 57 (2) 208-221(2006)
Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006
![Page 9: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/9.jpg)
Existing approaches deal with this problem use Machine Learning techniques
Hidden Markov Models (HMM) Conditional Random Fields (CRF) Support Vector Machines (SVM) (SSVM)
• Supervised approaches require a hand-labeled
training set created by an expert.
• Each generated model is particular to a given
application
• High computational cost
IETS – Challenges(III)IETS – Challenges(III)
![Page 10: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/10.jpg)
Related WorkRelated Work(Semi) (Semi) Supervised ApproachesSupervised Approaches
[Borkar et. al @ SIGMOD 2001]◦ Supervised extraction method based on Hidden
Markov Models (HMM)
[McCallum et. al @ ICML 2001]◦ Proposed the usage of Conditional Random Fields
(CRF), an supervised model – (S-CRF)
[Mansuri et. al @ ICDE 2006]◦ Semi-supervised approach based on CRF models
All of these approaches require an expert to create a hand-labeled training set for each application.
![Page 11: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/11.jpg)
Related WorkRelated Work(Semi) (Semi) Supervised ApproachesSupervised Approaches
Hand-labeled examples
<Neighboorhood> Regent Square </Neighboorhood>
<Price> $228,900 </Price> <No> 1028 </No> <Street>
Mifflin Ave, </Street> <Bed> 6 Bedrooms </Bed>
<Bath> 2 Bathrooms </Bath> <Phone>412-638-7273
</Phone>
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
CRF and HMM learn from the given examples, lexical, style, positioning and
sequecing featuresExamples are source-dependentScalability problem, Reusing pre-
existing models?
![Page 12: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/12.jpg)
Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches
Semi-structured
Records
Wikipedia Infobox
DBpedia
FreeBase
Knowledge Bases
Structured Records
![Page 13: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/13.jpg)
Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches
Supervised X UNsupervised Hand-labeled examples
Source Dependent
Scalability Problem
Reusability
Pre-existing information
Domain Representation
Easily adaptable
![Page 14: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/14.jpg)
[Agichtein et. al @ SIGKDD 2004]◦ Usage of Reference Tables to create an unsupervised
model using Hidden Markov Models (HMM)
[Zhao et. al @ SIAM ICDM 2008]◦ Usage of reference tables to create unsupervised
CRF models - (U-CRF)
[Cortez et. al @ JASIST 2009]◦ Unsupervised method to extract bibliographic
information Domain-specific heuristics, not general application.
Both models assume single positioning and ordering of attributes in all test instances. (Distinct Orderings ?)
Related WorkRelated WorkUNUNSupervised ApproachesSupervised Approaches
![Page 15: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/15.jpg)
Basic Concepts(I1)Basic Concepts(I1)Knowledge Base
◦Set of pairs KB =◦Building process trivial
◦Web Databases (Freebase, Googlebase)
)},(),...,,{( 11 nn OmOm
KB= { (Neighboorhhod, O ), (Street, O ), (Phone, O )}
O = { “Regent Square”, “Milenight Park”}
O = { “Regent St.”, “Morewood Ave.”, “Square Ave. Park”}
O = { “323 462-6252”, “(171) 289-7527”}
Neigh. Street
Neigh.
Street
Phone
Phone
KB: Domain Representation
Hand-labeled examples: Source representation
![Page 16: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/16.jpg)
Proposed MethodProposed MethodONDUX [Cortez et. al. @ SIGMOD 2010]
◦Blocking
◦Matching
◦Reinforcement
![Page 17: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/17.jpg)
ONDUX (II)ONDUX (II)Overview
3
12
![Page 18: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/18.jpg)
ONDUX (III)ONDUX (III)Blocking
◦ Split the input text in substrings called blocks;
◦ Consider the co-occurrence of consecutive terms based in the KB
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
![Page 19: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/19.jpg)
ONDUX (IV)ONDUX (IV)Matching
◦ Associate each block generated in the previous phase with an attribute according to the Knowledge Base
◦We use distinct matching functions:
Textual Values: FF Function (Field Frequency)
Numeric Values : NM Function (Numeric Matching)
![Page 20: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/20.jpg)
ONDUX (V)ONDUX (V)Matching
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
Street Price No. ??? Street
Bed. Bath. Phone
![Page 21: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/21.jpg)
ONDUX (VI)ONDUX (VI)How can we deal with blocks that
were incorrectly labeled or were not associated to any attribute?
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
Street Price No. ??? Street
Bed. Bath. Phone
![Page 22: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/22.jpg)
ONDUX (VII)ONDUX (VII)Reinforcement
◦ Review the labeling task performed in the Matching step
Unmatched blocks must receive a label of a given attribute
Mismatching blocks must be correctly labeled
◦How to handle this cases? Using positioning and sequencing
information that are obtained On-Demand.
![Page 23: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/23.jpg)
ONDUX (VIII)ONDUX (VIII)Reinforcement
◦ Given the extraction output of the matching step ONDUX automatically build a
graphical structure, the PSM.
PSM: Positioning and Sequencing Model.
![Page 24: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/24.jpg)
ONDUX (IX)ONDUX (IX)Reinforcement
◦Extraction Result
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
Price No.
Bed. Bath. Phone
Street
???
Neighborhood
Street
Street
![Page 25: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/25.jpg)
Experiments (1)Experiments (1)Setup
◦We tested our proposed approach in: Bibilographic Data (CORA, PersonalBib)
Collections are available in the Web
Dataset
#Attributes
#records
Source #Attributes #records
CORA 1..13 150 Cora 1..13 350
CORA 1..13 150 PersonalBib
7 395
Test Set
KB, Reference Table, …
![Page 26: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/26.jpg)
Experiments (II)Experiments (II)Evaluation
◦Metrics Precision, Recall and F-Measure
T-Test for the statistical validation of the results
◦Baseline Conditional Random Fields (CRF)
U-CRF (Unsupervised method) S-CRF (Classical supervised method)
![Page 27: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/27.jpg)
Experiments (III)Experiments (III)Extraction Quality
S-CRF achieves higher results than U-CRF due to the hand-labeled training
CORA includes a variety of styles and information (jconference, books)
In general, Matching and Reinforcement Step of ONDUX outperforms CRF models
![Page 28: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/28.jpg)
Experiments (IV)Experiments (IV)Extraction Quality
As discussed earlier, U-CRF is able to deal with different attribute orderings
Due to the Matching and Reinforcement Strategies, ONDUX outperforms CRF models
![Page 29: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/29.jpg)
Conclusions andConclusions andFuture Work (I)Future Work (I)Partial results of our research on
unsupervised strategies for information extraction
ONDUX◦ Flexible: Do not consider any particular style◦ Unsupervised: Do not require any human
effort to create a training set◦ On-Demand: Ordering and Positioning
Information are learned trough the Matching Phase
![Page 30: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/30.jpg)
Proposed strategy achieve good results of precision and recall◦Comparison with the state-of-art
As a Future Work◦Investigate different matching
functions;◦Multi-Record Extraction;◦Active Learning and Feedback;◦Error Detection;◦Nested structures?
Conclusions and Conclusions and Future Work (II)Future Work (II)
![Page 31: Unsupervised Strategies for Information Extraction by Text Segmentation Eli Cortez, Altigran da Silva Federal University of Amazonas - BRAZIL.](https://reader036.fdocuments.us/reader036/viewer/2022062804/5697bf811a28abf838c85326/html5/thumbnails/31.jpg)
Questions?