Domain-Centric Information Extraction
description
Transcript of Domain-Centric Information Extraction
![Page 1: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/1.jpg)
Domain-Centric InformationExtractionNilesh Dalvi. Yahoo! Research
![Page 2: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/2.jpg)
![Page 3: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/3.jpg)
Scaling IE
# of domains
# of
sou
rces
traditional IE
Domain-centric
Extraction
![Page 4: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/4.jpg)
Domain-centric Information Extraction : given a schema, populate it by extracting information from the entire Web.
![Page 5: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/5.jpg)
Outline
Part I : Problem Analysis.Part II : Our approach.
![Page 6: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/6.jpg)
Part I : Analysis ofData on the Web
![Page 7: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/7.jpg)
QuestionsSpread : Given a domain, how is the information about
the domain spread across the Web? Connectivity : how is the information connected? How
easy is it to discover sources in a domain?Value : how much value the tail entities in a domain
have? Details can be found in the paper “An analysis of
structured data on the Web” in VLDB 12.
![Page 8: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/8.jpg)
SpreadHow many websites are needed to build a complete
database for a domain?We look at domains with the following two properties:
We already have access to large comprehensive database of entities in the domain.
The entities have some attribute that can uniquely (or nearly uniquely) identify the entity, e.g., phone numbers of businesses and ISBN numbers of books.
![Page 9: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/9.jpg)
SpreadWe analyzed several domains : restaurants, schools,
libraries, retail & shopping, books etc.We used the entire web cache of Yahoo! search
engine.We say that a given webpage has a given entity if it
contains the identifying attribute of the entity.We aggregate the set of all entities found on each
website.
![Page 10: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/10.jpg)
# of websites
reca
ll
![Page 11: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/11.jpg)
# of websites
reca
ll
![Page 12: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/12.jpg)
# of websites
reca
ll
![Page 13: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/13.jpg)
Even for domains with well- established aggregator sites, we need to go to the long tail of websites to build a reasonably complete database.
![Page 14: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/14.jpg)
ConnectivityHow well is the information connected in a given
domain?We consider the entity-source graph for various
domains: bipartite graph with entities and websites as nodes There is an edge between entity e and website h if some
webpage in h contains e
We study various properties of the graph, like its diameter and connected components.
![Page 15: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/15.jpg)
![Page 16: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/16.jpg)
Content in a domain is well-connected, with a high degree of redundancy and overlap.
![Page 17: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/17.jpg)
Part II : Domain-centric extraction from script-generated sites
![Page 18: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/18.jpg)
A primer on script-generated sites.
![Page 19: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/19.jpg)
html
bodyhead
titlediv div
table
td
table
td td td td td
class=‘content’
width=80%Godfather
Title : Godfather Director : Coppola Runtime 118min
We can use the following Xpath rule to extract directors W = /html/body/div[2]/table/td[2]/text()
class=‘head’
Such a rule is called Wrapper, and can be learnt with a small amount of site-level supervision
![Page 20: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/20.jpg)
Domain-centric ExtractionProblem : Populate a given schema from the entire
Web.Use supervision at domain-level
Set of attributes to extract Seed set of entities Dictionaries/language models for attributes Domain-constraints
Main idea : use content redundancy across websites and structural coherency within websites
![Page 21: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/21.jpg)
Our Extraction PipelineDiscover Cluster Annotate Extract
![Page 22: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/22.jpg)
Step 1 : DiscoverStart from a seed set of entities
1. Construct web search queries from entities.2. Look at the top-k results for each query.3. Aggregate the hosts and pick the top hosts.
Extract entities from the hostsUpdate the seed set and repeat.
![Page 23: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/23.jpg)
Step 2 : Cluster
Problem : Given set of pages of the form <url, content>, cluster them so that pages from the same “script” are grouped together.
Need a solution which is: Unsupervised Works at web-scale
![Page 24: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/24.jpg)
Previous TechniquesState of the art approaches [Crescenzi et al. 2002,
Gottron 2008] look at pairwise similarity of pages and then use standard clustering algorithms (single linkage, k-means, etc.)
Problems: They do not scale to large websites. Their accuracy is not very high.
![Page 25: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/25.jpg)
Example
There are 2 clusters1. site.com/*/*/eats/*2. site.com/*/*/todo/*
u1 : site.com/CA/SanFrancisco/eats/id1.htmlu2 : site.com/WA/Seattle/eats/id2.htmlv1 : site.com/WA/Seattle/todo/id3.htmlv2 : site.com/WA/Portland/todo/id4.html
Observation : Pair-wise similarity is not effective.
![Page 26: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/26.jpg)
Our ApproachWe look at the set of pages holistically.We find a set of patterns that “best” explains the given
set of pages.We use an information-theoretic framework
encode webpages using patterns find set of patterns that minimize the description length of
the encoding
Details can be found in our paper, “Efficient algorithms for structural clustering of large websites”, in WWW 11.
![Page 27: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/27.jpg)
Step 3 : AnnotateWe construct annotators for each attribute in the
schema.Several classes of annotators can be defined:
Dictionary-based : for names of people, places, etc. Pattern-based : dates, prices, phone numbers etc. Language model-based : reviews, descriptions, etc.
Annotators only need to provide weak guarantees: Less than perfect precision Arbitrary low recall
![Page 28: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/28.jpg)
Step 4 : Extract Idea : make Wrapper Induction tolerant to noise
![Page 29: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/29.jpg)
![Page 30: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/30.jpg)
Our ApproachA generic framework, that can incorporate
any wrapper inductor.Input : A wrapper inductor Φ, a set of labels L Idea: Apply Φ on all subsets of L and choose
the wrapper that gives the best list.Need to solve two problems : enumeration
and ranking.
![Page 31: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/31.jpg)
Enumeration Input : A wrapper inductor, Φ and a set of labels L Wrapper space of L is defined as W(L) = {Φ(S)| S ⊆ L}Problem : Enumerate the wrapper space of L in time
polynomial in the size of the wrapper space and L.For a certain class of well-behaved wrappers, we can
solve the enumeration problem efficiently.
![Page 32: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/32.jpg)
RankingGiven a set of wrappers, we want to output one that
gives the “best” list.Let X be the list of nodes returned by a wrapper wChoose wrapper that maximizes P[X | L], or
equivalently, P[L | X] P[X]
![Page 33: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/33.jpg)
Components in RankingP[L | X]
Assume a simple annotator model with precision p and recall r that independently labels each node.
A wrapper that includes most of the input labels gets a high score.
P[X] Captures the structural coherency of the output,
independent of the labels. An output with nice repeating structure gets high score.
Details can be found in the paper, “Automatic Wrappers for Large Scale Web Extraction”, in VLDB 11.
![Page 34: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/34.jpg)
ExperimentsDatasets:
DEALERS : Used automatic form filling techniques to obtain dealer listings from 300 store locator pages
DISCOGRAPHY : Crawled 14 music websites that contain track listings of albums.
Task : Automatically learn wrappers to extract business names/track titles for each of the website.
![Page 35: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/35.jpg)
![Page 36: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/36.jpg)
![Page 37: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/37.jpg)
ConclusionsDomain-centric extraction is a promising first step
towards the general problem of web-scale information extraction
Domain level supervision, along with content redundancy across sites and structural coherency within sites can be effectively leveraged.
![Page 38: Domain-Centric Information Extraction](https://reader035.fdocuments.us/reader035/viewer/2022070500/56816859550346895dde8d3a/html5/thumbnails/38.jpg)
Thank You!