Exploring the Application Potential of Relational Web Tables
WebTables: Exploring the Power of Tables on the Web
description
Transcript of WebTables: Exploring the Power of Tables on the Web
![Page 1: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/1.jpg)
WebTables: Exploring the Power of Tables on the Web
Michael J. Cafarella, University of Washington (presently with University of Michigan)
Alon Halevy, GoogleDaisy Zhe Wang, UC Berkeley
Eugene Wu, MITYang Zhang, MIT
Proceedings of VLDB '08, Auckland, New Zealand
Presented by : Udit Joshi
![Page 2: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/2.jpg)
Introduction
• Web : A corpus of unstructured documents• Relational data often encountered• 14.1 billion HTML tables extracted by crawl• Non-relational tables filtered out• Corpus of 154M (1%) high quality relations • Searching and Ranking• Leveraging the statistical information
![Page 3: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/3.jpg)
A typical use of the table tag to describe relational data
![Page 4: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/4.jpg)
Contribution
• Ample user demand for structured data, visualisation• Around 30 million queries from Google’s 1-day log• Extracting a corpus of high quality relations (previous
work)• Determining effective Relation Ranking methods for
search• Analyzing and leveraging this corpus
![Page 5: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/5.jpg)
Outline
•Relation Extraction•Attribute Correlation Statistics Database
(ACSDb)
Data Model
•Challenges•Ranking AlgorithmsRelation
Search
•Schema auto-complete •Attribute synonym finding •Join graph traversal ACSDb
Applications
Experimental Results
![Page 6: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/6.jpg)
Data Model
• Relation Extraction• Attribute Correlation Statistics Database
(ACSDb)
![Page 7: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/7.jpg)
Relation Recovery
• Crawl based on <table> tag• Filter out non relational data
Relation extraction pipeline
![Page 8: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/8.jpg)
Use of Table Tag to Describe Relational Data
![Page 9: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/9.jpg)
Deep Web• Tables behind HTML forms• http://factfinder.census.gov/, http://www.cars.com/• Most deep web data not crawlable• Data in the Deep Web is huge• Google’s Deep Web Crawl Project uses ‘Surfacing’• Precomputes set of relevant form submissions• Search query for “citibank atm 94043” returns a parameterized
URL:http://locations.citibank.com/citibankV2/Index.aspx?zip=94022• Corpus 40% from deep web sources
![Page 10: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/10.jpg)
Relational Recovery
• Two stages for extraction system:– Relational filtering (for “good” relations)– Metadata detection (in top row of table)
• HTML parser on a page crawl • 14.1B instances of the <table> tag.• Script to disregard tables used for layout,
forms, calendars, etc.
![Page 11: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/11.jpg)
Relational Filtering
• Human judgment needed• 2 independent judges given training data• Scored from 1-5.• Qualifying score > 4
![Page 12: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/12.jpg)
Relational Filtering
• Machine-learning classification problem• Pair human classifications to a set of automatically
extracted table features• Forms a supervised training set for the statistical learner
Statistics to help distinguish relational tables
> 1
less variation
![Page 13: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/13.jpg)
Metadata Detection
• Only per-attribute labels needed.• Used in improving rank quality, data
visualization, construction of ACSDb.
Features to detect the header row in a table
![Page 14: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/14.jpg)
Relation Extractor’s Performance
high recall low precision
equal weight
![Page 15: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/15.jpg)
Data Model
• Relation Extraction• Attribute Correlation Statistics Database
(ACSDb)
![Page 16: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/16.jpg)
Attribute Correlation Statistics Database (ACSDb)
• Simple collection of statistics about schema attributes
• Derived from corpus of html tables• combo_make_model_year = 13
single_make = 3068• Available as a single file for download• 5.4M unique attribute names, 2.6M unique
schemas
Source : http://www.eecs.umich.edu/~michjc/acsdb.html
![Page 17: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/17.jpg)
Schema FreqACSDb
Recovered Relations
name addr city state
zip
Dan S 16 Park Seattle WA 98195
Alon H 129 Elm Belmont CA 94011
make model
year
Toyota Camry 1984
name size last-modifiedReadme.txt 182 Apr 26, 2005
cac.xml 813 Jul 23, 2008
make model year
color
Chrysler Volare 1974 yellow
Nissan Sentra 1994 red
make model year
Mazda Protégé 2003
Chevrolet Impala 1979
{make, model, year} 2
{name, size, last-modified} 1
{name, addr, city, state, zip} 1{make, model, year, color} 1
• ACSDb used for computing attribute probabilities– p(“make”) = 3/5
p(“zip”) = 1/5– p(“addr” | “name”) = 1/2
![Page 18: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/18.jpg)
Structure of Corpus
• Corpus R of databases• Each database R ∈ R is a single relation• URL Ru and offset Ri within page define R
• Schema Rs is an ordered list of attributes Rs = [Grand Prix, Date, Winning Driver……]
• Rt is the list of tuples, size of tuple t ≤|Rs|
![Page 19: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/19.jpg)
Extracting ACSDb from CorpusFunction createACS(R)A = {}seenDomains = {}for all R ∈ R
if getDomain(R.u) ∈ seenDomains[R.S] then seenDomains[R.S].add(getDomain(R.u)) A[R.S] = A[R.S] + 1end if
end for
![Page 20: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/20.jpg)
Distribution of frequency-ordered unique schemas in ACSDb
Small number of schemas appear very frequently
![Page 21: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/21.jpg)
Relational Search
• Challenges• Ranking Algorithms
![Page 22: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/22.jpg)
Relational Search
• Search engine style keyword based queries• Query-appropriate visualizations• Structured operations supported over search
results• Good search relevance is the key
![Page 23: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/23.jpg)
Relational Search
Keyword query
Possible visualization
Ranked list of databases returned
![Page 24: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/24.jpg)
Relation Ranking Challenges
• Relations are a mixture of “structure” and “content”
• Lack incoming hyperlink anchor text used in traditional IR
• PageRank style metrics unsuitable• Inverted Index unsuitable
![Page 25: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/25.jpg)
Relation Ranking Challenges
• No domain-specific schema graph• Applying word frequency to embedded tables• Factoring relations specific features– schema
elements, presence of keys, size of relation, # of NULLs
![Page 26: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/26.jpg)
Relational Search
• Challenges• Ranking Algorithms
![Page 27: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/27.jpg)
Naïve Rank• Query q and top k parameter as input • Query sent to search engine• Fetches top-k pages ,extracts tables from each
page• Stops even if less than k tables returned
1: Function naiveRank(q, k):2: let U = urls from web search for query q3: for i = 0 to k do4: emit getRelations(U[i])5: end for
![Page 28: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/28.jpg)
Filter Rank• Slight improvement• Ensures k relations extracted
1: Function filterRank(q, k):2: let U = ranked urls from web search for query q3: let numEmitted = 04: for all u U do∈5: for all r getRelations(u) do∈6: if numEmitted >= k then7: return8: end if9: emit r; numEmitted + +10: end for11: end for
![Page 29: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/29.jpg)
Feature Rank• No reliance on existing search engine• Uses several features to score each extracted
relation in the corpus• Feature scores combined using Linear Regression
Estimation (LRE) • LRE trained on thousand (q,relation) pairs• Judged by two judges on a scale of 1-5.• Results sorted on score
![Page 30: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/30.jpg)
Feature Rank
1: Function featureRank(q, k):2: let R = set of all relations extracted from corpus3: let score(r R) = combination of per-relation features∈4: sort r R by score(r)∈5: for i = 0 to k do6: emit R[i]7: end for
Query independent features:# rows, # colshas-header?# of NULLs in table
Query dependent features:document-search rank of source page# hits on header# hits on leftmost column# hits on second-to-leftmost column# hits on table body
Subject matterSemantic key
![Page 31: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/31.jpg)
Schema Rank• Uses ACSDb-based schema coherency score• Coherent Schema implies tighter relation• High: {make, model}• Low: {make, zipcode}• Pointwise Mutual Information (PMI) determines how
strongly two items are related.• Positive (strongly correlated) , Negative (negatively
correlated), 0 independent• Coherency score for schema S is average pairwise
PMI scores over all pairs of attributes in the schema.
![Page 32: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/32.jpg)
Schema Rank• Coherency Score
• Pointwise Mutual Information (PMI)
• 0 , + & -
1: Function cohere(R):2: totalPMI = 03: for all a attrs(R), b attrs(R), a ≠ b do∈ ∈4: totalPMI = PMI(a, b)5: end for6: return totalPMI/(|R| (|R| − 1))∗
![Page 33: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/33.jpg)
Indexing• Inverted index (term -> docid, offset) • WebTables data exists in two dimensions• (term -> tableid, (x, y) offsets) better suited for ranking
function• Supports queries with spatial operators like samerow and
samecol• Example: Paris and France on same row,
Paris, London and Madrid in same column.
![Page 34: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/34.jpg)
Web Tables Search System
Index split across servers
![Page 35: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/35.jpg)
ACSDb Applications
• Schema Auto Complete• Attribute Synonym-Finding• Join Graph Traversal
![Page 36: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/36.jpg)
Schema Auto-Complete
• To assist novice database designers• User enters one or more domain-specific attributes
(example: “make”)• System guesses suggestions appropriate to the target
domain (example: “model”, “year”, “price”, “mileage”)
![Page 37: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/37.jpg)
Schema Auto-Complete
• Maximize p(S-I | I)• Probability values computed from ACSDb• Add to S from overall attribute set A• Threshold t set to .01
![Page 38: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/38.jpg)
ACSDb Applications
• Schema Auto Complete• Attribute Synonym-Finding• Join Graph Traversal
![Page 39: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/39.jpg)
Attribute Synonym-Finding
• Traditionally done using Thesauri• Do not support non-natural-language strings eg
tel-# • Input set of context attributes, C• Output list of attribute pairs P likely to be
synonymous in schemas that contain C• Example: For attribute “artist”, output is
“song/track”.
![Page 40: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/40.jpg)
Attribute Synonym-Finding• For synonymous attributes a,b p(a,b) = 0• If p(a,b) = 0 & p(a)p(b) is large, syn score
high.• Synonyms appear in similar contexts C: for a
third attribute z, z C, z A, ∈ ∈p(z|a,C) ≈ p(z|b,C)
• If a, b always “replace” each other then denominator ≈ 0 else denominator is large
![Page 41: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/41.jpg)
Attribute Synonym-Finding
1: Function SynFind(C, t):2: R = []3: A = all attributes that appear in ACSDb with C4: for a A, b A, s.t. a ≠ b do∈ ∈5: if (a, b) ACSDb then ∈6: // Score candidate pair with syn function7: if syn(a, b) > t then8: R.append(a, b)9: end if10: end if11: end for12: sort R in descending syn order13: return R
![Page 42: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/42.jpg)
ACSDb Applications
• Schema Auto Complete• Attribute Synonym-Finding• Join Graph Traversal
![Page 43: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/43.jpg)
Join Graph Traversal• Assist a schema designer• Join Graph N,L • Node for every unique schema, undirected join link
between any 2 schemas sharing a label• Join graph cluttered• Cluster together similar schema neighbors
![Page 44: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/44.jpg)
Join Neighbor Similarity• Measure whether shared attribute D plays similar role
in schema X and Y• Similar to coherency score, except probability inputs to
PMI fn conditioned on presence of D• Two schemas cohere well, clustered together• Used as distance metric to cluster schemas sharing an
attribute with S.• User can choose from fewer outgoing links.
![Page 45: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/45.jpg)
Join Graph Traversal// input : ACSDb A, focal schema F// output : Join Graph (N,L) connecting any two schemas with shared attributes
1: Function ConstructJoinGraph(A, F):2: N = {}3: L = {}//schema S, shared attribute c4: for (S, c) A do∈5: N.add(S) // add node6: end for7: for (S, c) A do∈8: for attr F do∈9: if attr S then∈10: L.add((attr,F, S)) // add link11: end if12: end for13: end for14: return N,L
![Page 46: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/46.jpg)
Experimental Results
![Page 47: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/47.jpg)
Fraction of High Scoring Relevant Tables in Top-k
• Ranking: compared 4 algorithms on a test dataset , two judges• Judges rate (query,relation) pairs from 1-5• 1000 pairs over 30 queries• Queries chosen by hand• Fraction of top-k that are relevant (≥4) shows better
performance at higher grain
k Naïve
Filter Rank Rank-ACSDb
10 0.26 0.35 (35%)
0.43 (65%)
0.47 (81%)
20 0.33 0.47 (42%)
0.56 (70%)
0.59 (79%)
30 0.34 0.59 (74%)
0.66 (94%)
0.68 (100%)
![Page 48: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/48.jpg)
Schema Auto-Completion
Baseball at-bats
File system contentsFile system contents
Baseball at-bats
![Page 49: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/49.jpg)
Rate of attribute recall for 10 expert generated test schemas
• Output schema almost always coherent• Need to get most relevant attributes• 6 humans created schema for each case• Retained attributes ≥ 2 files sys ->address book
• 3 tries
Incremental improvement
Ambiguous data
![Page 50: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/50.jpg)
Synonym Finding
Fraction of correct synonyms in top-k ranked list from the synonym finder
Judge determines accuracy
![Page 51: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/51.jpg)
Join Neighbor Similarity• Join Graph Traversal
Neighbor Schemas
Dataset generated from a workload of 10 focal schemas
Very few incorrect schema members
![Page 52: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/52.jpg)
Future Scope• Using tuple-keys as an analogue to attribute labels,
create “data-suggest” feature• Creating new data sets by integrating this corpus with
user’s private data• Expanding the WebTables search engine to incorporate
a page quality metric like PageRank• Including non-HTML tables, deep web databases and
HTML Lists
![Page 53: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/53.jpg)
Conclusion
• First large-scale attempt to extract relational info from corpus of HTML tables
• Created unique ACSDb statistics• Showed utility of ACSDb
![Page 54: WebTables: Exploring the Power of Tables on the Web](https://reader036.fdocuments.us/reader036/viewer/2022062302/56816715550346895ddb82d7/html5/thumbnails/54.jpg)
References• V. Hristidis and Y. Papakonstantinou, “Discover: Keyword search in
relational databases”, In VLDB, 2002.• J. Madhavan, A. Y. Halevy, S. Cohen, X. L. Dong, S. R. Jeffery, D. Ko,
and C. Yu, “Structured data meets the web: A few observations”, IEEE Data Eng. Bull., 29(4):19–26, 2006.
• M. Cafarella, J. Madhavan, A. Halevy, ” Web-Scale Extraction of Structured Data”, SIGMOD Record 37(4): 55-61, 2008.
• M. Cafarella, A. Halevy, Z. Wang, E. Wu, and Y. Zhang, “Uncovering the relational web”, Eleventh International Workshop on the Web and Databases (WebDB), June 2008. Vancouver, Canada.
• M. Cafarella, A. Halevy, and J. Madhavan, “Structured Data on the Web”, Communications of the ACM 54(2): 72-79, 2011.