Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita...
-
Upload
janice-atkinson -
Category
Documents
-
view
215 -
download
0
Transcript of Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita...
Answering Table Queries on the Web using Column Keywords
Rakesh PimplikarIBM Research
Sunita Sarawagi IIT Bombay
1
User Query
Answer Table
Table Query Example
Name of Explorers Nationality Areas Explored
Vasco da Gama Portuguese Sea route to India
Abel Tasman Dutch Oceania
Christopher Columbus Caribbean
. . . . . . . . .
2
Name of Explorers Nationality Areas Explored
Mountains in North America Mountains in North America
Mount McKinley
Mount Saint Elias
Mount Lucania
. . .
Types of Structured Queries• Entities
• Relationship between two entities
• Entities with values of attributes
Pain Killers Side Effects Pain Killers Side Effects
aspirin asthma
ibuprofen asthma, upset stomach
naproxen sodium upset stomach
. . . . . .
Name of Explorers
Nationality Areas Explored
Name of Explorers Nationality Areas Explored
Vasco da Gama Portuguese Sea route to India
Abel Tasman Dutch Oceania
Christopher Columbus Caribbean
. . . . . . . . .
3
Our Data SourceTables on the Web
(elizabethan-era.org.uk)
4
(wikipedia.org)
(vaughns-1-pagers.com)
• Richer sources of structured knowledge than free-format text
Afghanistan Kabul
Albania Tirana
Algeria Algiers
Andorra Andorra la Vella
. . . . . .
Challenges
Name Nationality Main areas
exploredAbel Tasman Dutch Oceania
Vasco da Gama Portuguese Sea route to India
Alexander Mackenzie British Canada
. . . . . . . . .
Forest reserves
ID Name Area7 Shakespeare Hills 2236
9 Plains Creek 880
13 Welcome Swamp 168
. . . . . . . . .
<table>
<tr><th>…</th></tr>
…</table>
<table>
<tr><td>…</td></tr>
…</table>
5
Year Name Subject1902 Ronald Ross Medicine
1907 Rudyard Kipling Literature
. . . . . . . . .
The present list contains winners under the country/countries that are stated by the Nobel Prize committee on its website.Nobel Prize Winners
User Query
• Limited column specific information
Query has set of keywords. Web tables have headers.
• Designated HTML table header tag is not always used (80%).
• Many tables have no headers (18%).
• Header text is often uninformative.
• Context of a table can be helpful, but it does not give column specific information and it is often noisy.
System Architecture of WWT
6
The Column Mapping Task
Name Nationality Main areas
explored
Abel Tasman Dutch Oceania
Vasco da Gama Portuguese Sea route to India
Alexander Mackenzie British Canada
. . . . . . . . .
Exploration Who (explorer)
(Chronological order)
Sea route to India Vasco da Gama
Caribbean Christopher Columbus
Oceania Abel Tasman
. . . . . .
This article lists the explorations in history. For the documentary 'Explorations, powered by Duracell', see Explorations (TV)List of explorers - Wikipedia, the free encyclopedia
Forest reserves
ID Name Area
7 Shakespeare Hills 2236
9 Plains Creek 880
13 Welcome Swamp 168
. . . . . . . . .
Other Formal Reserves 1.3 Forest Reserves under the Forestry Act 1920All areas will be available for mineral exploration and mining
Name of Explorers Nationality Areas Explored
User Query
Web Table 3Web Table 2Web Table 1
Index Probe Relevant Tables
7
Name Nationality Main areas
explored
Abel Tasman Dutch Oceania
Vasco da Gama Portuguese Sea route to India
Alexander Mackenzie British Canada
. . . . . . . . .
User Query
The Column Mapping Task
Name of Explorers Nationality Areas Explored
Name of Explorers Nationality Areas Explored
Vasco da Gama Portuguese Sea route to India
Abel Tasman Dutch Oceania
Christopher Columbus Caribbean
Alexander Mackenzie British Canada
. . . . . . . . .
Answer Table
Map Columns
Consolidation
Name of Explorers Nationality Areas Explored
Exploration Who (explorer)
(Chronological order)
Sea route to India Vasco da Gama
Caribbean Christopher Columbus
Oceania Abel Tasman
. . . . . .
This article lists the explorations in history. For the documentary 'Explorations, powered by Duracell', see Explorations (TV)List of explorers - Wikipedia, the free encyclopedia
Forest reserves
ID Name Area
7 Shakespeare Hills 2236
9 Plains Creek 880
13 Welcome Swamp 168
. . . . . . . . .
Other Formal Reserves 1.3 Forest Reserves under the Forestry Act 1920All areas will be available for mineral exploration and mining
Web Table 3Web Table 2Web Table 1
8
Q1 Q2 Q3
• For each table t
• Step 1: Is t relevant?
IR_Sim(Q, C) + λ . IR_Sim(Q, h) > Threshold ?
• Step 2: If yes, map columns of t to columns in Q
A Baseline Approach
9
User Query, Q
h1 h2 h3 h4
Table, t
Context, CEdge Weight = IR_Sim(Qi , hj)
Limitations of Baseline Cannot match tables with poor/missing headers.
E.g.
Exploit content overlap with related tables How?
10
Name Nationality Main areas
exploredAbel Tasman Dutch Oceania
. . . . . . . . .
Graphical Model ApproachName Nationality Main areas
explored
Abel Tasman Dutch Oceania
Vasco da Gama Portuguese Sea route to India
Alexander Mackenzie British Canada
. . . . . . . . .
Exploration Who (explorer) Century
(Chronological order)
Sea route to India Vasco da Gama 15th/16th
Caribbean Christopher Columbus 15th/16th
Oceania Abel Tasman 17th
. . . . . . . . .
Forest reserves
ID Name Area
7 Shakespeare Hills 2236
9 Plains Creek 880
13 Welcome Swamp 168
. . . . . . . . .
N1 N2
N4 N5
N3
N7 N8 N9
Name of Explorers Nationality Areas Explored
User Query
• Create a node for every column
11
N6
Graphical Model ApproachName Nationality Main areas
explored
Abel Tasman Dutch Oceania
Vasco da Gama Portuguese Sea route to India
Alexander Mackenzie British Canada
. . . . . . . . .
Forest reserves
ID Name Area
7 Shakespeare Hills 2236
9 Plains Creek 880
13 Welcome Swamp 168
. . . . . . . . .
Name of Explorers Nationality Areas Explored
User Query
• Possible labels for every node
1. Name of explorers
2. Nationality
3. Areas Explored
12
N1 N2
N4 N5
N3
N7 N8 N9N6
Exploration Who (explorer) Century
(Chronological order)
Sea route to India Vasco da Gama 15th/16th
Caribbean Christopher Columbus 15th/16th
Oceania Abel Tasman 17th
. . . . . . . . .
1
1
2 3
3
Graphical Model ApproachName Nationality Main areas
explored
Abel Tasman Dutch Oceania
Vasco da Gama Portuguese Sea route to India
Alexander Mackenzie British Canada
. . . . . . . . .
Forest reserves
ID Name Area
7 Shakespeare Hills 2236
9 Plains Creek 880
13 Welcome Swamp 168
. . . . . . . . .
Name of Explorers Nationality Areas Explored
User Query
• Possible labels for every node
1. Name of explorers
2. Nationality
3. Areas Explored
4. NA (Not Assigned)
5. NR (Not Relevant)
13
N1 N2
N4 N5
N3
N7 N8 N9N6
Exploration Who (explorer) Century
(Chronological order)
Sea route to India Vasco da Gama 15th/16th
Caribbean Christopher Columbus 15th/16th
Oceania Abel Tasman 17th
. . . . . . . . .
1
1
2 3
3 NA NR NR NR
Graphical Model ApproachName Nationality Main areas
explored
Abel Tasman Dutch Oceania
Vasco da Gama Portuguese Sea route to India
Alexander Mackenzie British Canada
. . . . . . . . .
Exploration Who (explorer) Century
(Chronological order)
Sea route to India Vasco da Gama 15th/16th
Caribbean Christopher Columbus 15th/16th
Oceania Abel Tasman 17th
. . . . . . . . .
N1 N2
N4 N5
N3
Name of Explorers Nationality Areas Explored
User Query
• Edges Complete Bipartite Graph
between nodes of two tables Content overlap between
column contents and headers
Maximum Bipartite Matching
14
N6
Edge
Weights0.6
0.20.1
0.7
0
0
0
00.1
Graphical Model ApproachName Nationality Main areas
explored
Abel Tasman Dutch Oceania
Vasco da Gama Portuguese Sea route to India
Alexander Mackenzie British Canada
. . . . . . . . .
Exploration Who (explorer) Century
(Chronological order)
Sea route to India Vasco da Gama 15th/16th
Caribbean Christopher Columbus 15th/16th
Oceania Abel Tasman 17th
. . . . . . . . .
Forest reserves
ID Name Area
7 Shakespeare Hills 2236
9 Plains Creek 880
13 Welcome Swamp 168
. . . . . . . . .
N1 N2
N4 N5
N3
N7 N8 N9
Name of Explorers Nationality Areas Explored
User Query
• Edge Potentials Large weights Same label Soft Constraint
15
N6
0.6
0.7
0.3
0.4 0.1
Node Potentials
16
Score expressing the affinity of a table column ci to a query column Qj
Baseline approach: IR similarity between Qj and header of ci
Limitations of Baseline Similarity Generic IR similarity not a good fit for typical
roles of context + headers Context “topic” of a table Header label of a column
New model based on a two part segmentation of query words over context and header. 17
The present list contains laureates under the country/countries that are stated by the Nobel Prize committee on its website.
Year Winners Subject
1902 Ronald Ross Medicine
1907 Rudyard Kipling Literature
. . . . . . . . .
Nobel Prize WinnersUser Query
This article presents a comprehensive list of peaks in North America, highlighting some of the important features.
Mountain Peak Region Elevation
Mount McKinley Alaska 6194 m
Mount Logan Yukon 5956 m
. . . . . . . . .
Mountains in North AmericaUser Query
Limitations of Baseline Similarity Matches with other parts of table ignored
Frequent words in a column Multi-row headers
Split headers match to union of words Vs
Sub-headers match only to one header How to detect which of the two?
Take soft-max over matches over context, body, other headers of table on one part of the segmented query.
18
Band name Country GenreAarcon Germany Black MetalAct of God Russia Melodic BlackAdragard Italy Black Metal
. . . . . . . . .
Black metal bands
Name Nationality Main areas
exploredAbel Tasman Dutch Oceania
. . . . . . . . .
Exploration Who (explorer)
(Chronological order)Sea route to India Vasco da Gama
. . . . . .
User Query
Name of Explorers Nationality Areas ExploredUser Query
Segmented Similarity
Nobel Prize Winners
• Similarity score between a table column ci and a query column Qj
ci
19
Year
User Query
Qj
Segmented Similarity• Similarity score between a table column ci and a query column Qj
• Maximum soft-max score over matches of different segments of query with different parts of a table
Winners
Nobel Prize
.........................................................
.................... Winners ......................
.......... Nobel Prize ...........................Title
Context
Header Rows
User Query
20
Nobel Prize Winners Year
ci
Qj
Frequent Body ContentsHeader Text
Other Headers in the Same
Row
Other Header Rows in Same
ColumnContextTitle
Soft-max over matches over all sections
Segmented Similarity• Step 1: Segment query column keywords into two parts
• Step 2: Similarity scores between each part and different sections of table
• Step 3: Soft-max over sections of table where each part matches
User Query
21
Title
ContextNobel Prize Winners Year
CurrentHeader Row
Prefix Suffixci
Header Rows
Segmented Similarity• Step 1: Segment query column keywords into two parts
• Step 2: Similarity scores between each part and different sections of table
• Step 3: Soft-max over sections of table where each part matches
User Query
22
Title
Context
Header Rows
Nobel Prize Winners Year
Prefix Suffixci
Soft-max over matches over all sections
CurrentHeader Row
Hard Constraints• MUTEX Constraint
At most one column in a table can be mapped to a query column.
• ALL-IRR Constraint
If one column in a table is assigned a label NR, then all columns of table must be assigned NR.
• MUST-MATCH Constraint
Every relevant table must contain the first query column.
• MIN-MATCH Constraint
Every relevant table must contain at least m out of q query columns. m = 2 if q >= 2.
23
Labeling the Graphical Model
• Final goal: Jointly assign one of |Q|+2 labels to each column to
• maximize sum of node and edge potentials
• satisfy the hard constraints
• NP-Hard24
N1 N2
N4 N5
N3
N7 N8 N9N6
Inference Algorithms• Collective Inference: Table-Centric
Edge Potentials are used to modify node potentials Optimal table level inference
• Collective Inference: Edge-Centric Edge potentials are given central importance. Existing inference algorithms: Belief Propagation,
TRWS, MPLP, α-Expansion, etc. Modified α-Expansion works best in our case.
25
Experimental Setup• Workload
59 multi-column queries mostly collected from Amazon Mechanical Turk (AMT) service [Cafarella et al, 2009]
• Data source
25 million tables from a web crawl of 500 million pages
• Ground Truth
Manual labeling for 1906 web tables
26
Column Mapping Methods• Baseline
• NbrText
Baseline augmented with similarity scores from neighboring columns
• PMI [Cafarella et al, 2009]
Baseline augmented with corpus wide co-occurrence score of column contents and a label
• WWT
Our graphical model based approach with table-centric collective inference
27
Column Mapping Methods Comparison
28
• Overall error WWT: 30.3%, Baseline & PMI: 34.7%, NbrText: 34.2%.
BaselineB
asel
ine
Running Time
29
• Actual column mapping takes less than half a second.
• Time for table & index read can be improved with better machine configuration.
Summary• Presented a graphical model approach for answering
table queries
• A novel method to find similarity using two part query segmentation model
• Robust mechanism of exploiting content and header overlap across table columns
• Different algorithms for inferencing in graphical model
• 12% reduction in error relative to a baseline method
• Future Work Exploiting newer corpus wide co-occurrence statistics Alternative structured sources such as ontologies Enhance the search experience via faceted search and user
feedback. 30
Thank you.
31
Related Work• Query-By-Example Paradigm [2]
Extracting tables from lists on the web Web tables are not considered
• Halevy et al [3, 4] highlight the potential of web tables as a source of structured information
Collecting offline information like attribute columns, attribute associations, etc.
• OCTOPUS [1] Multiple user interactions are necessary PMI score for relevance ranking is not effective in our case.
• Schema Matching [5, 6] Managing complex alignment between the large number of
schema elements in two databases Web tables are noisy unlike database tables. 32
References1. M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data
integration for the relational web. PVLDB, 2(1):1090–1101, 2009.
2. R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. PVLDB, 2(1):289–300, 2009.
3. M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538–549, 2008.
4. M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the relational web. In WebDB, 2008.
5. A. Doan and A. Y. Halevy. Semantic integration research in the database community: A brief survey. The AI Magazine, 26(1):83–94, 2005.
6. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334–350, 2001.
33
Segmented Similarity
34
• Overall error reduction from 33.3% to 30.3%
• Reduction is more than 10% in 8 cases.