Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita...

34
Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1

Transcript of Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita...

Page 1: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Answering Table Queries on the Web using Column Keywords

Rakesh PimplikarIBM Research

Sunita Sarawagi IIT Bombay

1

Page 2: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

User Query

Answer Table

Table Query Example

Name of Explorers Nationality Areas Explored

Vasco da Gama Portuguese Sea route to India

Abel Tasman Dutch Oceania

Christopher Columbus Caribbean

. . . . . . . . .

2

Name of Explorers Nationality Areas Explored

Page 3: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Mountains in North America Mountains in North America

Mount McKinley

Mount Saint Elias

Mount Lucania

. . .

Types of Structured Queries• Entities

• Relationship between two entities

• Entities with values of attributes

Pain Killers Side Effects Pain Killers Side Effects

aspirin asthma

ibuprofen asthma, upset stomach

naproxen sodium upset stomach

. . . . . .

Name of Explorers

Nationality Areas Explored

Name of Explorers Nationality Areas Explored

Vasco da Gama Portuguese Sea route to India

Abel Tasman Dutch Oceania

Christopher Columbus Caribbean

. . . . . . . . .

3

Page 4: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Our Data SourceTables on the Web

(elizabethan-era.org.uk)

4

(wikipedia.org)

(vaughns-1-pagers.com)

• Richer sources of structured knowledge than free-format text

Page 5: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Afghanistan Kabul

Albania Tirana

Algeria Algiers

Andorra Andorra la Vella

. . . . . .

Challenges

Name Nationality Main areas

exploredAbel Tasman Dutch Oceania

Vasco da Gama Portuguese Sea route to India

Alexander Mackenzie British Canada

. . . . . . . . .

Forest reserves

ID Name Area7 Shakespeare Hills 2236

9 Plains Creek 880

13 Welcome Swamp 168

. . . . . . . . .

<table>

<tr><th>…</th></tr>

…</table>

<table>

<tr><td>…</td></tr>

…</table>

5

Year Name Subject1902 Ronald Ross Medicine

1907 Rudyard Kipling Literature

. . . . . . . . .

The present list contains winners under the country/countries that are stated by the Nobel Prize committee on its website.Nobel Prize Winners

User Query

• Limited column specific information

Query has set of keywords. Web tables have headers.

• Designated HTML table header tag is not always used (80%).

• Many tables have no headers (18%).

• Header text is often uninformative.

• Context of a table can be helpful, but it does not give column specific information and it is often noisy.

Page 6: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

System Architecture of WWT

6

Page 7: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

The Column Mapping Task

Name Nationality Main areas

explored

Abel Tasman Dutch Oceania

Vasco da Gama Portuguese Sea route to India

Alexander Mackenzie British Canada

. . . . . . . . .

Exploration Who (explorer)

(Chronological order)

Sea route to India Vasco da Gama

Caribbean Christopher Columbus

Oceania Abel Tasman

. . . . . .

This article lists the explorations in history. For the documentary 'Explorations, powered by Duracell', see Explorations (TV)List of explorers - Wikipedia, the free encyclopedia

Forest reserves

ID Name Area

7 Shakespeare Hills 2236

9 Plains Creek 880

13 Welcome Swamp 168

. . . . . . . . .

Other Formal Reserves 1.3 Forest Reserves under the Forestry Act 1920All areas will be available for mineral exploration and mining

Name of Explorers Nationality Areas Explored

User Query

Web Table 3Web Table 2Web Table 1

Index Probe Relevant Tables

7

Page 8: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Name Nationality Main areas

explored

Abel Tasman Dutch Oceania

Vasco da Gama Portuguese Sea route to India

Alexander Mackenzie British Canada

. . . . . . . . .

User Query

The Column Mapping Task

Name of Explorers Nationality Areas Explored

Name of Explorers Nationality Areas Explored

Vasco da Gama Portuguese Sea route to India

Abel Tasman Dutch Oceania

Christopher Columbus Caribbean

Alexander Mackenzie British Canada

. . . . . . . . .

Answer Table

Map Columns

Consolidation

Name of Explorers Nationality Areas Explored

Exploration Who (explorer)

(Chronological order)

Sea route to India Vasco da Gama

Caribbean Christopher Columbus

Oceania Abel Tasman

. . . . . .

This article lists the explorations in history. For the documentary 'Explorations, powered by Duracell', see Explorations (TV)List of explorers - Wikipedia, the free encyclopedia

Forest reserves

ID Name Area

7 Shakespeare Hills 2236

9 Plains Creek 880

13 Welcome Swamp 168

. . . . . . . . .

Other Formal Reserves 1.3 Forest Reserves under the Forestry Act 1920All areas will be available for mineral exploration and mining

Web Table 3Web Table 2Web Table 1

8

Page 9: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Q1 Q2 Q3

• For each table t

• Step 1: Is t relevant?

IR_Sim(Q, C) + λ . IR_Sim(Q, h) > Threshold ?

• Step 2: If yes, map columns of t to columns in Q

A Baseline Approach

9

User Query, Q

h1 h2 h3 h4

Table, t

Context, CEdge Weight = IR_Sim(Qi , hj)

Page 10: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Limitations of Baseline Cannot match tables with poor/missing headers.

E.g.

Exploit content overlap with related tables How?

10

Name Nationality Main areas

exploredAbel Tasman Dutch Oceania

. . . . . . . . .

Page 11: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Graphical Model ApproachName Nationality Main areas

explored

Abel Tasman Dutch Oceania

Vasco da Gama Portuguese Sea route to India

Alexander Mackenzie British Canada

. . . . . . . . .

Exploration Who (explorer) Century

(Chronological order)

Sea route to India Vasco da Gama 15th/16th

Caribbean Christopher Columbus 15th/16th

Oceania Abel Tasman 17th

. . . . . . . . .

Forest reserves

ID Name Area

7 Shakespeare Hills 2236

9 Plains Creek 880

13 Welcome Swamp 168

. . . . . . . . .

N1 N2

N4 N5

N3

N7 N8 N9

Name of Explorers Nationality Areas Explored

User Query

• Create a node for every column

11

N6

Page 12: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Graphical Model ApproachName Nationality Main areas

explored

Abel Tasman Dutch Oceania

Vasco da Gama Portuguese Sea route to India

Alexander Mackenzie British Canada

. . . . . . . . .

Forest reserves

ID Name Area

7 Shakespeare Hills 2236

9 Plains Creek 880

13 Welcome Swamp 168

. . . . . . . . .

Name of Explorers Nationality Areas Explored

User Query

• Possible labels for every node

1. Name of explorers

2. Nationality

3. Areas Explored

12

N1 N2

N4 N5

N3

N7 N8 N9N6

Exploration Who (explorer) Century

(Chronological order)

Sea route to India Vasco da Gama 15th/16th

Caribbean Christopher Columbus 15th/16th

Oceania Abel Tasman 17th

. . . . . . . . .

1

1

2 3

3

Page 13: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Graphical Model ApproachName Nationality Main areas

explored

Abel Tasman Dutch Oceania

Vasco da Gama Portuguese Sea route to India

Alexander Mackenzie British Canada

. . . . . . . . .

Forest reserves

ID Name Area

7 Shakespeare Hills 2236

9 Plains Creek 880

13 Welcome Swamp 168

. . . . . . . . .

Name of Explorers Nationality Areas Explored

User Query

• Possible labels for every node

1. Name of explorers

2. Nationality

3. Areas Explored

4. NA (Not Assigned)

5. NR (Not Relevant)

13

N1 N2

N4 N5

N3

N7 N8 N9N6

Exploration Who (explorer) Century

(Chronological order)

Sea route to India Vasco da Gama 15th/16th

Caribbean Christopher Columbus 15th/16th

Oceania Abel Tasman 17th

. . . . . . . . .

1

1

2 3

3 NA NR NR NR

Page 14: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Graphical Model ApproachName Nationality Main areas

explored

Abel Tasman Dutch Oceania

Vasco da Gama Portuguese Sea route to India

Alexander Mackenzie British Canada

. . . . . . . . .

Exploration Who (explorer) Century

(Chronological order)

Sea route to India Vasco da Gama 15th/16th

Caribbean Christopher Columbus 15th/16th

Oceania Abel Tasman 17th

. . . . . . . . .

N1 N2

N4 N5

N3

Name of Explorers Nationality Areas Explored

User Query

• Edges Complete Bipartite Graph

between nodes of two tables Content overlap between

column contents and headers

Maximum Bipartite Matching

14

N6

Edge

Weights0.6

0.20.1

0.7

0

0

0

00.1

Page 15: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Graphical Model ApproachName Nationality Main areas

explored

Abel Tasman Dutch Oceania

Vasco da Gama Portuguese Sea route to India

Alexander Mackenzie British Canada

. . . . . . . . .

Exploration Who (explorer) Century

(Chronological order)

Sea route to India Vasco da Gama 15th/16th

Caribbean Christopher Columbus 15th/16th

Oceania Abel Tasman 17th

. . . . . . . . .

Forest reserves

ID Name Area

7 Shakespeare Hills 2236

9 Plains Creek 880

13 Welcome Swamp 168

. . . . . . . . .

N1 N2

N4 N5

N3

N7 N8 N9

Name of Explorers Nationality Areas Explored

User Query

• Edge Potentials Large weights Same label Soft Constraint

15

N6

0.6

0.7

0.3

0.4 0.1

Page 16: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Node Potentials

16

Score expressing the affinity of a table column ci to a query column Qj

Baseline approach: IR similarity between Qj and header of ci

Page 17: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Limitations of Baseline Similarity Generic IR similarity not a good fit for typical

roles of context + headers Context “topic” of a table Header label of a column

New model based on a two part segmentation of query words over context and header. 17

The present list contains laureates under the country/countries that are stated by the Nobel Prize committee on its website.

Year Winners Subject

1902 Ronald Ross Medicine

1907 Rudyard Kipling Literature

. . . . . . . . .

Nobel Prize WinnersUser Query

This article presents a comprehensive list of peaks in North America, highlighting some of the important features.

Mountain Peak Region Elevation

Mount McKinley Alaska 6194 m

Mount Logan Yukon 5956 m

. . . . . . . . .

Mountains in North AmericaUser Query

Page 18: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Limitations of Baseline Similarity Matches with other parts of table ignored

Frequent words in a column Multi-row headers

Split headers match to union of words Vs

Sub-headers match only to one header How to detect which of the two?

Take soft-max over matches over context, body, other headers of table on one part of the segmented query.

18

Band name Country GenreAarcon Germany Black MetalAct of God Russia Melodic BlackAdragard Italy Black Metal

. . . . . . . . .

Black metal bands

Name Nationality Main areas

exploredAbel Tasman Dutch Oceania

. . . . . . . . .

Exploration Who (explorer)

(Chronological order)Sea route to India Vasco da Gama

. . . . . .

User Query

Name of Explorers Nationality Areas ExploredUser Query

Page 19: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Segmented Similarity

Nobel Prize Winners

• Similarity score between a table column ci and a query column Qj

ci

19

Year

User Query

Qj

Page 20: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Segmented Similarity• Similarity score between a table column ci and a query column Qj

• Maximum soft-max score over matches of different segments of query with different parts of a table

Winners

Nobel Prize

.........................................................

.................... Winners ......................

.......... Nobel Prize ...........................Title

Context

Header Rows

User Query

20

Nobel Prize Winners Year

ci

Qj

Page 21: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Frequent Body ContentsHeader Text

Other Headers in the Same

Row

Other Header Rows in Same

ColumnContextTitle

Soft-max over matches over all sections

Segmented Similarity• Step 1: Segment query column keywords into two parts

• Step 2: Similarity scores between each part and different sections of table

• Step 3: Soft-max over sections of table where each part matches

User Query

21

Title

ContextNobel Prize Winners Year

CurrentHeader Row

Prefix Suffixci

Header Rows

Page 22: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Segmented Similarity• Step 1: Segment query column keywords into two parts

• Step 2: Similarity scores between each part and different sections of table

• Step 3: Soft-max over sections of table where each part matches

User Query

22

Title

Context

Header Rows

Nobel Prize Winners Year

Prefix Suffixci

Soft-max over matches over all sections

CurrentHeader Row

Page 23: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Hard Constraints• MUTEX Constraint

At most one column in a table can be mapped to a query column.

• ALL-IRR Constraint

If one column in a table is assigned a label NR, then all columns of table must be assigned NR.

• MUST-MATCH Constraint

Every relevant table must contain the first query column.

• MIN-MATCH Constraint

Every relevant table must contain at least m out of q query columns. m = 2 if q >= 2.

23

Page 24: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Labeling the Graphical Model

• Final goal: Jointly assign one of |Q|+2 labels to each column to

• maximize sum of node and edge potentials

• satisfy the hard constraints

• NP-Hard24

N1 N2

N4 N5

N3

N7 N8 N9N6

Page 25: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Inference Algorithms• Collective Inference: Table-Centric

Edge Potentials are used to modify node potentials Optimal table level inference

• Collective Inference: Edge-Centric Edge potentials are given central importance. Existing inference algorithms: Belief Propagation,

TRWS, MPLP, α-Expansion, etc. Modified α-Expansion works best in our case.

25

Page 26: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Experimental Setup• Workload

59 multi-column queries mostly collected from Amazon Mechanical Turk (AMT) service [Cafarella et al, 2009]

• Data source

25 million tables from a web crawl of 500 million pages

• Ground Truth

Manual labeling for 1906 web tables

26

Page 27: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Column Mapping Methods• Baseline

• NbrText

Baseline augmented with similarity scores from neighboring columns

• PMI [Cafarella et al, 2009]

Baseline augmented with corpus wide co-occurrence score of column contents and a label

• WWT

Our graphical model based approach with table-centric collective inference

27

Page 28: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Column Mapping Methods Comparison

28

• Overall error WWT: 30.3%, Baseline & PMI: 34.7%, NbrText: 34.2%.

BaselineB

asel

ine

Page 29: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Running Time

29

• Actual column mapping takes less than half a second.

• Time for table & index read can be improved with better machine configuration.

Page 30: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Summary• Presented a graphical model approach for answering

table queries

• A novel method to find similarity using two part query segmentation model

• Robust mechanism of exploiting content and header overlap across table columns

• Different algorithms for inferencing in graphical model

• 12% reduction in error relative to a baseline method

• Future Work Exploiting newer corpus wide co-occurrence statistics Alternative structured sources such as ontologies Enhance the search experience via faceted search and user

feedback. 30

Page 31: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Thank you.

31

Page 32: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Related Work• Query-By-Example Paradigm [2]

Extracting tables from lists on the web Web tables are not considered

• Halevy et al [3, 4] highlight the potential of web tables as a source of structured information

Collecting offline information like attribute columns, attribute associations, etc.

• OCTOPUS [1] Multiple user interactions are necessary PMI score for relevance ranking is not effective in our case.

• Schema Matching [5, 6] Managing complex alignment between the large number of

schema elements in two databases Web tables are noisy unlike database tables. 32

Page 33: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

References1. M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data

integration for the relational web. PVLDB, 2(1):1090–1101, 2009.

2. R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. PVLDB, 2(1):289–300, 2009.

3. M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538–549, 2008.

4. M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu. Uncovering the relational web. In WebDB, 2008.

5. A. Doan and A. Y. Halevy. Semantic integration research in the database community: A brief survey. The AI Magazine, 26(1):83–94, 2005.

6. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334–350, 2001.

33

Page 34: Answering Table Queries on the Web using Column Keywords Rakesh Pimplikar IBM Research Sunita Sarawagi IIT Bombay 1.

Segmented Similarity

34

• Overall error reduction from 33.3% to 30.3%

• Reduction is more than 10% in 8 cases.