Collectively Representing Semi-Structured Data from the Web

Bhavana Dalvi , William W. Cohen and Jamie CallanLanguage Technologies Institute

Carnegie Mellon University

Paper ID : 02

This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.

Motivation Entities on the Web can be present in multiple datasets. E.g.

HTML tables, text documents etc. Traditional systems : Entities as sparse vector of document Ids

in which it occurs. We propose a low-dimensional representation for such entities. Helps to efficiently perform different tasks with a small number

of primitive operations : Semi-supervised Learning (SSL) Set Expansion (SE) Automatic Class Instance Acquisition (ASIA)

Entities in HTML tables

TC-2 TC-3

Country Sports

India Hockey

UK Cricket

USA Tennis

Country Capital City

India Delhi

USA Washington DC

Canada Ottawa

France Paris USA

Hockey

Cricket

Tennis

EntityTable-column

Entity-ColumnBi-partite Graph

Entities in unstructured text

Hockey

Cricket

Tennis

Country

Location

Sports

SuchasEntity

“Such as”Bi-partite Graph

Countries such as India are developing rapidly in terms of

infrastructure.

Outdoor sports include Tennis and Cricket.

Resultant Tri-partite Graph

Hockey

Cricket

Tennis

Country

Location

Sports

SuchasEntity

Table-column

Entity-ColumnBi-partite Graph

Encoding the graph“Entity-Column”Bi-partite Graph

Entity X1 X2

USA 0.43 0.66

India 0.41 0.69

Hockey 0.36 0.80

Cricket 0.35 0.82

Tennis 0.34 0.79

Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010)

Hockey

Cricket

Tennis

EntityTable-column

Entities with similar X1/X2 values should be ontologically similar - values summarize tabular co-occurrence

Encoding the graph

Hockey

Cricket

Tennis

Country

Location

Sports

SuchasEntity

Entity Y1 Y2

USA 0.23 0.76

India 0.21 0.79

Hockey 0.66 0.35

Cricket 0.16 0.92

Tennis 0.14 0.89

Low-dimensional embedding using bipartite Power Iteration Clustering (Lin & Cohen, ICML 2010/ECAI 2010)

Entities with similar Y1/Y2 values should be ontologically similar - values summarize “such as pattern” co-occurrence

Low-dimensional PIC3 embedding

n * t entity-tableColumn

Bipartite graph

n * s entity-suchas Bipartite graph

n * m PIC embeddingm << t

n * m PIC embeddingm << s

n * 2m PIC3 embeddingPIC

Concatenate

Entity X1 X2

USA 0.43 0.66

India 0.41 0.69

Hockey 0.36 0.80

Cricket 0.35 0.82

Tennis 0.34 0.79

0.23 0.76

0.21 0.79

0.66 0.35

0.16 0.92

0.14 0.89

Using PIC3 Representation

• Semi-Supervised Learning : Given few seed examples for each class, predict class-labels for unlabeled data-points.

• Set Expansion : Given a set of seed entities, find more entities similar to seed entities.

• Automatic Set Instance Acquisition (ASIA) : Given a concept name automatically find instances of that concept.

Quantitative Evaluation: DatasetsDataset Toy_Apple Delicious_Sports

#entities 14,996 438

# table-columns 156 925#entity-table column edges 176,598 9,192#suchas concepts 2,348 1,649#entity-suchas edges 7,683 4,799#general entity classes (NELL KB) 11 3#entities in general classes 419 39#hand-coded column types 31 30#columns in labeled types 156 925

Link to dataset: http://rtw.ml.cmu.edu/wk/WebSets/wsdm_2012_online

Task Training Testing

Semi-Supervised Learning

PIC3 + Train SVM classifier

Predict using learnt SVM model

SSL using PIC3Input : Few seed examples for each class label

Output : Class-labels for unlabeled data-points

PIC clusters similar entities together better SVM classifier on unlabeled data (use of background data)

SSL Task - I

# dimensions : 2504 10

SSL Task - II

# dimensions : 2574 10

Set Expansion

PIC3 Centroid(entity set) + K-NN (centroid)

Set Expansion using PIC3Input : Few seed entities e.g. Football, Hockey, Tennis

Output : More entities of same type as seeds e.g. Baseball, Badminton, Cricket, Golf ….

K-NN operation is extremely efficient using KD-trees.

Query Times• PIC3 preprocessing : 0.02 sec• # SE queries = 881

• Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/5 query classes at the expense of larger query time.

Method Total Query Time (s)K-NN + PIC3 12.7 K-NN-Baseline 80.1 MAD 38.2

Modified Adsorption : Graph based label

propagation algorithm

Automatic Set Instance Acquisition

PIC3 + Inverted index (suchasConcept entities)

seeds = top-k-entities (lookup concept in index)+ Set Expansion (seeds)

Automatic Set Instance Acquisition(ASIA) : using PIC3

Input : Class label e.g. Country

Output : Entities belonging to the given class label e.g. India, China, USA, Canada, Japan …..

Previously described Set Expansion algorithm is used as a subroutine here.

Query Times• PIC3 preprocessing : 0.02 sec• # ASIA queries = 25

• Precision Recall Curve : K-NN+PIC3 consistently beats K-NN-Baseline. Modified Adsorption method is better on 2/4 query classes at the expense of much larger query time.

Method Total Query Time (s)K-NN + PIC3 0.5K-NN-Baseline 1.4MAD 150.0

Conclusions & Future Work Presented a novel low-dimensional PIC3 representation for

entities on the Web using Power Iteration Clustering (PIC). Simple primitive operations on PIC3 to perform following tasks :

Semi-Supervised Learning Set Expansion Automatic Set Instance Acquisition

Future work : Use PIC3 representation for Named entity disambiguation and Unsupervised class-instance pair acquisition

Thank You !!

This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058.

Please visit our poster ID : 02

Examples : Set Expansion

Examples : ASIA

Set Expansion

ASIA Task

Collectively Representing Semi-Structured Data from the Web

Documents

Transcript of Collectively Representing Semi-Structured Data from the Web

On the demolding of micro-structured surfaces for medical ... · Therefore, demolding force was introduced as a representing value for the demoldability of micro structured surfaces.

MEMORANDUM FOR THE PRESIDENT FROM: …...mortgage-backed securities (CMBS).5 Furthermore, new CMBS and RMBS transactions have been" brought to market, collectively representing over

Fifth Floor - Structured Cabling Layout Plan - Structured ......Fifth Floor - Structured Cabling Layout Plan - Structured Cabling Installation ...

Statutory Issue Paper No. 43 Loan-Backed and Structured ...securities. Loan-backed securities and structured securities are collectively referred to as loan-backed securities in this

November 2013 Insurance in Malta An Industry ProfileMalta’s insurance sector collectively hit the €1.7 billion mark in terms of annual gross premiums in 2013, representing an increase

PRECEDENTIAL UNITED STATES COURT OF APPEALS FOR THE … · Fundamental Opportunity Fund, LP; and Arrowpoint Structured Opportunity Fund, LP (collectively “AAM”) from using a logo

Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured.

OData JSON Format Version 4docs.oasis-open.org/odata/odata-json-format/v4.01/csprd...The Open Data Protocol (OData) for representing and interacting with structured content is comprised

Perceiving and Representing Structured Information using Objects.

Europeans - Berrigasteiz · Web viewIt has been structured into 9 separate activities, each one representing different stages and functions of the general learning process. As an

Structured Product Guide Structured Commodity Solutions · Structured Product Guide Structured Commodity Solutions. What are structured products? Structured products are grain contracts

Structured Analysis and Structured Design

Collectively Intelligent Teams - CloudCME

Confronting Authority Collectively:

UNITED STATES DISTRICT COURT SOUTHERN … v. Sargent, 120 U.S ... Society for Clinical Pathology, and College of American Pathologists, collectively representing over 150,000 professionals.

Lecture 12 –Matrix Models for Population Biologygrandjb/wildpop/lectures/lect_12.pdf · The age-structured transition matrix model representing this system ... Life cycle graph

What is the Leslie Matrix? Method for representing dynamics of age or size structured populations Combines population processes (births and deaths) into.

(collectively “SC ”), Council’s

XML Documents and Schema in greater depth In one sense XML is … A language neutral way of representing structured data A language neutral way of representing.

Maintenance Organizations (collectively Health Insurance ...