Cost-Effective Conceptual Design over Taxonomies

Cost-Effective Conceptual Design over Taxonomies

Yodsawalai Chodpathumwan

University of Illinois at Urbana-Champaign

Ali Vakilian

Massachusetts Institute of Technology

Arash Termehchy, Amir Nayyeri

Oregon State University

Users have to query over unstructured dataset.

Medical articles, HTML pages, …

“John Adams, politician”

keyword query

ranked list

<article id=1>

<article id=2>

<article id=3>

Only Article id 1 is about a politician.

poor ranking quality!

precision = 1/3

Precision = #returned relevant answers#returned answers

<article id=1>

John Adams has been a former member of Ohio House of Representative from 2007 to 2014 …

<article id=2>

John Adams is a composer whose music inspired by nature …

<article id=3>

John Adams is a public high school located on the east side of Cleveland, Ohio, …

Wikipedia article excerpts

Annotating a dataset helps answering the query.

We can annotate the dataset using concepts from a taxonomy.

thing

agent place

person organization populated place

artistpolitician athlete

DBpedia taxonomy

schoollegislature state city

<article id=1>

John Adams has been a former member of Ohio House of Representative

from 2007 to 2014 …

<article id=2>


<article id=3>



Taxonomy:* Tree-shaped graph* Vertex = concept* Edge = subclass relation

artistpolitician

city

school

state

legislature

Users can submit structured queries over annotated dataset.

Politician(“John Adams”)

Structured keyword query

ranked list

<article id=1>

<article id=1>


from 2007 to 2014 …

<article id=2>


<article id=3>



precision = 1/1 = 1

Perfect!

cityschool state

artistpolitician

legislature

Concept annotation is costly

Instances of concepts are annotated by a program called concept annotator.

It is costly to develop, execute, and maintain a concept annotator.

• Hand-tuned program rules – need experts, time-consuming

• Machine learning technique – lots of relevant features, thousands of rules

• Executing concepts annotator may take several days and require lots of computational resources

• Datasets evolve over time – rewrite and re-execute concept annotators

It is not possible to always annotate all concepts.

Ideally, we would like to annotate instances of all concepts in a given taxonomy from a dataset to answer all queries effectively.

Reality, we can only annotate instances of some concepts.

thing

agent place


artistpolitician athlete

DBpedia taxonomy

schoollegislature state city

<article id=1>

John Adams has been a former member of Ohio House of Representative from 2007 to 2014 …

<article id=2>


<article id=3>



person

organization

person

Annotating dataset with only a subset of concepts from a taxonomy still helps.

Politician(“John Adams”)

Structured keyword query

ranked list

<article id=1>


from 2007 to 2014 …

<article id=2>


<article id=3>



organization

personperson

…

person organization

politician … …

<article id=1>

<article id=2>

precision = 1/2 > 1/3

Precision over unannotated dataset

Many taxonomies contain large number of concepts.

• Medical Subject Headings (MeSH), Plant Ontology, …

• An organization has limited amount of resources

• Annotate a dataset using only a subset of concepts from a given taxonomy: a conceptual design for the data

Which conceptual design to pick?

I can only annotate a few concepts

over this dataset.

dataset

Query

I want largest average precision over these

queries.

Find a cost-effective subset of concepts from an input taxonomy that maximizes the effectiveness of answering queries.

thing

agent place


artistpolitician athlete schoollegislature state city

Precision@k

Problem of Cost-Effective Conceptual Design (CECD)

Given a dataset, a query workload, a taxonomy, a fixed budget

We would like to select a conceptual design 𝑆 such that

• 𝐶∈𝑆𝑤(𝐶) ≤ 𝐵

• 𝑆 provides the largest precision@k of answering queries more than other designs that satisfy the budget constraint.

Fixed budget

Cost function

Let’s quantify the amount of improvement for precision@k: the Queriability of a design

Partitions of a conceptual design

Given a design 𝑆 over a taxonomy 𝑋,

the partition of a concept 𝑐 ∈ 𝑆 or 𝒑𝒂𝒓𝒕(𝒄) is a subset of leaf nodes in 𝑋such that, for every concept 𝑑 ∈ 𝑝𝑎𝑟𝑡(𝑐), the lowest ancestor of 𝑑 in 𝑺 is 𝑐 or 𝑑 = 𝑐.

Each leaf concept in 𝑋 belongs to at most one partition of a design 𝑆.

thing

agent place


artistpolitician athlete schoollegislature state city

𝑆 = {agent, person}

𝑝𝑎𝑟𝑡 person = {politician, athlete, artist}

𝑝𝑎𝑟𝑡 agent = {legislature, school}

𝑓𝑟𝑒𝑒 𝑆 = {state, city}

A set of leaf concepts that do not belong to any partition of 𝑆 is called 𝑓𝑟𝑒𝑒(𝑆).

Conceptual design 𝑺 helps answering queries whose concepts are in partitions of 𝑺.

dataset

person

politicianartist

Total improvement from design 𝑺 is 𝑷∈𝑺 𝒄∈𝒑𝒂𝒓𝒕(𝑷)𝒖 𝒄 𝒅 𝒄

𝒅(𝑷)

agent

organizationperson …

artist politician school … …

…

…

𝑑 𝑐 : frequency of documents of concept 𝑐

school(…)

politician(…)politician(…)

artist(…)

query workload

𝑢 𝑐 : popularity of concept 𝑐in query workload

Improvement is 𝑢(politician)𝑑 politician

𝑑 person

Portion of queries about “politician” is 𝑢 politician

Fraction of “politician” documents

amongst “person” is 𝑑 politician𝑑 person

Total improvement from partition of “person” is𝑢(politician)𝑑 politician

𝑑 person+𝑢(artist)𝑑 artist

𝑑 person+ ⋯ =

𝑐∈𝑝𝑎𝑟𝑡(person)

𝑢(𝑐)𝑑(𝑐)

𝑑(person)

𝐶 = “politician”, 𝑆 = {person,…}

The contribution of a design for queries whose concepts are not in any partition of the design.

Generally, the concepts with more instances in the dataset are more likely to appear in the top answers.

Thus, it is more likely they contain some relevant answers for the query.

The total improvement by concepts in 𝑓𝑟𝑒𝑒(𝑆) is

𝒄∈𝒇𝒓𝒆𝒆 𝑺

𝒖 𝒄 𝒅 𝒄

Portion of instances in the dataset that belong to 𝒄Portion of queries

whose concepts are 𝒄

dataset

person

organizationorganization

person

answers

relevant answers

organization

person

Formal definition of Cost-Effective Conceptual Design Problem

Given a taxonomy 𝑋, a dataset 𝐷, query workload 𝑄 and a budget 𝐵,

find a conceptual design 𝑆 over 𝑋 such that

𝑐∈𝑆

𝑤 𝑐 ≤ 𝐵

and 𝑆 maximizes the queriablity

𝑄𝑈 𝑆 =

𝑃∈𝑆

𝑐∈𝑝𝑎𝑟𝑡(𝑃)

𝑢 𝑐 𝑑 𝑐 𝑝𝑟(𝑃)

𝑑(𝑃)+

𝑐∈𝑓𝑟𝑒𝑒(𝑆)

𝑢 𝑐 𝑑(𝑐)

We have proposed an approximation algorithm called “Level-wise Algorithm” (LW)Find a design whose concepts are all from a same level of the input taxonomy

Infections

Skin-InfectionsEye-Infections Bone-Infections

Trachoma Hordeolum Ecthyma Erysipelas Periostitis Spondylitis

…

…

…

…

Find the design with maximum queriability for each level using APM algorithm [Termehchy, SIGMOD’14]

APM returns a design with largest queriability over a set of concepts.

𝑄𝑈({Infections,…})

𝑄𝑈({Eye−Infections,...})

𝑄𝑈({Trachoma,...})

𝑺𝒍𝒆𝒗𝒆𝒍 ← a design with 𝐦𝐚𝐱{𝑸𝑼,𝑸𝑼,𝑸𝑼,… }

𝑺𝒍𝒆𝒂𝒇 ← leaf concept with largest popularity (𝒖)

Return a design with 𝐦𝐚𝐱 𝑸𝑼 𝑺𝒍𝒆𝒗𝒆𝒍 , 𝑸𝑼 𝑺𝒍𝒆𝒂𝒇

Level-wise algorithm has a bounded approximation ratio over a special case of the CECD problem

• Sometimes it is easier to use and manage a conceptual design whose concepts are not subclass/superclass of each other.

• We call this design a disjoint design.

• May restrict the solution in the CECD problem to disjoint designs.

• We call this problem a disjoint CECD problem.

TheoremThe Level-wise algorithm is a 𝑂 log |𝐶| -approximation for the disjoint CECD problem.

Experiment Settings

• 8 extracted tree taxonomies from YAGO ontology, T1-T8• Number of concepts between 10 – 400 with height of 2 – 9

• 8 Datasets of articles from English Wikipedia Collection

• Bing (bing.com) query log whose relevant answers are Wikipedia article.

• Effectiveness metric: precision at 3 (𝑝@3)

• Two cost models: uniform cost and random cost

Validation of Queriability Function

• Oracle: enumerate all feasible designs and find design with maximum precision at 3.

• Queriability Maximization (QM): enumerate all feasible designs and find design with maximum queriability.

BT1 T2 T3

Oracle QM Oracle QM Oracle QM

Un

ifo

rm

0.10.20.30.4…

0.1490.1680.1770.192

0.1490.1680.1770.192

0.2410.3030.3180.320

0.2320.2850.3150.318

0.2220.2810.3040.306

0.2100.2690.3040.304

B=1 : enough budget to annotate all concepts

Results for random cost is similar to the results for uniform cost

Effectiveness of Level-Wise Algorithm

• Compare LW with APM.

BT1 T2 T3 T4 T5 T6 T7 T8

APM LW APM LW APM LW APM LW APM LW APM LW APM LW APM LW

Un

ifo

rm

0.10.20.30.4…

.089

.149

.164

.164

.103

.164

.164

.183

.234

.253

.292

.320

.232

.285

.316

.318

.208

.258

.288

.297

.210

.269

.297

.304

.158

.177

.191

.215

.179

.212

.231

.240

.178

.214

.228

.237

.206

.227

.242

.248

.229

.244

.247

.248

.240

.248

.248

.248

.243

.259

.260

.261

.254

.261

.261

.261

.250

.262

.262

.263

.259

.263

.263

.263


Results for random cost is similar to the results for uniform cost

Efficiency of Level-Wise Algorithm

Level-Wise algorithm• Took 2 seconds over T4 and T5

• Took 5 seconds over T6

• Took 6 seconds over T7 and T8

APM algorithm• Took 2 seconds over T4, T5 and T6



Conclusion

• We introduce the cost-effective conceptual design over taxonomies

• We propose an efficient approximation (LW) algorithm for the problem.

• Our empirical results show that LW is generally effective and scalable.

• For future work, we are working on variations of the problem including taxonomies that are directed acyclic graphs, multi-concept queries, and cost-dependency model.

Unused Slides

23

Running time

• APM algorithm is 𝑂( 𝐶 log 𝐶 )

• LW algorithm is 𝑂(ℎ 𝐶 log 𝐶 )Since we perform APM for each level in the taxonomy, in fact, for balanced tree, the running time of LW is

𝑂( 𝐶1 log 𝐶1 +⋯ 𝐶ℎ log 𝐶ℎ )

which is smaller than

𝑂 𝐶1 ∪⋯∪ 𝐶ℎ log 𝐶1 ∪⋯∪ 𝐶ℎ = 𝑂( 𝐶 log 𝐶 ).

24

How to quantify the queriability of a design?

• Effectiveness of returned answers is usually measured using precision (MRR, recall, ...).• Precision = #returned relevant answers / #returned answers

• If a conceptual design helps improve ranking quality, it should replace non-relevant answers with the relevant one.

• It also improves the precision (MRR, recall, …)

• Queriability – estimates the improve in query answering using the amount by which a design increases the fraction of relevant answers in the returned answers.

Level-Wise Algorighm

26

Conceptual design 𝑺 helps answering queries whose concepts are in partitions of 𝑺.

Given a query 𝐶(terms) such that 𝐶 belongs to the partition 𝑃 in the design 𝑆

Infections

Skin-InfectionsEye-Infections …

Trachoma Hordeolum Ecthyma Erysipelas …

dataset

Eye-Infections

Trachoma

𝐶 = “Trachoma”, 𝑆 = {Eye-Infections,…}

Let 𝑑(𝐶) be a frequency of documents with concept 𝐶

Let 𝑢(𝐶) be a popularity of queries with concept 𝐶

Fraction of documents about “Trachoma” within the set of documents annotated by “Eye-Infection” is 𝑑 Trachoma

𝑑 Eye−Infections.

Portion of query about “Trachoma” is 𝑢(Trachoma).

Hence, the improvement is 𝑢 Trachoma 𝑑 Trachoma𝑑 Eye−Infections

.

Total improvement from partition of “Eye-Infection” is𝑢 Trachoma 𝑑 Trachoma

𝑑 Eye−Infections+𝑢 Hordeolum 𝑑 Hordeolum

𝑑 Eye−Infections+⋯Hordeolum

For each partition, 𝑐∈𝑝𝑎𝑟𝑡(𝑃)𝑢(𝑐)𝑑(𝐶)

𝑑(𝑃).

Given 𝑆, the total improvement is 𝑃∈𝑆 𝑐∈𝑝𝑎𝑟𝑡(𝑃)𝑢 𝑐 𝑑 𝑐

𝑑(𝑃)

Validation of Queriability Function

• Oracle: enumerate all feasible designs and find design with maximum precision at 3.

• Queriability Maximization (QM): enumerate all feasible designs and find design with maximum queriability.

BT1 T2 T3

Oracle QM Oracle QM Oracle QM

Un

ifo

rm

0.10.20.30.4…

0.1490.1680.1770.192

0.1490.1680.1770.192

0.2410.3030.3180.320

0.2320.2850.3150.318

0.2220.2810.3040.306

0.2100.2690.3040.304

Ran

do

m

0.10.20.30.4…

0.1240.1630.1790.187

0.1240.1630.1770.183

0.2640.3200.3170.323

0.2620.2950.3160.318

0.2480.2880.3040.306

0.2390.2810.3040.306


Effectiveness of Level-Wise Algorithm

• Compare LW with APM.B

T1 T2 T3 T4 T5 T6 T7 T8

APM LW APM LW APM LW APM LW APM LW APM LW APM LW APM LW

Un

ifo

rm

0.10.20.30.4…

.089

.149

.164

.164

.103

.164

.164

.183

.234

.253

.292

.320

.232

.285

.316

.318

.208

.258

.288

.297

.210

.269

.297

.304

.158

.177

.191

.215

.179

.212

.231

.240

.178

.214

.228

.237

.206

.227

.242

.248

.229

.244

.247

.248

.240

.248

.248

.248

.243

.259

.260

.261

.254

.261

.261

.261

.250

.262

.262

.263

.259

.263

.263

.263

Ran

do

m

0.10.20.30.4…

.097

.111

.122

.164

.104

.145

.175

.185

.239

.263

.300

.321

.257

.291

.317

.320

.240

.263

.275

.294

.235

.283

.301

.305

.177

.180

.188

.212

.189

.212

.230

.239

.183

.216

.231

.239

.210

.230

.242

.248

.231

.245

.247

.248

.240

.248

.248

.248

.245

.259

.260

.261

.256

.261

.261

.261

.255

.262

.263

.263

.259

.263

.263

.263


<article id=1>

Granular conjunctivitis causes pain in the outer surface or cornea …

<article id=2>

Stye may lead to pain on the eyelids …

<article id=3>

GAS caused infections cause pain in tissues …

Cost-Effective Conceptual Design over Taxonomies

Documents

Transcript of Cost-Effective Conceptual Design over Taxonomies