Cost-Effective Conceptual Design over Taxonomies
Transcript of Cost-Effective Conceptual Design over Taxonomies
Cost-Effective Conceptual Design over Taxonomies
Yodsawalai Chodpathumwan
University of Illinois at Urbana-Champaign
Ali Vakilian
Massachusetts Institute of Technology
Arash Termehchy, Amir Nayyeri
Oregon State University
Users have to query over unstructured dataset.
Medical articles, HTML pages, …
“John Adams, politician”
keyword query
ranked list
<article id=1>
<article id=2>
<article id=3>
Only Article id 1 is about a politician.
poor ranking quality!
precision = 1/3
Precision = #returned relevant answers#returned answers
<article id=1>
John Adams has been a former member of Ohio House of Representative from 2007 to 2014 …
<article id=2>
John Adams is a composer whose music inspired by nature …
<article id=3>
John Adams is a public high school located on the east side of Cleveland, Ohio, …
Wikipedia article excerpts
Annotating a dataset helps answering the query.
We can annotate the dataset using concepts from a taxonomy.
thing
agent place
person organization populated place
artistpolitician athlete
DBpedia taxonomy
schoollegislature state city
<article id=1>
John Adams has been a former member of Ohio House of Representative
from 2007 to 2014 …
<article id=2>
John Adams is a composer whose music inspired by nature …
<article id=3>
John Adams is a public high school located on the east side of Cleveland, Ohio, …
Wikipedia article excerpts
Taxonomy:* Tree-shaped graph* Vertex = concept* Edge = subclass relation
artistpolitician
city
school
state
legislature
Users can submit structured queries over annotated dataset.
Politician(“John Adams”)
Structured keyword query
ranked list
<article id=1>
<article id=1>
John Adams has been a former member of Ohio House of Representative
from 2007 to 2014 …
<article id=2>
John Adams is a composer whose music inspired by nature …
<article id=3>
John Adams is a public high school located on the east side of Cleveland, Ohio, …
Wikipedia article excerpts
precision = 1/1 = 1
Perfect!
cityschool state
artistpolitician
legislature
Concept annotation is costly
Instances of concepts are annotated by a program called concept annotator.
It is costly to develop, execute, and maintain a concept annotator.
• Hand-tuned program rules – need experts, time-consuming
• Machine learning technique – lots of relevant features, thousands of rules
• Executing concepts annotator may take several days and require lots of computational resources
• Datasets evolve over time – rewrite and re-execute concept annotators
It is not possible to always annotate all concepts.
Ideally, we would like to annotate instances of all concepts in a given taxonomy from a dataset to answer all queries effectively.
Reality, we can only annotate instances of some concepts.
thing
agent place
person organization populated place
artistpolitician athlete
DBpedia taxonomy
schoollegislature state city
<article id=1>
John Adams has been a former member of Ohio House of Representative from 2007 to 2014 …
<article id=2>
John Adams is a composer whose music inspired by nature …
<article id=3>
John Adams is a public high school located on the east side of Cleveland, Ohio, …
Wikipedia article excerpts
person
organization
person
Annotating dataset with only a subset of concepts from a taxonomy still helps.
Politician(“John Adams”)
Structured keyword query
ranked list
<article id=1>
John Adams has been a former member of Ohio House of Representative
from 2007 to 2014 …
<article id=2>
John Adams is a composer whose music inspired by nature …
<article id=3>
John Adams is a public high school located on the east side of Cleveland, Ohio, …
Wikipedia article excerpts
organization
personperson
…
person organization
politician … …
<article id=1>
<article id=2>
precision = 1/2 > 1/3
Precision over unannotated dataset
Many taxonomies contain large number of concepts.
• Medical Subject Headings (MeSH), Plant Ontology, …
• An organization has limited amount of resources
• Annotate a dataset using only a subset of concepts from a given taxonomy: a conceptual design for the data
Which conceptual design to pick?
I can only annotate a few concepts
over this dataset.
dataset
Query
I want largest average precision over these
queries.
Find a cost-effective subset of concepts from an input taxonomy that maximizes the effectiveness of answering queries.
thing
agent place
person organization populated place
artistpolitician athlete schoollegislature state city
Precision@k
Problem of Cost-Effective Conceptual Design (CECD)
Given a dataset, a query workload, a taxonomy, a fixed budget
We would like to select a conceptual design 𝑆 such that
• 𝐶∈𝑆𝑤(𝐶) ≤ 𝐵
• 𝑆 provides the largest precision@k of answering queries more than other designs that satisfy the budget constraint.
Fixed budget
Cost function
Let’s quantify the amount of improvement for precision@k: the Queriability of a design
Partitions of a conceptual design
Given a design 𝑆 over a taxonomy 𝑋,
the partition of a concept 𝑐 ∈ 𝑆 or 𝒑𝒂𝒓𝒕(𝒄) is a subset of leaf nodes in 𝑋such that, for every concept 𝑑 ∈ 𝑝𝑎𝑟𝑡(𝑐), the lowest ancestor of 𝑑 in 𝑺 is 𝑐 or 𝑑 = 𝑐.
Each leaf concept in 𝑋 belongs to at most one partition of a design 𝑆.
thing
agent place
person organization populated place
artistpolitician athlete schoollegislature state city
𝑆 = {agent, person}
𝑝𝑎𝑟𝑡 person = {politician, athlete, artist}
𝑝𝑎𝑟𝑡 agent = {legislature, school}
𝑓𝑟𝑒𝑒 𝑆 = {state, city}
A set of leaf concepts that do not belong to any partition of 𝑆 is called 𝑓𝑟𝑒𝑒(𝑆).
Conceptual design 𝑺 helps answering queries whose concepts are in partitions of 𝑺.
dataset
person
politicianartist
Total improvement from design 𝑺 is 𝑷∈𝑺 𝒄∈𝒑𝒂𝒓𝒕(𝑷)𝒖 𝒄 𝒅 𝒄
𝒅(𝑷)
agent
organizationperson …
artist politician school … …
…
…
𝑑 𝑐 : frequency of documents of concept 𝑐
school(…)
politician(…)politician(…)
artist(…)
query workload
𝑢 𝑐 : popularity of concept 𝑐in query workload
Improvement is 𝑢(politician)𝑑 politician
𝑑 person
Portion of queries about “politician” is 𝑢 politician
Fraction of “politician” documents
amongst “person” is 𝑑 politician𝑑 person
Total improvement from partition of “person” is𝑢(politician)𝑑 politician
𝑑 person+𝑢(artist)𝑑 artist
𝑑 person+ ⋯ =
𝑐∈𝑝𝑎𝑟𝑡(person)
𝑢(𝑐)𝑑(𝑐)
𝑑(person)
𝐶 = “politician”, 𝑆 = {person,…}
The contribution of a design for queries whose concepts are not in any partition of the design.
Generally, the concepts with more instances in the dataset are more likely to appear in the top answers.
Thus, it is more likely they contain some relevant answers for the query.
The total improvement by concepts in 𝑓𝑟𝑒𝑒(𝑆) is
𝒄∈𝒇𝒓𝒆𝒆 𝑺
𝒖 𝒄 𝒅 𝒄
Portion of instances in the dataset that belong to 𝒄Portion of queries
whose concepts are 𝒄
dataset
person
organizationorganization
person
answers
relevant answers
organization
person
Formal definition of Cost-Effective Conceptual Design Problem
Given a taxonomy 𝑋, a dataset 𝐷, query workload 𝑄 and a budget 𝐵,
find a conceptual design 𝑆 over 𝑋 such that
𝑐∈𝑆
𝑤 𝑐 ≤ 𝐵
and 𝑆 maximizes the queriablity
𝑄𝑈 𝑆 =
𝑃∈𝑆
𝑐∈𝑝𝑎𝑟𝑡(𝑃)
𝑢 𝑐 𝑑 𝑐 𝑝𝑟(𝑃)
𝑑(𝑃)+
𝑐∈𝑓𝑟𝑒𝑒(𝑆)
𝑢 𝑐 𝑑(𝑐)
We have proposed an approximation algorithm called “Level-wise Algorithm” (LW)Find a design whose concepts are all from a same level of the input taxonomy
Infections
Skin-InfectionsEye-Infections Bone-Infections
Trachoma Hordeolum Ecthyma Erysipelas Periostitis Spondylitis
…
…
…
…
Find the design with maximum queriability for each level using APM algorithm [Termehchy, SIGMOD’14]
APM returns a design with largest queriability over a set of concepts.
𝑄𝑈({Infections,…})
𝑄𝑈({Eye−Infections,...})
𝑄𝑈({Trachoma,...})
𝑺𝒍𝒆𝒗𝒆𝒍 ← a design with 𝐦𝐚𝐱{𝑸𝑼,𝑸𝑼,𝑸𝑼,… }
𝑺𝒍𝒆𝒂𝒇 ← leaf concept with largest popularity (𝒖)
Return a design with 𝐦𝐚𝐱 𝑸𝑼 𝑺𝒍𝒆𝒗𝒆𝒍 , 𝑸𝑼 𝑺𝒍𝒆𝒂𝒇
Level-wise algorithm has a bounded approximation ratio over a special case of the CECD problem
• Sometimes it is easier to use and manage a conceptual design whose concepts are not subclass/superclass of each other.
• We call this design a disjoint design.
• May restrict the solution in the CECD problem to disjoint designs.
• We call this problem a disjoint CECD problem.
TheoremThe Level-wise algorithm is a 𝑂 log |𝐶| -approximation for the disjoint CECD problem.
Experiment Settings
• 8 extracted tree taxonomies from YAGO ontology, T1-T8• Number of concepts between 10 – 400 with height of 2 – 9
• 8 Datasets of articles from English Wikipedia Collection
• Bing (bing.com) query log whose relevant answers are Wikipedia article.
• Effectiveness metric: precision at 3 (𝑝@3)
• Two cost models: uniform cost and random cost
Validation of Queriability Function
• Oracle: enumerate all feasible designs and find design with maximum precision at 3.
• Queriability Maximization (QM): enumerate all feasible designs and find design with maximum queriability.
BT1 T2 T3
Oracle QM Oracle QM Oracle QM
Un
ifo
rm
0.10.20.30.4…
0.1490.1680.1770.192
0.1490.1680.1770.192
0.2410.3030.3180.320
0.2320.2850.3150.318
0.2220.2810.3040.306
0.2100.2690.3040.304
B=1 : enough budget to annotate all concepts
Results for random cost is similar to the results for uniform cost
Effectiveness of Level-Wise Algorithm
• Compare LW with APM.
BT1 T2 T3 T4 T5 T6 T7 T8
APM LW APM LW APM LW APM LW APM LW APM LW APM LW APM LW
Un
ifo
rm
0.10.20.30.4…
.089
.149
.164
.164
.103
.164
.164
.183
.234
.253
.292
.320
.232
.285
.316
.318
.208
.258
.288
.297
.210
.269
.297
.304
.158
.177
.191
.215
.179
.212
.231
.240
.178
.214
.228
.237
.206
.227
.242
.248
.229
.244
.247
.248
.240
.248
.248
.248
.243
.259
.260
.261
.254
.261
.261
.261
.250
.262
.262
.263
.259
.263
.263
.263
B=1 : enough budget to annotate all concepts
Results for random cost is similar to the results for uniform cost
Efficiency of Level-Wise Algorithm
Level-Wise algorithm• Took 2 seconds over T4 and T5
• Took 5 seconds over T6
• Took 6 seconds over T7 and T8
APM algorithm• Took 2 seconds over T4, T5 and T6
• Took 13 seconds over T7
• Took 40 seconds over T8
Conclusion
• We introduce the cost-effective conceptual design over taxonomies
• We propose an efficient approximation (LW) algorithm for the problem.
• Our empirical results show that LW is generally effective and scalable.
• For future work, we are working on variations of the problem including taxonomies that are directed acyclic graphs, multi-concept queries, and cost-dependency model.
Unused Slides
23
Running time
• APM algorithm is 𝑂( 𝐶 log 𝐶 )
• LW algorithm is 𝑂(ℎ 𝐶 log 𝐶 )Since we perform APM for each level in the taxonomy, in fact, for balanced tree, the running time of LW is
𝑂( 𝐶1 log 𝐶1 +⋯ 𝐶ℎ log 𝐶ℎ )
which is smaller than
𝑂 𝐶1 ∪⋯∪ 𝐶ℎ log 𝐶1 ∪⋯∪ 𝐶ℎ = 𝑂( 𝐶 log 𝐶 ).
24
How to quantify the queriability of a design?
• Effectiveness of returned answers is usually measured using precision (MRR, recall, ...).• Precision = #returned relevant answers / #returned answers
• If a conceptual design helps improve ranking quality, it should replace non-relevant answers with the relevant one.
• It also improves the precision (MRR, recall, …)
• Queriability – estimates the improve in query answering using the amount by which a design increases the fraction of relevant answers in the returned answers.
Level-Wise Algorighm
26
Conceptual design 𝑺 helps answering queries whose concepts are in partitions of 𝑺.
Given a query 𝐶(terms) such that 𝐶 belongs to the partition 𝑃 in the design 𝑆
Infections
Skin-InfectionsEye-Infections …
Trachoma Hordeolum Ecthyma Erysipelas …
dataset
Eye-Infections
Trachoma
𝐶 = “Trachoma”, 𝑆 = {Eye-Infections,…}
Let 𝑑(𝐶) be a frequency of documents with concept 𝐶
Let 𝑢(𝐶) be a popularity of queries with concept 𝐶
Fraction of documents about “Trachoma” within the set of documents annotated by “Eye-Infection” is 𝑑 Trachoma
𝑑 Eye−Infections.
Portion of query about “Trachoma” is 𝑢(Trachoma).
Hence, the improvement is 𝑢 Trachoma 𝑑 Trachoma𝑑 Eye−Infections
.
Total improvement from partition of “Eye-Infection” is𝑢 Trachoma 𝑑 Trachoma
𝑑 Eye−Infections+𝑢 Hordeolum 𝑑 Hordeolum
𝑑 Eye−Infections+⋯Hordeolum
For each partition, 𝑐∈𝑝𝑎𝑟𝑡(𝑃)𝑢(𝑐)𝑑(𝐶)
𝑑(𝑃).
Given 𝑆, the total improvement is 𝑃∈𝑆 𝑐∈𝑝𝑎𝑟𝑡(𝑃)𝑢 𝑐 𝑑 𝑐
𝑑(𝑃)
Validation of Queriability Function
• Oracle: enumerate all feasible designs and find design with maximum precision at 3.
• Queriability Maximization (QM): enumerate all feasible designs and find design with maximum queriability.
BT1 T2 T3
Oracle QM Oracle QM Oracle QM
Un
ifo
rm
0.10.20.30.4…
0.1490.1680.1770.192
0.1490.1680.1770.192
0.2410.3030.3180.320
0.2320.2850.3150.318
0.2220.2810.3040.306
0.2100.2690.3040.304
Ran
do
m
0.10.20.30.4…
0.1240.1630.1790.187
0.1240.1630.1770.183
0.2640.3200.3170.323
0.2620.2950.3160.318
0.2480.2880.3040.306
0.2390.2810.3040.306
B=1 : enough budget to annotate all concepts
Effectiveness of Level-Wise Algorithm
• Compare LW with APM.B
T1 T2 T3 T4 T5 T6 T7 T8
APM LW APM LW APM LW APM LW APM LW APM LW APM LW APM LW
Un
ifo
rm
0.10.20.30.4…
.089
.149
.164
.164
.103
.164
.164
.183
.234
.253
.292
.320
.232
.285
.316
.318
.208
.258
.288
.297
.210
.269
.297
.304
.158
.177
.191
.215
.179
.212
.231
.240
.178
.214
.228
.237
.206
.227
.242
.248
.229
.244
.247
.248
.240
.248
.248
.248
.243
.259
.260
.261
.254
.261
.261
.261
.250
.262
.262
.263
.259
.263
.263
.263
Ran
do
m
0.10.20.30.4…
.097
.111
.122
.164
.104
.145
.175
.185
.239
.263
.300
.321
.257
.291
.317
.320
.240
.263
.275
.294
.235
.283
.301
.305
.177
.180
.188
.212
.189
.212
.230
.239
.183
.216
.231
.239
.210
.230
.242
.248
.231
.245
.247
.248
.240
.248
.248
.248
.245
.259
.260
.261
.256
.261
.261
.261
.255
.262
.263
.263
.259
.263
.263
.263
B=1 : enough budget to annotate all concepts
<article id=1>
Granular conjunctivitis causes pain in the outer surface or cornea …
<article id=2>
Stye may lead to pain on the eyelids …
<article id=3>
GAS caused infections cause pain in tissues …