1 I256: Applied Natural Language Processing Marti Hearst Oct 2, 2006.
Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of...
-
date post
20-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of...
Nearly-Automated Metadata Hierarchy Creation
Emilia Stoica and Marti HearstSIMS
University of California, Berkeley
Motivation
Want to assign items labels from multiple hierarchies
Motivation
Description: 19th c. paint horse; saddle and hackamore; spurs; bandana on rider; old time cowboy hat; underchin thong; flying off.
Nature Animal Mammal Horse
Occupations Cowboy
Clothing Hats Cowboy Hat
Media Engraving Wood Eng.
Location North America America
Use in Browsing Interfaces like Flamenco
Use in Browsing Interfaces like Flamenco
How to Obtain the Hierarchies?
Goal: Help an information architect get started
Currently they do it all by hand! Assume they will do some editing Nearly automated
Multiple hierarchies (facets) Automatically assign items to multiple hierarchies
Related Work
Automated text categorization LOTS of work on this Assumes that a set of categories is already created
To be intuitive, a categorization should contain sets of IS-A relations (hierarchical) Rosenfeld and Morville, (2002) Pratt, Hearst, and Fagan (1999)
Current automated approaches contain only associative relations
Examples ofAssociative Relations Hofmann 1999
Collection: Machine learning abstracts Top-level categories: learn, paper, base, model, new train Problem:
These are not intuitive categories for machine learning
Sanderson and Croft 1999 Collection: Medical texts Top level categories:
disease, post polio, serious disease, dengue, infection control, immunology, …
Problem: These are at different levels of generality
Examples ofAssociative Relations Schuetze 1993
Collection: Arts descriptions Sample Groupings:
carriage cart horse ride walk passing horseback wagon men chicken rider
bald balding head facing hand faced arm hat haired glove long
Problem: Terms are associated with one another, but are not
organized into hierarchies that can be navigated.
Our Approach
Leverage the structure of WordNet
Doc
umen
ts
WordNet
Get hypernym
paths
Sel
ect
ter
ms
Build tree
Compresstree
1. Select Terms
red blue
Select well distributed
terms from collection Doc
ume
nts
WordNet
Get hypernym
pathsSel
ect
term
s
Build tree
Comp. tree
2. Get Hypernym Path
red blue
chromatic color
abstraction
property
visual property
color
red, redness
abstraction
property
visual property
color
blue, blueness
chromatic color
Get hypernym path for each term D
ocum
ent
s
WordNet
Get hypernym
pathsSel
ect
te
rms
Build tree
Comp. tree
3. Build Tree
red blue
chromatic color
abstraction
property
visual property
color
red, redness
abstraction
property
visual property
color
blue, blueness
chromatic color
red blue
abstraction
property
visual property
color
red, redness
chromatic color
blue, blueness
Doc
ume
nts
WordNet
Get hypernym
pathsSel
ect
te
rms
Buildtree
Comp. tree
Merge hypernym paths to build a tree
4. Compress Tree
Doc
ume
nts
WordNet
Get hypernym
pathsSel
ect
te
rms
Build tree
Comp.tree
Eliminate a parent with fewer than n children unless it is the root or its distribution is larger than 0.1*maxdist
red, redness
color
red
chromatic color
blue, blueness
blue
green, greenness
green green red
color
chromatic color
blue
4. Compress Tree (cont.)
Eliminate a child whose name appears within parent’s
red
color
chromatic color
blue green
color
red blue green
Doc
ume
nts
WordNet
Get hypernym
pathsSel
ect
te
rms
Build tree
Comp. tree
Disambiguation
Ambiguity in: Word senses Paths up the hypernym tree
Sense 1 for word “tuna”organism, being => plant, flora => vascular plant => succulent => cactus
=> tuna
Sense 2 for word “tuna”organism, being => fish => food fish => tuna => bony fish => spiny-finned fish => percoid fish => tuna
2 paths for same word
2 paths for
same sense
How to Select the Right Senses and Paths?
(This part is not in the paper.) Solution: Modify the algorithm
First: build core tree (1) Create paths for words with only one sense (2) Use Domains
Wordnet has 212 Domains medicine, mathematics, biology, chemistry, linguistics, soccer, etc.
Automatically scan the collection to see which domains apply The user selects which of the suggested domains to use or he may add his own Paths for terms that match the selected domains are added to the core tree
Then: add remaining terms to the core tree.
Enrich Core Tree
For each new term t Q(t) 0; // set of candidate paths for each path p of t
compute the fraction fp(t) of nodes in p that are shared with a path in the core tree
if (fp(t) > thresh ) Q(t) = Q(t) U {p}
if (Q(t) = {}) chose first sense of t
else among all p’s in Q(t), chose path in core tree with
most items assigned
Enrich Core Tree
entity entity substance, matter object food, nutrient artifact nutriment instrumentality dish device fondue, fondu conductor semiconductor
diode light-emitting diode (led)
Core tree
Toaster with
led indicators
Chip (p1) Chip (p2)
entity entity substance,matter object food, nutrient artifact nutriment instrumentality dish device snack food conductor chip semiconductor
chip
Enrich Core Tree
entity entity entity entity substance, matter object substance,matter object food, nutrient artifact food, nutrient artifact nutriment instrumentality nutriment instrumentality dish device dish device fondue, fondu conductor snack food conductor semiconductor chip semiconductor
diode chip light-emitting diode (led)
Core tree Chip (p1) Chip (p2)
fp1(Chip) = 5/7Q = {p1}
Enrich Core Tree
entity entity entity entity substance, matter object substance,matter object food, nutrient artifact food, nutrient artifact nutriment instrumentality nutriment instrumentality dish device dish device fondue, fondu conductor snack food conductor semiconductor chip semiconductor
diode chip light-emitting diode (led)
Core tree Chip (p1) Chip (p2)
fp1(Chip) = 5/7fp2(Chip) = 7/8Q = {p1, p2}
Enrich Core Tree (cont’d)
entity entity substance, matter object food, nutrient artifact nutriment instrumentality dish (1699) device fondue, fondu (40) conductor semiconductor (45)
diode
light-emitting diode (led)
Core tree
snack food
chip
Chose this path since it has more items assigned
chip
Results on a Recipes/
Kitchen Appliances Data Set
Results on a Recipes/
Kitchen Appliances Data Set
Discussion
This is very simple, but works very well Why hasn’t this been done before?
Because WordNet did not have enough coverage?
Conclusions
Can nearly-automatically build a set of hierarchies by finding IS-A relations between terms using WordNet
The method has been tested on various domains: medicine, mathematics, recipes, news, arts
User study in progress
Limitations: The ontology has to be appropriate for the target domain No disambiguation between nouns, verbs, and adjectives