Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of...

26
Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of...

Page 1: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Nearly-Automated Metadata Hierarchy Creation

Emilia Stoica and Marti HearstSIMS

University of California, Berkeley

Page 2: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Motivation

Want to assign items labels from multiple hierarchies

Page 3: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Motivation

Description: 19th c. paint horse; saddle and hackamore; spurs; bandana on rider; old time cowboy hat; underchin thong; flying off.

Nature Animal Mammal Horse

Occupations Cowboy

Clothing Hats Cowboy Hat

Media Engraving Wood Eng.

Location North America America

Page 4: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Use in Browsing Interfaces like Flamenco

Page 5: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Use in Browsing Interfaces like Flamenco

Page 6: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

How to Obtain the Hierarchies?

Goal: Help an information architect get started

Currently they do it all by hand! Assume they will do some editing Nearly automated

Multiple hierarchies (facets) Automatically assign items to multiple hierarchies

Page 7: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Related Work

Automated text categorization LOTS of work on this Assumes that a set of categories is already created

To be intuitive, a categorization should contain sets of IS-A relations (hierarchical) Rosenfeld and Morville, (2002) Pratt, Hearst, and Fagan (1999)

Current automated approaches contain only associative relations

Page 8: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Examples ofAssociative Relations Hofmann 1999

Collection: Machine learning abstracts Top-level categories: learn, paper, base, model, new train Problem:

These are not intuitive categories for machine learning

Sanderson and Croft 1999 Collection: Medical texts Top level categories:

disease, post polio, serious disease, dengue, infection control, immunology, …

Problem: These are at different levels of generality

Page 9: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Examples ofAssociative Relations Schuetze 1993

Collection: Arts descriptions Sample Groupings:

carriage cart horse ride walk passing horseback wagon men chicken rider

bald balding head facing hand faced arm hat haired glove long

Problem: Terms are associated with one another, but are not

organized into hierarchies that can be navigated.

Page 10: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Our Approach

Leverage the structure of WordNet

Doc

umen

ts

WordNet

Get hypernym

paths

Sel

ect

ter

ms

Build tree

Compresstree

Page 11: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

1. Select Terms

red blue

Select well distributed

terms from collection Doc

ume

nts

WordNet

Get hypernym

pathsSel

ect

term

s

Build tree

Comp. tree

Page 12: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

2. Get Hypernym Path

red blue

chromatic color

abstraction

property

visual property

color

red, redness

abstraction

property

visual property

color

blue, blueness

chromatic color

Get hypernym path for each term D

ocum

ent

s

WordNet

Get hypernym

pathsSel

ect

te

rms

Build tree

Comp. tree

Page 13: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

3. Build Tree

red blue

chromatic color

abstraction

property

visual property

color

red, redness

abstraction

property

visual property

color

blue, blueness

chromatic color

red blue

abstraction

property

visual property

color

red, redness

chromatic color

blue, blueness

Doc

ume

nts

WordNet

Get hypernym

pathsSel

ect

te

rms

Buildtree

Comp. tree

Merge hypernym paths to build a tree

Page 14: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

4. Compress Tree

Doc

ume

nts

WordNet

Get hypernym

pathsSel

ect

te

rms

Build tree

Comp.tree

Eliminate a parent with fewer than n children unless it is the root or its distribution is larger than 0.1*maxdist

red, redness

color

red

chromatic color

blue, blueness

blue

green, greenness

green green red

color

chromatic color

blue

Page 15: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

4. Compress Tree (cont.)

Eliminate a child whose name appears within parent’s

red

color

chromatic color

blue green

color

red blue green

Doc

ume

nts

WordNet

Get hypernym

pathsSel

ect

te

rms

Build tree

Comp. tree

Page 16: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Disambiguation

Ambiguity in: Word senses Paths up the hypernym tree

Sense 1 for word “tuna”organism, being => plant, flora => vascular plant => succulent => cactus

=> tuna

Sense 2 for word “tuna”organism, being => fish => food fish => tuna => bony fish => spiny-finned fish => percoid fish => tuna

2 paths for same word

2 paths for

same sense

Page 17: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

How to Select the Right Senses and Paths?

(This part is not in the paper.) Solution: Modify the algorithm

First: build core tree (1) Create paths for words with only one sense (2) Use Domains

Wordnet has 212 Domains medicine, mathematics, biology, chemistry, linguistics, soccer, etc.

Automatically scan the collection to see which domains apply The user selects which of the suggested domains to use or he may add his own Paths for terms that match the selected domains are added to the core tree

Then: add remaining terms to the core tree.

Page 18: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Enrich Core Tree

For each new term t Q(t) 0; // set of candidate paths for each path p of t

compute the fraction fp(t) of nodes in p that are shared with a path in the core tree

if (fp(t) > thresh ) Q(t) = Q(t) U {p}

if (Q(t) = {}) chose first sense of t

else among all p’s in Q(t), chose path in core tree with

most items assigned

Page 19: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Enrich Core Tree

entity entity substance, matter object food, nutrient artifact nutriment instrumentality dish device fondue, fondu conductor semiconductor

diode light-emitting diode (led)

Core tree

Toaster with

led indicators

Chip (p1) Chip (p2)

entity entity substance,matter object food, nutrient artifact nutriment instrumentality dish device snack food conductor chip semiconductor

chip

Page 20: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Enrich Core Tree

entity entity entity entity substance, matter object substance,matter object food, nutrient artifact food, nutrient artifact nutriment instrumentality nutriment instrumentality dish device dish device fondue, fondu conductor snack food conductor semiconductor chip semiconductor

diode chip light-emitting diode (led)

Core tree Chip (p1) Chip (p2)

fp1(Chip) = 5/7Q = {p1}

Page 21: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Enrich Core Tree

entity entity entity entity substance, matter object substance,matter object food, nutrient artifact food, nutrient artifact nutriment instrumentality nutriment instrumentality dish device dish device fondue, fondu conductor snack food conductor semiconductor chip semiconductor

diode chip light-emitting diode (led)

Core tree Chip (p1) Chip (p2)

fp1(Chip) = 5/7fp2(Chip) = 7/8Q = {p1, p2}

Page 22: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Enrich Core Tree (cont’d)

entity entity substance, matter object food, nutrient artifact nutriment instrumentality dish (1699) device fondue, fondu (40) conductor semiconductor (45)

diode

light-emitting diode (led)

Core tree

snack food

chip

Chose this path since it has more items assigned

chip

Page 23: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Results on a Recipes/

Kitchen Appliances Data Set

Page 24: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Results on a Recipes/

Kitchen Appliances Data Set

Page 25: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Discussion

This is very simple, but works very well Why hasn’t this been done before?

Because WordNet did not have enough coverage?

Page 26: Nearly-Automated Metadata Hierarchy Creation Emilia Stoica and Marti Hearst SIMS University of California, Berkeley.

Conclusions

Can nearly-automatically build a set of hierarchies by finding IS-A relations between terms using WordNet

The method has been tested on various domains: medicine, mathematics, recipes, news, arts

User study in progress

Limitations: The ontology has to be appropriate for the target domain No disambiguation between nouns, verbs, and adjectives