Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

38
Using Metadata in Search Using Metadata in Search Prof. Marti Hearst Prof. Marti Hearst SIMS 202, Lecture 27 SIMS 202, Lecture 27

Transcript of Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Page 1: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Using Metadata in SearchUsing Metadata in Search

Prof. Marti HearstProf. Marti Hearst

SIMS 202, Lecture 27SIMS 202, Lecture 27

Page 2: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

TodayToday

Search using Search using MetadataMetadata Comparing search with controlled

vocabulary vs. free-text A GUI for browsing and search on

metadata + free text More Videos!More Videos!

Page 3: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

What is Metadata for?What is Metadata for?

““Normalizing” natural languageNormalizing” natural language distinguish homonyms group synonyms together

Organizing informationOrganizing information for search for browsing

Page 4: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

UWMS Data Mining WorkshopMarti A. Hearst

SIMS 202, Fall 1997

What Categories DoWhat Categories Do

Summarize a document according Summarize a document according to pre-defined main topicsto pre-defined main topics

Compress the many ways of Compress the many ways of representing a concept into onerepresenting a concept into one

Identify which subset of attributes Identify which subset of attributes are salient for a collectionare salient for a collection

Page 5: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Controlled VocabulariesControlled Vocabularies

Assign metadata from a pre-defined set Assign metadata from a pre-defined set of allowable categories or descriptorsof allowable categories or descriptors

Some studies confuse human-assigned Some studies confuse human-assigned categories and controlled vocabulariescategories and controlled vocabularies Could be human-assigned but from an

uncontrolled set Could be computer-assigned from a

controlled set

Page 6: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Controlled Vocabulary Controlled Vocabulary (Svenonius 86)(Svenonius 86)

Original uses of metadata were for Original uses of metadata were for classificationclassification and and organizationorganization

Computers allow for Computers allow for searchsearch initially, search over subject codes more recently, free text search still more recent: free text search on full

text Controlled vocab is seen in contrast to free Controlled vocab is seen in contrast to free

text search on titles, abstracts, or bodytext search on titles, abstracts, or body

Page 7: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Problems with Problems with UnUnControlled Controlled VocabulariesVocabularies If terms extracted from titlesIf terms extracted from titles

Titles may not be informative Docs on same topic may not be expressed using

the same vocabulary insects vs. entomology free trade vs. tariff

Additionally, if terms extracted from full textAdditionally, if terms extracted from full text term co-occurrence may be incidental passing references many more candidate terms

Page 8: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Problems with Controlled Problems with Controlled VocabulariesVocabularies

Too vague or high-levelToo vague or high-level Potentially out of datePotentially out of date Expensive to buildExpensive to build Difficult to search withDifficult to search with

Most not designed for search How to locate categories of interest?

Page 9: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Category Search and BrowsingCategory Search and Browsing

Massicotte 88 (cited in Drabenstott & Weller 96)Massicotte 88 (cited in Drabenstott & Weller 96)

““The problem we are faced with is the undue The problem we are faced with is the undue display length of a browse list under a given display length of a browse list under a given search term. … indexes will continue to expand search term. … indexes will continue to expand at an ever-increasing rate. This factor alone at an ever-increasing rate. This factor alone will eventually make the alphabetical index less will eventually make the alphabetical index less and less viable as a method of searching.”and less viable as a method of searching.”

How to make use of all that category information? How to make use of all that category information?

Page 10: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Free Text vs. Controlled Free Text vs. Controlled VocabVocab

Usually, the two methodsUsually, the two methods retrieve different sets of documents controlled vocab -> higher recall free text -> higher precision

Studies usually find it’s best to use Studies usually find it’s best to use bothboth

Page 11: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Free Text vs. Controlled Free Text vs. Controlled VocabVocab Controlled vocab -> higher recallControlled vocab -> higher recall

Once you locate the right category, you can retrieval all docs within that category

all insects! all insects + bugs + vermin

Page 12: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Free Text vs. Controlled Free Text vs. Controlled VocabVocab Controlled vocab -> lower precisionControlled vocab -> lower precision

accuracy traded off for consistency limited number of categories free text can be more precise

just two specific insects insect name + what it eats

Blair & Maron using free text got high precision (~70%) and low recall (~25%)

Page 13: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Free Text vs. Controlled Free Text vs. Controlled VocabVocab

A contradiction: A contradiction: (Markey et al. 80)(Markey et al. 80)

Eric dataset and descriptors (c.v.) 165 free text queries 1 in 8 free text queries could not be

expressed with descriptors C.V. produced higher precision and

lower recall -- contradicting most other studies

Page 14: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Free Text vs. Controlled Free Text vs. Controlled VocabVocab

Why do the Markey et al. results Why do the Markey et al. results differ from most other studies?differ from most other studies? Perhaps Eric descriptors are sparse Contradiction implies a need for more

investigation

Page 15: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Free Text vs. Controlled Free Text vs. Controlled VocabVocab General agreement:General agreement:

Usually the two approaches retrieve different sets of (relevant) documents

Implication: Implication: Need ranking algorithms that combine the two

Strategies:Strategies: Automatically map query words into c.v. Modified relevance feedback: (Srinivasan 96)

find some good documents find more docs that share their category labels (as

opposed to those docs that share their free text terms)

Page 16: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

UWMS Data Mining WorkshopMarti A. Hearst

SIMS 202, Fall 1997

How to Use Text CategoriesHow to Use Text Categories

Mapping query words to controlled vocabularyMapping query words to controlled vocabulary lots of research on this helps in some cases, hurts in others

Organizing retrieval results (new!)Organizing retrieval results (new!) problems:

too many categories/document too many documents/category the right categories aren’t there

Idea: address difficulties by devising a Idea: address difficulties by devising a better user interface.better user interface.

Page 17: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Example: MeSH and MedLineExample: MeSH and MedLine

MeSH Medical Category HierarchyMeSH Medical Category Hierarchy ~18,000 labels manually assigned ~8 labels/article on average avg depth: 4.5, max depth 9

Top Level Categories:Top Level Categories:anatomyanatomy diagnosisdiagnosis related discrelated disc

animalsanimals psychpsych technologytechnology

diseasedisease biologybiology humanitieshumanities

drugsdrugs physicsphysics

Page 18: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Multiple Categories per Multiple Categories per DocumentDocument

DrugDrug SymptomSymptom Anatomy Anatomy

D1D1 S1S1 A1A1

D2D2 S2S2 A2A2

D3D3 S3S3 A3A3

Medical articles contain Medical articles contain combinationscombinations of these concept typesof these concept types

Page 19: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

[D1 S3 A1][D3 S2 S3][D1 D2 S2 A2] …

Dx Sx Ax

Dx Sx A1 Dx S1 Ax D1 Sx Ax

Dx S1 A1 D1 S1 Ax D1 Sx A1

D1 S1 A1

How to Group the Category Types?How to Group the Category Types?

Page 20: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

UWMS Data Mining WorkshopMarti A. Hearst

SIMS 202, Fall 1997

Large Category SetsLarge Category Sets

Problems for User InterfacesProblems for User Interfaces Too many categories to browse

Too many docs per category Docs belong to multiple categories Need to integrate search Need to show the documents

Page 21: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Grateful Med Query SpecificationGrateful Med Query Specification

Page 22: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Grateful Med Category SubTreeGrateful Med Category SubTree

Page 23: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Using Grateful MedUsing Grateful Med

Problems:Problems: Does not integrate category selection

with viewing of categories Only a few categories visible at a

time, with little context Does not show relationship of

retrieved documents to the category structure

Page 24: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Cat-a-Cone: Cat-a-Cone: (Hearst & Karadi 97)(Hearst & Karadi 97)Multiple Simultaneous CategoriesMultiple Simultaneous Categories

Key Ideas:Key Ideas: Separate documents from category

labels Show both simultaneously

Link the two for iterative feedbackLink the two for iterative feedback Distinguish between:Distinguish between:

Searching for Documents vs. Searching for Categories

Page 25: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Collection

Retrieved Documents

searchsearch

CategoryHierarch

y

browsebrowsequery terms

Page 26: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Collection

Retrieved Documents

searchsearch

CategoryHierarch

y

browsebrowsequery terms

Page 27: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Cat-a-Cone Cat-a-Cone (Hearst & Karadi 97)(Hearst & Karadi 97)

Catacomb: Catacomb: (definition 2b, online Websters)“A complex set of interrelated things”

Makes use of earlier PARC work on Makes use of earlier PARC work on 3D+animation:3D+animation:

Rooms Henderson and Card 86IV: Cone Tree Robertson, Card, Mackinlay 93Web Book Card, Robertson, York 96

Page 28: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

ConeTree for Category ConeTree for Category LabelsLabels

Browse/explore category hierarchyBrowse/explore category hierarchy by search on label names by growing/shrinking subtrees by spinning subtrees

AffordancesAffordances learn meaning via ancestors, siblings disambiguate meanings all cats simultaneously viewable

Page 29: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Virtual Book for Result SetsVirtual Book for Result Sets

Categories on Page (Retrieved Document) linked to Categories in Tree

Flipping through Book Pages causes some Subtrees to Expand and Contract

Most Subtrees remain unchanged

Book can be Stored for later Re-Use

Page 30: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Example QueryExample Query

Patient Query on Breast Cancer dataset:

“‘Do I have to have radiation if I have a mastectomy, and what would be the effects?”

How does the user know which categories?

Page 31: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Interactive Category Hierarchy Interactive Category Hierarchy

Smoothly interlink:Smoothly interlink: search over categories search over document contents browsing of categories browsing of retrieved documents

Page 32: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Improvements over Grateful Improvements over Grateful MedMed

Integrate category selection with Integrate category selection with viewing of categories viewing of categories

Show all categories + context Show all categories + context Show relationship of retrieved Show relationship of retrieved

documents to the category structuredocuments to the category structure

Page 33: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

UWMS Data Mining WorkshopMarti A. Hearst

SIMS 202, Fall 1997

Comparison StudyComparison Study

H. Chen, A. Houston, R. Sewell, and B. H. Chen, A. Houston, R. Sewell, and B. Schatz, Schatz, JASISJASIS, to appear, to appear

Comparison: Kohonen Map and YahooComparison: Kohonen Map and Yahoo Task:Task:

“Window shop” for interesting home page Repeat with other interface

Results:Results: Starting with map could repeat in Yahoo (8/11) Starting with Yahoo unable to repeat in map

(2/14)

Page 34: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Concept LandscapesConcept Landscapes

Pharmocology

Anatomy

Legal

Disease

Hospitals

(e.g., Lin, Chen, Wise et al.) Single concept per documentSingle concept per document No titlesNo titles Browsing without searchBrowsing without search

Page 35: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

UWMS Data Mining WorkshopMarti A. Hearst

SIMS 202, Fall 1997

Comparison Study (cont.)Comparison Study (cont.)

Participants liked:Participants liked: Correspondence of region size to

number of documents in region Overview (but also wanted zoom) Ease of jumping from one topic to

another Multiple routes to topics Use of category and subcategory labels

Page 36: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

UWMS Data Mining WorkshopMarti A. Hearst

SIMS 202, Fall 1997

Comparison Study (cont.)Comparison Study (cont.)

Participants wanted:Participants wanted: hierarchical organization other ordering of concepts (alphabetical) integration of browsing and search corresponce of color to meaning more meaningful labels labels at same level of abstraction fit more labels in the given space combined keyword and category search multiple category assignment (sports+entertain)

Page 37: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Comparison Study (cont.)Comparison Study (cont.)

Cat-a-cone Cat-a-cone contains most of the desired

properties lacks the disliked properties

Page 38: Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.

Marti A. HearstSIMS 202, Fall 1997

Summary: Cat-a-Cone Summary: Cat-a-Cone

Interface that smoothly integratesInterface that smoothly integrates search over multiple categories search over document contents browsing of multiple categories browsing of retrieved documents

Iterative, InteractiveIterative, Interactive Retain partial results in a Retain partial results in a

workspaceworkspace