Marti Hearst SIMS 247 SIMS 247 Lecture 11 Evaluating Interactive Interfaces February 24, 1998.
Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.
-
Upload
terence-spencer -
Category
Documents
-
view
212 -
download
0
Transcript of Using Metadata in Search Prof. Marti Hearst SIMS 202, Lecture 27.
Using Metadata in SearchUsing Metadata in Search
Prof. Marti HearstProf. Marti Hearst
SIMS 202, Lecture 27SIMS 202, Lecture 27
Marti A. HearstSIMS 202, Fall 1997
TodayToday
Search using Search using MetadataMetadata Comparing search with controlled
vocabulary vs. free-text A GUI for browsing and search on
metadata + free text More Videos!More Videos!
Marti A. HearstSIMS 202, Fall 1997
What is Metadata for?What is Metadata for?
““Normalizing” natural languageNormalizing” natural language distinguish homonyms group synonyms together
Organizing informationOrganizing information for search for browsing
UWMS Data Mining WorkshopMarti A. Hearst
SIMS 202, Fall 1997
What Categories DoWhat Categories Do
Summarize a document according Summarize a document according to pre-defined main topicsto pre-defined main topics
Compress the many ways of Compress the many ways of representing a concept into onerepresenting a concept into one
Identify which subset of attributes Identify which subset of attributes are salient for a collectionare salient for a collection
Marti A. HearstSIMS 202, Fall 1997
Controlled VocabulariesControlled Vocabularies
Assign metadata from a pre-defined set Assign metadata from a pre-defined set of allowable categories or descriptorsof allowable categories or descriptors
Some studies confuse human-assigned Some studies confuse human-assigned categories and controlled vocabulariescategories and controlled vocabularies Could be human-assigned but from an
uncontrolled set Could be computer-assigned from a
controlled set
Marti A. HearstSIMS 202, Fall 1997
Controlled Vocabulary Controlled Vocabulary (Svenonius 86)(Svenonius 86)
Original uses of metadata were for Original uses of metadata were for classificationclassification and and organizationorganization
Computers allow for Computers allow for searchsearch initially, search over subject codes more recently, free text search still more recent: free text search on full
text Controlled vocab is seen in contrast to free Controlled vocab is seen in contrast to free
text search on titles, abstracts, or bodytext search on titles, abstracts, or body
Marti A. HearstSIMS 202, Fall 1997
Problems with Problems with UnUnControlled Controlled VocabulariesVocabularies If terms extracted from titlesIf terms extracted from titles
Titles may not be informative Docs on same topic may not be expressed using
the same vocabulary insects vs. entomology free trade vs. tariff
Additionally, if terms extracted from full textAdditionally, if terms extracted from full text term co-occurrence may be incidental passing references many more candidate terms
Marti A. HearstSIMS 202, Fall 1997
Problems with Controlled Problems with Controlled VocabulariesVocabularies
Too vague or high-levelToo vague or high-level Potentially out of datePotentially out of date Expensive to buildExpensive to build Difficult to search withDifficult to search with
Most not designed for search How to locate categories of interest?
Marti A. HearstSIMS 202, Fall 1997
Category Search and BrowsingCategory Search and Browsing
Massicotte 88 (cited in Drabenstott & Weller 96)Massicotte 88 (cited in Drabenstott & Weller 96)
““The problem we are faced with is the undue The problem we are faced with is the undue display length of a browse list under a given display length of a browse list under a given search term. … indexes will continue to expand search term. … indexes will continue to expand at an ever-increasing rate. This factor alone at an ever-increasing rate. This factor alone will eventually make the alphabetical index less will eventually make the alphabetical index less and less viable as a method of searching.”and less viable as a method of searching.”
How to make use of all that category information? How to make use of all that category information?
Marti A. HearstSIMS 202, Fall 1997
Free Text vs. Controlled Free Text vs. Controlled VocabVocab
Usually, the two methodsUsually, the two methods retrieve different sets of documents controlled vocab -> higher recall free text -> higher precision
Studies usually find it’s best to use Studies usually find it’s best to use bothboth
Marti A. HearstSIMS 202, Fall 1997
Free Text vs. Controlled Free Text vs. Controlled VocabVocab Controlled vocab -> higher recallControlled vocab -> higher recall
Once you locate the right category, you can retrieval all docs within that category
all insects! all insects + bugs + vermin
Marti A. HearstSIMS 202, Fall 1997
Free Text vs. Controlled Free Text vs. Controlled VocabVocab Controlled vocab -> lower precisionControlled vocab -> lower precision
accuracy traded off for consistency limited number of categories free text can be more precise
just two specific insects insect name + what it eats
Blair & Maron using free text got high precision (~70%) and low recall (~25%)
Marti A. HearstSIMS 202, Fall 1997
Free Text vs. Controlled Free Text vs. Controlled VocabVocab
A contradiction: A contradiction: (Markey et al. 80)(Markey et al. 80)
Eric dataset and descriptors (c.v.) 165 free text queries 1 in 8 free text queries could not be
expressed with descriptors C.V. produced higher precision and
lower recall -- contradicting most other studies
Marti A. HearstSIMS 202, Fall 1997
Free Text vs. Controlled Free Text vs. Controlled VocabVocab
Why do the Markey et al. results Why do the Markey et al. results differ from most other studies?differ from most other studies? Perhaps Eric descriptors are sparse Contradiction implies a need for more
investigation
Marti A. HearstSIMS 202, Fall 1997
Free Text vs. Controlled Free Text vs. Controlled VocabVocab General agreement:General agreement:
Usually the two approaches retrieve different sets of (relevant) documents
Implication: Implication: Need ranking algorithms that combine the two
Strategies:Strategies: Automatically map query words into c.v. Modified relevance feedback: (Srinivasan 96)
find some good documents find more docs that share their category labels (as
opposed to those docs that share their free text terms)
UWMS Data Mining WorkshopMarti A. Hearst
SIMS 202, Fall 1997
How to Use Text CategoriesHow to Use Text Categories
Mapping query words to controlled vocabularyMapping query words to controlled vocabulary lots of research on this helps in some cases, hurts in others
Organizing retrieval results (new!)Organizing retrieval results (new!) problems:
too many categories/document too many documents/category the right categories aren’t there
Idea: address difficulties by devising a Idea: address difficulties by devising a better user interface.better user interface.
Marti A. HearstSIMS 202, Fall 1997
Example: MeSH and MedLineExample: MeSH and MedLine
MeSH Medical Category HierarchyMeSH Medical Category Hierarchy ~18,000 labels manually assigned ~8 labels/article on average avg depth: 4.5, max depth 9
Top Level Categories:Top Level Categories:anatomyanatomy diagnosisdiagnosis related discrelated disc
animalsanimals psychpsych technologytechnology
diseasedisease biologybiology humanitieshumanities
drugsdrugs physicsphysics
Marti A. HearstSIMS 202, Fall 1997
Multiple Categories per Multiple Categories per DocumentDocument
DrugDrug SymptomSymptom Anatomy Anatomy
D1D1 S1S1 A1A1
D2D2 S2S2 A2A2
D3D3 S3S3 A3A3
Medical articles contain Medical articles contain combinationscombinations of these concept typesof these concept types
Marti A. HearstSIMS 202, Fall 1997
[D1 S3 A1][D3 S2 S3][D1 D2 S2 A2] …
Dx Sx Ax
Dx Sx A1 Dx S1 Ax D1 Sx Ax
Dx S1 A1 D1 S1 Ax D1 Sx A1
D1 S1 A1
How to Group the Category Types?How to Group the Category Types?
UWMS Data Mining WorkshopMarti A. Hearst
SIMS 202, Fall 1997
Large Category SetsLarge Category Sets
Problems for User InterfacesProblems for User Interfaces Too many categories to browse
Too many docs per category Docs belong to multiple categories Need to integrate search Need to show the documents
Marti A. HearstSIMS 202, Fall 1997
Grateful Med Query SpecificationGrateful Med Query Specification
Marti A. HearstSIMS 202, Fall 1997
Grateful Med Category SubTreeGrateful Med Category SubTree
Marti A. HearstSIMS 202, Fall 1997
Using Grateful MedUsing Grateful Med
Problems:Problems: Does not integrate category selection
with viewing of categories Only a few categories visible at a
time, with little context Does not show relationship of
retrieved documents to the category structure
Marti A. HearstSIMS 202, Fall 1997
Cat-a-Cone: Cat-a-Cone: (Hearst & Karadi 97)(Hearst & Karadi 97)Multiple Simultaneous CategoriesMultiple Simultaneous Categories
Key Ideas:Key Ideas: Separate documents from category
labels Show both simultaneously
Link the two for iterative feedbackLink the two for iterative feedback Distinguish between:Distinguish between:
Searching for Documents vs. Searching for Categories
Marti A. HearstSIMS 202, Fall 1997
Collection
Retrieved Documents
searchsearch
CategoryHierarch
y
browsebrowsequery terms
Marti A. HearstSIMS 202, Fall 1997
Collection
Retrieved Documents
searchsearch
CategoryHierarch
y
browsebrowsequery terms
Marti A. HearstSIMS 202, Fall 1997
Cat-a-Cone Cat-a-Cone (Hearst & Karadi 97)(Hearst & Karadi 97)
Catacomb: Catacomb: (definition 2b, online Websters)“A complex set of interrelated things”
Makes use of earlier PARC work on Makes use of earlier PARC work on 3D+animation:3D+animation:
Rooms Henderson and Card 86IV: Cone Tree Robertson, Card, Mackinlay 93Web Book Card, Robertson, York 96
Marti A. HearstSIMS 202, Fall 1997
ConeTree for Category ConeTree for Category LabelsLabels
Browse/explore category hierarchyBrowse/explore category hierarchy by search on label names by growing/shrinking subtrees by spinning subtrees
AffordancesAffordances learn meaning via ancestors, siblings disambiguate meanings all cats simultaneously viewable
Marti A. HearstSIMS 202, Fall 1997
Virtual Book for Result SetsVirtual Book for Result Sets
Categories on Page (Retrieved Document) linked to Categories in Tree
Flipping through Book Pages causes some Subtrees to Expand and Contract
Most Subtrees remain unchanged
Book can be Stored for later Re-Use
Marti A. HearstSIMS 202, Fall 1997
Example QueryExample Query
Patient Query on Breast Cancer dataset:
“‘Do I have to have radiation if I have a mastectomy, and what would be the effects?”
How does the user know which categories?
Marti A. HearstSIMS 202, Fall 1997
Interactive Category Hierarchy Interactive Category Hierarchy
Smoothly interlink:Smoothly interlink: search over categories search over document contents browsing of categories browsing of retrieved documents
Marti A. HearstSIMS 202, Fall 1997
Improvements over Grateful Improvements over Grateful MedMed
Integrate category selection with Integrate category selection with viewing of categories viewing of categories
Show all categories + context Show all categories + context Show relationship of retrieved Show relationship of retrieved
documents to the category structuredocuments to the category structure
UWMS Data Mining WorkshopMarti A. Hearst
SIMS 202, Fall 1997
Comparison StudyComparison Study
H. Chen, A. Houston, R. Sewell, and B. H. Chen, A. Houston, R. Sewell, and B. Schatz, Schatz, JASISJASIS, to appear, to appear
Comparison: Kohonen Map and YahooComparison: Kohonen Map and Yahoo Task:Task:
“Window shop” for interesting home page Repeat with other interface
Results:Results: Starting with map could repeat in Yahoo (8/11) Starting with Yahoo unable to repeat in map
(2/14)
Marti A. HearstSIMS 202, Fall 1997
Concept LandscapesConcept Landscapes
Pharmocology
Anatomy
Legal
Disease
Hospitals
(e.g., Lin, Chen, Wise et al.) Single concept per documentSingle concept per document No titlesNo titles Browsing without searchBrowsing without search
UWMS Data Mining WorkshopMarti A. Hearst
SIMS 202, Fall 1997
Comparison Study (cont.)Comparison Study (cont.)
Participants liked:Participants liked: Correspondence of region size to
number of documents in region Overview (but also wanted zoom) Ease of jumping from one topic to
another Multiple routes to topics Use of category and subcategory labels
UWMS Data Mining WorkshopMarti A. Hearst
SIMS 202, Fall 1997
Comparison Study (cont.)Comparison Study (cont.)
Participants wanted:Participants wanted: hierarchical organization other ordering of concepts (alphabetical) integration of browsing and search corresponce of color to meaning more meaningful labels labels at same level of abstraction fit more labels in the given space combined keyword and category search multiple category assignment (sports+entertain)
Marti A. HearstSIMS 202, Fall 1997
Comparison Study (cont.)Comparison Study (cont.)
Cat-a-cone Cat-a-cone contains most of the desired
properties lacks the disliked properties
Marti A. HearstSIMS 202, Fall 1997
Summary: Cat-a-Cone Summary: Cat-a-Cone
Interface that smoothly integratesInterface that smoothly integrates search over multiple categories search over document contents browsing of multiple categories browsing of retrieved documents
Iterative, InteractiveIterative, Interactive Retain partial results in a Retain partial results in a
workspaceworkspace