1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

24
1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers

Transcript of 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

Page 1: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

1

CS 430: Information Discovery

Lecture 21

Thesauruses and Gazetteers

Page 2: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

2

Course Administration

Page 3: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

3

Lexicon and Thesaurus

Lexicon contains information about words, their morphological variants, and their grammatical usage.

Thesaurus relates words by meaning:

ship, vessel, sail; craft, navy, marine, fleet, flotilla

book, writing, work, volume, tome, tract, codex

search, discovery, detection, find, revelation

(From Roget's Thesaurus, 1911)

Page 4: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

4

Thesaurus in Information Retrieval

Use of a thesaurus in indexing (precoordination)

A. Manual

Used to guide human indexer to assign standard terms and associations.

computer-aided instructionsee also educationUF teaching machinesBT educational computingTT computer applicationsRT educationRT teaching

From: INSPEC Thesaurus

Page 5: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

5

Thesaurus in Information Retrieval

Use of a thesaurus in indexing (precoordination)

B. Automatic

Divide terms into thesaurus classes. Replace similar terms by a thesaurus class.

408 dislocation 409 blast-cooledjunction heat-flow

minority-carrier heat-transfern-p-np-n-p 410 anneal

point-contact strainrecombinetransitionunijunction

From: Salton and McGill

Page 6: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

6

Desirable Properties for Information Retrieval

• Thesaurus is specific to a subject area. Contains only terms of interest for identification within that subject area.

• Ambiguous terms are coded only for the senses important for that field.

• Target is that each thesaurus class should include terms of moderate frequency. Ideally the classes should have similar frequency.

Page 7: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

7

Art and Architecture Thesaurus

•Controlled vocabulary for describing and retrieving information: fine art, architecture, decorative art, and material culture.

•Almost 120,000 terms for objects, textual materials, images, architecture and culture from all periods and all cultures.

•Used by archives, museums, and libraries to describe items in their

collections.

•Used to search for materials.

•Used by computer programs, for information retrieval, and natural language processing.

A project of the J. Paul Getty Trust

Page 8: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

8

Art and Architecture Thesaurus

Provides the terminology for objects, and the vocabulary necessary to describe them, such as style, period, shape, color, construction, or use, and scholarly concepts, such as theories, or criticism.

Concept:

a cluster of terms, one of which is established as the preferred term, or descriptor.

Categories:

associated concepts, physical attributes, styles and periods, agents, activities, materials, and objects.

Page 9: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

9

Art and Architecture Thesaurus: Sample Record

Record ID: 198841

Descriptor: rhyta

Note: Refers to vessels from Ancient Greece, eastern Europe, or the Middle East that typically have a closed form with two openings, one at the top for filling and one at the base so that liquid could stream out. They are often in the shape of a horn or an animal's head, and were typically used as a drinking cup or for pouring wine into another vessel.

Hierarchy: Containers [TQ]...<containers by function or context>...........<culinary containers>...................<containers for serving and consuming food>

Page 10: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

10

Art and Architecture Thesaurus: Sample Record (continued)

Terms:rhytarhyton (alternate, singular)

protomai protome rhea rheon rheons

Related concepts:stirrup cupssturzbechersdrinking vesselsceremonial vessels

Page 11: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

11

MeSH -- Medical Subject Headings

Controlled vocabulary for indexing articles, for cataloging books and other holdings, and for searching MeSH-indexed databases, including MEDLINE.

• About 19,000 primary subject headings• Thesaurus of 110,000 chemical terms. • Total vocabulary over 300,000 terms.

National Library of Medicine provides MeSH subject headings for each of the 400,000 articles that it indexes every year.

"MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts."

Page 12: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

12

MeSH -- Medical Subject Headings

MeSH hierarchy:

general terms, e.g., anatomy, organisms, diseases, biological sciences;

anatomy is divided into sixteen topics, e.g., body regions and musculoskeletal system;

body regions is divided into sections, e.g., abdomen, axilla, back

etc.

Page 13: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

13

Example of MeSH hierarchy

Biological Sciences [G] Biological Sciences [G01] + Health Occupations [G02] + Environment and Public Health [G03] + Biological Phenomena, Cell Phenomena, and Immunity [G04] + Genetics [G05] + Biochemical Phenomena, Metabolism, and Nutrition [G06] + Physiological Processes [G07] + Reproductive and Urinary Physiology [G08] + Circulatory and Respiratory Physiology [G09] + Digestive, Oral, and Skin Physiology [G10] + Musculoskeletal, Neural, and Ocular Physiology [G11] + Chemical and Pharmacologic Phenomena [G12] +

Page 14: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

14

Example of MeSH hierarchy (continued)

Physiological Processes [G07] Adaptation, Physiological [G07.062] + Aging [G07.168] + Body Constitution [G07.265] + Body Temperature [G07.315] Body Temperature Regulation [G07.315.232] + Skin Temperature [G07.315.753] Chronobiology [G07.450] + Electrophysiology [G07.453] + Fluid Shifts [G07.503] Growth and Embryonic Development [G07.553] + Homeostasis [G07.621] + Tensile Strength [G07.900] Tropism [G07.950] +

Page 15: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

15

Example of MeSH hierarchy (continued)

MeSH Heading Body Temperature

Tree Number E01.370.600.120

Tree Number G07.315

Entry Term Organ Temperature

See Also Fever

See Also Thermography

See Also Thermometers

Allowable Qualifiers DE GE IM PH RE

Unique ID D001831

Page 16: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

16

Observations about Manually Maintained Thesaurus

• Permit very rich structure of relationships

• Most effective when user of search system is skilled in the discipline and trained in the use of the thesaurus (e.g., medical librarian)

• Needs continually updating as a field develops new terminology

• Expensive to create and maintain

Page 17: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

17

Gazetteers

The Alexandria Digital Library (ADL): geolibrary at University of California at Santa Barbara where a primary attribute of objects is location on Earth (e.g., map, satellite photograph).

Geographic footprint: latitude and longitude values that represent a point, a bounding box, a linear feature, or a complete polygonal boundary.

Gazetteer: list of geographic names, with geographic locations and other descriptive information.

Geographic name: proper name for a geographic place or feature (e.g., Santa Barbara County, Mount Washington, St. Francis Hospital, and Southern California)

Page 18: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

18

Alexandria Thesaurus: Example

canals

A feature type category for places such as the Erie Canal.

Used for:

The category canals is used instead of any of the following.

canal bends canalized streams ditch mouths ditches drainage canals drainage ditches ... more ...

Broader Terms:

Canals is a sub-type of hydrographic structures.

Page 19: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

19

Alexandria Thesaurus: Example (continued)

canals (continued)

Related Terms:

The following is a list of other categories related to canals (non-hierarchial relationships).

channels locks transportation features tunnels

Scope Note:

Manmade waterway used by watercraft or for drainage, irrigation, mining, or water power. » Definition of canals.

Page 20: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

20

Use of a Gazetteer

• Answers the "Where is" question; for example, "Where is Santa Barbara?"

• Translates between geographic names and locations. A user can find objects by matching the footprint of a geographic name to the footprints of the collection objects.

• Locates particular types of geographic features in a designated area. For example, a user can draw a box around an area on a map and find the schools, hospitals, lakes, or volcanoes in the area.

Page 21: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

21

Alexandria Gazetteer: Example from a search on "Tulsa"

Feature name State County Type Latitude Longitude

Tulsa OK Tulsa pop pl 360914N 0955933W

Tulsa Country OK Osage locale 360958N 0960012WClub

Tulsa County OK Tulsa civil 360600N 0955400W

Tulsa Helicopters OK Tulsa airport 360500N 0955205WIncorporatedHeliport

Page 22: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

22

Challenges for the Alexandria Gazetteer

Content standard: A standard conceptual schema for gazetteer information.

Feature types: A type scheme to categorize individual features, is rich in term variants and extensible.

Temporal aspects: Geographic names and attributes change through time.

"Fuzzy" footprints: Extent of a geographic feature is often approximate or ill-defined (e.g., Southern California).

Page 23: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

23

Challenges for the Alexandria Gazetteer (continued)

Quality aspects:

(a) Indicate the accuracy of latitude and longitude data.

(b) Ensure that the reported coordinates agree with the other elements of the description.

Spatial extents:

(a) Points do not represent the extent of the geographic locations and are therefore only minimally useful.

(b) Bounding boxes, often include too much territory (e.g., the bounding box for California also includes Nevada).

Page 24: 1 CS 430: Information Discovery Lecture 21 Thesauruses and Gazetteers.

24

Examples of Gazetteers

Alexandria Digital Library

Linda L. Hill, James Frew, and Qi Zheng, Geographic Names: The Implementation of a Gazetteer in a Georeferenced Digital Library. D-Lib Magazine, 5: 1, January 1999. http://www.dlib.org/dlib/january99/hill/01hill.html

Getty Thesaurus of Geographic Names

http://www.getty.edu/research/tools/vocabulary/tgn/