Terminology identification from full text: OCLC’s WordSmith experience

31
Terminology identification from full text: OCLC’s WordSmith experience Jean Godby Senior Research Scientist OCLC Online Computer Library Center, Inc. SOASIST Full-Day Workshop on Aboutness June 21, 2001

description

Terminology identification from full text: OCLC’s WordSmith experience. Jean Godby Senior Research Scientist OCLC Online Computer Library Center, Inc. SOASIST Full-Day Workshop on Aboutness June 21, 2001. Outline of this talk. The need for terminology Sources of terminology - PowerPoint PPT Presentation

Transcript of Terminology identification from full text: OCLC’s WordSmith experience

Page 1: Terminology identification from full text: OCLC’s WordSmith experience

Terminology identificationfrom full text: OCLC’s WordSmith

experience

Jean GodbySenior Research Scientist

OCLC Online Computer Library Center, Inc.

SOASIST Full-Day Workshop on Aboutness

June 21, 2001

Page 2: Terminology identification from full text: OCLC’s WordSmith experience

Outline of this talk

• The need for terminology• Sources of terminology

• Extracting terminology from free text• Organizing it• Mapping it to library classification

schemes

Page 3: Terminology identification from full text: OCLC’s WordSmith experience

Increasing subject access to document collections

More human effort Less human effort

More abstract view of the data Less abstract

Cataloging TokenizingClassification Indexing WordSmith

Scorpion

ClassificationResearch

Page 4: Terminology identification from full text: OCLC’s WordSmith experience

Subject terminology fromlibrary classification

schemes

• Strengths– Derived from scholarship in subject analysis and

classification theory– Permits interoperability between Web resources

and traditional published materials

• Weaknesses– Literary warrant is based on traditional published

materials.– Human effort is required to keep them current.– They must be modified for use in automated

systems.– They aren’t free.

Page 5: Terminology identification from full text: OCLC’s WordSmith experience

Subject terminology from full text

• Strengths– Literary warrant is based on current text.– Coverage is not restricted to traditionally

published material.– The style is closer to the user’s vocabulary.

• Weaknesses– The data is noisy and difficult to organize.

Page 6: Terminology identification from full text: OCLC’s WordSmith experience

Terminology identification

• ...is an essential first step in the analysis of a document's content.

• ...is one of the most mature research subjects in natural language processing.

Page 7: Terminology identification from full text: OCLC’s WordSmith experience

Lexical phrases

• Are the names of persistent concepts.

• Act like words.

• Are commonly used to name new concepts in rapidly evolving technical subject domains.

Page 8: Terminology identification from full text: OCLC’s WordSmith experience

A lexical phrase:“Recurrent erosion”

Page 9: Terminology identification from full text: OCLC’s WordSmith experience

Not a lexical phrase:“Recurrent problem”

Page 10: Terminology identification from full text: OCLC’s WordSmith experience

Identifying lexical phrases

Tokenized text: ...Planetary scientists think the convex shape came about as lava welled up beneath the crater's solid floor….

Ngrams: planetary scientists think, convex shape, welled up, coincided with, five times greater than, easiest way, Milky Way, absolute magnitudes brighter than, added material, advanced study, African American

Index filter: planetary scientists, convex shape, easiest way, Milky Way, absolute magnitudes, added material, advanced study, African American

Topic filter: planetary scientists, Milky Way

Page 11: Terminology identification from full text: OCLC’s WordSmith experience

Strategies in the topic filter

• Word/phrase frequency and strength of association

• “Knowledge-poor” text analysis

• More sophisticated but computable text analysis

Page 12: Terminology identification from full text: OCLC’s WordSmith experience

Word and phrase frequencies

• Word/phrase frequencyhigh: dublin core, metadata, element, electronic resourceslow: availability period, background, applicable

terminologies

• Weighted frequency 1. core element, date element, metadata element 2. author name, entity name, corporate name 3. HTML tag, end tag, meta tag

Page 13: Terminology identification from full text: OCLC’s WordSmith experience

Knowledge-poor techniques 1:parts of speech in local context

• Some noun phrase heads usually appear in text only with adjective or noun modifiers.

holes--black holes, grey holes, central holes

• Others usually appear without modifiers.

galaxy--cartwheel galaxies, spiral galaxy a galaxy, if galaxies; ...however, galaxy formation

Page 14: Terminology identification from full text: OCLC’s WordSmith experience

Consequences

• We can identify topical single terms:

galaxy, star, sun, moon

government, abortion, communism metadata, html, Internet, information

• We can create subject taxonomies: galaxy (-ies) *hole(s) cartwheel galaxy black holes elliptical galaxy drill holes spiral galaxy grey holes

Page 15: Terminology identification from full text: OCLC’s WordSmith experience

Knowledge-poor techniques 2: subject probes

• Goal: to get high-quality subject terms• Look for indications that something is talked about, written about,

or studied: topics in, study of, analysis of, (on the) subject of, major in, is called, is known as

• Probes differ in specificity. topics in sciences, arts, humanities, library science, astronomy,

physics, business, data visualization, computer science, mathematics, computer and network security, mathematics, number theory, medicine

analysis of metabolic regulation, numerical analysis, saline water phenomena, coals, iron ore, cereal grains, income dynamics among men, working hours, inflation, mass belief systems, aerial photography

Page 16: Terminology identification from full text: OCLC’s WordSmith experience

More clues can be identified with “knowledge-rich” processing

You can sum up the big difference between beans on the one hand and Java applets and applications on the other in one word (okay, two words) : component model. Chapter 2 contains a nice, thorough discussion of component models (which is a pretty important concept, so I devoted an entire chapter to the subject).

Java Beans for Dummies. Emily Vander Veer. Chicago, IL: IDG Books Worldwide. 1997, p. 14.

Page 17: Terminology identification from full text: OCLC’s WordSmith experience

Some results

Page 18: Terminology identification from full text: OCLC’s WordSmith experience

Terminology lists: tokenizing vs. indexing

havehaveihavelhavenhavenshaverahavertyhaveyhavice

havill havilland

health carehealth care coveragehealth insurancehousinghousing policy…….world tradeworld trade accordworld trade agreementworld trade centerworld trade center

bombing

Page 19: Terminology identification from full text: OCLC’s WordSmith experience

Terminology extraction works best with:

• Full text

• Collections of text, not isolated documents

• Text from a single subject domain

• Algorithms that are tuned to the style of the text

Page 20: Terminology identification from full text: OCLC’s WordSmith experience

An application: browse displays

Page 21: Terminology identification from full text: OCLC’s WordSmith experience

Organizing terminology

Dewey Decimal

Dewey

Deweycall

numbers

Dewey numbers

Deweydecimal

classificationnumbers cutter

numbers

B/N

B/N

B/N

Broad/Narrow

DDC

DDC and LCSH

Library of Congress Subject Headings

SubjectHeadings

Ellipsis

Acronym

Coordination

Acronym

B/N

Page 22: Terminology identification from full text: OCLC’s WordSmith experience

An application: a topic map for a collection of

Web resources

Page 23: Terminology identification from full text: OCLC’s WordSmith experience

Another application: a terminology server

Page 24: Terminology identification from full text: OCLC’s WordSmith experience

Mapping vocabulary to library classification

schemes

• Explicit– For each document in a collection, extract

terminology using WordSmith.– Assign Dewey Decimal Classification (DDC)

numbers using Scorpion.– Identify the highest associations between

extracted terms and DDC numbers.

• Implicit– Make both sources of subject information

available in a user interface.

Page 25: Terminology identification from full text: OCLC’s WordSmith experience

Terminology mapping works best when:

• The upstream processes for extracting terminology are clean.

• It operates on a large collection of domain-specific text.

• The classification scheme is simplified.

Page 26: Terminology identification from full text: OCLC’s WordSmith experience

The Desire database of Web documents about

engineering

Page 27: Terminology identification from full text: OCLC’s WordSmith experience

Science aspects

Page 28: Terminology identification from full text: OCLC’s WordSmith experience

Social science aspects

Page 29: Terminology identification from full text: OCLC’s WordSmith experience

Links to documents about other types of pollution

Page 30: Terminology identification from full text: OCLC’s WordSmith experience

In sum

• We can automatically extract useful terminology from full text.

• The terminology can be embedded in applications of varying complexity.

• There is a tradeoff between accuracy and technical sophistication.

Page 31: Terminology identification from full text: OCLC’s WordSmith experience

For more information

Godby, Jean and Reighart, Ray. 1998. “The WordSmith indexing system..” Accessible at:http://www.oclc.org/oclc/research/publications/review98/godby_reighart/

wordsmith.htm

Godby, Jean; Miller, Eric; and Reighart, Ray . 2000. “Automatically generated topic maps of World Wide Web resources.” Accessible at:http://www.oclc.org/oclc/research/publications/review99/godby/topicmaps.htm

Godby, Jean and Reighart, Ray. 2001. “Terminology identification in a collection of Web resources. In: K. Calhoun and J. Riemer, eds. CORC: New tools and possibilities for electronic resource description. New York: The Hayworth Press, Inc., 49-66.