A New Kind of Catalog Charley Pennell Principal Cataloger for Metadata North Carolina State...

Post on 17-Dec-2015

224 views 0 download

Tags:

Transcript of A New Kind of Catalog Charley Pennell Principal Cataloger for Metadata North Carolina State...

A New Kind of Catalog

Charley PennellPrincipal Cataloger for Metadata

North Carolina State University

North Carolina Library Association 2007

Where is this talk headed?

Local motivation National trends What is Endeca? Features Does Endeca work? Where are we going from here? Where is everybody else going?

Why a new catalog?What was wrong with the old one?

A little TRLN catalog primer

TRLN libraries (Duke, NCCU, NCSU, UNC-CH) jointly develop and maintain BIS, 1985-1992

DRA implemented for catalog (UNC & Duke continue Acq/Serials modules), 1991-1993

No integrated keyword/browse capability, 1993-1999

Web2 catalog implemented, 1999- Sirsi & DRA “merge” in 2002; Taos DOA

A little TRLN catalog primer 2

NCSU & NCCU to Unicorn; Duke to Aleph; UNC-CH to Millenium, 2003-2004

Sirsi/Dynix merger, 2004: vendor focus shifts (even more) toward school/public market

While agreeing to continue to support Web2, S/D increasingly looking to merge all product catalogs into single interface

What was the catalog lacking?

Simplicity: a simple, hopefully uncluttered interface Interactivity: ways to interact with results to get

better results Forgiveness: just fix my typos and case errors,

don’t make me feel stupid! Response time: always Real-time sorting: the limit is how many?!! Relevance ranking: as if! Web services: use the Web to repurpose data,

enable mash-ups, add-ons & improvements

Which interface is ready for immediate use?

0

10

20

30

40

50

60

70

80

90

1stQtr

2ndQtr

3rdQtr

4thQtr

East

West

North

So, why DOES everyone think that the catalog sucks stinks?

"Most integrated library systems, as they are currently configured and used, should be removed from public view."

- Roy Tennant, OCLC

The old model

The integrated library system

Historically, the ILS developed as an inventory control system for use by library staff only

First library automation systems (Plessey, CLSI, Geac, Innovative) were designed around circulation or acquisitions functions

Interaction time was calibrated to the slow pace of backroom work where the audience was basically captive

Staff focus on known-item searching, not resource discovery

The catalog as part of the ILS

The first integrated OPACs were veneers on top of existing inventory management systems—patrons & staff competed for system resources! They still do!

First OPACs allowed for browse only; early keyword searching restricted to certain fields (A/T/S) only

Libraries with no IT support were stuck with what their vendor provided and the enhancement process for improvements

Libraries with IT support created their own systems: BIS, NOTIS, Clarement Colleges, Georgetown, PALS, DOBIS/LIBIS

The state of the ILS in 2007 Customer demands for increasing

functionality in a marketplace with little $$ to spend has reduced the ILS vendor pool through mergers and buyouts

New functionality (multi-search, ERMS, E-Ref, ILL, etc.) increasingly being met by stand-alone and third party applications

Increasing competition from open source (Koha, Evergreen, Scriblio, LibraryThing) and e-commerce

Q: Is our dogged adherence to MARC the only thing keeping the remaining ILS vendors afloat?

The state of the catalog 2007 Library users’ search expectations have been

conditioned by interactions with commercial Websites and Google, with which Libraries can barely afford to compete, but must

Libraries are becoming increasingly virtual as users interact with us online (e-resources, Second Life)

User expectations for online experiences are more interactive, instantaneous, and inviting

Perhaps most importantly…

The information resources represented in the catalog represent a shrinking percentage of what end users need or want

Calhoun’s Aristotelian vs. Copernican views of the catalog

What do users want from the OPAC?

Make subject searching in online catalogs easier using post-Boolean probabilistic searching with automatic spelling correction, term weighting, intelligent stemming, relevance feedback, and output ranking

Streamline users' book selection decisions at the catalog by adding tables of contents and back-of-the-book indexes to cataloging (i.e., metadata) records

Reduce the many failed subject searches by expanding the online catalog with full texts—journal and newspaper articles, encyclopedias, dissertations, government documents, etc. Increase finding strategies in online catalogs through the library classification-- Markey, Karen (2007). “The online library catalog: Paradise lost and paradise regained”, D-Lib Magazine, 13(1/2).

“Many researchers express surprise at the brevity (from one to three words) of the queries people submit to online systems. Belkin tells why so few words make up their queries, "Precisely because of the inquirer's lack of knowledge about a problem area, it is impossible to specify what would resolve it." For Belkin, the saving grace is the inquirer's ability to recognize what he or she wants or does not want during the course of the search. Therein lies an important solution to the problem—information systems that report results for easy eyeballing and instantaneous recognition of relevant possibilities.” – Karen Markey

What is an Endeca?

A software company based in Cambridge, MA

A search and information access technology provider for a number of major e-commerce websites

Developers of the Endeca Information Access Platform

Endeca features

Commercial-strength search/sort speeds

Site customizable relevance ranking

Faceted browse True browsing (LC

classification) Spell-checking ”Did you mean?” Automatic word

stemming

Endeca at NCSU Libraries Went live in January 2006 Works with a text version

of a daily snapshot of Libraries’ MARC & other metadata

Used to improve the discovery portion of the library catalog

Interoperates with ILS for holdings, current availability status

Web2 interface still present for known item & authority searching

Implementation timeline

License / negotiation: Spring 2005 Acquire: Summer 2005 Implementation:

August 2005 : vendor training September 2005 : finalize requirements October 2005 – January 2006 : design and

development January 12, 2006 : go-live date

Widen to TRLN partners: Winter 2008

Implementation Team

Implementation Team brought together from IT, DLI, Cataloging, Collections, Reference, Circulation

Worked on indexing, UI, usability testing, etc. Areas of contention

Number of initial search boxes (1 or 2) Order, grouping of facets Placement of classification hierarchies, breadcrumbs Use of “search” and “browse” on tabs

Visualization aided by Tito’s wireframes

8th (and Final) Revision: Aggregate holdings information by library.

Reduces complexity of continuing and online resources.

Brief view vs. Full view gives user choice about displaying holdings.

NCSU Endeca features

Facets

Results

Call # browse

Breadcrumbs

Features we started with

Faceted browse Availability facet Breadcrumbs Spell check / Did you mean Hierarchical subject browse based on LCC Fuzzy link to live Web2 data New book browse for titles added in last

week only

Features that we’ve added

New book browse based on relative date (last week, last month, last three months)

RSS feeds based on user results “Search within” results Send search to TRLN partners Static unique link to live Web2 data

Relevance ranking

Based on locally customizable algorithm: Most relevant: query exactly as entered For multi-term searches: phrase match Field match

title match more relevant than notes match Other factors:

number of fields matched weighted frequency static ordering (publication date, circulation stats)

Faceting at the NCSU Libraries

Follows on what we have learned from the commercial Web search model

Mines metadata already available via MARC record, local class number, ILS item categories, circ status, and date stamping

Required massive clean-up of 6xx subdivisions Allows both pre- and post-coordinate limits Uses table mapping to enable drilling down through

call number results

Facet refinements

Availability Author Library Format Language

New(ness)

LC Classification Subject: Topic Subject: Genre Subject: Region Subject: Era

A single facet need not represent data from a single field

Single Unicorn item types (Book, Kit, Manuscript, Map, Data set)

Multiple Unicorn item types (Audio, Microform, Thesis/Dissertation, Software & Multimedia, Videos)

Leader byte 07 (Bib lvl): Journal, Magazine

Library (Online)

Ranking facet results by number of postings makes sense in a short list, but not in a long list

The author facet is less useful in some types of searches …

… than others!

Technical overview

Raw MARC data

NCSU exports and reformats

Flat text files

Data Foundr

yParse text

files Indices

MDEX Engine

NCSU Web Application

HTTP

HTTP

Information Access Platform

MARC ingest

MARC flat text file(s) for ingest by Endeca. Transformation accomplished with MARC4J. Opportunity to manipulate data on the back-end.

Transformed data

The end result…

Video

Other Endeca library catalogs

Phoenix Public Library: http://www.phoenixpubliclibrary.org/

McMaster University: http://libcat.mcmaster.ca

Florida Center for Library Automation http://catalog.fcla.edu/

Individual Florida universities http://fs.catalog.fcla.edu/, etc.

Does Endeca work?

Problems: authority control

Endeca is a keyword search engine; “browse” can only be effected using sort options

There is no authority control within Endeca itself, rather it relies on AC within ILS

To make use of available metadata, subjects were split along subdivisions. Authors were not

Talks were held with the vendor to explain the potential for drawing on authority x-refs to collocate searches

Problems: subject context

Problems with wrong delimiter values (esp. $v) Problems maintaining context in atomized LCSH

One-way relationships English language$vDictionaries$xSpanish

Chronological headings devoid of geographic context Cuba$xHistory$yRevolution, 1959

Phrase headings expressed in multiple subdivisions Prisoners$xAbuse of

Problems: subject hierarchies

Chronological hierarchy not built into $y “19th century” does not subsume 1800-1809, 1801-1861, 1809-1817, 1815-

1861, 1817-1825, Civil War, 1861-1865, etc. Geological periods exist as text only (Ordovician, Pleistocene, etc.)

Some chronological headings are expressed as text in 650$a Middle Ages Nineteen sixties

Geographic hierarchy not consistent between 651 and 650 $zNorth Carolina$zRaleigh $aRaleigh (N.C.)

BT/NT/RT relationships from authority file lacking

Some potential solutions

Search behavior education FAST (Faceted Application of Subject

Terminology) Web2 x-refs to redirect searches to Endeca Combining $z hierarchies Hierarchy lists

What do our users think?

“The new Endeca system is incredible. It would be difficult to exaggerate how much better it is than our old online card catalog (and therefore that of most other universities). I've found myself searching the catalog just for fun, whereas before it was a chore to find what I needed.”

- NCSU Undergrad, Statistics

“The new library catalog search features are a big improvement over the old system. Not only is the search extremely fast, but seemingly it's much more intelligent as well.”

- NCSU faculty, Psychology

Usability testing

Task Difficulty: Old Catalog

Easy43%

Medium12%

Hard22%

Failed23%

Task Difficulty: New Catalog

Easy59%

Medium12%

Hard7%

Failed22%

Usability testingAverage Task Duration:

Old vs New Catalog00:00.0 00:43.2 01:26.4 02:09.6 02:52.8 03:36.0

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Task 7

Task 8

Task 9

Task 10

Old Catalog

New Catalog

Usage statistics

Searches by Field Type: J uly 06 - J an 07

0

60,000

120,000

180,000

240,000

300,000

360,000

420,000

Keyword(default)

I SBN Title Author Subject Multi-Field

Search and Navigation

Search 67%Navigation 8%

Search -> Navigation 25%

Newness wearing off?

March ‘06 - May ‘06

July ‘06-January ‘07

Requests by Search Type

Search -> Navigation

29%

Navigation 20%

Search 51%

Navigation by Dimensions

Subject: Topic26%

Availability2%

LC Classification21%

Format10%

New10%

Library10%

Subject: Genre6%

Subject: Era2% Language

3%

Subject: Region4%

Author6%

July 06 – Jan 07

Navigation by Dimension (most used)

0 20,000 40,000 60,000 80,000 100,000 120,000 140,000

Availability

Subject: Era

Language

Subject: Region

Author

Subject: Genre

Library

New

Format

LC Classification

Subject: Topic

Requests

July 06 – Jan 07

Navigation by Dimension (order of UI presentation)

32,650

16,009

12,257

22,818

54,476

57,667

34,096

145,589

120,644

9,286

0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000

Author

Language

Subject: Era

Subject: Region

Library

Format

Subject: Genre

Subject: Topic

LC Classification

Availability

Requests

July 06 – Jan 07

Where are we going from here?

Future directions

Additional hierarchies (geographic names, dates) Make use of NAF, SAF, particularly cross-reference

structure Massage underlying metadata

Addition of Date Cataloged – Done! Addition of LC Class numbers to e-resources – Done! FRBR work numbers/records? – Tested! FAST headings?

Accommodation of true browse for all indexes

Future opportunities Expanding the scope of the implementation to the

10M records in TRLN (Duke, NCCU, NCSU, UNC-Chapel Hill)

Enrich catalog through external web services: book jackets, reviews, TOC, etc. – Amazon, OCLC.

LibraryThing, Bowker Syndetics Build use-case based cross-application shopping

cart functionality Integrate catalog w/other tools through web services

—“Free the Data”

Web services…

Mobile device searching

Where is everybody else going?

Catalogs detaching themselves from ILS Detached data lends itself to experimentation Don’t have to throw out baby with bathwater when

better interfaces come out Data itself safe and secure in ILS

MARC becoming superfluous; MARC’s granularity NOT!

Social interaction: reviews, folksonomic tags, ratings

Phoenix Public Library on Endeca

III’s new faceted catalog, Encore

ExLibris Primo at Vanderbilt

Athens County, OH—Koha Zoom open source

Georgia PINES—Evergreen open source

Casey Bisson’s Scriblio

Danbury Public powered by LibraryThing

OCLC WorldCat Local at UW

Thanks for listening!

Charley Pennell

Principal Cataloger for Metadata

NCSU Libraries

North Carolina State University

Raleigh, NC 27695-7111

cpennell@ncsu.edu

More info at: http://www.lib.ncsu.edu/endeca/