Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn...

40
Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn...

Page 1: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Accommodating Diverse Search Requirements over a Fedora

Repository

Michael Durbin and Jon W. Dunn

Fedora User Group – Open Repositories 2008

April 3, 2008

Page 2: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

April 19, 2023Fedora Users Group - Open Repositories 2008

Background

o Indiana University Digital Library Program• Started in 1997

o Diversity of formats and collections• Text, image, musical scores, audio, video, …

o Diversity of search systems• DLXS, XTF, Lucene, DB2 NSE, Oracle Text

o Current project to unify architecture for storage, discovery, and delivery around Fedora

Page 3: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Search System Development

o Phase one: create a search architecture and template for an image based search and discovery application

o Phase two: extend the template and architecture to support more advanced search and discovery applications over different object types

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 4: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

PHASE I: CREATING A BASIC IMAGE SEARCH

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 5: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Phase One: Simple Image Search

o Slocum puzzle collection: ideal test caseo Small number of objectso Simple content model• Each object represents a single physical puzzle• Basic metadata: METS, MODS, DC • RELS-EXT isMemberOf relationship with a

collection object• Pre-scaled derivative images

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 6: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 7: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Requirements: Identifier Resolution

o External Identifiers rather than Fedora PIDs• Seamless migration to Fedora• No commitment to any underlying repository

architectureo Requirement: Quickly resolve our identifier (PURL)

to the Fedora PID

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 8: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Requirements: PURL Identifier Resolution

April 19, 2023Fedora Users Group - Open Repositories 2008

Hypothetical ID Resolution Service

OCLC PURL Resolver

http://fedora.dlib.indiana.edu:8080/fedora/get/iudl:19794/THUMBNAIL

http://purl.dlib.indiana.edu/iudl/lilly/slocum/thumbnail/LL-SLO-004696

Page 9: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Requirements: Keyword and Fielded Search

o Very basic search requirements for any discovery and delivery web application• Keyword search should maximize discovery• MODS fields should be searchable to maximize

accuracy of matches• Search results paging• Support for simple Boolean operators• Wildcard searches are a requirement• Full metadata record (MODS) returned

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 10: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Remaining Requirements

o User interface• Extensible, Reusable, Customizable

o Service oriented approach• Centralize core search system• Standards-based access for integration with

other services and end-user tools

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 11: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Requirements: Search System

April 19, 2023Fedora Users Group - Open Repositories 2008

PURL Resolution

Fielded Search

Fedora Integration

SlocumWebapp

GenericSearch Webapp

UI Layer Search Layer

Page 12: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Solutions: Search Protocol

o Search and Retrieve via URL (SRU)• One of very few standard search protocols• Extremely powerful and flexible query language

(CQL)• Can return records of any type• Most commonly used with DC, MODS, MARCXML

• Has mechanisms for extension in case special needs arise

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 13: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Search System Solutions: SRU

April 19, 2023Fedora Users Group - Open Repositories 2008

PURL Resolution

Fielded Search

Fedora Integration

SlocumWebapp

GenericSearch Webapp

SRU

SRU

UI Layer Search Layer

Page 14: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Solutions: Existing Products

o Fedora Search• Good for finding items based on basic Fedora

metadata, but not for more sophisticated searching

o Fedora Resource Index Search• Also limited to searching basic metadata, not the

content of datastreams

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 15: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Solutions: Existing Products

o Fedora Generic Search Service (GSearch)• Hooks into Fedora• Works with Lucene• Easy to customize search fields though XSLT

transformation of existing metadatao OCLC SRU/W Implementation• Relatively complete implementation in Java, with

ongoing development• Others have had success using with Lucene

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 16: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Search System

April 19, 2023Fedora Users Group - Open Repositories 2008

index

OCLC SRU Implementation

Lucene Databaseextension

Fedora Generic Search Service

Reads

Updates

SRU

Page 17: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Phase 1 Solution: General Applicability

o Pieces of this solution have been used for other image collections

o SRU is used to expose these collections to OneSearch@IU, our federated search service

o The XSLT that assigned metadata to Lucene index fields was a solid base for the indexing needs of other collections.

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 18: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Phase 1 Solution: Lingering Problems

o Our XSLT for the Generic Search Service wasn’t perfect

o Some complications prevented full automationo We punted on getting the perfect Lucene analyzer

configuration

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 19: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

PHASE II: EXTENDING FOR DIFFERENT COLLECTIONS

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 20: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

EVIA Digital Archive

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 21: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Requirement: EVIADA Video Annotation Collection

April 19, 2023Fedora Users Group - Open Repositories 2008

Video ObjectVideo Object

Video ObjectVideo Object

Video ObjectVideo Object

Field Collection Object

Field Collection Object

Custom Annotation SoftwareCustom Annotation Software

Field Collection

Page 22: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Requirement: EVIADA Video Annotation Collection

o Complex Data model• One Fedora object which is addressable and

discoverable in partso New features• Faceted Search and Browse• Extensive custom fields

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 23: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Requirements: IN Harmony Sheet Music Collection

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 24: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Requirements: IN Harmony Sheet Music Collection

o Complex Content model• Three types of objects below the collection• Sheet music• Individual Score• Page Image

April 19, 2023Fedora Users Group - Open Repositories 2008

Chariot Race MarchChariot Race March

Page 25: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Requirements: IN Harmony Sheet Music Collection

o New Features• Faceted Search and Browse• Exact match searches• Date range searches• Dozens of very specific fields• Sorting by date or title

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 26: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Options:

o Extend our existing implementation• All too appealing because

of familiarity and “sunk costs”

• Major conflicts between existing model and desired model could result in unmaintainable “hackish” implementations

April 19, 2023Fedora Users Group - Open Repositories 2008

o Switch to a new infrastructure• Would be great, if

something existed that met our needs without having to rework everything

o Some combination• Best of both worlds?

Page 27: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Options: Faceted Search and Browse

o Use Solr• Built-in support for facets• Is a service layer with an XML response

• But do we really want to abandon SRU, or maintain two search service protocols?

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 28: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Options: Faceted Search and Browse

o Extend SRU Implementation• Prevents the need for yet another service layer• Has wide reuse potential

• Could be backed by Solr without substantially more effort.

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 29: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Solution: Faceted Search over SRU

April 19, 2023Fedora Users Group - Open Repositories 2008

SRU Service

(now with facet support)

Page 30: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Solution: Other SRU Improvements

o More complete CQL support• Easy Improvements• Operators (and, or, not, any, all)• Application-specific fields

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 31: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Solutions: Other SRU Improvements

o More complete CQL support • Difficult Improvements• “cql.exact” relation• facet implementation• sort support

April 19, 2023Fedora Users Group - Open Repositories 2008

dc.subject exact “United Kingdom”

index

dc.subjectdc.subject.exact

dc.subject

dc.subject.sort

Page 32: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Options: Index Generation

April 19, 2023Fedora Users Group - Open Repositories 2008

Fedora Generic Search Service

Homegrown Solution

Page 33: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Reconsideration: GSearch

o Limited by the one to one relationship between Lucene documents and fedora objects

o Storing valid XML in CDATA to be stored in Lucene is messy and is prone to error as the metadata becomes more diverse

o We really only use it to generate a Lucene index

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 34: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Consideration: Solr

o Robust wrapper for Lucene• Exposes service to update index• Exposes search features as a service• Abstracts away much of the of complexities of

Luceneo Migrating existing search indexes would be

prohibitively time consuming, but it might be the best tool to bring up new collections

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 35: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Solution: Custom index service

o A service whose initial functionality is simply to create and maintain Lucene Index directories that are served by SRU.• Can easily be extended/configured to use

different search engines or to delegate the process entirely (perhaps to Solr)

o Support for existing GSearch style XSLTo Simple Java interface to allow for easy index

implementations.

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 36: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Search Service

April 19, 2023Fedora Users Group - Open Repositories 2008

index

OCLC SRU Implementation

Lucene Database – configured for quick id resolution

Custom Index Service

Lucene Database – configured for basic search

index

index

Basic Index Writer

GSearch Style XSLT Index Writer

Lucene Database – configured for advanced search

New Style XSLT Index Writer

Compound Model Java Index Writer

indexLucene Database – configured for compound model searches

Page 37: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Search Service

April 19, 2023Fedora Users Group - Open Repositories 2008

index

OCLC SRU ImplementationLucene Database – configured for quick id resolution

Custom Index Service

Lucene Database – configured for basic search

index

index

Basic Index Writer

G Search Style XSTL Index Writer

Lucene Database – configured for advanced search

New Style XSTL Index Writer

Compound Model Java Index Writer

index

Lucene Database – configured for compound model searches

Solr Database – configured to interface with solr.

Solr

Solr Wrapping Index

Page 38: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Future Plans

o Full Text searching• Search text of entire books or journals• Determine where in the hierarchy the match

occurred• Provide snippets with highlighted matches in

context for the search results listingo Solutions• XTF, Solr through our custom index service

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 39: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Conclusion

o Most of the work is configuring the index which is a requirement that cannot be avoided.

o Migration doesn’t have to be difficult or disruptiveo Always be willing and able to consider new

products and technologies

April 19, 2023Fedora Users Group - Open Repositories 2008

Page 40: Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Thanks! Any Questions?

o www.dlib.indiana.eduo wiki.dlib.indiana.edu/confluence/x/AQI

o [email protected] [email protected]

April 19, 2023Fedora Users Group - Open Repositories 2008