Search Technologies for Digital Libraries

Post on 05-Dec-2014

179 views 0 download

description

Introduction Search & Information Retrieval Technologies for Digital Libraries

Transcript of Search Technologies for Digital Libraries

Contemporary Search Technologies - also for Libraries?

Clemens Neudecker, KB – 20/04/2011

Table of contents

Retrieval: Status Quo

New ways of searching

Prototypes & Outlook

Lossau (dlib, 2004)

How to position the library as an information provider in the 21st century?

Search services are critical!

http://www.dlib.org/dlib/june04/lossau/06lossau.html

Library as a “depot”

Collect

Preserve

Library as a “gateway”

New ways of searching and/or browsing

Service infrastructure

User-Generated content

Competition: Internet Search Engines

Simple Search

• By keyword

• Boolean operators

Advanced Search

Facets

Views

Phrases

Meta-Search

Basics

• Crawling

• Indexing

• Searching

• Ranking results

http://nlp.stanford.edu/IR-book/

Technology

Apache Lucene/Solr (KB: Migration Verity)

http://lucene.apache.org/

http://lucene.apache.org/solr/ SRU = Search/Retrieve via URL

http://www.loc.gov/standards/sru/ CQL = Contextual Query Language

http://www.loc.gov/standards/sru/specs/cql.html

Retrieval: Status Quo

Catalogue

Metadata

Catalogue Search

Metadata

Dublin Core (DCMI)

http://dublincore.org/

Z39.50

http://www.loc.gov/z3950/agency/

Metadata Harvesting

Open Archives Initiative: OIA-PMH

http://www.openarchives.org/

Linked Data

Authority Data

Named Entities

(Persons, Places, Institutions)

http://viaf.org/ Gazetteers

http://www.world-gazetteer.com/ Other Examples:

LocAuth, PND, NaCo

Persistent Identifier

URN = Uniform Resource Name

NBN = National Bibliography Number

Resolver = Translation into web address

Problems

Correctness of data

Coverage

Formats

Alignment

Multilingualism

What happened since

Google Books

The European Library

Europeana

Wolfram/Watson

What’s next?

Google Book Search

http://books.google.com/

Google Ngram Viewer

http://ngrams.googlelabs.com/

The European Library

http://search.theeuropeanlibrary.org

Europeana

http://www.europeana.eu/portal/

Michael

http://www.michael-culture.org

WorldCat

http://www.worldcat.org/

IBM Dr. Watson

www.ibm.com/uk/Watson

Wolfram

http://www.wolframalpha.com/

The web

The web is not limited to the www!

Data deluge

“Deep web” – not indexed (dynamic) parts

Web of users – currently ~2 billion

Internet Archive

http://www.archive.org/

Wayback Machine

http://web.archive.org/

Web archiving

The web as a resource

Knowledge Extraction (not the actual data!)

→ Semantic Web

(web of knowledge,

rather than data)

Semantic Web

RDFhttp://www.w3.org/RDF/

OWLhttp://www.w3.org/2004/OWL/

SPARQL http://www.w3.org/TR/rdf-sparql-query/

SKOS http://www.w3.org/2004/02/skos/

Ontologies

Ontology = “Model of the World”

Classes Instances Properties

Semantic Graphs

New resources

Digital libraries (Images + OCR) Digital born material The web

→ Interoperability (STITCH, CATCH)

Full text (OCR)

"... tte->e°n.m.66-..ie k>okke cire-5^ea. ver.è. 6.or ^ ^ ^ °

kiesrellj-oe-ikei^, v-in eeo ^elj-escdapeo ^UOI^, 7

^n>5«--'-/-r. veel8-Iiec-jc ttui5vroll^ v,a 'z » ^ v e . X. «. ^ ^ I» 2 L t. L ^-i ? > " Z Z^

l»v«e».ic. sx ^ ^ , 6en 2 l8c«. Leb. ^ L I L I tZ.

6eo zc> ^pr>!, >«(ZS. 8 O II 0 v ? L W. . L^-L"

. . ^ ... ,. , ^,a «ore Vrienilea ea Lekenaêll zeven dy aeeea ^^

^ LLQ d2i« 4 urea, 18 myoe ttuisvi-ouiv, van Kenoi5, Sis asr 0v?e darlelvk >zetief6e Vscier', ?. L08, op L

«eea vel.^esckspLa ^5^()I>Z verlof. Ke6ed w»cj6zZ reo l2urev, as eev Verval vsn ^evev^drscdceo, ^ ^ ^ "A.

Oevki>i7L«., K0>.^^Q8N()VL^, secZerr z ''Vckev öeclle^ri^ te , jv6evou6er6oru " ^

<Zen Zv ^pri!, 1806. ^x>0lè:ecsr. vsv dyQ!l 92 ^sr^n, ker ^clelvke vzet det Leu^visie vervvzilelc! 'O L ^ ^ ^ '-

".' «eckea mi6ck»z ruim êên uur verlatte ovvorfpieck-z. i>«kl. ^0-6 k»rskter verdeaxSe »Ue ryve iiinöeren en L--»S « > I L^Z

OCR Lexica

Word matching (fuzzy words) Frequency Morphology Historic forms Inflected forms

Visibility

“Hidden” - only indexed Highlighting in image Full text behind image (PDF) Parallel/switched mode User Correction/Annotation

Hidden in index

Image highlighting

PDF

Parallel/Switched

Crowdsourcing

Crowdsourcing examples

UIBK Catalogue NLA Newspapers

http://trove.nla.gov.au/newspaper Digitalkoot

http://www.digitalkoot.fi/en/splash Concert TranscriBentham

http://www.transcribe-bentham.da.ulcc.ac.uk/td/Transcribe_Bentham

UIBK Catalogue

Trove I

Trove II

Digitalkoot

Concert

TranscriBentham

Prototypes

Prototype: FEP

Prototype: Assets

http://virserv.isti.cnr.it:8080/assetsIRService/index

Prototype: Semantic Search

http://eculture.cs.vu.nl/europeana/session/search

Prototype: Waisda

http://waisda.q42.net/, http://blog.waisda.nl/

Prototype: Geospatial Search

Prototype: Image Annotation

http://dme.arcs.ac.at/annotation/ Problem: No Flash in Europeana (A/V content)

Prototype: Random Image Explorer

http://europeana.fe2.nl/ (Willem Jan Faber, KB)

Solution: Common API

API = Application Programming Interface

Set of descriptions defining how to access an electronic resource/application through a common interface

API

Documented Interface Definition

Machine readable

Public/shared

API Benefits

Data/functionality available through documented, public interfaces

Anybody can use it

Can be integrated in other services/tools

Can be compared, combined, linked

Libraries need not be the actual host