mod_oai: Metadata Harvesting for Everyone
description
Transcript of mod_oai: Metadata Harvesting for Everyone
![Page 1: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/1.jpg)
mod_oai: Metadata Harvesting
for EveryoneMichael L. Nelson, Herbert Van de Sompel, Xiaoming Liu, Aravind
Elango
{mln,aelango}@cs.odu.edu{herbertv,liu_x}@lanl.gov
DLF 2004 Fall ForumBaltimore MD
October 25-27, 2004
mod_oai is sponsored by the Andrew Mellon Foundation
![Page 2: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/2.jpg)
Outline
• mod_oai– crawling vs. harvesting– complex objects & OAI-PMH– how mod_oai works– scenarios– demos
• More information– http://www.modoai.org/– http://www.openarchives.org/
![Page 3: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/3.jpg)
www.getty.edu
doc1; last mod2003-03-12
doc2; last mod2002-07-19
doc3; last mod2003-11-29
doc4; last mod2002-10-03
doc100; last mod2003-09-113…
what documents have beenmodified since 2003-11-15?
Inefficient Web Crawlers
robot image from: http://www.q-design.com/toy/ToyArt/robots/55.JPEG
![Page 4: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/4.jpg)
www.getty.edu with OAI-PMH
doc1; last mod2003-03-12
doc2; last mod2002-07-19
doc3; last mod2003-11-29
doc4; last mod2002-10-03
doc100; last mod2003-09-113…
what documents have beenmodified since 2003-11-15?
A More Efficient Way…
![Page 5: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/5.jpg)
mod_oai• Goal: integrate OAI-PMH functionality into
the web server itself…• mod_oai: an Apache 2.0 module to
automatically answer OAI-PMH requests for an http server– written in C– respects values in .htaccess, httpd.conf
• Result: web harvesting with OAI-PMH semantics (e.g., from, until, sets)
• www.foo.edu/modoai?ListIdentifiers&metdataPrefix=oai_dc&from=2004-09-15&set=video:mpeg
![Page 6: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/6.jpg)
OAI-PMH data model
resource
item
Dublin Coremetadata
MARCXMLmetadata
MPEG-21DIDL records
OAI-PMH identifier = entry point to all records pertaining to the resource
METS metadata pertaining
to the resource
modeled representation of the resource
simplemodel
more expressivemodel
complexmodel
complexmodel
![Page 7: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/7.jpg)
OAI-PMH and complex models
• OAI-PMH record == modeled representation of the resource• Can be selectively harvested via OAI-PMH ~ datestamp, set• Resource can be:
– simple object (1 file) – compound object (multiple files)
• OAI-PMH records can contain:– Typical metadata– Actual resource(s)
• By-Value – base64 encoded• By-Reference – http address of resource• both
– Identifiers of metadata and resource(s), unambiguously mapped to the identified data
– A variety of secondary information
![Page 8: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/8.jpg)
Complex Objects & OAI-PMH
• LANL Repository– OAI-PMH as a Repository Access Protocol to
access metadata and content represented as DIDLs
• APS/LANL/LoC Mirroring– OAI-PMH transfer of APS content represented
in application neutral format (DIDLs)
• LANL DSpace Plug-in– Exposes MPEG-21 DIDL documents through
built-in DSpace OAI-PMH infrastructure
![Page 9: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/9.jpg)
How mod_oai works
• Install on an Apache 2.0 server– compile & edit httpd.conf
http://www.foo.edu/ now has an OAI-PMH baseURL of:
http://www.foo.edu/modoai
![Page 10: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/10.jpg)
OAI-PMH characteristics: Typical Repository
OAI-PMH Entity value description
Resource URL PDF, PS, XML, HTML or other file
Item
identifier OAI Identifier
DNS-based name of metadata about resource
set membership LCSH Library of Congress Subject Heading
Record
metadataPrefix oai_dc bibliographic metadata in Dublin Core
datestamp 2004-10-18
modification date of DC record
Record
metadataPrefix oai_marc bibliographic metadata in MARC
datestamp 2004-07-31
modification date of MARC record
![Page 11: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/11.jpg)
resource
DC, HTTP, DIDL Modeled Representations item
Dublin Coremetadata
HTTPheaders
DIDL: base64 orurls + HTTP headers records
OAI Identifier == URL of Resource
OAI-PMH Data Model in mod_oai
http://techreports.larc.nasa.gov/ltrs/PDF/2004/aiaa/NASA-aiaa-2004-0015.pdf
Set membership == MIME type
![Page 12: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/12.jpg)
OAI-PMH characteristics: mod_oaiOAI-PMH Entity value description
Resource URL HTML, GIF, PDF or other web file
Item
identifier URL same URL as the resource
set membership MIME type MIME type of the resource
Record
metadataPrefix http_header the http headers that would have been returned via HTTP GET/HEAD
datestamp 2004-07-31 modification date of resource
Record
metadataPrefix oai_dc a subset of http_header in DC
datestamp 2004-07-31 modification date of resource
Record
metadataPrefix oai_didl MPEG-21 DIDL: base64 encoded resource + http_header metadata
datestamp 2004-07-31 modification date of resource
![Page 13: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/13.jpg)
OAI-PMH Concepts
concept mod_oai interpretation
OAI Identifier URL of resource
set MIME type of resource
datestamp change time of resource
deleted records “no” deleted records
![Page 14: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/14.jpg)
http_header
![Page 15: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/15.jpg)
Use Cases
• Regular Web Crawling– use ListIdentifiers to discover URLs– add new URLs to the list of URLs to be
crawled
• Harvesting Resources w/ OAI-PMH– use ListRecords to extract the entire
resource as an MPEG-21 DIDL AIP
![Page 16: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/16.jpg)
Regular Crawling: ListIdentifiers
harvester issues a ListIdentifiers, finds the updates, and does HTTP GETs on just the updates
![Page 17: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/17.jpg)
Resource Harvesting: ListRecords
harvester issues a ListRecords, and gets the updates in DIDLs (http headers + by-value or by-ref
resources)
![Page 18: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/18.jpg)
Demo
• Repository Explorer– http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai– http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai?
archive=http://whiskey.cs.odu.edu/modoai
• Direct URLs– http://whiskey.cs.odu.edu/modoai?verb=Identify– http://whiskey.cs.odu.edu/modoai?verb=ListMetadataForm
ats– http://whiskey.cs.odu.edu/modoai?
verb=ListIdentifiers&metadataPrefix=oai_dc– http://whiskey.cs.odu.edu/modoai?
verb=ListRecords&metadataPrefix=http_header– http://whiskey.cs.odu.edu/modoai?
verb=ListRecords&metadataPrefix=oai_didl
![Page 19: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/19.jpg)
Datestamps and Etags
• Procedure– 16 harvests over 1
month of 465,374 .dk domains
– 5,543,470 possible downloads
– 5,182,034 successful downloads
– 599,143 changes
Datestamp and Etag Example
L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004
http://www.netarchive.dk/website/publications/Etags-2004.pdf
![Page 20: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/20.jpg)
Errors in Datestamps and Etags
Indicating ChangeEtags Datestamps
missed change 0.087% 0.30%
redundant crawl
32% 10.7%
L. Clausen, “Concerning Etags and Datetsamps”, 4th International Web Archiving Workshop, ECDL 2004
http://www.netarchive.dk/website/publications/Etags-2004.pdf
40.1 % of pages without Etags0.07% of pages without Datestamps
![Page 21: mod_oai: Metadata Harvesting for Everyone](https://reader035.fdocuments.us/reader035/viewer/2022081603/56813d54550346895da70eee/html5/thumbnails/21.jpg)
mod_oai…
• is:– a simple way to more
efficiently harvest web pages
– a possible impact on robots.txt
– fully OAI-PMH compliant • works with existing
harvesters
• is not:– yet suitable for dynamic
files– a replacement for
• DSpace• Fedora• eprints.org• other digital libraries /
repositories / cms