NISO Forum, Denver, September 24, 2012: ResourceSync: Web-Based Resource Synchronization
-
Upload
national-information-standards-organization-niso -
Category
Education
-
view
1.761 -
download
1
description
Transcript of NISO Forum, Denver, September 24, 2012: ResourceSync: Web-Based Resource Synchronization
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync: Web-Based
Resource Synchronization
Herbert Van de Sompel Los Alamos National Laboratory
@hvdsomp
ResourceSync is funded by The Sloan Foundation & JISC #resourcesync
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Los Alamos National Laboratory & OAI: Martin Klein, Robert Sanderson, Herbert Van de Sompel
Cornell University & OAI: Berhard Haslhofer, Simeon Warner
Old Dominion University & OAI: Michael L. Nelson
University of Michigan & OAI: Carl Lagoze
NISO: Todd Carpenter, Nettie Lagace, Peter Murray
ResourceSync Core Team – NISO & OAI
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
• Manuel Bernhardt, Delving B.V. • Kevin Ford, Library of Congress • Richard Jones, JISC • Graham Klyne, JISC • Stuart Lewis, JISC • David Rosenthal, LOCKSS • Christian Sadilek, Red Hat • Shlomo Sanders, Ex Libris, Inc. • Sjoerd Siebinga, Delving B.V. • Ed Summers, Library of Congress • Jeff Young, OCLC Online Computer Library Center
ResourceSync Technical Group
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync
ResourceSync: What & Why?
Problem Perspective & Conceptual Approach
Possible Technical Choices
Q&A
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync
ResourceSync: What & Why?
Problem Perspective & Conceptual Approach
Possible Technical Choices
Q&A
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Synchronize What?
• Web resources – things with a URI that can be dereferenced and are cache-able (no dependency on underlying OS, technologies etc.)
• Small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources)
• That change slowly (weeks/months) or quickly (seconds), and where latency needs may vary
• Focus on needs of research communication and cultural heritage organizations, but aim for generality
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Why?
… because lots of projects and services are doing synchronization but have to resort to ad-hoc, case by case, approaches!
• Project team involved with projects that need this
• Experience with OAI-PMH: widely used in repos but o XML metadata only o Attempts at synchronizing actual content via OAI-PMH
(complex object formats, dc:identifier) not successful. o Web technology has moved on since 1999
• Devise a shared solution for data, metadata, linked data?
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Use Cases – The Basics
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Use Cases - More
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Out Of Scope (For Now)
• Bidirectional synchronization
• Destination-defined selective synchronization (query)
• Bulk URI migration
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Use Case: arXiv Mirroring
• 1M article versions, ~800/day created or updated at 8 PM US Eastern Time
• Metadata and full-text for each article
• Accuracy important
• Want low barrier for others to use
• Look for more general solution than current homebrew mirroring (running with minor modifications since 1994!) and occasional rsync (filesystem layout specific, auth issues)
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Use Case: DBpedia Live Duplication
• Average of 2 updates per second • Want low latency => need a push technology
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync
ResourceSync: What & Why?
Problem Perspective & Conceptual Approach
Possible Technical Choices
Q&A
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync Problem
• Consideration: • Source (server) A has resources that change over time: they
get created, modified, deleted • Destination (servers) X, Y, and Z leverage (some) resources
of Source A. • Problem:
• Destinations want to keep in step with the resource changes at source A: resource synchronization.
• Goal: • Design an approach for resource synchronization aligned
with the Web Architecture that has a fair chance of adoption by different communities. • The approach must scale better than recurrent HTTP
HEAD/GET on resources.
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Destination: 3 Basic Synchronization Needs
1. Baseline synchronization – A destination must be able to perform an initial load or catch-up with a source
- avoid out-of-band setup
2. Incremental synchronization – A destination must have some way to keep up-to-date with changes at a source
- subject to some latency; minimal: create/update/delete - allow to catch-up after destination has been offline
3. Audit – A destination should be able to determine whether it is synchronized with a source
- subject to some latency
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 1: Describing Content
In order to advertise the resources that a source wants destinations to know about, it may describe them:
o Publish an inventory of resource URIs and possibly associated metadata - Destination GETs the Content Description - Destination GETs listed resources by their URI
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 2: Communicating Change Events
In order to achieve lower latency, a source may communicate about changes to its resources:
o 2.1. Change Set: Publish a list of recent change events (create, update, delete resource) - Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 2: Communicating Change Events
In order to achieve lower latency, a source may communicate about changes to its resources:
o 2.1. Change Set: Publish a list of recent change events (create, update, delete resource) - Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.
o 2.2. Push Change Set: Push a list of recent change events (create, update, delete resource) towards (a) destination(s) - Destination acts upon change events, e.g. GETs created/
updated resources, removes deleted resources.
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 3: Providing Access to Versions
In order to allow a destination to catch up with missed changes, a source may support:
o 3.1. Historical Change Sets: Provide access to change events that occurred prior to the ones listed in the current Change Set
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 3: Providing Access to Versions
In order to allow a destination to catch up with missed changes, a source may support:
o 3.1. Historical Change Sets: Provide access to change events that occurred prior to the ones listed in the current Change Set
o 3.2. Historical Content: Provide access to prior resource versions
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 4: Transferring Content
By default, content is transferred in response to a GET issued by a destination against a URI of a source’s resource. But a source may support additional mechanisms:
o 4.1. Dump: Publish a package of resource representations and necessary metadata - Destination GETs the Dump - Destination unpacks the Dump
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source Capability 4: Transferring Content
By default, content is transferred in response to a GET issued by a destination against a URI of a source’s resource. But a source may support additional mechanisms:
o 4.1. Dump: Publish a package of resource representations and necessary metadata - Destination GETs the Dump - Destination unpacks the Dump
o 4.2. Alternate Content Transfer: Support alternative mechanisms to optimize getting content (see later)
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Source: Advertise Capabilities
A source needs to advertise the capabilities it supports to allow a destination to discover them
• Some capabilities may be provided by a third party, not the source itself
o e.g. Historical Change Sets, Historical Content o But the source should still make those third party capabilities
discoverable - trust
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync
ResourceSync: What & Why?
Problem Perspective & Conceptual Approach
Possible Technical Choices
Q&A
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync: A Framework of Capabilities
• Modular framework allowing selective deployment of capabilities
• A Source selects which capabilities to support in order to meet local and community needs
• A Source’s Capabilities can be discovered via capability descriptions
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
BY REFERENCE!
BY VALUE!
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Sitemap
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://example.com/res1</loc> <lastmod>2012-08-08T08:15:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2012-08-08T13:22:00Z</lastmod> </url> </urlset>
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Baseline Matching - Sitemap
• Periodic publication of up-to-date Sitemap, which is a “by reference” inventory of a Source’s resources
• Use ”as is” with resource location and last modification date as core elements
• Introduce extension elements aimed at supporting audit: e.g. MD5 hash of content
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
robots.txt!
discovery
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Baseline Matching – Dump
• A Dump is a “by-value” inventory of a Source’s resources
• Periodic publication of an up-to-date Dump
• Possible technology: ZIP file consisting of:
• Special-purpose Sitemap that acts as a manifest for resources contained in the ZIP file • Introduce an element to express correspondence
between resource URI and filename in the ZIP file • Resource bitsteams
• Possible technology: WARC file
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Change Communication – Pull Change Sets
• Periodic publication of a Change Set that describes recent changes
• A Change Set is a Sitemap-style document, enhanced to express change events rather than inventory. Per change event, convey: • About the event:
• datetime • event type: create/update/delete (maybe move/copy)
• About the changed resource: • URI • Information relevant for audit, e.g. fixity, size, mime type • Further information to aide accessing the resource (see
later)
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Change Set, Based on Sitemap
<?xml version="1.0" encoding="UTF-8"?> <urlset rs:type="changeset” xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <url> <loc>http://example.com/res1</loc> <lastmod rs:type="updated">2012-08-08T08:15:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod rs:type="created">2012-08-08T10:22:00Z</lastmod> </url> </urlset>
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Change Set, from Scratch
<?xml version="1.0" encoding="UTF-8"?> <changeset xmlns="http://www.openarchives.org/rs/changeset"> <change> <link rel="created" length="1234" type="text/html” href="http://example.com/res1.html"/> <date>2012-09-25T09:00:00Z</date> <fixity>ni:///sha-256;f4OxZX_x_FO5LcGBSKHWXfwtSx</fixity> </change> </changeset>
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Change Communication – Push Change Sets
• Use a push technology to convey changes
• Express changes using same Sitemap-style document • A Change Set in this case might convey only one change
event
• Possible technology: XMPP PubSub
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
<XMPP PubSub Intermezzo>
XMPP Publish-Subscribe: Client to Subscription Service, Subscription Service to Client(s) communication
• One of the XMPP (Extensible Messaging and Presence Protocol) extensions http://xmpp.org/extensions/xep-0060.html
• Apple Notifications based on XMPP PubSub
• Both client and server tools widely available
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
</XMPP PubSub Intermezzo>
Source Destination PubSub Server
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Change Communication Memory
• Publication of one or more Change Sets that convey historical (rather than recent) changes
• All historical Change Sets use same Sitemap-style document
• Same approach irrespective of whether pull or push is used for Change Communication
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Resource Transfer
• Resources are obtained in bulk by obtaining a Dump
• An individual resource is, by default, obtained by dereferencing a resource’s URI listed in: • Sitemap • Change Set
• Alternative access mechanisms are introduced to obtain an individual resource: • From a mirror site • Access to diff with previous version instead of access to the
entire changed resource • Resource version
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Resource Memory
• Requires a (short or long term) archive of resource versions
• Access to specific version can be expressed as an alternative access mechanism in e.g. Change Set. • Via a link to a version resource that is the result of the
change expressed in the Change Set • Via a link to a Memento TimeGate that supports access to all
available prior versions
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
<Memento Intermezzo>
http://www.mementoweb.org/
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Original Resources and Mementos
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Bridge from Present to Past
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Bridge from Past to Present
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Memento Framework
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Original Resource: http://lanlsource.lanl.gov/pics/picoftheday.png
Memento Framework
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Time Travel across Versions of a Picture of the Day
Movie at: http://www.mementoweb.org/demo/picoftheday.mov
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Original Resource: http://dbpedia.org/resource/France
Memento Framework
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Time-Series Analysis across DBpedia Versions
Data collected through HTTP Navigation
Paper at http://arxiv.org/abs/1003.3661
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
</Memento Intermezzo>
http://www.mementoweb.org/
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync Timeline • August 2012
o First draft spec shared for feedback with ResourceSync team
• September 2012 o Problem Statement paper in D-Lib Magazine o In-person meeting of ResourceSync Team
• October 2012 o Revise spec, conduct experiments o Solicit broad feedback
• December 2012 – Finalize specification (?)
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
Pointers • First ResourceSync draft spec (do not implement!):
http://www.openarchives.org/rs/0.1/resourcesync!
• ResourceSync Simulator code on github http://github.org/resync/simulator!
• NISO ResourceSync workspace http://www.niso.org/workrooms/resourcesync/!
• Memento http://mementoweb.org!
ResourceSync – Herbert Van de Sompel NISO Forum, September 24 2012, Denver, CO
ResourceSync: Get the Sticker!
Herbert Van de Sompel Los Alamos National Laboratory
@hvdsomp
ResourceSync is funded by The Sloan Foundation & JISC