ResourceSync: Web-Based Resource Synchronization

74
ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands ResourceSync: Web-Based Resource Synchronization Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp ResourceSync is funded by The Sloan Foundation & JISC

description

Presentation about the NISO/OAI ResourceSync effort used at TICER 2012 Summer School.

Transcript of ResourceSync: Web-Based Resource Synchronization

Page 1: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

ResourceSync: Web-Based

Resource Synchronization

Herbert Van de Sompel

Los Alamos National Laboratory @hvdsomp

ResourceSync is funded by The Sloan Foundation & JISC

Page 2: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Cornell University & OAI:

Berhard Haslhofer, Carl Lagoze, Simeon Warner Old Dominion University & OAI:

Michael L. Nelson Los Alamos National Laboratory & OAI:

Martin Klein, Robert Sanderson, Herbert Van de Sompel NISO:

Todd Carpenter, Nettie Lagace, Peter Murray

ResourceSync Core Team – NISO & OAI

Page 3: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

•  Manuel Bernhardt, Delving B.V. •  Kevin Ford, Library of Congress •  Richard Jones, JISC •  Graham Klyne, JISC •  Stuart Lewis, JISC •  David Rosenthal, LOCKSS •  Christian Sadilek, Red Hat •  Shlomo Sanders, Ex Libris, Inc. •  Sjoerd Siebinga, Delving B.V. •  Ed Summers, Library of Congress •  Jeff Young, OCLC Online Computer Library Center

ResourceSync Technical Group

Page 4: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

ResourceSync

ResourceSync: What & Why? Problem Perspective & Conceptual Approach Technical Details Q&A

Page 5: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

ResourceSync

ResourceSync: What & Why? Problem Perspective & Conceptual Approach Technical Details Q&A

Page 6: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Synchronize What?

•  Web resources – things with a URI that can be dereferenced and are cache-able (no dependency on underlying OS, technologies etc.)

•  Small websites/repositories (a few resources) to large repositories/datasets/linked data collections (many millions of resources)

•  That change slowly (weeks/months) or quickly (seconds), and where latency needs may vary

•  Focus on needs of research communication and cultural heritage organizations, but aim for generality

Page 7: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Why?

… because lots of projects and services are doing synchronization but have to resort to ad-hoc, case by case, approaches!

•  Project team involved with projects that need this

•  Experience with OAI-PMH: widely used in repos but o  XML metadata only o  Attempts at synchronizing actual content via OAI-PMH

(complex object formats, dc:identifier) not successful. o  Web technology has moved on since 1999

•  Devise a shared solution for data, metadata, linked data?

Page 8: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Use Cases – The Basics

Page 9: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Use Cases - More

Page 10: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Out Of Scope (For Now)

•  Bidirectional synchronization

•  Destination-defined selective synchronization (query) •  Bulk URI migration

Page 11: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Use Case: arXiv Mirroring

•  1M article versions, ~800/day created or updated at 8 PM US Eastern Time

•  Metadata and full-text for each article

•  Accuracy important

•  Want low barrier for others to use

•  Look for more general solution than current homebrew mirroring (running with minor modifications since 1994!) and occasional rsync (filesystem layout specific, auth issues)

Page 12: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Use Case: DBpedia Live Duplication

•  Average of 2 updates per second •  Want low latency => need a push technology

Page 13: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

ResourceSync

ResourceSync: What & Why? Problem Perspective & Conceptual Approach Technical Details Q&A

Page 14: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

ResourceSync Problem

•  Consideration: •  Source (server) A has resources that change over time: they

get created, modified, deleted •  Destination (servers) X, Y, and Z leverage (some) resources

of Source A. •  Problem:

•  Destinations want to keep in step with the resource changes at Source A: resource synchronization.

•  Goal: •  Design an approach for resource synchronization aligned

with the Web Architecture that has a fair chance of adoption by different communities. •  The approach must scale better than recurrent HTTP

HEAD/GET on resources.

Page 15: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Destination: 3 Basic Synchronization Needs

1.  Baseline synchronization – A destination must be able to perform an initial load or catch-up with a source

-  avoid out-of-band setup

2.  Incremental synchronization – A destination must have some way to keep up-to-date with changes at a source

-  subject to some latency; minimal: create/update/delete -  allow to catch-up after destination has been offline

3.  Audit – A destination should be able to determine whether it is synchronized with a source

-  subject to some latency

Page 16: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capability 1: Describing Content

In order to advertise the resources that a source wants destinations to know about, it may describe them:

o  Publish an inventory of resource URIs and possibly associated metadata -  Destination GETs the Content Description -  Destination GETs listed resources by their URI

Page 17: ResourceSync: Web-Based Resource Synchronization
Page 18: ResourceSync: Web-Based Resource Synchronization
Page 19: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capability 2: Communicating Change Events

In order to achieve lower latency, a source may communicate about changes to its resources:

o  2.1. Change Set: Publish a list of recent change events (create, update, delete resource) -  Destination acts upon change events, e.g. GETs created/

updated resources, removes deleted resources.

Page 20: ResourceSync: Web-Based Resource Synchronization
Page 21: ResourceSync: Web-Based Resource Synchronization
Page 22: ResourceSync: Web-Based Resource Synchronization
Page 23: ResourceSync: Web-Based Resource Synchronization
Page 24: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capability 2: Communicating Change Events

In order to achieve lower latency, a source may communicate about changes to its resources:

o  2.1. Change Set: Publish a list of recent change events (create, update, delete resource) -  Destination acts upon change events, e.g. GETs created/

updated resources, removes deleted resources.

o  2.2. Push Change Set: Push a list of recent change events (create, update, delete resource) towards (a) destination(s) -  Destination acts upon change events, e.g. GETs created/

updated resources, removes deleted resources.

Page 25: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capability 3: Providing Access to Versions

In order to allow a destination to catch up with missed changes, a source may support:

o  3.1. Historical Change Sets: Provide access to change events that occurred prior to the ones listed in the current Change Set

Page 26: ResourceSync: Web-Based Resource Synchronization
Page 27: ResourceSync: Web-Based Resource Synchronization
Page 28: ResourceSync: Web-Based Resource Synchronization
Page 29: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capability 3: Providing Access to Versions

In order to allow a destination to catch up with missed changes, a source may support:

o  3.1. Historical Change Sets: Provide access to change events that occurred prior to the ones listed in the current Change Set

o  3.2. Historical Content: Provide access to prior resource versions

Page 30: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capability 4: Transferring Content

By default, content is transferred in response to a GET issued by a destination against a URI of a source’s resource. But a source may support additional mechanisms:

o  4.1. Dump: Publish a package of resource representations and necessary metadata -  Destination GETs the Dump -  Destination unpacks the Dump

Page 31: ResourceSync: Web-Based Resource Synchronization
Page 32: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capability 4: Transferring Content

By default, content is transferred in response to a GET issued by a destination against a URI of a source’s resource. But a source may support additional mechanisms:

o  4.1. Dump: Publish a package of resource representations and necessary metadata -  Destination GETs the Dump -  Destination unpacks the Dump

o  4.2. Alternate Content Transfer: Support alternative mechanisms to optimize getting content, e.g. content via a mirror site, only changes not the entire changed resource.

Page 33: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source: Advertise Capabilities

A source needs to advertise the capabilities it supports to allow a destination to discover them

•  Some capabilities may be provided by a third party, not the source itself

o  e.g. Historical Change Sets, Historical Content o  But the source should still make those third party capabilities

discoverable - trust

Page 34: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

ResourceSync

ResourceSync: What & Why? Problem Perspective & Conceptual Approach Technical Details Q&A

Page 35: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

So Many Choices

XMPP

AtomPub

SDShare

RSS

Atom

PubSubHubbub

Sitemap

XMPP

rsync

OAI-PMH

WebDAV Col. Syn.

OAI-ORE

DSNotify

RDFsync

Crawl

Push

Pull

SWORD

SPARQLpush

Page 36: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Page 37: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Page 38: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

A Framework Based on Sitemaps

•  Modular framework allowing selective deployment

•  Sitemap is the core component throughout the framework

o  Introduce extension elements and attributes: -  In ResourceSync namespace (rs:) to

accommodate synchronization needs -  In XHTML namespace (xhtml:) mainly to

accommodate discovery needs o  Reuse Sitemap format for Change Sets (both

current and historical) and for manifest in Dump

Page 39: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capabilities – Destination Needs

Page 40: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capabilities – Destination Needs

Page 41: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Sitemap with Added Datetime

Page 42: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Change Types: Extend lastmod, Use expires!

Page 43: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Sitemap with lastmod and expires!

Page 44: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Sitemap Discovery via robots.txt!

Page 45: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capabilities – Destination Needs

Page 46: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Change Set: An rs Typed Sitemap

Page 47: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

More rs Extension Elements

Page 48: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Change Set with rs and xhtml Extensions

Page 49: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Change Set Discovery via Sitemap

Page 50: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Pushing Change Sets via XMPP PubSub

XMPP Publish-Subscribe: Client to Subscription Service, Subscription Service to Client(s) communication

•  One of the XMPP (Extensible Messaging and Presence Protocol)

extensions http://xmpp.org/extensions/xep-0060.html •  Apple Notifications based on XMPP PubSub •  Available tools, see http://xmpp.org/about-xmpp/

technology-overview/pubsub/#impl-client o  XMPP Servers with PubSub support:

-  ejabberd , OpenFire , Tigase , SleekXMPP o  XMPP libraries with PubSub support:

-  Strophe (C, JavaScript), XMPP4R (Ruby), SleekXMPP (Python), PubSub Client (Python)

Page 51: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Pushing Change Sets via XMPP PubSub

Page 52: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Change Set via XMPP

Page 53: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Push Change Set Discovery via Sitemap

Page 54: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capabilities – Destination Needs

Page 55: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Discovering a Historical Change Set via a Current Change Set

Page 56: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capabilities – Destination Needs

Page 57: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Discovering Historical Content – Link to Version Resource

Page 58: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Memento Intermezzo

http://www.mementoweb.org/

Page 59: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Original Resources and Mementos

Page 60: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Bridge from Present to Past

Page 61: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Bridge from Past to Present

Page 62: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Memento Framework

Page 63: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Discovering Historical Content – Link to Memento TimeGate

Page 64: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capabilities – Destination Needs

Page 65: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Dump

•  Two formats currently under discussion:

o  Format based on ZIP: -  Package content -  Add manifest (manifest.xml) expressed in

Sitemap format -  ZIP it up

o  WARC files as used by the web archiving community

Page 66: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Mapping URI to File Path with rs:path!

Page 67: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Manifest (manifest.xml) Expressed in Sitemap Format

Page 68: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Dump Discovery via Sitemap

Page 69: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Source Capabilities – Destination Needs

Page 70: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Alternate Location

Page 71: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Alternate Protocol, e.g. Obtain Changes Only

Page 72: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Timeline •  August 2012

o  First draft spec shared for feedback with ResourceSync team

•  September 2012 o  In-person meeting of ResourceSync Team o  Revise spec, conduct experiments o  Solicit broad feedback o  Paper in D-Lib Magazine

•  December 2012 – Finalize specification (?)

Page 73: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

Pointers •  First draft spec:

http://www.openarchives.org/rs/0.1/resourcesync!

•  Simulator code on github http://github.org/resync/simulator!

•  NISO workspace http://www.niso.org/workrooms/resourcesync/!!

•  List for public comment coming soon

Page 74: ResourceSync: Web-Based Resource Synchronization

ResourceSync – Herbert Van de Sompel TICER Summer School, August 22 2012, Tilburg, The Netherlands

ResourceSync: Web-Based

Resource Synchronization

Herbert Van de Sompel

Los Alamos National Laboratory @hvdsomp

ResourceSync is funded by The Sloan Foundation & JISC