ResourceSync: Leveraging Sitemaps for Resource Synchronization

30
ResourceSync: Leveraging Sitemaps for Resource Synchronization WWW 2013, Rio de Janeiro, May 17th Bernhard Haslhofer | University of Vienna Simeon Warner | Cornell University Carl Lagoze | University of Michigan Martin Klein, Robert Sanderson | Los Alamos National Labs Michael L. Nelson | Old Dominion University Herbert van de Sompel | Los Alamos National Labs http://www.openarchives.org/rs/

description

 

Transcript of ResourceSync: Leveraging Sitemaps for Resource Synchronization

Page 1: ResourceSync: Leveraging Sitemaps for Resource Synchronization

ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, Rio de Janeiro, May 17th

Bernhard Haslhofer | University of ViennaSimeon Warner | Cornell UniversityCarl Lagoze | University of MichiganMartin Klein, Robert Sanderson | Los Alamos National LabsMichael L. Nelson | Old Dominion UniversityHerbert van de Sompel | Los Alamos National Labs

http://www.openarchives.org/rs/

Page 2: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

ResourceSync

• What and Why?

• Synchronization Scenarios

• ResourceSync Basics

• Demos

• Status and Next Steps

2

Page 3: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

What?

• A framework for synchronizing Web resources from a Source to a Destination

3

Web

sync

$ resync http://example.com

Page 4: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Why?

• rsync: filesystem sync, but not Web

• OAI-PMH: metadata, but not resources

• Web-DAV: extends HTTP, requires server installation at source

• ...

4

… because lots of projects and services are doing synchronization but rely on ad-hoc solutions!

Page 5: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

ResourceSync

• What and Why?

• Synchronization Scenarios

• ResourceSync Basics

• Demos

• Status and Next Steps

5

Page 6: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

arxiv.org mirroring

• 2.4M resources (PDF, metadata, Latex src)

• ~800/day created or updated

• uses homebrew mirroring since 1994 (!)

• look for more general solution to support independent destinations

6

Page 7: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Wikipedia

• 1.4 updates / sec

• many dependent services reusing Wikipedia content (e.g., DBPedia, Freebase, etc.)

• harvest articles via OAI-PMH, retrieve changes via IRC, download dumps

7

Page 8: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

data.europeana.eu

• aggregates metadata from >200 data providers in Europe

• 10 largest providers contribute 80%

• >190 providers contribute 20%

8

Page 9: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Design Guidelines

• Sync small websites / repositories (few resources) but also large data collections (millions of resources)

• Support low change frequency (weeks / months) to high change frequency (seconds) sources

• Low adoption barrier!

9

Page 10: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

ResourceSync

• What and Why?

• Synchronization Scenarios

• ResourceSync Basics

• Demos

• Status and Next Steps

10

Page 11: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Resource List

11

Destination

Source

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" modified="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/res1</loc> </url> <url> <loc>http://example.com/res2</loc> </url></urlset>

$ resync -b http://example.com

XML Sitemap

Page 12: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Resource List

12

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" modified="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"/> </url></urlset>

Source

Page 13: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Change List

13

Destination

Source

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="changelist" modified="2013-01-03T11:00:00Z"/> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> <rs:md change="updated"/> </url> <url> <loc>http://example.com/res3</loc> <lastmod>2013-01-02T18:00:00Z</lastmod> <rs:md change="deleted"/> </url></urlset>

$ resync -b http://example.com$ resync -i http://example.com

XML Sitemap

Page 14: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Resource Dump

14

Source

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcedump" modified="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/resourcedump.zip</loc> <lastmod>2013-01-03T09:00:00Z</lastmod> </url></urlset>

XML Sitemap

Page 15: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Resource Dump

15

http://example.com/resourcedump.zip

|- manifest.xml|- resources

|- res1|- res2

Page 16: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Resource Dump Manifest

16

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcedump-manifest" modified="2013-01-03T09:00:00Z"/> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-03T03:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" path="/resources/res1"/> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-03T04:00:00Z</lastmod> <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e" path="/resources/res2"/> </url></urlset>

manifest.xml (XML Sitemap)

Page 17: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Capability List

17

Destination

Source

<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:ln href="http://example.com/info-about-source.xml" rel="describedby" type="application/xml"/> <rs:md capability="capabilitylist" modified="2013-01-02T14:00:00Z"/> <url> <loc>http://example.com/dataset1/resourcelist.xml</loc> <rs:md capability="resourcelist"/> </url> <url> <loc>http://example.com/dataset1/resourcedump.xml</loc> <rs:md capability="resourcedump"/> </url> <url> <loc>http://example.com/dataset1/changelist.xml</loc> <rs:md capability="changelist"/> </url></urlset>

$ resync -x http://example.com

XML Sitemap

Page 18: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Large Resource Lists

18

<?xml version="1.0" encoding="UTF-8"?><sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" modified="2013-01-03T09:00:00Z"/> <sitemap> <loc>http://example.com/resourcelist-part2.xml</loc> <lastmod>2013-01-03T09:00:00Z</lastmod> </sitemap> <sitemap> <loc>http://example.com/resourcelist-part1.xml</loc> <lastmod>2013-01-03T09:00:00Z</lastmod> </sitemap></sitemapindex>

Source

Page 19: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Other Capabilities

Page 20: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

ResourceSync

• What and Why?

• Synchronization Scenarios

• ResourceSync Basics Walkthrough

• Demos

• Status and Next Steps

20

Page 21: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Available code

• ResourceSync client and library (Python)

• ResourceSync source simulator

21

http://github.com/resync

Page 22: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Install resync client/library

22

$ git clone git://github.com/resync/resync.git$ cd resync/$ python setup.py build$ sudo python setup.py install

$ sudo easy_install resync

$ sudo pip install resync

or

or

Page 23: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Install resync simulator

23

$ git clone git://github.com/resync/simulator.git$ cd simulator/$ chmod u+x simulate-source$ ./simulate-source

$ sudo easy_install tornado

Page 24: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Run client against simulator

24

$ resync -b http://localhost:8888

$ resync -i http://localhost:8888

Page 25: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

resync @ arxiv.org

25

resync -v --noauth http://resync.library.cornell.edu/arxiv-q-bio\=/tmp/qbio http://resync.library.cornell.edu/arxiv\=/tmp/arxiv

Page 26: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

resync @ en.wikipedia.org

26

Page 27: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

ResourceSync

• What and Why?

• Synchronization Scenarios

• ResourceSync Basics Walkthrough

• Demos

• Status and Next Steps

27

Page 28: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Status

• Beta spec (v.0.6) for public commenthttp://www.openarchives.org/rs/0.6/resourcesync

• Tool development started

• Separate documents for archiving and push deployments

28

Page 29: ResourceSync: Leveraging Sitemaps for Resource Synchronization

WWW 2013, May 17th

Next Steps

• Continue tool development & deployment

• Collect

• public comments on [email protected]

• implementation issues onhttps://github.com/resync/resync/issues

• Version 0.9 to be released in Summer 2013

• Version 1.0 in fall 2013 (NISO standard)

29