Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons
description
Transcript of Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons
Surviving the Deluge: Lessons from DOIs
and Electronic Publishing at John
Wiley & Sons
ByMatthew Larson
Who I Am
• Developer, tech lead, and system architect at Wiley for the last 9 years
• Responsible for DOI registration and other systems that support electronic publishing activities
The Challenge
• A lot of content from a lot of places• “A lot of content”: > 1000 new DOIs
per day from journals, books and major reference works
• “A lot of places”: Offices in New Jersey, Boston, San Francisco, Oxford, Germany, Singapore, and elsewhere
• How to handle DOI registration for all this content?
Surviving the Deluge: Some Principles
• Here are five system design principles we’ve used for handling the constant stream of content
• We’ll look at how we employ them in our CrossRef registration application, XIRS (eXternal Identifier Registration System)– XIRS was built and launched in 2008 to
help handle the combined Wiley/Blackwell content load
– XIRS receives CrossRef XML, submits it to CrossRef, and tracks the CrossRef responses
– Written in Java (1.6) and deployed on Tomcat 6
XIRS Data Flows
Principle 1: Centralization
• Content from multiple sources must go through a single system to be manageable.– Might be tempting to offer a client library
instead but that invites trouble• XIRS handles all Wiley DOI registrations
– All error handling, reporting and support in one place
• Any system that needs to store or process content must be on the critical path for publication– Otherwise you will always be chasing
synchronization
Principle 2: Transparency
• Make the system as transparent as possible– It’s easy to miss errors when batch
processing this much content• How?
– Always make real-time system views available
• Provides quick assurance that everything is alright and shows when it isn’t
– Email daily error reports with the most critical data in the subject line
– Show as much information as possible to make support easier
– Provide easy-to-use reporting so anyone can query the system
Real-time System Views
Notifications
As Much Info As Possible
Simple Reporting
Principle 3: Heal Thyself
• Systems that batch process this much content must be self healing
• Lowers support load and increases accuracy
• One good method: queue everything– Assume that nothing will work on the first try– Assume that networks, disks, and external
services will fail. – The system that sends data to XIRS queues
that data--if XIRS is down the data flow will start as soon as XIRS is available.
– XIRS itself queues the data it receives. If CrossRef is down, it will try until the data submission is successful.
Principle 3: Heal Thyself
• Another good technique: belt and braces (have backup methods)– We process the CrossRef emails
coming back– But if we get no CrossRef response (for
whatever reason), we call CrossRef directly to get the response
– Lets us avoid extra support work if there are emails problems—the system will find those responses itself
Principle 4: Trust No One
• Assume that bad data will show up sooner rather than later
• QA everything– We parse the CrossRef XML before it is sent to
XIRS– And then XIRS parses it again and rejects it if it
doesn’t parse
• Proactively check and fix data– We insert the timestamp in XIRS so it’s always
the latest– We shorten fields to fit within the size limits– We check every file for multiple prefixes in the
same file and split the file up automatically– All XSLT must check for string-length(.) > 0
before inserting any elements to avoid empty elements
Principle 4: Trust No One
• Assume files will be too big– Some reference articles can have
500+ citations– We limit by DOI count and size on disk
• Assume files will be too small– Files can come very quickly with the
same container-level (journal) DOI– Causes database contention – We have to remove container-level
DOIs that have already been registered to avoid this
Principle 5: Synchronicity
• All system communications should be synchronous wherever possible– Delayed/asynchronous responses will
be lost when batch processing large amounts of data and support will be much tougher
– Long waits and timeout values are acceptable to ensure that the calling system knows whether the call was successful or not