NoSQL technologies from an STM publishing perspective
Bradley P. Allen, Elsevier Labs
Presentation at NoSQL Now 2011
San Jose, CA, USA
2011-08-25
Peak physical media: is it here?
• “Music Sales”, New York Times, 1 August 2009. http://www.nytimes.com/imagepages/2009/08/01/opinion/01blow.ready.html
• “Initial Circs per student”, William Denton, 31 January 2011. http://www.miskatonic.org/2011/01/31/initial-circs-student
• “Rise of e-book Readers to Result in Decline of Book Publishing Business”, Steven Mather, iSuppli, 28 April 2011. http://www.isuppli.com/Home-and-Consumer-Electronics/News/Pages/Rise-of-e-book-Readers-to-Result-in-Decline-of-Book-Publishing-Business.aspx 2
• Print revenue is softening
• Online channels are exploding
– Changing the way customers create and consume our content
– Leading to new requirements and market opportunities for online products
In any case, the challenge to STM publishers is clear
3
• Academic context and tradition inhibits business model innovation
• Technology and business traditionally separate concerns
• Acquisitions create content and data silos
• Global market drives lowest common denominator technology choices
Additional challenges in STM publishing
4
A simple model of the evolution of STM publishing
Print era: 1600s -1980
• Packaged as books and journals
• Physically distributed
• Access and discovery through libraries
Digital Library era: 1980 – 2010s
• Packaged as books and journals
• Digitally distributed
• Access and discovery through search engines
Platform-as-a-service era: 2010s
• Packaged as apps
• Digitally distributed
• Access and discovery through social networks
5
STM publishing use cases in transition
Use case Digital Library era Platform-as-a-service era
A new medical term relevant to an emerging healthcare issue (e.g. a new type of avian flu virus) needs to be incorporated into a search index immediately
Organizational governance issues about how taxonomies are be updated, coupled with manually-intensive workflows and ad-hocapproaches to content tagging, inhibit rapid response
A single, automated and standardized taxonomy management and content enhancement workflow allows rapid and timely update of search applications
Application developers want to mash up epidemiological data with medical journal articles to create topic-specific Web resource
Data silos without easy means of programmatic access by developers, coupled with governance and business model questions , inhibit data reuse
Content API and single-point-of-access repository allow data and content to be accessed, discovered and reused across multiple applications
Digital library developers want to stagecontent into single repository for unified search index generation
Duplication of core content leads to synchronization, quality control issues
Consolidation of duplicate repositories into a single point of truth across all content accessible and discoverable through a Content API eliminates the need forduplication and synchronization
Third party solutions providers want to integrate content (e.g. tagged medical journal articles, medical taxonomies) into point-of-care solutions
No standards, no APIs for point-of-care content integration across all content and data
Standards and APIs that scale across multiple partners, for all content types, for all delivery formats
Publishers want to deliver their content to tablets and e-readers in delivery formats that take advantage of the displays and interaction modalities on those devices
No clear standard or approach for targeting emerging eReader, tablet devices, multipleand divergent approaches leading to siloedsolutions, duplication of effort
Web- and industry-standards for eReader, tablet devices supported as part of standard automated processing into delivery channel-specific formats, regularly updated and exposed through a Content API
Journal publisher wants to integrate content enhancements across multiple subject matter areas to add value to products leveraging Article of the Future technology
No single point of access to content enhancements, no standards for contentenhancement suppliers and partners to deliver enhancements for integration
Easy access to multiple opportunities for content enhancements embedded in standard next-generation article formats and provided using standard content enhancement formats
6
Facets of STM publishing processes
Acquisition TransformationAccess and discovery
Enhancement Composition Delivery
submitting
crawling
syndicating
formatting
mapping
cleansing
indexing
querying
updating
storing
annotating
subject tagging
classification
entity recognition
author
supplier
Web site
typesetter
automated process
subject matter expert
search engine
content repository
entity registry
product catalog
editor
reviewer
user
designer
developer
e-book
mobile app
mobile-enhanced Web site
API
entity extraction
fact extraction
clustering
aggregating
ordering
summarizing
filtering
analysis
rendering
design
publishing
accessing
retrieving
deleting
Entity Activity
Process Type
article
book
media object
entity record
taxonomy
ontology
user-generated content
Content Type
7
• Broad range of content types– Must treat as first-class objects video, audio,
images, datasets, metadata and knowledge organization systems in addition to articles and books
• Standards-based– Web-standard formats to support ease of
integration and interoperability
• Fine-grained– Must be decomposable into and addressable in
fragments smaller than the unit of publication; e.g., down to the level of specific words, phrases, images, table cells in articles or book chapters, key frames and segments in videos
• Discoverable– Must be easily located across all levels of
granularity,
• Accessible– Must be easily accessed through content
creation, retrieval, update and deletion (CRUD) services
• Flexible– New content types and associated schemas
must be easily added through configuration
• Reusable– It must be efficient for product developers to
aggregate and compose content fragments into new products
• Modifiable– Support the enhancement and correction of
content at any time following creation
• Broad range of delivery formats– Content standards and services must support
fulfillment, delivery and presentation across desktop, notebook, tablet and mobile computing devices
Emerging content requirements
8
Relational metadata
Relational Metadata
Relational Metadata
Relational Metadata
9
Emerging content architecture
Linked data
Acquire
Transform,
Enhance, Compose
Deliver
Document
Entity record
Media object
Relational metadata
Relational metadata
Relational metadata
Content acquisition and transformation
10
Content enhancement and analytics
11
Content composition and delivery
12
• NoSQL emphasizes design choices that focus on delivering robust, scalable Web applications– Document-centric
– Schemaless
– Support for analytics
– Read/write at Web scale
– Move scale-out from development to operations
• As we shift to the platform-as-a-service era, these features become an important part of the STM publishing technology stack
Why NoSQL is important to STM publishing
13
• Schemaless, document-centric stores– Ease repository extension to accommodate expanding range of new, finer-
grained content types– Fit HTML5/JS/CSS content stack providing web-based alternatives to native apps– Expedite application stack refresh in support of authoring and editorial workflow
portals and tools
• Support for analytics eases innovation in scientometrics• Read/write at Web scale accommodates solutions incorporating content
at more dynamic, fine-grained scale– Entity records– Annotations – Other forms of community-contributed content– Linked data integration of heterogeneous information resources across the Web
for mashups/solutions
• Moving scale-out from development to operations reduces time-to-market, cost of failure for emerging, niche publishing opportunities
How NoSQL addresses STM publishing’s needs
14
• Integrated support for search– Free text retrieval– Faceted navigation
• Query language functionality– Nearest-neighbor matching– Joins vs. join-free
• Primitives/support for analytics design patterns– Clustering– Classification– Entity resolution
• Primitives/support for semantic enhancement– Linked data– Language processing
• Versioning for document stores
Where STM publishing can drive NoSQL requirements
15
• Entity registries
• Metadata repositories
• Big data analytics
• User-built apps
Elsevier applications of NoSQL technologies
16
Linked Data Repository
17
SciVal
18
SciVerse
19
• STM publishing is in transition
• This is driving new requirements for content
• Many of these requirements are well met by NoSQL solutions
• Some requirements point to areas of future work for NoSQL technologists and vendors
Conclusions
20
Top Related