Reconceiving the Web as a Distributed (NoSQL) Data System
Reconceiving the Web as a Distributed (NoSQL) Data System
Daniel AustinPayPal, Inc.NoSQL Now! ConferenceAugust 22, 2013V1.2
The Big Idea
“The World-Wide Web is the World’s Largest NoSQL
Distributed Data System”
The Mind Map
History
• DNS (1983)The first large-scale DDS, using Flat files• WWW (1989)“a single user-interface to many large classes of stored information such as reports, notes, data-bases, computer documentation and on-line systems help”
Berners-Lee & Cailliau, 1989
But Why NoSQL?
WWWDB: Anatomy
WWW
HTML(Presentation)
URI(Addressing)
HTTP(Transport)
Typology of Hyperlink Queries• Hypertext links come in two flavors:
transitive and intransitive• Transitive queries are usually for
inactive content – presentation material to supplement the user’s queried data
• Intransitive queries are user-actuated and usually provide navigation and business logic for the query
Data Clients Query Data Sources
What Do HTTP URIs Identify?• Not a single resource• WWWDB query syntax is split
between HTTP ‘verbs’ (POST, GET, PUT, DELETE) and their objects, addressed by URIs
• URI encapsulates a resource as the object identified by a query
(Note that transitive and intransitive hyperlinks almost always go to different locations)
CDN as a Caching Mechanism• CDNs such as Akamai and
Cloudfront provide local caching services for WWWDB, mostly for static, presentation-related objects– Frequency-based caching for transitive
hyperlinks– Most secondary queries go to the CDN– 95%+ of all the bytes transported over
the Web– ~90% of all WWWDB queries (HTTP
requests/responses)
APIs as Secondary Queries• Active Subqueries• Usually dynamic• URIs function as a selection mechanism• Often User-Actuated, Intransitive Events• Query results often modify the display
REST as a Query Syntax Mechanism• Common
Semantics– REST provides a
means of specifying the proper query for an object in a specific state
• Demands NoSQL due to state constraints
• Uses query strings for ranged searches
Image courtesy IBM
Indexing WWWDB
• Google, Bing, Yahoo! and other ‘index searches’ on WWWDB– Inconsistent results are accepted
• Query Cache or a Data Cache?• Secondary Query Routing• Alternative query indices – Wolfram
Alpha, Index Mundi, Twitter act as ‘almanacs’
Does the CAP Theorem Apply?
Yes, It Does, But Only Partially• Partition and Availability – 404’s,
DDOS• WWWDB Relaxes the Consistency
Constraint• We accept inconsistent queries and
broken links as a tradeoff for real-time availability and high-velocity updates
But We Can Do Better!
Drawbacks of the CAP Model• Caching – All data is Not cached
everywhere– Some sites are single-location/single
source– Hard (static) assets are far more
widely cached• What does CAP mean when data is
only partially distributed?– Very little – consistency only applies to
part of the queries
Improving WWWDB
• Better Data Clients– HTML5 provides new query
mechanism via Web Sockets, WebStorage, and other means
– Still mostly presentation-level improvments
• Better Caching, Distribution & Tranport– Work currently being done at IETF on
HTTP 2.0• Better Queries
– Very little work being done – more on this later!
RDF and the Semantic Web• Changes query patterns but not
storage– Queries based on semantic ID of
resource• Requires content to be semantically
labeled• Work on Sparql reduces query
limitations– But may also make things slower (!)
• Cloud computing and query distribution will prove a more powerful force for improving WWWDB than semantic queries
Browsers as Data Clients
• Presentation First!– Data is treated as secondary
• Designed for Browsing Not Querying– Query patterns are inefficient– Semi-stateful nature of Web sessions
• Bedeviled with Legacy Issues
Optimizing Web Queries
• REST doesn’t imply FAST – Use a domain model to limit query
endpoints– May require unnecessary requests
• Query-string semantics allows for joins, arbitrary comparison
• Recognize that some queries require state and use it
• Distribute intransitive queries more widely
Reforming Hypertext for Querying WWWDB• Enlarge the number of link types• Distinguish transitive links• Add bidirectional linking• Enhance the semantics of the query
string• Make hypertext more useful for
mobile and devices
IPv6 and Query Routing for WWWDB• The IPv6 space is large enough to
allow for multiple query addressing schemes:– Semantic addressing of objects by
type– Objects in the Internet of Things– Dynamic, context driven addressing
Scaling the WWWDB
• This may require expanding our notions of URIs and links (queries)
• Semantic mapping of resources requires additional complexity for queries
• Explicit state management for efficiency
Every system has a scaling limit
Final Thoughts• The Web is the largest NoSQL
Distributed Data System– URIs address the resultset of a NoSQL
query– Transitive and Intransitive hyperlinks
• We can add power and simplicity to our queries by carefully reforming the URI syntax and the current implementations of hypertext
• HTTP and HTML are undergoing significant evolution – now it’s time for URIs!
Reconceiving the Web as a Distributed Data System
Thank You!
Reconceiving the Web as a Distributed Data System
Thank You!
Daniel AustinPayPal, Inc.NoSQL Now! ConferenceAugust 22, 2013V1.2
@daniel_b_austin
Top Related