SDEC2011 NoSQL Data modelling
-
Upload
korea-sdec -
Category
Technology
-
view
3.138 -
download
0
Transcript of SDEC2011 NoSQL Data modelling
NoSQL Data ModelingConcepts and Cases
Shashank Tiwariblog: shanky.org | twitter: @[email protected]
NoSQL?
NoSQL : Various Shapes and Sizes
• Document Databases
• Column-family Oriented Stores
• Key/value Data stores
• XML Databases
• Object Databases
• Graph Databases
Key Questions
• How do I model data for my application?
• How do I determine which one is right for me?
• Can I easily shift from one database to the other?
• Is there a standard way of storing, accessing, and querying data?
Agenda for this session
• Explore some of the main NoSQL products
• Understand how they are similar and different
• How best to use these products in the stack
•
Document Databases
• also GenieDB, SimpleDB
What is a document db?
• One that stores documents
• Popular options:
• MongoDB -- C++
• CouchDB -- Erlang
• Also Amazon’s SimpleDB
• ...what exactly is a document?
In the real world
• (Source: http://guide.couchdb.org/draft/why.html)
In terms of JSON
• {name: “John Doe”,
• zip: 10001}
What about db schema?
• Schema-less
• Different documents could be stored in a single collection
Data types: MongoDB
• Essential JSON types:
• string
• integer
• boolean
• double
Data types: MongoDB (...cont)
• Additional JSON types
• null, array and object
• BSON types -- binary encoded serialization of JSON like documents
• date, binary data, object id, regular expression and code
• (Reference: bsonspec.org)
A BSON example: object id
Data types: CouchDB
• Everything JSON
• Large objects: attachments
CRUD operations for documents
• Create
• Read
• Update
• Delete
MongoDB: Create Document
• use mydb
• w = {name: “John Doe”, zip: 10001};
• db.location.save(w);
Create db and collection
• Lazily created
• Implicitly created
• use mydb
• db.collection.save(w)
MongoDB: Read Document
• db.location.find({zip: 10001});
• { "_id" : ObjectId("4c97053abe67000000003857"), "name" : "John Doe", "zip" : 10001 }
MongoDB: Read Document (...cont)
• db.location.find({name: "John Doe"});
• { "_id" : ObjectId("4c97053abe67000000003857"), "name" : "John Doe", "zip" : 10001 }
MongoDB: Update Document
• Atomic operations on single documents
• db.location.update( { name:"John Doe" }, { $set: { name: "Jane Doe" } } );
CouchDB: RESTful
• Supports REST verbs: GET, HEAD, PUT, POST, DELETE
• Supports Replication
• Supports the notion of attachments
• Could work in offline modes and supports small footprint profiles
Sorted Ordered Column-family Datastores
• Sorted
• Ordered
• Distributed
• Map
Essential schema
Multi-dimensional View
A Map/Hash View
• {
• "row_key_1" : { "name" : {
• "first_name" : "Jolly", "last_name" : "Goodfellow"
• } } },
• "location" : { "zip": "94301" },
Architectural View (HBase)
The Persistence Mechanism
Model Wrappers (The GAE Way)
• Python
• Model, Expando, PolyModel
• Java
• JDO, JPA
HBase Data Access
• Thrift + Avro
• Java API -- HTable, HBaseAdmin
• Hive (SQL like)
• MapReduce -- sink and/or source
Transactions
• Atomic row level
• GAE Entity Groups
Indexes
• Row ordered
• Secondary indexes
• GAE style multiple indexes
• thinking from output to query
Use cases
• Many Google’s Products
• Facebook Messaging
• StumbleUpon
• Open TSDB
• Mahalo, Ning, Meetup, Twitter, Yahoo!
• Lily -- open source CMS built on HBase & Solr
Brewer’s CAP Theorem
• http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
• http://theory.lcs.mit.edu/tds/papers/Gilbert/Brewer6.ps
Distributed Systems & Consistency (case: success)
Distributed Systems & Consistency (case: failure)
Binding by Transactions
Consistency Spectrum
Inconsistency Window
RWN Math
• R – Number of nodes that are read from.
• W – Number of nodes that are written to.
• N – Total number of nodes in the cluster.
• In general: R < N and W < N for higher availability
R + W > N
• Easy to determine consistent state
• R + W = 2N
• absolutely consistent, can provide ACID gaurantee
• In all cases when R + W > N there is some overlap between read and write nodes.
R = 1, W = N
• more reads than writes
• W = N
• 1 node failure = entire system unavailable
R = N, W =1
• W = N
• Chance of data inconsistency quite high
• R = N
• Read only possible when all nodes in the cluster are available
R = W = ceiling ((N + 1)/2)
Effective quorum for eventual consistency
Eventual consistency variants
• Causal consistency -- A writes and informs B then B always sees updated value
• Read-your-writes-consistency -- A writes a new value and never see the old one
• Session consistency -- read-your-writes-consistency within a client session
• Monotonic read consistency -- once seen a new value, never return previous value
• Monotonic write consistency -- serialize writes by the same process
Dynamo Techniques
• Consistent Hashing (Incremental scalability)
• Vector clocks (high availability for writes)
• Sloppy quorum and hinted handoff (recover from temporary failure)
• Gossip based membership protocol (periodic, pair wise, inter-process interactions, low reliability, random peer selection)
• Anti-entropy using Merkle trees
• (source: http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf)
Consistent Hashing
CouchDB MVCC Style
• (Source: http://guide.couchdb.org/draft/consistency.html)
Key/value Stores
• Memcached
• Membase
• Redis
• Tokyo Cabinet
• Kyoto Cabinet
• Berkeley DB
Questions?
• blog: shanky.org | twitter: @tshanky