MongoDB Basics

111
Sarang Shravagi Python Developer,ScaleArc @_sarangs

description

MongoDB workshop given by me at MIT, Pune. This PDF has example of how to design mongodb schema as per application usage.

Transcript of MongoDB Basics

Page 1: MongoDB Basics

Sarang Shravagi Python Developer,ScaleArc

@_sarangs

Page 2: MongoDB Basics

Let’s Know Each Other

• Why are you attending?

• Do you code?

• OS?

• Programing Language?

• JSON?

• MongoDB?

Page 3: MongoDB Basics

Agenda

• SQL and NoSQL Database

• What is MongoDB?

• Hands-On and Assignment

• Design Models

• MongoDB Language Driver

• Disaster Recovery

• Handling BigData

Page 4: MongoDB Basics

Data Patterns & Storage Needs

• Product Information

• User Information

• Purchase Information

• Product Reviews

• Site Interactions

• Social Graph

• Search Index

Page 5: MongoDB Basics

SQL to NoSQL

Design Paradigm Shift

Page 6: MongoDB Basics

Database Evolution

Page 7: MongoDB Basics

SQL Storage

• Was designed when – Storage and data transfer was costly

– Processing was slow

– Applications were oriented more towards data collection

• Initial adopters were financial institutions

Page 8: MongoDB Basics

SQL Storage

• Structured – schema

• Relational – foreign keys, constraints

• Transactional – Atomicity, Consistency, Isolation, Durability

• High Availability through robustness – Minimize failures

• Optimized for Writes

• Typically Scale Up

Page 9: MongoDB Basics

NoSQL Storage

• Is designed when – Storage is cheap

– Data transfer is fast

– Much more processing power is available

• Clustering of machines is also possible – Applications are oriented towards consumption of User

Generated Content

– Better on-screen user experience is in demand

Page 10: MongoDB Basics

NoSQL Storage

• Semi-structured – Schemaless

• Consistency, Availability, Partition Tolerance

• High Availability through clustering – expect failures

• Optimized for Reads

• Typically Scale Out

Page 11: MongoDB Basics

Different Databases

Half Level Deep

Page 12: MongoDB Basics

SQL: RDBMS

• MySql, Postgresql, Oracle etc.

• Stores data in tables having columns – Basic (number, text) data types

• Strong query language

• Transparent values – Query language can read and filter on them

– Relationship between tables based on values

• Suited for user info and transactions

Page 13: MongoDB Basics

NoSQL Data Model

Page 14: MongoDB Basics

NoSQL: Key/Value

Page 15: MongoDB Basics

NoSQL: Document

• MongoDB, CouchDB etc.

• Object Oriented data models – Stores data in document objects having fields

– Basic and compound (list, dict) data types

• SQL like queries

• Transparent values – Can be part of query

• Suited for product info and its reviews

Page 16: MongoDB Basics

NoSQL: Document

Page 17: MongoDB Basics

NoSQL: Column Family

• Cassandra, Big Table etc.

• Stores data in columns

• Transparent values – Can be part of query

• SQL like queries

• Suited for search

Page 18: MongoDB Basics

NoSQL: Graph

• Neo4j

• Stores data in form of nodes and relationships

• Query is in form of traversal

• In-memory

• Suited for social graph

Page 19: MongoDB Basics

NoSQL: Graph

Page 20: MongoDB Basics

What is MongoDB?

Page 21: MongoDB Basics

MongoDB is a ___________ database

1. Document

2. Open source

3. High performance

4. Horizontally scalable

5. Full featured

Page 22: MongoDB Basics

1. Document Database

• Not for .PDF & .DOC files

• A document is essentially an associative array

• Document = JSON object

• Document = PHP Array

• Document = Python Dict

• Document = Ruby Hash

• etc

Page 23: MongoDB Basics

Database Landscape

Page 24: MongoDB Basics

2. Open Source

• MongoDB is an open source project

• On GitHub

• Licensed under the AGPL

• Started & sponsored by MongoDB Inc (formerly

known as 10gen)

• Commercial licenses available

• Contributions welcome

Page 25: MongoDB Basics

7,000,000+ MongoDB Downloads

150,000+ Online Education Registrants

35,000+ MongoDB Management Service (MMS) Users

30,000+ MongoDB User Group Members

20,000+ MongoDB Days Attendees

Global Community

Page 26: MongoDB Basics

3. High Performance

• Written in C++

• Extensive use of memory-mapped files

i.e. read-through write-through memory caching.

• Runs nearly everywhere

• Data serialized as BSON (fast parsing)

• Full support for primary & secondary indexes

• Document model = less work

Page 27: MongoDB Basics

Better Data

Locality

Performance

In-Memory

Caching

In-Place

Updates

Page 28: MongoDB Basics

4. Scalability

Auto-Sharding

• Increase capacity as you go

• Commodity and cloud architectures

• Improved operational simplicity and cost visibility

Page 29: MongoDB Basics

High Availability

• Automated replication and failover

• Multi-data center support

• Improved operational simplicity (e.g., HW swaps)

• Data durability and consistency

Page 30: MongoDB Basics

Scalability: MongoDB Architecture

Page 31: MongoDB Basics

5. Full Featured

• Ad Hoc queries

• Real time aggregation

• Rich query capabilities

• Strongly consistent

• Geospatial features

• Support for most programming languages

• Flexible schema

Page 32: MongoDB Basics

MongoDB is Fully Featured

Page 33: MongoDB Basics

MongoDB Architecture

Page 34: MongoDB Basics

Terminology

Page 35: MongoDB Basics

Do More With Your Data

MongoDB Rich Queries

• Find Paul’s cars

• Find everybody in London with a car

built between 1970 and 1980

Geospatial • Find all of the car owners within 5km of

Trafalgar Sq.

Text Search • Find all the cars described as having

leather seats

Aggregation • Calculate the average value of Paul’s

car collection

Map Reduce

• What is the ownership pattern of colors

by geography over time? (is purple

trending up in China?)

{

first_name: ‘Paul’,

surname: ‘Miller’,

city: ‘London’,

location: [45.123,47.232],

cars: [

{ model: ‘Bentley’,

year: 1973,

value: 100000, … },

{ model: ‘Rolls Royce’,

year: 1965,

value: 330000, … }

}

}

Page 36: MongoDB Basics

Hands-On & Assignment

Page 37: MongoDB Basics

mongodb.org/downloads

Page 38: MongoDB Basics

$ tar –zxvf mongodb-osx-x86_64-2.6.0.tgz

$ cd mongodb-osx-i386-2.6.0/bin

$ mkdir –p /data/db

$ ./mongod

Running MongoDB

Page 39: MongoDB Basics

MongoDB: Core Binaries

• mongod – Database server

• mongo – Database client shell

• mongos – Router for Sharding

Page 40: MongoDB Basics

Getting Help

• For mongo shell – mongo –help

• Shows options available for running the shell

• Inside mongo shell – db.help()

• Shows commands available on the object

Page 41: MongoDB Basics

Database Operations

• Database creation

• Creating/changing collection

• Data insertion

• Data read

• Data update

• Creating indices

• Data deletion

• Dropping collection

Page 42: MongoDB Basics

MacBook-Pro-:~ $ mongo

MongoDB shell version: 2.6.0

connecting to: test

> db.cms.insert({text: 'Welcome to MongoDB'})

> db.cms.find().pretty()

{

"_id" : ObjectId("51c34130fbd5d7261b4cdb55"),

"text" : "Welcome to MongoDB"

}

Mongo Shell

Page 43: MongoDB Basics

Diagnostic Tools

• mongostat

• mongoperf

• mongosnif

• mongotop

Page 44: MongoDB Basics

Import Export Tools

• For objects – mongodump

– mongorestore

– bsondump

– mongooplog

• For data items – mongoimport

– mongoexport

Page 45: MongoDB Basics

Assignment

• Tasks – assignments.txt

• Data – students.json

Page 46: MongoDB Basics

Questions?

Page 47: MongoDB Basics

Sarang Shravagi

@_sarangs

Thank You

Page 48: MongoDB Basics

Design Models

Page 49: MongoDB Basics

First step in any application is

Determine your entities

Page 50: MongoDB Basics

Entities in our Blogging System

• Users (post authors)

• Article

• Comments

• Tags, Category

• Interactions (views, clicks)

Page 51: MongoDB Basics

In a relational base app

We would start by doing schema

design

Page 52: MongoDB Basics

Typical (relational) ERD

Page 53: MongoDB Basics

In a MongoDB based app

We start building our app and let the schema evolve

Page 54: MongoDB Basics

MongoDB ERD

Page 55: MongoDB Basics

Seek = 5+ ms Read = really really fast

Post

Author Comment

Disk seeks and data locality

Page 56: MongoDB Basics

Post

Author

Comment Comment Comment Comment Comment

Disk seeks and data locality

Page 57: MongoDB Basics

MongoDB Language Driver

Page 58: MongoDB Basics

Real applications are not

built in the shell

Page 59: MongoDB Basics

MongoDB has native

bindings for over 12

languages

Page 60: MongoDB Basics

Drivers & Ecosystem

Drivers

Support for the most popular

languages and frameworks

Frameworks

Morphia MEAN Stack

Java

Python

Perl

Ruby

Page 61: MongoDB Basics

Working With MongoDB

Page 62: MongoDB Basics

# Python dictionary (or object)

>>> article = { ‘title’ : ‘Schema design in MongoDB’,

‘author’ : ‘sarangs’,

‘section’ : ‘schema’,

‘slug’ : ‘schema-design-in-mongodb’,

‘text’ : ‘Data in MongoDB has a flexible schema.

So, 2 documents needn’t have same structure.

It allows implicit schema to evolve.’,

‘date’ : datetime.utcnow(),

‘tags’ : [‘MongoDB’, ‘schema’] }

>>> db[‘articles’].insert(article)

Design schema.. In application code

Page 63: MongoDB Basics

>>> img_data = Binary(open(‘article_img.jpg’).read())

>>> article = { ‘title’ : ‘Schema evolutionin MongoDB’,

‘author’ : ‘mattbates’,

‘section’ : ‘schema’,

‘slug’ : ‘schema-evolution-in-mongodb’,

‘text’ : ‘MongoDb has dynamic schema. For good

performance, you would need an implicit

structure and indexes’,

‘date’ : datetime.utcnow(),

‘tags’ : [‘MongoDB’, ‘schema’, ‘migration’],

‘headline_img’ : {

‘img’ : img_data,

‘caption’ : ‘A sample document at the shell’

}}

>>> db[‘articles’].insert(article)

Let’s add a headline image

Page 64: MongoDB Basics

>>> article = { ‘title’ : ‘Favourite web application framework’,

‘author’ : ‘sarangs’,

‘section’ : ‘web-dev’,

‘slug’ : ‘web-app-frameworks’,

‘gallery’ : [

{ ‘img_url’ : ‘http://x.com/45rty’, ‘caption’ : ‘Flask’, ..},

..

]

‘date’ : datetime.utcnow(),

‘tags’ : [‘Python’, ‘web’],

}

>>> db[‘articles’].insert(article)

And different types of article

Page 65: MongoDB Basics

>>> user = {

'user' : 'sarangs',

'email' : ‘[email protected]',

'password' : ‘sarang',

'joined' : datetime.utcnow(),

'location' : { 'city' : 'Mumbai' },

}

} >>> db[‘users’].insert(user)

Users and profiles

Page 66: MongoDB Basics

Modelling comments (1)

• Two collections – articles and comments

• Use a reference (i.e. foreign key) to link together

• But.. N+1 queries to retrieve article and comments

{

‘_id’ : ObjectId(..),

‘title’ : ‘Schema design in MongoDB’,

‘author’ : ‘mattbates’,

‘date’ : ISODate(..),

‘tags’ : [‘MongoDB’, ‘schema’],

‘section’ : ‘schema’,

‘slug’ : ‘schema-design-in-mongodb’,

‘comments’ : [ ObjectId(..), …]

}

{ ‘_id’ : ObjectId(..),

‘article_id’ : 1,

‘text’ : ‘A great article, helped me

understand schema design’,

‘date’ : ISODate(..),,

‘author’ : ‘johnsmith’

}

Page 67: MongoDB Basics

Modelling comments (2)

• Single articles collection –

embed comments in article

documents

• Pros • Single query, document

designed for the access pattern

• Locality (disk, shard)

• Cons • Comments array is unbounded;

documents will grow in size

(remember 16MB document

limit)

{

‘_id’ : ObjectId(..),

‘title’ : ‘Schema design in MongoDB’,

‘author’ : ‘mattbates’,

‘date’ : ISODate(..),

‘tags’ : [‘MongoDB’, ‘schema’],

‘comments’ : [

{

‘text’ : ‘A great article,

helped me

understand schema design’,

‘date’ : ISODate(..),

‘author’ : ‘johnsmith’

},

]

}

Page 68: MongoDB Basics

Modelling comments (3)

• Another option: hybrid of (2) and (3), embed top x comments (e.g. by date, popularity) into the article document

• Fixed-size (2.4 feature) comments array

• All other comments ‘overflow’ into a comments

collection (double write) in buckets

• Pros

– Document size is more fixed – fewer moves

– Single query built

– Full comment history with rich query/aggregation

Page 69: MongoDB Basics

Modelling comments (3) {

‘_id’ : ObjectId(..),

‘title’ : ‘Schema design in MongoDB’,

‘author’ : ‘mattbates’,

‘date’ : ISODate(..),

‘tags’ : [‘MongoDB’, ‘schema’],

‘comments_count’: 45,

‘comments_pages’ : 1

‘comments’ : [

{

‘text’ : ‘A great article, helped me

understand schema design’,

‘date’ : ISODate(..),

‘author’ : ‘johnsmith’

},

]

}

Total number of comments • Integer counter updated by

update operation as

comments added/removed

Number of pages • Page is a bucket of 100

comments (see next slide..)

Fixed-size comments array • 10 most recent

• Sorted by date on insertion

Page 70: MongoDB Basics

Modelling comments (3)

{

‘_id’ : ObjectId(..),

‘article_id’ : ObjectId(..),

‘page’ : 1,

‘count’ : 42

‘comments’ : [

{

‘text’ : ‘A great article, helped me

understand schema design’,

‘date’ : ISODate(..),

‘author’ : ‘johnsmith’

},

}

One comment bucket

(page) document

containing up to about 100

comments

Array of 100 comment sub-

documents

Page 71: MongoDB Basics

Modelling interactions

• Interactions – Article views

– Comments

– (Social media sharing)

• Requirements

– Time series

– Pre-aggregated in preparation for analytics

Page 72: MongoDB Basics

Modelling interactions

• Document per article per day –

‘bucketing’

• Daily counter and hourly sub-

document counters for

interactions

• Bounded array (24 hours)

• Single query to retrieve daily

article interactions; ready-made

for graphing and further

aggregation

{

‘_id’ : ObjectId(..),

‘article_id’ : ObjectId(..),

‘section’ : ‘schema’,

‘date’ : ISODate(..),

‘daily’: { ‘views’ : 45, ‘comments’ :

150 }

‘hours’ : {

0 : { ‘views’ : 10 },

1 : { ‘views’ : 2 },

23 : { ‘comments’ : 14, ‘views’ : 10

}

}

}

Page 73: MongoDB Basics

JSON and RESTful API

Client-side

JSON

(eg AngularJS,

(BSON)

Real applications are not built at a shell – let’s build a RESTful

API.

Pymongo

driver

Python web

app HTTP(S) REST

Examples to follow: Python RESTful API using Flask

microframework

Page 74: MongoDB Basics

myCMS REST endpoints

Method URI Action

GET /articles Retrieve all articles

GET /articles-by-tag/[tag] Retrieve all articles by tag

GET /articles/[article_id] Retrieve a specific article by article_id

POST /articles Add a new article

GET /articles/[article_id]/comments Retrieve all article comments by

article_id

POST /articles/[article_id]/comments Add a new comment to an article.

POST /users Register a user user

GET /users/[username] Retrieve user’s profile

PUT /users/[username] Update a user’s profile

Page 75: MongoDB Basics

$ git clone http://www.github.com/mattbates/mycms_mongodb

$ cd mycms-mongodb

$ virtualenv venv

$ source venv/bin/activate

$ pip install –r requirements.txt

$ mkdir –p data/db

$ mongod --dbpath=data/db --fork --logpath=mongod.log

$ python web.py

[$ deactivate]

Getting started with the skeleton code

Page 76: MongoDB Basics

@app.route('/cms/api/v1.0/articles', methods=['GET'])

def get_articles():

"""Retrieves all articles in the collection

sorted by date

"""

# query all articles and return a cursor sorted by date

cur = db['articles'].find().sort('date’)

if not cur:

abort(400)

# iterate the cursor and add docs to a dict

articles = [article for article in cur]

return jsonify({'articles' : json.dumps(articles, default=json_util.default)})

RESTful API methods in Python + Flask

Page 77: MongoDB Basics

@app.route('/cms/api/v1.0/articles/<string:article_id>/comments', methods = ['POST'])

def add_comment(article_id):

"""Adds a comment to the specified article and a

bucket, as well as updating a view counter

"””

page_id = article['last_comment_id'] // 100

# push the comment to the latest bucket and $inc the count

page = db['comments'].find_and_modify(

{ 'article_id' : ObjectId(article_id),

'page' : page_id},

{ '$inc' : { 'count' : 1 },

'$push' : {

'comments' : comment } },

fields= {'count' : 1},

upsert=True,

new=True)

RESTful API methods in Python + Flask

Page 78: MongoDB Basics

# $inc the page count if bucket size (100) is exceeded

if page['count'] > 100:

db.articles.update(

{ '_id' : article_id,

'comments_pages': article['comments_pages'] },

{ '$inc': { 'comments_pages': 1 } } )

# let's also add to the article itself

# most recent 10 comments only

res = db['articles'].update(

{'_id' : ObjectId(article_id)},

{'$push' : {'comments' : { '$each' : [comment],

'$sort' : {’date' : 1 },

'$slice' : -10}},

'$inc' : {'comment_count' : 1}})

RESTful API methods in Python + Flask

Page 79: MongoDB Basics

def add_interaction(article_id, type):

"""Record the interaction (view/comment) for the

specified article into the daily bucket and

update an hourly counter

"""

ts = datetime.datetime.utcnow()

# $inc daily and hourly view counters in day/article stats bucket

# note the unacknowledged w=0 write concern for performance

db['interactions'].update(

{ 'article_id' : ObjectId(article_id),

'date' : datetime.datetime(ts.year, ts.month, ts.day)},

{ '$inc' : {

'daily.{}’.format(type) : 1,

'hourly.{}.{}'.format(ts.hour, type) : 1

}},

upsert=True,

w=0)

RESTful API methods in Python + Flask

Page 80: MongoDB Basics

$ curl -i http://localhost:5000/cms/api/v1.0/articles

HTTP/1.0 200 OK

Content-Type: application/json

Content-Length: 335

Server: Werkzeug/0.9.4 Python/2.7.5

Date: Thu, 10 Apr 2014 16:00:51 GMT

{

"articles": "[{\"title\": \"Schema design in MongoDB\", \"text\": \"Data in MongoDB

has a flexible schema..\", \"section\": \"schema\", \"author\": \"sarangs\", \"date\":

{\"$date\": 1397145312505}, \"_id\": {\"$oid\": \"5346bef5f2610c064a36a793\"},

\"slug\": \"schema-design-in-mongodb\", \"tags\": [\"MongoDB\", \"schema\"]}]"}

Testing the API – retrieve articles

Page 81: MongoDB Basics

$ curl -H "Content-Type: application/json" -X POST -d '{"text":"An interesting

article and a great read."}'

http://localhost:5000/cms/api/v1.0/articles/52ed73a30bd031362b3c6bb3/comment

s

{

"comment": "{\"date\": {\"$date\": 1391639269724}, \"text\": \"An interesting

article and a great read.\"}”

}

Testing the API – comment on an article

Page 82: MongoDB Basics

Disaster Recovery

Introduction to Replica Sets and

High Availability

Page 83: MongoDB Basics

Disasters

• Physical Failure – Hardware

– Network

• Solution – Replica Sets

• Provide redundant storage for High Availability

– Real time data synchronization

• Automatic failover for zero down time

Page 84: MongoDB Basics

Replication

Page 85: MongoDB Basics

Multi Replication

• Data can be replicated to multiple places simultaneously

• Odd number of machines are always needed in a replica set

Page 86: MongoDB Basics

Single Replication

• If you want to have only one or odd number of secondary, you need to setup an arbiter

Page 87: MongoDB Basics

Failover

• When primary fails, remaining machines vote for electing new primary

Page 88: MongoDB Basics

Handling Big Data

Introduction to Map/Reduce

and Sharding

Page 89: MongoDB Basics

Large Data Sets

• Problem 1 – Performance

• Queries go slow

• Solution – Map/Reduce

Page 90: MongoDB Basics

Aggregation

Page 91: MongoDB Basics

Map Reduce

• A way to divide large query computation into smaller chunks

• May run in multiple processes across multiple machines

• Think of it as GROUP BY of SQL

Page 92: MongoDB Basics

Map/Reduce Example

• Map function digs the data and returns required values

Page 93: MongoDB Basics

Map/Reduce Example

• Reduce function uses the output of Map function and generates aggregated value

Page 94: MongoDB Basics

Large Data Sets

• Problem 2 – Vertical Scaling of Hardware

• Can’t increase machine size beyond a limit

• Solution – Sharding

Page 95: MongoDB Basics

Sharding

• A method for storing data across multiple machines

• Data is partitioned using Shard Keys

Page 96: MongoDB Basics

Data Partitioning: Range Based

• A range of Shard Keys stay in a chunk

Page 97: MongoDB Basics

Data Partitioning: Hash Bsed

• A hash function on Shard Keys decides the chunk

Page 98: MongoDB Basics

Sharded Cluster

Page 99: MongoDB Basics

Optimizing Shards: Splitting

• In a shard, when size of a chunk increases, the chunk is divided into two

Page 100: MongoDB Basics

Optimizing Shards: Balancing

• When number of chunks in a shard increase, a few chunks are migrated to other shard

Page 101: MongoDB Basics

Schema iteration

New feature in the backlog?

Documents have dynamic schema so we just iterate

the object schema.

>>> user = { ‘username’ : ‘matt’,

‘first’ : ‘Matt’,

‘last’ : ‘Bates’,

‘preferences’ : { ‘opt_out’ : True } }

>>> user.save(user)

Page 102: MongoDB Basics

docs.mongodb.org

Page 103: MongoDB Basics

Online Training at MongoDB University

Page 104: MongoDB Basics

For More Information

Resource Location

MongoDB Downloads mongodb.com/download

Free Online Training education.mongodb.com

Webinars and Events mongodb.com/events

White Papers mongodb.com/white-papers

Case Studies mongodb.com/customers

Presentations mongodb.com/presentations

Documentation docs.mongodb.org

Additional Info [email protected]

Resource Location

Page 105: MongoDB Basics

We've introduced a lot of

concepts here

Page 106: MongoDB Basics

Schema Design @

Page 107: MongoDB Basics

Replication @

Page 108: MongoDB Basics

Indexing @

Page 109: MongoDB Basics

Sharding @

Page 110: MongoDB Basics

Questions?

Page 111: MongoDB Basics

Sarang Shravagi

@_sarangs

Thank You