MongoDB at Sailthru: Scaling and Schema Design

43
MongoDB at Sailthru Scaling and Schema Design Ian White @eonwhite NoSQL Now! 8/25/11 Sunday, August 7, 2011

Transcript of MongoDB at Sailthru: Scaling and Schema Design

Page 1: MongoDB at Sailthru: Scaling and Schema Design

MongoDB at SailthruScaling and Schema

Design

Ian White@eonwhite

NoSQL Now!8/25/11

Sunday, August 7, 2011

Page 2: MongoDB at Sailthru: Scaling and Schema Design

Sailthru

• API-based transactional email led to...

• Mass campaign email led to...

• Intelligence and user behavior

• Three engineers built the ESP we always wanted to use

• Some Clients: Huffpo-AOL, Thrillist, Refinery 29, Flavorpill, Business Insider, Fab, Totsy, New York Observer

Sunday, August 7, 2011

Page 3: MongoDB at Sailthru: Scaling and Schema Design

How We Got To MongoDB from SQL

• JSON was part of Sailthru infrastructure from start (SQL columns and S3)

• Kept a close eye on CouchDB project

• MongoDB felt like natural fit

• Used for user profiles and analytics initially

• Migrated one table at a time (very, very carefully)

Sunday, August 7, 2011

Page 4: MongoDB at Sailthru: Scaling and Schema Design

Sailthru Architecture

• User interface to display stats, build campaigns and templates, etc (PHP/EC2)

• API, link rewriting, and onsite endpoints (PHP/EC2)

• Core mailer engine (Java/EC2 and colo)

• Modified-postfix SMTP servers (colo)

• 11 database servers on EC2 (for now)

Sunday, August 7, 2011

Page 5: MongoDB at Sailthru: Scaling and Schema Design

MongoDB Overview

• 13 instances on EC2 (6 two-member replica sets, 1 backup server)

• About 40 collections

• About 1TB

• Largest single collection is 500m docs

Sunday, August 7, 2011

Page 6: MongoDB at Sailthru: Scaling and Schema Design

Users are Documents

• Users aren’t records split among multiple tables

• End user’s lists, clickstream interests, geolocation, browser, time of day, purchase history becomes one ever-growing document

Sunday, August 7, 2011

Page 7: MongoDB at Sailthru: Scaling and Schema Design

Profiles Accessible Everywhere

• Put abandoned shopping cart notifications within a mass email

{if profile.purchase_incomplete} <p>This is what’s in your cart:</p> {foreach profile.purchase_incomplete.items as item} {item.qty} <a href=”{item.url}”>{item.title}</a><br/> {/foreach}{/if}

Sunday, August 7, 2011

Page 8: MongoDB at Sailthru: Scaling and Schema Design

Profiles Accessible Everywhere

• Show a section of content conditional on the user’s location

{if profile.geo.city[‘New York, NY US’]} <div>Come to the New York Meetup on the 27th!</div>{/if}

Sunday, August 7, 2011

Page 9: MongoDB at Sailthru: Scaling and Schema Design

Profiles Accessible Everywhere

• Show different content depending on user interests as measured by on-site behavior

{select} {case horizon_interest('black,dark')} <img src="http://example.com/dress-image-black.jpg" /> {/case} {case horizon_interest('green')} <img src="http://example.com/dress-image-green.jpg" /> {/case} {case horizon_interest('purple,polka_dot,pattern')} <img src="http://example.com/dress-image-polkadot.jpg" /> {/case}{/select}

Sunday, August 7, 2011

Page 10: MongoDB at Sailthru: Scaling and Schema Design

Profiles Accessible Everywhere

• Pick top content from a data feed based on tags

{content = horizon_select(content,10)}

{foreach content as c} <a href=”{c.url}”>{c.title}</a><br/>{/foreach}

Sunday, August 7, 2011

Page 11: MongoDB at Sailthru: Scaling and Schema Design

Other Advantages of MongoDB

• High performance

• Take any parameters from our clients

• Really flexible development

• Great for analytics (internal and external)

• No more downtime for schema migrations or reindexing

Sunday, August 7, 2011

Page 12: MongoDB at Sailthru: Scaling and Schema Design

How We Run mongod• mongod --dbpath /path/to/db --logpath /path/to/log/

mongodb.log --logappend --fork --rest --replSet main1 --journal

• Don’t ever run without replication

• Don’t ever kill -9

• Don’t run without writing to a log

• Run behind a firewall

• Use journaling now that it’s there

• Use --rest, it’s handy

Sunday, August 7, 2011

Page 13: MongoDB at Sailthru: Scaling and Schema Design

Separate DBs By Collections

• Lower-effort than auto-sharding

• Separate databases for different usage patterns

• Consider consequences of database failure/unavailability

• But make sure your backup and monitoring strategy is prepared for multiple DBs

Sunday, August 7, 2011

Page 14: MongoDB at Sailthru: Scaling and Schema Design

Our Five Replica Sets

• main: most of the stuff on the UI, lots of small/medium collections

• horizon: realtime onsite browsing data

• profile: user profile data (60m user docs)

• message: last three months of emails

• archive: emails older than three months

Sunday, August 7, 2011

Page 15: MongoDB at Sailthru: Scaling and Schema Design

Monitoring

• Some stuff to monitor: faults/sec, index misses, % locked, queue size, load average

• we check basic status once/minute on all database servers (SMS alerts if down), email warnings on thresholds every 10 minutes

• have been beta-ing 10gen’s MMS product

Sunday, August 7, 2011

Page 16: MongoDB at Sailthru: Scaling and Schema Design

Backups

• Used to use mongodump - don’t do that anymore

• Have single node of each replica set on a backup server

• Two-hour slave delay

• fsync/lock, freeze xfs file system, EBS snapshot, unfreeze, unlock

Sunday, August 7, 2011

Page 17: MongoDB at Sailthru: Scaling and Schema Design

The Great EC2 EBS Outage Adventure

• We survived

• Most of our nodes unavailable for 2-4 days

• Were able to spin up new instances from backup server, snapshots, and get operational within hours

• Wasn’t fun

Sunday, August 7, 2011

Page 18: MongoDB at Sailthru: Scaling and Schema Design

DESIGN

Sunday, August 7, 2011

Page 19: MongoDB at Sailthru: Scaling and Schema Design

Develop Your Mental Model of MongoDB

• You don’t need to look at the internals

• But try to gain a working understanding of how MongoDB operates, especially RAM and indexes

Sunday, August 7, 2011

Page 20: MongoDB at Sailthru: Scaling and Schema Design

Big-Picture Design Questions

• What is the data I want to store?

• How will I want to use that data later?

• How big will the data get?

• If the answers are “I don’t know yet”, guess with your best YAGNI

Sunday, August 7, 2011

Page 21: MongoDB at Sailthru: Scaling and Schema Design

“But premature optimization is evil”

• Knuth said that about code, which is flexible and easy to optimize later

• Data is not as flexible as code

• So doing some planning for performance is usually good when it comes to your data

Sunday, August 7, 2011

Page 22: MongoDB at Sailthru: Scaling and Schema Design

Specific MongoDB Design Questions

• Embed vs top-level collection?

• Denormalize (double-store data)?

• How many/which indexes?

• Arrays vs hashes for embedding?

• Implicit schema (field names and types)

Sunday, August 7, 2011

Page 23: MongoDB at Sailthru: Scaling and Schema Design

Short Field Names?

• Disk space: cheap

• RAM: not cheap

• Developer Time: expensive

• Err towards compact, readable fieldnames

• Might be worth writing a mapper

• Probably wish we’d used c instead of client_id

Sunday, August 7, 2011

Page 24: MongoDB at Sailthru: Scaling and Schema Design

Favor Human-Readable Foreign Keys

• DBRefs are a bit cumbersome

• Referencing by MongoId often means doing extra lookups

• Build human-readable references to save you doing lookups and manual joins

Sunday, August 7, 2011

Page 25: MongoDB at Sailthru: Scaling and Schema Design

Example

• Store the Template and the Email as strings on the message object

• { template: “Internal - Blast Notify”, email: “[email protected]” }

• No external reference lookups required

• The tradeoff is basically just disk spaceSunday, August 7, 2011

Page 26: MongoDB at Sailthru: Scaling and Schema Design

Embed vs Top-Level Collections?

• Major question of MongoDB schema design

• If you can ask the question at all, you might want to err on the side of embedding

• Don’t embed if the embedding could get huge

• Don’t feel too bad about denormalizing by embedding AND storing in a top-level collection

Sunday, August 7, 2011

Page 27: MongoDB at Sailthru: Scaling and Schema Design

Typical Properties of Top-Level Collections

• Independence: They don’t “belong” conceptually to another collection

• Nouns: the building blocks of your system

• Easily referenceable and updatable

Sunday, August 7, 2011

Page 28: MongoDB at Sailthru: Scaling and Schema Design

Embedding Pros

• Super-fast retrieval of document with related data

• Atomic updates

• “Ownership” of embedded document is obvious

• Usually maps well to code structures

Sunday, August 7, 2011

Page 29: MongoDB at Sailthru: Scaling and Schema Design

Embedding Cons

• Harder to get at, do mass queries

• Does not size up infinitely, will hit 16MB limit

• Hard to create references to embedded object

• Limited ability to indexed-sort the embedded objects

Sunday, August 7, 2011

Page 30: MongoDB at Sailthru: Scaling and Schema Design

If You Think You Can Embed

• You probably should

• I take advantage of embedding in my designs more often now than I did three years ago

• It’s a gift MongoDB gives you in exchange for giving up your joins

Sunday, August 7, 2011

Page 31: MongoDB at Sailthru: Scaling and Schema Design

Design Example:User Permissions

• Users can have various broad permission levels for any number of clients

• For example, user ‘ploki’ might have permission level ‘admin’ for client 76 and permission level ‘reports_only’ for client 450

Sunday, August 7, 2011

Page 32: MongoDB at Sailthru: Scaling and Schema Design

How Will We Use This Data?

• Retrieve all clients for a given user

• Retrieve all users for a given client

• Retrieve a permission level for a given client for a given user

Sunday, August 7, 2011

Page 33: MongoDB at Sailthru: Scaling and Schema Design

How Will This Data Grow?

• In the medium term, it will stay small

• Number of clients and number of users can both grow infinitely

Sunday, August 7, 2011

Page 34: MongoDB at Sailthru: Scaling and Schema Design

Back in SQL-land

• There’s a fairly standard way to do it

• It’s a many-many relationship, so

• Use a join table (client_user)

Sunday, August 7, 2011

Page 35: MongoDB at Sailthru: Scaling and Schema Design

Should We Use a New Top-Level Collection?

db.client.user.save( { client_id: 76, username: ‘ploki’, permission: ‘admin’,});db.client.user.save( { client_id: 450, username: ‘ploki’, permission: ‘reports_only’,});

db.client.user.ensureIndex( { client_id: 1 } );db.client.user.ensureIndex( { username: 1 } );

// get all users belonging to a clientdb.client.user.find( { client_id: 76 } );

// get all clients a user has access todb.client.user.find( { username: ‘ibwhite’ } );

// get permissions for our current userdb.client.user.findOne( { username: user.name } );

Sunday, August 7, 2011

Page 36: MongoDB at Sailthru: Scaling and Schema Design

Probably Not

• Only needed if we have lots of clients per user AND lots of users per client

• This is a case where we can embed, so let’s do so

Sunday, August 7, 2011

Page 37: MongoDB at Sailthru: Scaling and Schema Design

Three Ways to Embed‘clients’: { ‘76’: ‘admin’, ‘450’: ‘reports_only’, },index:???

‘clients’: [ {‘_id’: 76, ‘access’: ‘admin’}, {‘_id’: 450, ‘access’: ‘reports_only’}},index: { ‘clients._id’: 1 }

‘clients’: [ 76, 450 ],‘clients_access’: { ’76’: ‘admin’, ‘450’: ‘reports_only’,}index: { clients: 1 }

Object

Arrayof objects

Arrayand object

Not good:can’t do a multikeys index

on the keys of a hash

Okay:but have to search

through arrayto find by _id

on retrieved doc

Our approach:Fields next to eachother alphabetically

Sunday, August 7, 2011

Page 38: MongoDB at Sailthru: Scaling and Schema Design

Indexes

• Index all highly frequent queries

• Do less-indexed queries only on secondaries

• Reduce the size of indexes whereever you can on big collections

• Don’t sweat the medium-sized collections, focus on the big wins

Sunday, August 7, 2011

Page 39: MongoDB at Sailthru: Scaling and Schema Design

Take Advantage of Multiple-Field Indexes• Order matters

• If you have an index on {client_id: 1, email: 1 }

• Then you also have the {client_id: 1} index “for free”

• but not { email: 1}

Sunday, August 7, 2011

Page 40: MongoDB at Sailthru: Scaling and Schema Design

Use your _id

• You must use an _id for every collection, which will cost you index size

• So do something useful with _id

Sunday, August 7, 2011

Page 41: MongoDB at Sailthru: Scaling and Schema Design

Take advantage of fast ^indexes

• Messages have _ids like: 32423.00000341

• Need all messages in blast 32423:

• db.message.blast.find( { _id: /^32423\./ } );

• (Yeah, I know the \. is ugly. Don’t use a dot if you do this.)

Sunday, August 7, 2011

Page 42: MongoDB at Sailthru: Scaling and Schema Design

Manual Range Partioning

• We moved a big message.blast collection into per-day collections:

• message.blast.20110605message.blast.20110606message.blast.20110607etc...

• Keeps working set indexes smaller

• When we move data into the archive, drop() is much faster than remove()

Sunday, August 7, 2011

Page 43: MongoDB at Sailthru: Scaling and Schema Design

Questions?Looking for a job?

[email protected]/eonwhite

Sunday, August 7, 2011