Download - Production MongoDB in the Cloud

Production MongoDB

in the CloudFrom Essentials to Corner Cases

Who are we?

Mike Hobbs & Bridget Kromhout

Social Commerce&

Brand Interest Graph Analytics

http://www.8thbridge.com/

Why MongoDB?

● Scalable, high-performance, open source● Dynamic schemas for unstructured data● Query language close to SQL in power● "Eventually consistent" is hard to program right

Our configuration12-node cluster (4 shards x 3 replica sets)Several other non-sharded replica setsDesired webapp response time is < 10ms

Total data size: 110 GBTotal index size: 28 GBLargest collection: 49 GBLargest index: 8.1 GB

EC2: EBS, instance size, replicationMongoDB: right for only some data sets

Memory & iowait

Working set needs to fit in memory

● Indexes● Frequently accessed records

Avoid swapping!!!EBS latency in EC2 is an issue.

FragmentationFragmentation steals from your most precious resource by reserving memory that is not used.

Run a compaction when your storageSize significantly exceeds your data sizemongos> db.widgets.stats()

..."size" : 5097988,"storageSize" : 22507520,

Padding can reduce fragmentation and I/Odb.widgets.insert({widg_id: "72120", padding: "XXXX...XXX"})db.widgets.update({widg_id: "72120"}, { $unset: {padding: ""}, $set: {desc: "Grout remover", price: "13.39", instock: true} })

Replica sets

"optime" : { "t" : 1365165841000 , "i" : 1 }, "optimeDate" : { "$date" : "Fri Apr 5 07:44:01 2013" },

test-3-1.yourdomain

test-3-2.yourdomain

test-3-3.yourdomain

test-3-1.yourdomain

test

Elections

08:52:06 [rsMgr] can't see a majority of the set, relinquishing primary 08:52:06 [rsMgr] replSet relinquishing primary state 08:52:06 [rsMgr] replSet SECONDARY 08:52:12 [rsMgr] replSet can't see a majority, will not try to elect self

Primary always determined by an election.

2-member replSet without an arbiter: if the secondary goes offline, the primary will step down:

Priorities can rig elections.

Ensure availability of an odd number of voting members.

Manual primary changes

No "become primary now" command. Manual stepdowns with recusal timeout are best option.

test-1:PRIMARY> rs.stepDown(300)Wed Apr 3 11:45:36 DBClientCursor::init call() failedWed Apr 3 11:45:36 query failed : admin.$cmd { replSetStepDown: 300.0 } to: 127.0.0.1:27017Wed Apr 3 11:45:36 Error: error doing query: failed src/mongo/shell/collection.js:155Wed Apr 3 11:45:36 trying reconnect to 127.0.0.1:27017Wed Apr 3 11:45:36 reconnect 127.0.0.1:27017 oktest-1:SECONDARY>

This triggers an election.

(Obviously, make sure your preferred candidate(s) can win.)

States: down (initializing), startup2, secondary, primary

replSet back to standalone? No. Test server: replicaset of 1, shard of 1. removed --replSet but shard configuration needed manual update:db.shards.update({host:"testreplset/test.domain.net"}, {$set:{host:"test.domain.net"}})

UpdatedExisting values no longer returned by mongos, butvisible when connected to mongod:> db.schedule.update({_id:...}, {$set:{lock:true}}, false, true); db.runCommand("getlasterror"){ "updatedExisting" : true, "n" : 1, "connectionId" : 73, "err" : null, "ok" : 1}

Solution: re-adding --replSet to the mongod startup line and reverting shard configs. (Bug open with 10gen.)

ShardingCan increase parallelization of CPU & I/OCarefully choose a shard key (nontrivial to change)Must run config servers & mongosDoesn't ensure high availabilityDoesn't help if you're already out of memory

256GB collection max for initial sharding

Rebalancing data across shardsQueries block while servers negotiate final hand-off.

Updating indexes after hand-off can be slow.

Best run off-peakmongos> use configswitched to db configmongos> db.settings.find(){ "_id" : "balancer", "activeWindow" : { "start" : "23:00", "stop" : "6:00" }}

Mongos & replSet primary changesApplication-level errors talking to mongos after an election:

pymongo.errors.AutoReconnect: could not connect to localhost:27020: [Errno 111] Connection refusedpymongo.errors.OperationFailure: database error: error querying server

Mongos errors talking to mongod on original primary:

Tue Apr 2 09:01:05 [conn3288] Socket say send() errno:110 Connection timed out 10.141.131.214:27017Tue Apr 2 09:01:05 [conn3288] DBException in process: socket exception [SEND_ERROR] for 10.141.131.214:27017

Connection pool checked lazily; invalid connections can persist for days, depending on load. Can clear manually:mongos> db.adminCommand({connPoolSync:1});{ "ok" : 1 }mongos>

Failure handlingApplications must handle fail-over outages:AutoReconnect & OperationFailure in pymongo

def auto_reconnect(func, *args, **kwargs):""" Executes func, retrying on AutoReconnect """for _ in range(100):

try:return func(self, *args, **kwargs)

except pymongo.errors.AutoReconnect:pass

except pymongo.errors.OperationFailure:pass

time.sleep(0.1)raise TimeoutError()

MMS (MongoDB Monitoring Service)● free; hosted by 10gen● need to run agent locally● 10gen's commercial support relies on MMS

Profiling queries [1]Finding bad queries that are actively running:$ mongo | tee mongo.log> db.currentOp()...bye$ grep numYields mongo.log

"numYields" : 0,"numYields" : 62247,"numYields" : 0,...

# Use your favorite viewer to find the op with 62247 yields

Helpful to get server back to a responsive state:$ mongo> db.killOp(10883898)

Profiling queries [2]Using nscanned to find queries that likely aren't using indexes:$ grep -P 'nscanned:\d\d' /var/log/mongodb.log

... or in real-time:$ tail -f /var/log/mongodb.log | grep -P 'nscanned:\d\d'

MongoDB also provides the setProfilingLevel() command which can log all queries to system.profile collection. > db.system.profile.find({nscanned:{$gte:10}})

system.profile does incur some performance overhead, though.

Nagios● plugin uses pymongo● set up service groups

https://github.com/mzupan/nagios-plugin-mongodb

Ideas for the future

● Better reconnect handling in applications● Lose the EBS? Ephemeral disk faster; rely

on replication to keep data persistent.● Intelligent use of mongo profiling (reduce

observer effect of setProfilingLevel)● Use more MMS alerts● Going to 2.4.x (fast counts, hashed

sharding)