Recovery Field Guide MongoDB Backup and - Percona...Optimise restore time, not backup run time Users...
Transcript of Recovery Field Guide MongoDB Backup and - Percona...Optimise restore time, not backup run time Users...
MongoDB Backup and Recovery Field Guide
Tim VaillancourtSr Technical Operations Architect, Percona
2
`whoami`
{name: “tim”,lastname: “vaillancourt”,employer: “percona”,techs: [
“mongodb”,“mysql”,“cassandra”,“redis”,“rabbitmq”,“solr”,“python”,“golang”
]}
3
Agenda
● History● Methods
○ Logical○ Binary
■ Cold■ LVM■ Hot Backup
● Integrity / Consistency○ mongodb_consistent_backup
● Architecture● Restore and Validation
4
History
● 3000-4000 BC: Culturally significant data backed up in a universal format
● 1400: The Printing Press● 1600-1800: Chapultepec Aqueduct● 1990s: Floppy and Zip Disks● 2000s: No more Floppy/Zip Disks● Present: All my data is on Google
Drive and I have 7 days of hourly Time Machine backups!
● Future: ?
5
Replication != Backup
● Replication is not a backup!○ Replication is High Availability○ Including
■ Binary/Statement-based Replication of any type● Delayed Replication***
■ RAID Arrays● <EOF>
Backup Methods
7
Logical Backups
● Tools○ mongodump
■ Uses find() queries with $snapshot to backup all collections■ Supports Gzip and Threading in 3.2+■ Outputs a directory containing bson files in various subdirectories
○ Custom Queries■ The client API could be used similarly to mongodump to perform logical backups
● Benefits○ Reduced storage footprint○ Replication awareness○ Compatibility
● Drawbacks
8
Binary Backups: Cold Backup
● Very simple process● Causes full outage to MongoDB instance!● Process
○ Stop mongod○ Copy and archive dbPath○ Start mongod
9
Binary Backups: LVM / Filer / Cloud Disk
● Process○ If Non-Journalled
■ db.fsyncLock()■ Keep session open
○ Create block-device snapshot○ Unlock the database
■ db.fsyncUnlock()○ Copy or achive the snapshot directory○ Remove block devics snapshot (as quickly as
possible!)● LVM
○ Snapshots have been demonstrated to cause up to 30%* write latency impact to disk due to COW
10
Binary Backups: Hot Backup
● PSMDB or MongoDB Enterprise○ Pay $$$ for MongoDB Enterprise or download
PSMDB for free(!)○ db.adminCommand({
createBackup: 1,backupDir: "/data/mongodb/backup"
})○ Copy/archive the output path○ Delete the backup output path○ NOTE:
■ RocksDB-based createBackup creates filesystem hardlinks whenever possible!
■ Delete RocksDB backupDir as soon as possible to reduce bloom filter overhead!
Backup Integrity / Consistency
12
The “Distributed Cluster Backup Problem”
● Mongodump is single node consistent only!● Common to most or all database techs in
sharded environment● Problems:
○ Backup tools consider single-instance integrity only
○ Backups of different shards may complete at different times
○ Changes replicate asynchronously○ Data may be balancing / moving in the cluster
● Risks:○ Orphaned documents / references○ Holes in data
13
Backups: mongodb_consistent_backup
● Python project by Percona-Lab for consistent backups● URL: https://github.com/Percona-Lab/mongodb_consistent_backup● Best-effort support, not a “Percona Product”● Created to solve limitations in MongoDB backup tools:
○ Replica Set and Sharded Cluster awareness○ Cluster-wide Point-in-time consistency○ In-line Oplog backup (vs post-backup)○ Notifications of success / failure
● Extra Features○ Remote Upload (AWS S3, Google Cloud Storage and Rsync)○ Archiving (Tar or ZBackup deduplication and optional AES-at-rest)○ CentOS/RHEL7 RPMs and Docker-based releases (.deb soon!)
14
Backups: mongodb_consistent_backup
● 1.2.0○ Multi-threaded Rsync Upload○ Replica Set Tags support○ Support for MongoDB SSL / TLS connections and client auth○ Rotation / Expiry of old backups (locally-stored only)
● Future○ Incremental Backups○ Binary-level Backups (Hot Backup, Cold Backup, LVM, Cloud-based, etc)○ More Notification Methods (PagerDuty, Email, etc)○ Restore Helper Tool○ Instrumentation / Metrics○ <YOUR AWESOME IDEA HERE> we take GitHub PRs (and it’s Python)!
Backup Architecture
16
Architecture: Simple Example
● Method○ Run mongodump (with --oplog) using a plain secondary○ Store backups with on-site remote storage (filer, rsync,
etc)● Potential Issues
○ Application Impact■ I/O and CPU impact due to backups may affect application■ Storage-engine and FS caches will become dirty■ Primary Failure
● A failure of the Primary may cause the Secondary backing-up to become Primary
● This can be avoided by using a Read Preference of ‘secondary’ (supported in recent mongodump versions)
○ No Disaster Recovery
17
Architecture: Tag-Based Example
● Replica Set Tags○ Allow selection of MongoDB nodes using key/value pairs○ Represented in JSON/single document○ Many key/value pairs is possible
● Example Backup from “west” Only○ Specify a single node with a tag such as { location: “west” }○ Use Read Preference Tag in mongodump/mongodb_consistent_backup to target a
specific node.
18
Architecture: Offsite Backup Example
● Example○ Create backup within local datacenter○ Upload completed backups to other datacenter, cloud,
etc■ mongodb_consistent_backup supports Amazon S3,
Google Cloud Storage and Rsync for remote upload!● Benefits
○ Fast backup time due to in-datacenter latency● Drawbacks
○ A full backup data uploaded each backup job
19
Architecture: Disaster Recovery Example
● Example○ Place a SECONDARY node in another location
■ Dedicated node is recommended to reduce impact■ hidden:true recommended
○ Run backup from off-site SECONDARY member○ Optionally upload to Cloud Storage
● Benefits○ Only changes (replication) replicated to offsite location○ Potentially faster uploads to Cloud Storage
● Drawbacks○ Bootstrap / Initial Sync may use high bandwidth (if not
seeded by backup)
Restore and Validation“It’s not a backup system, it’s a restore system” ~ Raymond Blum, Google SRE
21
Restoring and Validation
● Methodology○ Optimise restore time, not backup run time
■ Users and business care how fast their data is back, not how long it takes to backup■ Binary-level backups are much faster to restore in MongoDB
● Validation○ This is very application specific○ Random sample restored data and validate
■ Example: Compare to Production● Compare real Production item, user, article, etc to backup● Ensure backup age doesn’t cause false alarms, ie: test data older than backup
■ Example: Integration Test / QA● Run code integration tests or QA on restored data
■ Example: Production Backup as Test Data● Copy Production Data to Test periodically using backups
22
Thank You Sponsors!
23
SAVE THE DATE!
CALL FOR PAPERS OPENING SOON!www.perconalive.com
April 23-25, 2018Santa Clara Convention Center
24
Questions?