RDS for MySQL, No BS Operations and Patterns

76
RDS for MySQL No BS Operations and Patterns Laine Campbell, CEO PalominoDB

description

Amazon's RDS for MySQL is a wonderful tool with a significant value. It can also create a lot of havoc if you are not aware of it's limitations and changes before you make it a core part of your environment. In this deck, we discuss those issues.

Transcript of RDS for MySQL, No BS Operations and Patterns

Page 1: RDS for MySQL, No BS Operations and Patterns

RDS for MySQLNo BS Operations and Patterns

Laine Campbell, CEO PalominoDB

Page 2: RDS for MySQL, No BS Operations and Patterns

The Party Line

Relational Database ServiceFully ManagedSimple to DeployEasy to ScaleReliableCost Effective

Page 3: RDS for MySQL, No BS Operations and Patterns

Fully Managed

Ignore the man behind the curtainBackupsProvisioningPatchingPerformance ManagementFailoverReplication

Page 4: RDS for MySQL, No BS Operations and Patterns

Fully Managed

BackupsSnapshot Based - Same as EBS

Snapshots cause spikes in latencyAvoided in Multi-AZ

Snapshots are taken from masterOr the standby in Multi-AZ

Set up automatic schedulesPoint in Time Recovery via binlogsUser executed snapshots

Page 5: RDS for MySQL, No BS Operations and Patterns

RDS Backups

Can I snapshot a replica?Nope. Backup from your master.

Of course, you can promote a replica, then snapshot it for testbeds.

Page 6: RDS for MySQL, No BS Operations and Patterns

RDS Backups

I like RDS BackupsWhen using Multi-AZ

AND

When loads are minimal

It's like unicorns are flying my binlogs to heaven

Page 7: RDS for MySQL, No BS Operations and Patterns

Fully Managed

Provisioning

Rapid Master LaunchesMaster in a few minutes (or it's free?)Standby in a different AZ? Push a button!

Rapid Replica BuildsNeed more replicas? Push a button!

Page 8: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Provisioning your masterStandalone - no failover or redundancy

Multi-AZ - standby in a separate availability zone

Pick your Version

Pick your maintenance window

Page 9: RDS for MySQL, No BS Operations and Patterns
Page 10: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Overview of AZ and RegionsAmazon Regions equate to data-centers in different geographical regions. (99.5% SLA based on more than one AZ being unavailable)

Availability zones are isolated from one another in the same region to minimize impact of failures.

RDS does not interact across regions.

Page 11: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Can multiple AZs save me?Amazon states AZs do not share :

● Cooling● Network● Security● Generators● Facilities

Page 12: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Can multiple AZs save me?Apr, 2011 - US East Region EBS Failed

* Incorrect network failover.* Saturated intra-node communications.* Cascading failures impacted EBS in all AZs.

Jul, 2012 - US East Partial Impact* Electrical storms impacted multiple sites.* Failover of metadata DB took too long.* EBS I/O was frozen to minimize corruption.

Page 13: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Can multiple AZs save me?

They can reduce risk.

Cross AZ latency can vary as much as 3x. (too slow to allow mysql cluster across AZs)

A multi-az failover can create a degraded performance condition when minimal latency is required.

Page 14: RDS for MySQL, No BS Operations and Patterns

Multi-AZ Failover

From AWS Docs

Page 15: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Multi-AZ Magical FailoverReplicates via unicorn express

Fails over quite often, with up to 30 seconds of downtime

You do not get to choose your failover AZ

Typical I/O write impact for synch replicationaka unicorn express

Page 16: RDS for MySQL, No BS Operations and Patterns

Multi-AZ Failover

From AWS Blog

Page 17: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Pick Your VersionMySQL 5.1 or MySQL 5.5

:( No MariaDB :(:( No XtraDB :(

:( No Drizzle :(:( No TokuDB :(

Page 18: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Pick Your Maintenance Window30 minute window your software patching can occurCan be different for different instancesYou need to plan ahead for instances to be out of service.

Page 19: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

They'll shut off my DB????

Page 20: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Auto-Version Minor UpgradeIf you choose no, you will not experience automatic upgrades (and thus downtime).Some critical security patches can still be done.RDS team is fairly good about communicating upgrades.

Page 21: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Basic Instance TypesMicro - 630 MB RAM, 2 ECU - Low I/OSmall - 1.7 GB RAM, 1 ECU - Med I/OLarge - 7.5 GB RAM, 4 ECU - High I/OXLarge - 15 GB RAM, 8 ECU - High I/O

Page 22: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Fancy Instance Types

High Mem XL - 17.1 GB RAM, 6.5 ECU - High I/OHigh Mem 2XL - 34 GB RAM, 13 ECU - High I/OHigh Mem 4XL - 68 GB RAM, 26 ECU - High I/O

Page 23: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Storage ProvisioningFrom 5 GB to 3 TBAt 300 GB, EBS Volumes start to get striped.Striping = better performanceProvisioned IOPS (up to 30,000)

= more stable I/O and costs more too!

Page 24: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Virtual Private Cloud (VPC)Allows you to create your own virtual network simulating traditional DC networks.

You must create a DB Subnet Group in VPC

VPC Subnets cannot cross availability zones.

VPC security group allows access control to your DB

Page 25: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Virtual Private Cloud (VPC)Mixed architectures with some VPC, and some non-VPC creates major issues.

Auto-scaling becomes difficult.

Don't do it!

Page 26: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Database Security Groups

Controls all MySQL access to RDS instances.

Defaults to "deny all"

Access can be granted by IP Range and EC2 sec groups.

Page 27: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Database Security Groups

Don't grant access to 10.x.x.x, use a security group.

IPs entered with CIDR - Classless Inter-Domain Routing

Make sure you understand CIDR! (or you may haveunwelcome visitors!)

Page 28: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Parameter GroupsDefines parameters used by your RDS instances.

There is a "default" group that you can modify.

One or more RDS instances can map to an individual parameter group.

Page 29: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Parameter Group Best PracticesDon't ever use the default group.

The default group doesn't allow dynamic parameterchanges. Everything requires a restart.

Build different groups for each mysql master/replicagrouping.

Page 30: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Parameter Group Best PracticesUse different parameter groups for masters vs. replicas.

Consider using different parameter groups for different replica types (app query, ad hoc, ETL)

Remember to use test environments. Test!!!

Page 31: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Why different parameter groups?Granularity - Do you want to apply the same parameter to everything in the cluster?

● Read Only?● Slow Logging?● innodb_flush_method

Page 32: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Page 33: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Provisioning your ReplicasDoes not have to be the same instance type as themaster.

Pick your availability zone (great for mapping replicasto app servers in the same AZ.)

Don't forget to apply a different parameter groupthan your master.

Page 34: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Provisioning your ReplicasAdding a replica impacts your master performance.(If not in multi-az)

You can only launch in serial - and it can take anon-trivial amount of time to launch.

Adding many replicas can take awhile. Script it!

Page 35: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

What can I do with my replica?Send queries to it

Promote it to a master

Poke it with a stick

Use it for special purposes (mysqldump, ETL, ad hoc)

Page 36: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Sending queries to the replica?Set up Route53 cnames - weighted round robin.

Internal elastic load balancer in the VPC.

VPC/Route53 does not do a mysql health check.

HAProxy can be leveraged.

Page 37: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Replica master PromotionThis is a great way to build a test environment.

Can be leveraged for rolling migrations

But a replica can't have a replica! Must promote first!

Page 38: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

Replica promotion for failoverThis can be used instead of Multi-AZ. Why?

When using log_sync=0, a master failover in multi-azmay strand your replicas.

Old log doesn't close correctly. Replica cannotproceed. And you can't move to the next log!

Page 39: RDS for MySQL, No BS Operations and Patterns

RDS Provisioning

All of my replicas must be rebuilt!

Page 40: RDS for MySQL, No BS Operations and Patterns

A Day in the Life

What does an RDS DBA do?

Page 41: RDS for MySQL, No BS Operations and Patterns

A Day in the Life

What does an RDS DBA do?Need a replica?

Push a button or call an API.

Need to create a test environment?Promote a replica, call an API.

New Cluster?Push a button or call an API.

Page 42: RDS for MySQL, No BS Operations and Patterns

A Day in the Life

What does an RDS DBA do?Need a backup?

Push a button or call an API.

Need to recover a database?Push a button or call an API.

New Cluster?Push a button or call an API.

Page 43: RDS for MySQL, No BS Operations and Patterns

A Day in the Life

Need to do a query review?

You don't have access to the logs at the filesystem level.

You can look in the console or via API for some initial diagnostics.

Page 44: RDS for MySQL, No BS Operations and Patterns

A Day in the Life

Query ReviewsNeed to do a REAL query review?

Log to the csv table - slow_log mysql -u user -p -h host.rds.amazonaws.com -D mysql -s -r -e "SELECT CONCAT( '# Time:

', DATE_FORMAT(start_time, '%y%m%d %H%i%s'), '\n', '# User@Host: ', user_host, '\n', '# Query_time: ', TIME_TO_SEC(query_time), ' Lock_time: ', TIME_TO_SEC(lock_time), ' Rows_sent: ', rows_sent, ' Rows_examined: ', rows_examined, '\n', sql_text, ';' ) FROM mysql.slow_log" > /tmp/mysql.slow_log.log

pt-query-digest --limit 100% /tmp/mysql.slow_log.log > /tmp/query-digest.txt

Page 45: RDS for MySQL, No BS Operations and Patterns

A Day in the Life

Query ReviewsNo Microsecond Patch

Using long-query-time=0 logs all queriesBut they record as 0 on timeYou have no accurate profile of query time for < 1 sec.

You also can't use TCPDump on the MySQL Instance.We often use this if logging everything will dropperformance on your DB instance to unacceptable levels.

WHICH IT CAN

Page 46: RDS for MySQL, No BS Operations and Patterns

A Day in the Life

Need to rotate logs?

call mysql.rds_rotate_slow_log;

call mysql.rds_rotate_general_log;

Page 47: RDS for MySQL, No BS Operations and Patterns

A Day in the Life

Need to kill a process?

call mysql.rds_kill_query (99);

kills the current query for this thread.

call mysql.rds_kill (99);

kills the thread.

Page 48: RDS for MySQL, No BS Operations and Patterns

A Day in the Life

Managing Replication

Need to stop replication? Break it yourself!

call mysql.rds_skip_repl_error;

Skips the current replication error.

Page 49: RDS for MySQL, No BS Operations and Patterns

A Day in the Life

Reviewing Status Trends

Global Status History

Event snapshots status into mysql.rds_global_status_history;

You can trend this into many tools.

Page 50: RDS for MySQL, No BS Operations and Patterns

Monitoring MySQL

CloudwatchCPUUtilizationDatabase ConnectionsFreeStorageSpaceNetwork In/OutRead/Write IOPsRead/Write BytesRead/Write Latency

Page 51: RDS for MySQL, No BS Operations and Patterns

Monitoring MySQL

Where are the MySQL Metrics?

Cloudwatch doesn't expose them.

You can use: Cacti, Graphite, Zabbix, etc... fortrending.

Page 52: RDS for MySQL, No BS Operations and Patterns

Monitoring MySQL

Can I alert on cloudwatch metrics?

Cloudwatch allows you to set up your alerts.

But you probably want all metrics and alerts in the same system, don't you?

Page 53: RDS for MySQL, No BS Operations and Patterns

Monitoring MySQL

Also cloudwatch is unreliable

It often doesn't poll at every interval.

Can miss/skip important events.

Page 54: RDS for MySQL, No BS Operations and Patterns

Monitoring MySQL

What can I use?

Nagios can poll mysql directly

Poll from graphite

Page 55: RDS for MySQL, No BS Operations and Patterns

Some things that suck

Moving data in and out

Want to do a dump and load upgrade?

Want to migrate to a new region?

Want to do multi-layer replication?

Page 56: RDS for MySQL, No BS Operations and Patterns

Some things that suck

Migrations/Upgrades out of RDS

Take a replica out of service.Dump your data.Upgrade your binaries.Load your data.Give replicas to your replica.Failover reads, then writes.MINIMAL DOWNTIME

Page 57: RDS for MySQL, No BS Operations and Patterns

Some things that suck

Migrations/Upgrades in RDS

Page 58: RDS for MySQL, No BS Operations and Patterns

Some things that suck

Migrations/Upgrades in RDS

Dump a bunch of tables.Load deltas via tons of scripting.Keep the deltas on each table minimal.Take a few hours downtime.Sync the delta.Test.Go live and drink a lot.

Page 59: RDS for MySQL, No BS Operations and Patterns

Some things that suck

This also applies to:

Moving data between regions.

Migration to EC2 from RDS.

Migrating to a datacenter from AWS

Page 60: RDS for MySQL, No BS Operations and Patterns

Patterns for RDS

Prototyping and Testing:

Rapid build and destroy.

Short lifecycles.

Quick testing lifecycles.

Page 61: RDS for MySQL, No BS Operations and Patterns

Patterns for RDS

Moderate Uptime SLAs:

Region Level SLA is 99.5% across two AZ's (43.8 hours of downtime per year)

Add in failover times for multi-AZ master (6 more hours)

Expect around 4 days of downtime withoutmulti-region

Page 62: RDS for MySQL, No BS Operations and Patterns

Patterns for RDS

That doesn't include:

Downtime from bad queries

Downtime from user error

Downtime from upgrades/migrations

Page 63: RDS for MySQL, No BS Operations and Patterns

Patterns for RDS

Relaxed Latency Requirements:

Multi-AZ can introduce cross-AZ latencywithout AZ specific architectural design.

EBS storage can introduce unpredictableLatency without P-IOPS

Snapshots of master, replica builds and multi-AZfailovers can impact write latency.

Page 64: RDS for MySQL, No BS Operations and Patterns

Patterns for RDS

Relaxed Latency Requirements:

If you use write-through cache, this can be mitigated

If you use significant caching, this can be mitigated

If you use AZ aware design, this can be mitigated

Page 65: RDS for MySQL, No BS Operations and Patterns

Patterns for RDS

Dataset Specifics:

Small datasets can allow for rapid region migrations

Read only datasets can also allow for this

Data you don't mind losing can also allow for this

Page 66: RDS for MySQL, No BS Operations and Patterns

Patterns for RDS

No DBA(s):

You still need DBAs to design, tune and configure.

But RDS does reduce some DBA overhead.

With investment in automation, this overhead is notsignificant.

Still, automation requires money/hours. If you haveno budget, RDS is a good way to start.

Page 67: RDS for MySQL, No BS Operations and Patterns

War Stories

Obama for America:US-East Region

Multi-AZ

5 Clusters, 30 Instances

Provisioned IOPs, 1 TB Storage

Page 68: RDS for MySQL, No BS Operations and Patterns

Obama for America

Data Growth:Opsview had no visibility to OS, and thus wewere surprised regularly by storage growth. Had to build custom plugins.

Upgrading storage or instance size in multi-AZ can cause an unpredictable downtime window.

Downtime is small, but the whole process can take30 minutes and you don't know when the REALdowntime will occur.

Page 69: RDS for MySQL, No BS Operations and Patterns

Obama for America

Hurricane Sandy:Hurricane Sandy was poised to strike Virginia andUS East.

Luckily we had built out EC2 and data migrationscripts.

Took 3 days solid for the whole team to build out US-West region.

Page 70: RDS for MySQL, No BS Operations and Patterns

Obama for America

Human Error:While doing rolling DDL, sql_log_bin disabled at theglobal level on master. (Damn you 5.5!!!!)

No access to binlogs made troubleshooting verychallenging.

An hour of troubleshooting because we blamed thedisk and had no visibility.

Had to rebuild all replicas in serial overnight once

Page 71: RDS for MySQL, No BS Operations and Patterns

Obama for America

Migration to P-IOPs:

Things that make you go hmmm....

Page 72: RDS for MySQL, No BS Operations and Patterns

War Stories

Call of Duty, Black Ops 2:5 Clusters, 25 instances.

US East

Multi-AZ

Provisioned IOPs

Page 73: RDS for MySQL, No BS Operations and Patterns

CoD Black Ops 2

Hurricane Sandy:Data migration scripts not setup for continuousreplication.

Had to draw a line in the sand on when to movedata.

Any additional data would be lost, if cutover occurred.

Page 74: RDS for MySQL, No BS Operations and Patterns

CoD Black Ops 2

Multi-AZ Failover:Writes required sync_binlog=0

Master failed over to standby.

All replicas stopped replicating.

DBA couldn't “change master”

Read load swarmed the master while we rebuilt.

Page 75: RDS for MySQL, No BS Operations and Patterns

CoD Black Ops 2

Provisioned IOPs:Came out, super exciting!

Let's migrate!

Oh, no push button migration.

2 Senior DBAs, 3 weeks to build migration scriptsand test/migrate.

Page 76: RDS for MySQL, No BS Operations and Patterns

Q&A

Laine Campbell, CEO PalominoDB

http://www.slideshare.net/lainecampbell