Upgrade to MySQL 5.6 without downtime

40
Olivier Dasini - @freshdaz Upgrade to MySQL 5.6 without downtime Meetup LeMug.fr @Dailymotion - Paris - Sept 17, 2015 1

Transcript of Upgrade to MySQL 5.6 without downtime

Olivier Dasini - @freshdaz

Upgrade to MySQL 5.6 without downtime

Meetup LeMug.fr @Dailymotion - Paris - Sept 17, 2015

1

Olivier Dasini - @freshdaz

AgendaMe, Myself & I

Technical background

Why upgrade to 5.6?

Performance testing

Preprod upgrade

Production upgrade

Wrap-up2

Olivier Dasini - @freshdaz

Olivier DASINI - @freshdaz

● MySQL Geek & Data enthusiast

● Technical writer, blogger and speaker

● Insatiable hunger of learning

● co-creator of French MySQL User Group

Me, Myself & I

3

Olivier Dasini - @freshdaz

AgendaMe, Myself & I

Technical background

Why upgrade to 5.6?

Performance testing

Preprod upgrade

Production upgrade

Wrap-up4

Olivier Dasini - @freshdaz

Technical background 1/3Can split MySQL users in 3 types regarding their working set order of magnitude:

● <= Tens of GBs : 20%○ MySQL usage probably not (so) critical○ Migration (quite) easy, could be manual

● <= Tens of TBs : 75%○ MySQL is critical => strong production constraints○ Migration should be carefully planned○ Need automation however some parts could be manual

● >= Hundreds+ of TBs : 5%○ MySQL highly critical. think twice (or more) before upgrading.○ Same than above w/ automation (everywhere)

5

Olivier Dasini - @freshdaz

Technical background 2/3The company :

● Software development

● Provides a cloud-based customer service platform

○ ~ 1,000 people

○ ~ 60,000 paid customers in 150 countries

6

Olivier Dasini - @freshdaz

Technical background 3/3MySQL flavour : Percona Server 5.5 on Fusion IO

Data size : ~ 30 TB | Daily growth rate : up to 40 GB

# MySQL group of replicas (1 Master / n Slaves) : ~ 50

# MySQL instances : ~ 200

Mostly OLTP oriented workload - InnoDB tables

Thousands qps, mostly reads (Selects)

Replication lag sensitive

No downtime allowed!!!

7

Olivier Dasini - @freshdaz

AgendaMe, Myself & I

Technical background

Why upgrade to 5.6?

Performance testing

Preprod upgrade

Production upgrade

Wrap-up8

Olivier Dasini - @freshdaz

Why upgrade to 5.6? 1/3Tons of new cool stuffs :

● Security improvements● InnoDB enhancements● Partitioning● Performance Schema● Replication and logging● Optimizer enhancements● …

Complete list : http://dev.mysql.com/doc/refman/5.6/en/mysql-nutshell.html

9

Olivier Dasini - @freshdaz

Why upgrade to 5.6? 2/3Choose what features we'd like to have.

Team brainstorming...

● Define which added features will suit○ Schedule when we'll use them○ Avoid too many changes at one time

● Pay attention to deprecated features○ They'll probably be removed in future version○ Shouldn't be used anymore

● Pay extra attention to removed features○ They'll break your server

10

Olivier Dasini - @freshdaz

Why upgrade to 5.6? 3/3Team brainstorming result :

● InnoDB enhancement○ Persistent stats ○ Online DDL○ New flushing algo○ New checksum algo

● Performance Schema● Replication

○ Smaller image for Row base replication○ Crash safe Master ⇔ Crash safe binlog○ Crash safe Slave ⇔ Table logging for master / slaves info○ GTID (for automatic Switchover/Failover) : [Phase 2]○ Parallel replication : [Phase 3]

● Optimizer enhancements...11

Upgrade Confidence Index : 60%

Olivier Dasini - @freshdaz

AgendaMe, Myself & I

Technical background

Why upgrade to 5.6?

Performance testing

Preprod upgrade

Production upgrade

Wrap-up12

Olivier Dasini - @freshdaz

Performance testing 1/135.6 upgrade will be awesome

(at least in theory)

Many articles proves it, Yeah!

http://dimitrik.free.fr/blog/archives/2013/02/mysql-performance-mysql-56-vs-mysql-55-vs-mariadb-55.htmlhttps://blogs.oracle.com/MySQL/entry/mysql_5_6_is_a

Benchmarks never lies :)… but is their truth ours?

In real life perf will depend on many factors like workload, hardware, configurations, …

What about us?

13

Olivier Dasini - @freshdaz

Performance testing 2/13● The plan is to get our own numbers● Compare 5.5 and 5.6 performances in a production context● Unfortunately we have customers !!! :)● Out of production but with similar context (as far as

possible)○ Data○ Queries○ Workload○ Hardware○ Configuration...

=> Ad-hoc 5.6 upgrade on 1 server14

Olivier Dasini - @freshdaz

Performance testing 3/13Build 5.6 test server from a 5.5 slave.

Choose a "small" cluster (1.5 TB)

Ad_hoc upgrade is quite straightforward:

Clone a 5.5 server -> Upgrade in 5.6 -> Setting up replication

Steps● Take a binary backup (Xtrabackup) from db5.5 (5.5 instance)● Restore the binary backup on new server (5.6 candidate but still in 5.5)● 5.6 binaries upgrade + New configuration (5.6 my.cnf)● mysql_upgrade● Start replication (master is still in 5.5)

15

Olivier Dasini - @freshdaz

Performance testing 4/13Issue : Fatal replication error 1/2

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master; the first event 'db_master_5.5-bin-log.003440' at 974453835, the last event read from '/var/log/mysql/db_master_5.5-bin-log.003440' at 974453835, the last byte read from '/var/log/mysql/db_master_5.5-bin-log.003440' at 974453854.'

On the master binary log:

ERROR: Error in Log_event::read_log_event(): 'Event too big', data_len: 1852797793, event_type: 104

Could not read entry at offset 974453835: Error in log format or read error.

#150318 18:09:39 server id 174326798 end_log_pos 107 Start: binlog v 4, server v 5.5.32-31.0-log created 150318 18:09:39

16

Olivier Dasini - @freshdaz

Performance testing 5/13Issue : Fatal replication error 2/2

● We've never found any explanation.● We tried to increase the max_allowed_packet dynamically

on both master and the 5.6 slave… but no effect.● Only 5.6 slave was impacted ie no issues for 5.5 slaves● No fixes except ignore this binlog ie switch to the next

one.○ Meaning risks of losing events…○ Also high risks of inconsistency

So we dropped the data and reloaded a fresh 5.5 dump + mysql_upgrade. 17

Olivier Dasini - @freshdaz

Performance testing 6/13The goal is to compare performance between 5.5 & 5.6

5.6 status :

○ Replicating data as any other 5.5 slaves○ Contains production data○ Same hardware characteristics

Ready to start our benchmarks \o/

18

Olivier Dasini - @freshdaz

Performance testing 7/13Toolpt-upgrade : https://www.percona.com/doc/percona-toolkit/2.2/pt-upgrade.html

pt-upgrade executes queries in the given MySQL LOGS on each DSN, compares the results, and reports any significant differences. The tool can also save the results for later analyses. LOGS can be slow, general, binary, tcpdump and raw.

Best practices

● Split your (slow) logs into small chunks : 200 ~ 500 MB of data○ Easier to manage○ Output easier to analyse

● Choose carefully your data samples○ Capture queries at different time○ Reduce the risk to missed important queries

19

Olivier Dasini - @freshdaz

Performance testing 8/13Phase 1 - Collect Slow Logs

For each collection :

● Connect to 5.5 slave in production ● Set long_query_time to 0

○ mysql> SET GLOBAL long_query_time = 0; ● Clean slow log

○ $ cp /dev/null /var/log/mysql/slow-log ● Wait for X mins or watch the slow-log grow to ~300MB (whichever comes 1st)● Set long_query_time to its default value

○ mysql> SET GLOBAL long_query_time = <DEFAULT_VALUE>; ● Copy dated slow log

○ $ cp /var/log/mysql/slow-log ./slow-log-$(date +"%F-%H-%M-%S") ● Clean slow log

○ $ cp /dev/null /var/log/mysql/slow-log20

Olivier Dasini - @freshdaz

Performance testing 9/13Phase 2 - Benchmarks (cold & warm buffers) and Compare 1/2

1. Ensure both slaves - 5.5 & 5.6 - have no replication lag2. Stop replication on db_5.5:

a. mysql_5.5> STOP SLAVE; 3. Wait for a few seconds....4. Stop replication on db_5.6:

a. mysql_5.6> STOP SLAVE;5. Note down the master log file and position from the above step-4. 6. Both slaves should be in perfect sync.

Update db_5.5's master log/position to reflect db_5.6's master log/position respectively. So the when pt-upgrade is run, it returns the same set and the number of of rowsa. mysql_5.5> START SLAVE SQL_THREAD UNTIL MASTER_LOG_FILE =

'<log_file>', MASTER_LOG_POS = <log_position>;

21

Olivier Dasini - @freshdaz

Performance testing 10/13Phase 2 - Benchmarks (cold & warm buffers) and Compare 2/2

7. Run pt-upgrade on db_5.5 (reference results)a. Cold bench (after a mysql restart)b. Warm bench (after the first run)

8. Run pt-upgrade on db_5.6a. Cold bench (after a mysql restart)b. Warm bench (after the first run)

9. db_5.5. back to production

22

Olivier Dasini - @freshdaz

Performance testing 11/13Our tests was interesting

Query response time was usually equals or better in 5.6

However we found 1 big query regression

● Query time: From (0.09 sec) to (16 min 40.35 sec)

23

Upgrade Confidence Index : 75%

Olivier Dasini - @freshdaz

Performance testing 12/13Issue : Query regression

● Basically Optimizer was chosen the wrong index.● Bug opened to MySQL (by Percona)

Possible fixes :

● Disable index extensions algorithm (pre 5.6.9 behavior)○ SET optimizer_switch="use_index_extensions=off";

● Use hint: IGNORE / FORCE INDEX○ … IGNORE INDEX (bad_index) … || … FORCE INDEX (good_index) …

● Use NULL-safe equal operator ie replace "IS NULL" by "<=> NULL"○ … column_id <=> NULL …

● Rewrite query○ The most sustainable choice○ Many possibilities… worked with the appropriate dev team

24

Olivier Dasini - @freshdaz

Performance testing 13/13As soon as the query was fixed and tested we put the 5.6 in production.

● 5.6 is like the other 5.5 slaves● Monitored closely for weeks● Slow query logs analysis chown good numbers

○ Fewer slow queries○ Smaller amount of total slow query time

● Smaller CPU usage

So far so good…

25

Upgrade Confidence Index : 90%

Olivier Dasini - @freshdaz

AgendaMe, Myself & I

Technical background

Why upgrade to 5.6?

Performance testing

Preprod upgrade

Production upgrade

Wrap-up26

Olivier Dasini - @freshdaz

Preprod upgrade 1/9● Workload different from production : smaller● Data size different from production : tinier● Hardware also different

=> Not relevant for performance tests

But is very important to :● Test the upgrade process

○ Can't do it manually○ Should be transparent for our customers

● Know how our internal tools / other apps will behave with 5.6○ Databases are used in so many different ways○ Can't test them all so if it breaks someone will shout!

● Sensibilise other MySQL consumers to this migration○ We need their feedback

This step is also very important because an entire cluster downgrade (back to 5.5) is a painful operation

27

Olivier Dasini - @freshdaz

Preprod upgrade 2/9Preprod technical context

Flavour : Percona Server 5.5 on VMs

Data size : ~ GBs

# MySQL group of replicas : 4

# MySQL instances : 12

Mostly OLTP oriented workload - InnoDB tables

Hundreds qps, mostly reads (Selects)

Replication lag sensitive - Preferably no downtime

28

Olivier Dasini - @freshdaz

Preprod upgrade 3/9Overall process - Upgrade the 1st slave

● Put OOR one slave (per) cluster● Upgrade the slave ⇔ [more details later]● Put it back to rotation (as a replica)● Checks / Tests / Monitor● Backup the slave (Binary backup w/ Xtrabackup)

○ Base backup for other slaves

Similar to what we'll use in production (obvious!)

29

Olivier Dasini - @freshdaz

Preprod upgrade 4/9Overall process - Upgrade the 2nd (other) slave(s)

● Put OOR the 5.5 slave● Drop the data● Upgrade the binaries● Restore the 5.6 binary backup on this slave.● Put it back to rotation● Checks / Tests / Monitor

● So far, a downgrade is still quite easy:○ Binary backup from master, restore to slave after binaries downgrade

30

Olivier Dasini - @freshdaz

Preprod upgrade 5/9Overall process - Upgrade the master

Last step, easy but very sensitive

● Switch master failover ○ Promote a 5.6 slave to become the new master○ Usually less than 1 second in read only mode

● Then upgrade the old master & restore it from 5.6 backup● We have our internal tool for switch master failover

○ but 5.6 broke it…○ Whole cluster in a read only state without master ie no write allowed○ Fortunately that happens in preprod :)

31

Olivier Dasini - @freshdaz

Preprod upgrade 6/9Issue : Internal tools broken - Switch master failover

The tool uses deprecated statements SLAVE START and SLAVE STOP, instead of START SLAVE and STOP SLAVE. But they were removed in 5.6.

In old versions of MySQL (before 4.0.5), this statement was called SLAVE START. This usage is still accepted in MySQL 5.5 for backward compatibility, but is deprecated and is removed in MySQL 5.6 : https://dev.mysql.com/doc/refman/5.5/en/start-slave.html

The SLAVE START and SLAVE STOP statements. Use The START SLAVE and STOP SLAVE statements : http://dev.mysql.com/doc/refman/5.6/en/mysql-nutshell.html

Fix: Use the right statements

=> avoid usage of deprecated commands / functions /...

32

Olivier Dasini - @freshdaz

Preprod upgrade 7/9Issue : Internal tools broken - Internal usage

Because of the new configuration, new information are logged in the binlog:

You can also cause the server to write checksums for the events using CRC32 checksums by setting the binlog_checksum system variable : http://dev.mysql.com/doc/refman/5.6/en/mysql-nutshell.html

http://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#sysvar_binlog_checksum

These tools parses the binlog…

Fix : Development by the relevant team33

Olivier Dasini - @freshdaz

Preprod upgrade 8/9Upgrade workflow 1/2

1. Extract schema and data + Pre-upgrade checks

2. Drop MySQL directories (datadir, logdir)

[ binaries upgraded to 5.6 by OPS + Disk encryption ] : OPS tasks

3. Load schema + Post-upgrade checks

4. Load data + Post-upgrade check2 & Compare differences in "before" & "after" checks

Checks: object count, charset,...34

Olivier Dasini - @freshdaz

Preprod upgrade 9/9Upgrade workflow 2/2

● Upgrade process was split in a dozen of scripts● Theses scripts was called by 4 main wrapper scripts for convenience● 2 types of granularity provide more flexibility

○ In case of issue DBAs can resume the process "manually" at any step○ An extra step can easily be added eg (schema modification)

● Automation is important○ Tasks are pretty straightforward but time consuming○ Lowering risk of error○ Hundreds of servers

● DBA needs to be aware of the status● Script sends emails to DBAs when

○ Task is completed○ In case of error 35

Upgrade Confidence Index : 95%

Olivier Dasini - @freshdaz

AgendaMe, Myself & I

Technical background

Why upgrade to 5.6?

Performance testing

Preprod upgrade

Production upgrade

Wrap-up36

Olivier Dasini - @freshdaz

Prod upgrade Final step(s), final tests

● Preprod is similar but not identical to prod.● To be more comfortable we

○ Added extra slaves on our smaller clusters○ Ran the full process on them

● Not possible to test the switch master failover● But we were confident enough to start, so we started

○ In progress...

37

Upgrade Confidence Index : 99%

Olivier Dasini - @freshdaz

AgendaMe, Myself & I

Technical background

Why upgrade to 5.6?

Performance testing

Preprod upgrade

Production upgrade

Wrap-up38

Olivier Dasini - @freshdaz

Wrap-up● Identified what's relevant for you in the new release

○ Understand the changes : added / removed features○ Don't be an earlier adopter (if you don't have a proper support team)

: let other clean the way● Make your own tests

○ Performance : related to your workload / data set○ Functional : are your apps depend on a removed/changed feature?

● Split the work in lots○ Easier to manage/debug/...

● Automation○ Manual things are error prone○ Write it once, use it at will

● Communication○ Explain / describe what you are going to do○ Involve consumers, looking for their feedback

39

Olivier Dasini - @freshdaz

Questions?

40

Thank you!

Olivier DASINI

Twitter : @freshdaz

Mail : [email protected]

Skype : olivier.dasini