Optimizing MySQL without touching SQL or my · Optimizing MySQL without touching SQL or my.cnf...

Post on 06-Aug-2020

6 views 0 download

Transcript of Optimizing MySQL without touching SQL or my · Optimizing MySQL without touching SQL or my.cnf...

Percona Live, Santa Clara 2017

Optimizing MySQL without touching SQL or my.cnf

Maxim Bublis, Peter Boros

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

We want to be more efficient• Less hardware • Faster responses • Power savings

Percona Live, Santa Clara 2017

OS-level optimizationsSet cpufreq scaling governor to performance

$ echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Percona Live, Santa Clara 2017

What’s going onstrace and perf are our friends

Percona Live, Santa Clara 2017

TCP Wrapper libraryWritten by Wietse Venema in 1990

Host-based networking ACL

It is NOT the same as MySQL ACL – doesn’t verify credentials

Percona Live, Santa Clara 2017

How it worksHow libwrap works in general – it reads /etc/hosts.allow and /etc/hosts.deny

Verifies that client host can connect to server

Percona Live, Santa Clara 2017

Let’s run strace$ sudo strace -c -p 28894 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 94.39 56.109363 678 82816 poll 0.82 0.489786 1 328312 read 0.66 0.391561 2 164156 munmap 0.55 0.328845 1 248451 fcntl 0.50 0.298584 2 157468 865 futex 0.49 0.291448 2 164156 open 0.44 0.259529 2 164156 mmap 0.39 0.232314 1 164894 738 setsockopt 0.38 0.228411 1 164156 close 0.35 0.209210 1 164156 fstat 0.32 0.192175 2 82079 rt_sigaction 0.26 0.156935 2 82817 accept 0.18 0.108958 1 82079 getsockname 0.17 0.100658 1 82079 getpeername 0.01 0.003137 3 1045 clone ------ ----------- ----------- --------- --------- ---------------- 100.00 59.446237 2132821 1603 total

Percona Live, Santa Clara 2017

Let’s run strace$ sudo strace -c -p 28894 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 94.39 56.109363 678 82816 poll 0.82 0.489786 1 328312 read 0.66 0.391561 2 164156 munmap 0.55 0.328845 1 248451 fcntl 0.50 0.298584 2 157468 865 futex 0.49 0.291448 2 164156 open 0.44 0.259529 2 164156 mmap 0.39 0.232314 1 164894 738 setsockopt 0.38 0.228411 1 164156 close 0.35 0.209210 1 164156 fstat 0.32 0.192175 2 82079 rt_sigaction 0.26 0.156935 2 82817 accept 0.18 0.108958 1 82079 getsockname 0.17 0.100658 1 82079 getpeername 0.01 0.003137 3 1045 clone ------ ----------- ----------- --------- --------- ---------------- 100.00 59.446237 2132821 1603 total

Percona Live, Santa Clara 2017

Better to use iptablesMakes tens of thousands context switches per second

• open() • fstat() • read() • close() • etc

We spend 2.5% of system time there

Percona Live, Santa Clara 2017

Fast Mutexes

Percona Live, Santa Clara 2017

Mutex implementation On Linux it is spin lock + futex.

Futex is user-space mutex implementation provided by kernel.

Percona Live, Santa Clara 2017

Fast Mutexes“Fast” mutex

volatile ulong j;

j = 0;

for (i = 0; i < delayloops * 50; i++) j += i;

return(j);

NPTL

do { if (cnt++ >= max_cnt) { LLL_MUTEX_LOCK (mutex); break; } atomic_spin_nop (); } while (LLL_MUTEX_TRYLOCK (mutex) != 0);

Percona Live, Santa Clara 2017

Fast Mutexes“Fast” mutex

Does load+store+add per loop iteration

Number of spins is produced by PRNG

PRNG uses floating-point arithmetics

Utilizes CPU execution port

NPTL

Emits nop / pause instruction

Adaptive number of spins

Percona Live, Santa Clara 2017

Fixed in 5.7.8https://dev.mysql.com/worklog/task/?id=4601

“Pretending that we can implement faster mutexes with such a simple code is justnaive and labeling then as "fast mutex" without any evidence is misleading toour users.”

Percona Live, Santa Clara 2017

sysbenchhttps://github.com/akopytov/sysbench

Scriptable multi-threaded benchmark tool based on LuaJIT

Provides a collection of OLTP-like database benchmarks

Percona Live, Santa Clara 2017

sysbench example$ sysbench \ --num-threads=$num_threads \ --max-time=300 \ --max-requests=0 \ --test=/usr/local/share/sysbench/oltp_read_write.lua \ --mysql-user=sbtest \ --mysql-password=sbtest \ --mysql-host=127.0.0.1 \ --mysql-port=3306 \ --table-size=1000000 \ --tables=64 \ --report-interval=1 \ --auto-inc=off \ run

Percona Live, Santa Clara 2017

CompilersUbuntu 16.04

• GCC 4.9 • GCC 5.4 • Clang 3.8

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

Profile-guided optimizations• Requires “training data” • Usually requires to rebuild binary • Common mistake: collecting profile data from unit tests

To build “instrumented” binary: -fprofile-generate

To use profiling data: -fprofile-use (GCC additionally requires -fprofile-correction)

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

AutoFDOhttps://github.com/google/autofdo

Helps to convert perf data into profiling data

Works with both GCC and Clang

Percona Live, Santa Clara 2017

AutoFDOAlmost as efficient as building instrumented binaries

Requires LBR support (last branch record) from both CPU and kernel

Regular perf works well, but works better with perf -b

Percona Live, Santa Clara 2017

Link-time optimizationsAlso known as Full Program Optimization)

Supported by both modern GCC and Clang with -flto option

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

Native instruction setsBy default compilers generate code using k8 instruction set for x86_64

It was created and released by AMD in 2000.

-march=native and -mtune=native

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

PGO + LTOWe can combine it!

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

PGO + LTO + NativeEven better!

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

Percona Live, Santa Clara 2017

Real world scenarios• Artificial profiling data won’t help much in real world scenarios • Probably will hurt in reality – like slower replication, etc.

Percona Live, Santa Clara 2017

Percona Query Playbackhttps://github.com/Percona-Lab/query-playback

Workload reproduction based on slow log

Percona Live, Santa Clara 2017

Options for capturingslow_query_log_timestamp_precision = microsecond slow_query_log_timestamp_always = 1 log_slow_verbosity = full long_query_time = 0

log_slow_rate_limit = 1

Percona Live, Santa Clara 2017

Slow log event example# Time: 170330 6:25:14.040355 # User@Host: edgestore_user[edgestore_user] @ [127.0.0.1] Id: 10671469 # Schema: stage_edgestore_shard232 Last_errno: 0 Killed: 0 # Query_time: 0.000274 Lock_time: 0.000037 Rows_sent: 0 Rows_examined: 0 Rows_affected: 1 # Bytes_sent: 11 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0 # InnoDB_trx_id: 125F7CBE2A # QC_Hit: No Full_scan: No Full_join: No Tmp_table: No Tmp_table_on_disk: No # Filesort: No Filesort_on_disk: No Merge_passes: 0 # InnoDB_IO_r_ops: 0 InnoDB_IO_r_bytes: 0 InnoDB_IO_r_wait: 0.000000 # InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000 # InnoDB_pages_distinct: 5 SET timestamp=1490855114; ACTUAL QUERY SQLTEXT;

Percona Live, Santa Clara 2017

Learn more about Edgestore

26 April - 3:30 PM - 4:20 PM @ Ballroom A

Percona Live, Santa Clara 2017

Query Playback• For every thread id in slow log it opens new connection • For each client connection it enqueues queue_depth number of statements to

replay • Replays those on those connections

Percona Live, Santa Clara 2017

Query Playback issues• Leaks connections • Often runs into lock waits, and seems to do nothing

(Slow log ordering issue) • Output is unreadable, reasonable debugging is next to impossible

Percona Live, Santa Clara 2017

Query Playback enhancements• No more connection leak (preprocessing) • Correct ordering (preprocessing) • Accurate playback mode

Percona Live, Santa Clara 2017

Kudos to Marius Wachtlerhttps://github.com/undingen

Pyston JIT compiler team member

Joined us for a couple of planning cycles

Bicycling from Austria to Australia now

Percona Live, Santa Clara 2017

Example$ percona-playback \

--mysql-user=testuser \

--mysql-host=127.0.0.1 \

--mysql-password=secret \

--query-log-file my.slow.log \

--queue-depth 1 \

--mysql-schema=stage_edgestore_shard232 \

--disable-reporting-plugin error_report \

--ignore-row-result-diffs \

--mysql-max-retries=0

Percona Live, Santa Clara 2017

Accurate mode• Controlled by query-log-accurate-mode option • Use with high queue depth

• In theory ordering issues are still possible, but we didn’t experience any, restoring the timing solves it.

• Keeps the gap between queries if there were any • Query 1 starts at T, query 2 at T + 5 sec

• If Query 1 was 1 second originally, and was played back in 0.8 seconds, playback will wait 4.2 seconds before Query 2

• If Query 1 was 1 second originally, and was played back in 2 seconds, playback will wait 2 seconds before Query 2

• If Query 1 was 1 seconds originally, and was played back in 6 seconds, playback will start Query 2 immediately

Percona Live, Santa Clara 2017

Tips for larger queue depth• “Faster than possible” playback, while ordering is guaranteed

to have issues • The session_init_query option can help here, for example for

reducing innodb_lock_wait_timeout • New mysql_filter_error option to filter out known issues

to a specific workload

Percona Live, Santa Clara 2017

We are hiring!

Percona Live, Santa Clara 2017