Getput suite
-
Upload
iben-rodriguez -
Category
Technology
-
view
338 -
download
0
description
Transcript of Getput suite
A Swift Benchmarking Tool
Mark Seger
Hewlett Packard
Cloud Services
4/19/20131
Getput Swift Performance Tools
Problem Statement
• Performance Measurements
– Consistent/standard mechanisms for controlled experiments
– Ability to easily modify test parameters
– Minimal installation, configuration and use
– Easy to compare results of multiple runs
– Easy to clean up when done
• Benchmarking – run performance tests at scale
– Repeat tests while increasing demand for resources
– Parallel tests must be coordinated: start/finish together
4/19/2013 2Getput Swift Performance Tools
Getput Suite
• Multiple tools organized in a hierarchy
– getput: actual workhorse, runs tests on single client
– gpmaster: coordinates running getput on multiple clients
– gpsuite: defines suites of tests to minimize switches usage
– yourscript: can call gpsuite multiple times when desired
4/19/2013 3Getput Swift Performance Tools
getput.py
• Uses swiftclient library
• Lots of switches, lots of different behaviors
– Standalone• Basic: creds, cname, oname, size, num/runtime, tests, rep count
• More: processes, container type: shared/byproc/bynode, latency details, operation logging, and still more
– Multi-node (controlled by gpmaster)• start time, rank
4/19/2013 4Getput Swift Performance Tools
gpmaster.py
• Coordinates running of getput on multiple clients– Assures all start together and finish approx together
– Summarizes results as a single line
– Unlike getput only runs 1 test at a time, job for gpsuite
• More required switches than getput– Credentials file
– Rank
– Start time
– Hosts file or single client name, may need ssh key too
– And a few more…
• But rarely run by itself!
4/19/2013 5Getput Swift Performance Tools
gpsuite.py
• Removes complexity of running gpmaster
• Think of macros: gpsuite –suite full
– Sets of object sizes, eg: 1k, 10k, 100k, etc
– Numbers of threads, eg: 1, 2, 4, 8, etc• Distributes threads across multiple clients
• Some runs can take hours with a single command
• Cleans up after each run
4/19/2013 6Getput Swift Performance Tools
Getput OutputEarliest versionsInst Start End Seconds Tests Num MB/S IOPS Errs
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0
4/19/2013 7Getput Swift Performance Tools
Getput OutputEarliest versionsInst Start End Seconds Tests Num MB/S IOPS Errs
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0
Inst Start End Seconds Tests Num MB/S IOPS Latency LatRange Errs
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0.085 0.02-00.22 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0.040 0.04-00.05 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0.018 0.01-00.05 0
Added latency range in later versions
4/19/2013 8Getput Swift Performance Tools
Getput OutputEarliest versionsInst Start End Seconds Tests Num MB/S IOPS Errs
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0
Inst Start End Seconds Tests Num MB/S IOPS Latency LatRange Errs
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0.085 0.02-00.22 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0.040 0.04-00.05 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0.018 0.01-00.05 0
Inst Start End Seconds Tests Num MB/S IOPS Latency LatRange Errs Procs OSize %CPU Comp
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0.085 0.02-00.22 0 1 10k 0.30 no
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0.040 0.04-00.05 0 1 10k 0.39 no
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0.018 0.01-00.05 0 1 10k 0.58 no
Added latency range in later versions
Added CPU and started playing with compression in more recent versions
4/19/2013 9Getput Swift Performance Tools
Getput OutputEarliest versionsInst Start End Seconds Tests Num MB/S IOPS Errs
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0
Inst Start End Seconds Tests Num MB/S IOPS Latency LatRange Errs
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0.085 0.02-00.22 0
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0.040 0.04-00.05 0
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0.018 0.01-00.05 0
Inst Start End Seconds Tests Num MB/S IOPS Latency LatRange Errs Procs OSize %CPU Comp
0 13:59:20 13:59:29 8.57 put 100 0.11 11.67 0.085 0.02-00.22 0 1 10k 0.30 no
0 13:59:29 13:59:33 4.03 get 100 0.24 24.83 0.040 0.04-00.05 0 1 10k 0.39 no
0 13:59:33 13:59:34 1.80 del 100 0.54 55.68 0.018 0.01-00.05 0 1 10k 0.58 no
Added latency range in later versions
Added CPU and started playing with compression in more recent versions
Eventually added latency distribution histogramLatency LatRange Errs Procs OSize 0.0 0.1 0.2 0.3 0.4 0.5
0.106 0.02-00.36 0 10 10k 527 396 67 10 0 0
0.041 0.01-00.07 0 10 10k 1000 0 0 0 0 0
0.031 0.01-00.16 0 10 10k 964 36 0 0 0 0
4/19/2013 10Getput Swift Performance Tools
Observations
• Swift multi-scaling excellent– With multiple clients performance grows close to linearly
– With single client and multiple threads• Smaller objects scale very well with even lots of threads
• Larger objects hit either CPU/Network wall!
• Both compression and encryption cost CPU– Limits large object bandwidth, less so with smaller ones
– Early testing: !compression up to 2X boost for large objects• Similar behavior when using http instead of https
– Only just started looking at changing ciphers
Recommendation: make compression, ssl and cipher choice optional in swiftclient
4/19/2013 11Getput Swift Performance Tools
Look at the network during tests
segerm@az1-nv-compute-0001:~$ ./getput.py -cc -oo -n1 -s100m -tp --comp
Inst Start End Seconds Tests Num MB/S IOPS Latency LatRange
0 15:52:15 15:52:20 5.85 put 1 17.10 0.17 5.800 5.80-05.80
segerm@az1-nv-compute-0001:~$ collectl
waiting for 1 second sample...
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut
0 0 1342 1078 0 0 20 2 0 4 70 56
0 0 261 304 0 0 20 2 0 3 0 2
1 0 580 578 0 0 0 0 0 5 0 3
3 0 4697 780 0 0 0 0 135 2010 15956 11517
4 0 5859 1324 0 0 0 0 138 2345 19037 13708
4 0 5168 609 0 0 48 6 138 2354 19036 13706
4 0 5597 993 0 0 4 1 138 2351 19053 13717
4 0 5129 538 0 0 0 0 139 2366 19053 13716
3 0 4579 1070 0 0 0 0 107 1817 14554 10495
0 0 154 201 0 0 20 2 0 1 0 1
This is always true for uncompressible objects: upload speed ~= network bandwidth
4/19/2013 12Getput Swift Performance Tools
Compression can be your friend too
segerm@az1-nv-compute-0001:~$ ./getput.py -cc -oo -n5 -s100m -tp --otype s --comp
Inst Start End Seconds Tests Num MB/S IOPS Latency LatRange
0 16:00:19 16:00:29 10.33 put 5 48.42 0.48 2.060 2.03-02.09
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut
0 0 223 292 0 0 56 9 0 1 0 1
1 0 618 565 0 0 0 0 14 20 2 16
3 0 1380 694 0 0 0 0 14 167 605 317
4 0 1846 1194 0 0 0 0 11 165 508 304
3 1 9799 1008 0 0 12 2 173 2949 848 2949
4 1 11071 996 0 0 0 0 198 3377 607 3376
Another reason to make compression optional!
#<----CPU[HYPER]-----><----------Disks-----------><----------Network---------->
#cpu sys inter ctxsw KBRead Reads KBWrit Writes KBIn PktIn KBOut PktOut
1 0 1512 523 0 0 16 3 5 36 8 34
8 2 6377 892 0 0 0 0 658 6588 171130 117279
7 2 5488 1835 0 0 8 1 519 4933 150290 103175
6 2 8772 6113 0 0 0 0 744 8679 162089 114059
Look what the proxy is doing
3 Obj Servers
…but only for compressible objects
4/19/2013 13Getput Swift Performance Tools
Let’s talk about latency
• Latency metrics originally based on averages– Like coarse monitoring, great for trends but poor for exceptions
– Soon realized more detail was needed
• Consider the following. What does it really mean?– Is the only problem that one entry of 0.083?
4/19/2013 14Getput Swift Performance Tools
On closer inspection
• The first 4 entries don’t look too bad
• Even the bottom one isn’t that horrible
4/19/2013 15Getput Swift Performance Tools
Ranges shed more light
• Even though first 4 lines have close latencies, look at their max values
• Now we know why line 5 so bad
• Even line 6 has very high max
4/19/2013 16Getput Swift Performance Tools
But even that’s not enough
• Min/Max doesn’t tell us how many outliers• Line 2/4 have almost 50 in the .5 bucket• Line 5 has 6 PUTs >4 seconds• Line 6 all over the place
4/19/2013 17Getput Swift Performance Tools
Example 1: Latency of 0.04 too high!
• When looking at 1k, 10k and 100k GETS, noticed IOPS for 10k were lower!– Great reason to look at more than MB/sec
• After much digging discovered this only applied to object sizes 7888 -> 22469 bytes– This could only have been found by running sets of tests and looking
very closely at the numbers
• What’s going on here?
4/19/2013 18Getput Swift Performance Tools
Example 1: Latency of 0.04 too high!
• When looking at 1k, 10k and 100k GETS, noticed IOPS for 10k were lower!– Great reason to look at more than MB/sec
• After much digging discovered this only applied to object sizes 7888 -> 22469 bytes– This could only have been found by running sets of tests and looking
very closely at the numbers
• What’s going on here?– We run pound on proxies to support multiple connection ports
– Proxy does fast get and passes data to pound over loopback address
– Max segsize for loopback >> network MSS
– Eventlet uses 8192 byte buffers
– Nagle algorithm: bytes > 8192 and ~<8192+MSS have delayed ACK
• Eventlet needs bigger buffers? Turn off nagle?4/19/2013 19Getput Swift Performance Tools
Example 2: Latency 0.5
• Observed a number of these in small object PUTs
• Caused by a proxy timeout connecting to obj server
• Might be worth looking into ways to reduce and/or not try to re-contact a non-responsive server
4/19/2013 20Getput Swift Performance Tools
Example 3: Latency 6 Secs
• These occur less frequently, but do happen
• Traced back to disk error on object server
• BUT the other 2 object servers responded in < 1sec
• Think about how many IOPS are being lost!
Might it be worth it to return after 2 successes?Maybe at least ignore writes to that disk?
4/19/2013 21Getput Swift Performance Tools
So what’s next for latency?
• Investigate why some ops have even longer latencies
• Added another switch to getput! --logops
– Extended put_object() to return transaction ID
– Writes detailed log records for every operation
– Makes it possible for longer latency transactions to be traced
segerm@az1-nv-compute-0000:~$ more /tmp/getput-p-0-1363878303.log15:05:03.522 1363878303.521659 1363878303.459080 0.062547 eb4194b73e46f52f774a63fa552755d4 o-0-1-115:05:03.574 1363878303.574005 1363878303.521702 0.052291 eb4194b73e46f52f774a63fa552755d4 o-0-1-215:05:03.627 1363878303.627218 1363878303.574032 0.053174 eb4194b73e46f52f774a63fa552755d4 o-0-1-315:05:03.686 1363878303.686175 1363878303.627244 0.058918 eb4194b73e46f52f774a63fa552755d4 o-0-1-415:05:03.747 1363878303.746874 1363878303.686201 0.060661 eb4194b73e46f52f774a63fa552755d4 o-0-1-515:05:03.804 1363878303.804106 1363878303.746900 0.057194 eb4194b73e46f52f774a63fa552755d4 o-0-1-615:05:03.866 1363878303.866148 1363878303.804133 0.061979 eb4194b73e46f52f774a63fa552755d4 o-0-1-715:05:03.932 1363878303.931911 1363878303.866175 0.065724 eb4194b73e46f52f774a63fa552755d4 o-0-1-8
Recommendation: GET, PUT and DEL calls should return transaction IDs
4/19/2013 22Getput Swift Performance Tools
swcmd: a nifty helper utility• One challenge of benchmarking can be LOTs of
containers and objects needing cleanup
– Can have dozens to 100s containers
– Can have Ks to 100Ks of objects
– Swift client too slow for deletes!
• Swift client utility could use some more functionality
– How about displaying numbers of objects in containers?
– Container sizes and even dates?
– When listing containers same things
– What about parallel or even wild card listing/deletes?
• Only parallelizes for >1K objects in a container
• Uses multiprocessing can hit 300-400 deletes/sec
4/19/2013 23Getput Swift Performance Tools
Examples
swcmd ls
63482 61M 2013-03-21 16:19:12 qc-1363882747
49 4G 2013-03-09 13:13:36 vlat-1362834811
0 0 2013-03-20 22:05:06 vlat-1363817101
1 10 2013-03-15 13:58:37 xxx-0-0
1 200M 2013-03-11 12:28:16 xyxxy
2 200M 2013-03-11 12:29:01 xyzzy
2901 702M 2013-02-12 16:34:19 zzz
swcmd –p ls xyz # list containers starting with xyz
swcmd –f rc zzz # force removal of zzz even though not empty
swcmd –p pf x # force removal of ALL containers starting with x
Swcmd rm xyzzy/xyzzy # remove specific object
Recommendation: add these types of features to the swift utility
4/19/2013 24Getput Swift Performance Tools
Questions?
4/19/2013 25Getput Swift Performance Tools