Shak larry-jeder-perf-and-tuning-summit14-part2-final

Click here to load reader

  • date post

    15-Jan-2015
  • Category

    Technology

  • view

    269
  • download

    5

Embed Size (px)

description

 

Transcript of Shak larry-jeder-perf-and-tuning-summit14-part2-final

  • 1. Performance Analysis and Tuning Part 2D. John Shakshober (Shak)Sr Consulting Eng / Director Performance EngineeringLarry WoodmanSenior Consulting Engineer / Kernel VMJeremy EderPrincipal Software Engineer/ Performance Engineering

2. Agenda: Performance Analysis Tuning Part II Part I RHEL Evolution 5>6>7 outoftheboxtuned for Clouds tuned Auto_NUMA_Balance tuned for NonUniform Memory Access (NUMA) Cgroups / Containers Scalabilty Scheduler tunables Transparent Hugepages, Static Hugepages 4K/2MB/1GB Part II Disk and Filesystem IO Throughputperformance Network Performance and Latencyperformance System Performance/Tools perf, tuna, systemtap, performancecopilot Q & A 3. Disk I/O in RHEL 4. RHEL tuned packageAvailable profiles:- balanced- desktop- latency-performance- network-latency- network-throughput- throughput-performance- virtual-guest- virtual-hostCurrent active profile: throughput-performance 5. Tuned: Profile throughput-performancethroughput-performancegovernor=performanceenergy_perf_bias=performancemin_perf_pct=100readahead=4096kernel.sched_min_granularity_ns = 10000000kernel.sched_wakeup_granularity_ns = 15000000vm.dirty_background_ratio = 10vm.swappiness=10 6. I/O Tuning Understanding I/O Elevators Deadline new RHEL7 default for all profilesTwo queues per device, one for read and one for writes I/Os dispatched based on time spent in queue CFQ used for system disks off SATA/SAS controllersPer process queueEach process queue gets fixed time slice (based on process priority) NOOP used for high-end SSDs (Fusion IO etc)FIFOSimple I/O Merging Lowest CPU Cost 7. Iozone Performance Effect of TUNED EXT4/XFS/GFSRHEL7 RC 3.10-111 File System In Cache Performanceext3 ext4 xfs gfs2450040003500300025002000150010005000Intel I/O (iozone - geoM 1m-4g, 4k-1m) not tunedtunedThroughput in MB/SecRHEL7 3.10-111 File System Out of Cache Performanceext3 ext4 xfs gfs28007006005004003002001000Intel I/O (iozone - geoM 1m-4g, 4k-1m) not tunedtunedThroughput in MB/Sec 8. SAS Application on Standalone SystemsPicking a RHEL File Systemxfs most recommended Max file system size 100TB Max file size 100TB Best performingext4 recommended Max file system size 16TB Max file size 16TBext3 not recommended Max file system size 16TB Max file size 2TBxfs-rhel718000160001400012000100008000600040002000xfs-rhel6SAS Mixed Analytics (RHEL6 vs RHEL7)perf 32 (2 socket Nahelam) 8 x 48GBTOTAL Time System Timeext3-rhel7ext3-rhel6ext4-rhel7ext4-rhel6200-20-40-60-80gfs2-rhel7gfs2-rhel60-1006.18-4.054.94 9.59File system testedTime in seconds (lower is better) 9. Tuning Memory Flushing CachesDrop unused Cache to control pagecache dynamicallyFrees most pagecache memoryFile cacheIf the DB uses cache, may notice slowdownNOTE: Use for benchmark environments.Free pagecache# sync; echo 1 > /proc/sys/vm/drop_cachesFree slabcache# sync; echo 2 > /proc/sys/vm/drop_cachesFree pagecache and slabcache# sync; echo 3 > /proc/sys/vm/drop_caches 10. Virtual Memory Manager (VM) TunablesReclaim Ratios/proc/sys/vm/swappiness/proc/sys/vm/vfs_cache_pressure/proc/sys/vm/min_free_kbytesWriteback Parameters/proc/sys/vm/dirty_background_ratio/proc/sys/vm/dirty_ratioReadahead parameters/sys/block//queue/read_ahead_kb 11. Per file system flush daemonbufferUser spaceKernelmemory copypagecacheRead()/Write()Flush daemonFile systemPagecachepage 12. swappinessControls how aggressively the system reclaims anonymousmemory:Anonymous memory - swappingMapped file pages writing if dirty and freeingSystem V shared memory - swappingDecreasing: more aggressive reclaiming of pagecachememoryIncreasing: more aggressive swapping of anonymousmemory 13. vfs_cache_pressureControls how aggressively the kernel reclaims memory in slabcaches.Increasing causes the system to reclaim inode cache anddentry cache.Decreasing causes inode cache and dentry cache to grow. 14. min_free_kbytesDirectly controls the page reclaim watermarks in KBDefaults are higher when THP is enabled# echo 1024 > /proc/sys/vm/min_free_kbytes-----------------------------------------------------------Node 0 DMA free:4420kB min:8kB low:8kB high:12kBNode 0 DMA32 free:14456kB min:1012kB low:1264kB high:1516kB-----------------------------------------------------------echo 2048 > /proc/sys/vm/min_free_kbytes-----------------------------------------------------------Node 0 DMA free:4420kB min:20kB low:24kB high:28kBNode 0 DMA32 free:14456kB min:2024kB low:2528kB high:3036kB----------------------------------------------------------- 15. dirty_background_ratio, dirty_background_bytesControls when dirty pagecache memory starts getting written.Default is 10%Lowerflushing starts earlierless dirty pagecache and smaller IO streamsHigherflushing starts latermore dirty pagecache and larger IO streams dirty_background_bytes over-rides when you want < 1% 16. dirty_ratio, dirty_bytes Absolute limit to percentage of dirty pagecache memory Default is 20% Lower means clean pagecache and smaller IO streams Higher means dirty pagecache and larger IO streams dirty_bytes overrides when you want < 1% 17. dirty_ratio and dirty_background_ratio100% of pagecache RAM dirtyflushd and write()'ng processes write dirty buffersdirty_ratio(20% of RAM dirty) processes start synchronous writesflushd writes dirty buffers in backgrounddirty_background_ratio(10% of RAM dirty) wakeup flushddo_nothing0% of pagecache RAM dirty 18. Network Performance Tuning 19. RHEL7 NetworksIPv4 Routing Cache replaced with Forwarding Information BaseBetter scalability, determinism and securitySocket BUSY_POLL (aka low latency sockets)40G NIC support, bottleneck moves back to CPU :-)VXLAN Offload (for OpenStack)NetworkManager: nmcli and nmtui 20. 20NUMA NODE 1Locality of PacketsStream fromCustomer 1Stream fromCustomer 2Stream fromCustomer 3Socket1/Core1Socket1/Core2Socket1/Core3Stream fromCustomer 4 Socket1/Core4 21. Tuned: Profile InheritanceParentsthroughput-performance latency-performanceChildrennetwork-throughput network-latencyvirtual-hostvirtual-guestbalanceddesktop 22. Tuned: Profile Inheritance (throughput)throughput-performancegovernor=performanceenergy_perf_bias=performancemin_perf_pct=100readahead=4096kernel.sched_min_granularity_ns = 10000000kernel.sched_wakeup_granularity_ns = 15000000vm.dirty_background_ratio = 10vm.swappiness=10network-throughputnet.ipv4.tcp_rmem="4096 87380 16777216"net.ipv4.tcp_wmem="4096 16384 16777216"net.ipv4.udp_mem="3145728 4194304 16777216" 23. Tuned: Profile Inheritance (latency)latency-performanceforce_latency=1governor=performanceenergy_perf_bias=performancemin_perf_pct=100kernel.sched_min_granularity_ns=10000000vm.dirty_ratio=10vm.dirty_background_ratio=3vm.swappiness=10kernel.sched_migration_cost_ns=5000000network-latencytransparent_hugepages=nevernet.core.busy_read=50net.core.busy_poll=50net.ipv4.tcp_fastopen=3kernel.numa_balancing=0 24. Networking performance System setupEvaluate the 2 new tuned profiles for networkingDisable unnecessary services, runlevel 3 Follow vendor guidelines for BIOS Tuning Logical cores? Power Management? Turbo?In the OS, considerDisabling filesystem journalSSD/Memory StorageReducing writeback thresholds if your app does disk I/ONIC Offloads favor throughput 25. Network Tuning: Buffer Bloat# ss |grep -v sshState Recv-Q Send-Q Local Address:Port Peer Address:PortESTAB 0 0 172.17.1.36:38462 172.17.1.34:12865ESTAB 0 3723128 172.17.1.36:58856 172.17.1.34:53491NIC ring buffers: # ethtool -g p1p1 10G line-rate ~4MB queue depth Matching servers4.504.003.503.002.502.001.501.000.500.00Kernel Buffer Queue Depth10Gbit TCP_STREAMTime (1-sec intervals)Send-Q DepthQueue Depth (MB) 26. Tuned: Network Throughput Boostbalanced throughput-performance network-latency network-throughput400003000020000100000R7 RC1 Tuned Profile Comparison - 40G NetworkingIncorrect Binding Correct Binding Correct Binding (Jumbo)Mbit/s39.6Gb/s19.7Gb/s 27. netsniff-ng: ifppsAggregate networkstats to one screenCan output to .csv Pkts/sec 28. netsniff-ng: ifppsAggregate networkstats to one screenCan output to .csv Pkts/sec Drops/sec 29. netsniff-ng: ifppsAggregate networkstats to one screenCan output to .csv Pkts/sec Drops/secHard/Soft IRQs/sec 30. Network Tuning: Low Latency TCP set TCP_NODELAY (Nagle) Experiment with ethtool offloads tcp_low_latency tiny substantive benefit foundEnsure kernel buffers are right-sizedUse ss (Recv-Q Send-Q)Don't setsockopt unless you've really testedReview old code to see if you're using setsockopt Might be hurting performance 31. Network Tuning: Low Latency UDPMainly about managing bursts, avoiding drops rmem_max/wmem_maxTX netdev_max_backlog txqueuelenRX netdev_max_backlog ethtool -g ethtool -c netdev_budgetDropwatch tool in RHEL 32. Full DynTicks (nohz_full) 33. Full DynTicks PatchsetPatchset Goal:Stop interrupting userspace tasksMove timekeeping to non-latency-sensitivecoresIf nr_running=1, thenscheduler/tick can avoid thatcoreDefault disabled...Opt-in vianohz_full cmdline optionKernel Tick:timekeeping (gettimeofday)Scheduler load balancingMemory statistics (vmstat) 34. RHEL6 and 7 TicklessTimeTick No No No TickUserspace Task Timer Interrupt Idle 35. nohz_fullUserspace Task Timer Interrupt T i ckless doesn'trequire idle...TimeNo Ticks 36. Busy Polling 37. SO_BUSY_POLL Socket OptionSocket-layer code polls receive queue of NICReplaces interrupts and NAPIRetains full capabilities of kernel network stack 38. BUSY_POLL Socket OptionBaseline SO_BUSY_POLL80000700006000050000400003000020000100000netperf TCP_RR and UDP_RR Transactions/secTCP_RR-RXTCP_RR-TXUDP_RR-RXUDP_RR-TXTrans/sec 39. Power Management 40. Power Management: P-states and C-statesP-state: CPU FrequencyGovernors, Frequency scalingC-state: CPU Idle State Idle drivers 41. Introducing intel_pstate P-state DriverNew Default Idle Driver in RHEL7: intel_pstate (not a module) CPU governors replaced with sysfs min_perf_pct and max_perf_pct Moves Turbo knob into OS control (yay!) Tuned handles most of this for you: Sets min_perf_pct=100 for most profiles Sets x86_energy_perf_policy=performance (same as RHEL6) 42. Impact of CPU Idle Drives (watts per workload)RHEL7 @ C1kernel build disk read disk write unpack tar.gz active idle35302520151050% power saved 43. Turbostat shows P/C-states on Intel CPUsDefaultpk cor CPU %c0 GHz TSC %c1 %c3 %c6 %c70 0 0 0.24 2.93 2.88 5.72 1.32 0.00 92.720 1 1 2.54 3.03 2.88 3.13 0.15 0.00 94.180 2 2 2.29 3.08 2.88 1.47 0.00 0.00 96.250 3 3 1.75 1.75 2.88 1.21 0.47 0.12 96.44latency-performancepk cor CPU %c0 GHz TSC %c1 %c3 %c6 %c70 0 0 0.00 3.30 2.90 100.00 0.00 0.00 0.000 1 1 0.00 3.30 2.90 100.00 0.00 0.00 0.000 2 2 0.00 3.30 2.90 100.00 0.00 0.00 0.000 3 3 0.00 3.30 2.90 100.00 0.00 0.00 0.00 44. Frequency Scaling (Turbo) Varying Load3.6 3.59Skt 0 1 thrd Skt 1 1 thrd Skt 0 8 thrds Skt 1 8 thrds3.22.82.423.31 3.33.132.92.692.92.7Avg GHz Turbo ONAvg GHz Turbo OFFAvg GHz 45. Analysis ToolsPerformance Co-Pilot 46. Performance Co-Pilot (PCP)(Multi) system-level performancemonitoring and management 47. pmchart graphical metric plotting toolCan plot myriad performance statistics 48. pmchart graphical metric plotting toolCan plot myriad performance statisticsRecording mode allows for replay i.e. on a different systemRecord in GUI, then# pmafm $recording.folio 49. pmchart graphical metric plotting toolCan plot myriad performance statisticsRecording mode allows for replay i.e. on a different systemRecord in GUI, then# pmafm $recording.folioShips with many pre-cooked views...for example:ApacheServers: CPU%/Net/Busy/Idle Apache ServersOverview: CPU%/Load/IOPS/Net/Memory 50. Performance Co-Pilot Demo ScriptTiny script to exercise 4 food groups...CPU# stress -t 5 -c 1DISK# dd if=/dev/zero of=/root/2GB count=2048 bs=1M oflag=directNETWORKnetperf -H rhel7.lab -l 5# MEMORY# stress -t 5 --vm 1 vm-bytes 16G 51. CPU %Load AvgIOPSNetworkMemoryAllocated 52. pmcollectl modeCPU 53. pmcollectl modeIOPSCPU 54. pmcollectl modeIOPSNETCPU 55. pmcollectl modeIOPSMEMNETCPU 56. pmatop mode 57. Tuna 58. Network Tuning: IRQ affinityUse irqbalance for the common caseNew irqbalance automates NUMA affinity for IRQsMove 'p1p1*' IRQs to Socket 1:# tuna -q p1p1* -S1 -m -x# tuna -Q | grep p1p1Manual IRQ pinning for the last X percent/determinism 59. Tuna GUI Capabilities Updated for RHEL7Run tuning experiments in realtimeSave settings to a conf file (then load with tuna cli) 60. Tuna GUI Capabilities Updated for RHEL7 61. Tuna GUI Capabilities Updated for RHEL7 62. Tuna GUI Capabilities Updated for RHEL7 63. Network Tuning: IRQ affinityUse irqbalance for the common caseNew irqbalance automates NUMA affinity for IRQsFlow-Steering TechnologiesMove 'p1p1*' IRQs to Socket 1:# tuna -q p1p1* -S1 -m -x# tuna -Q | grep p1p1Manual IRQ pinning for the last X percent/determinism Guide on Red Hat Customer Portal 64. CoreTuna for IRQsMove 'p1p1*' IRQs to Socket 1:# tuna -q p1p1* -S0 -m -x# tuna -Q | grep p1p178 p1p1-0 0 sfc79 p1p1-1 1 sfc80 p1p1-2 2 sfc81 p1p1-3 3 sfc82 p1p1-4 4 sfc 65. Tuna for processes# tuna -t netserver -Pthread ctxt_switchespid SCHED_ rtpri affinity voluntary nonvoluntary cmd13488 OTHER 0 0xfff 1 0 netserver# tuna -c2 -t netserver -m# tuna -t netserver -Pthread ctxt_switchespid SCHED_ rtpri affinity voluntary nonvoluntary cmd13488 OTHER 0 2 1 0 netserver 66. Tuna for core/socket isolation# tuna -S1 -i# grep Cpus_allowed_list /proc/`pgrep rsyslogd`/statusCpus_allowed_list: 0-15 67. Tuna for core/socket isolation# tuna -S1 -i# grep Cpus_allowed_list /proc/`pgrep rsyslogd`/statusCpus_allowed_list: 0-15# tuna -S1 -i (tuna sets affinity of 'init' task as well)# grep Cpus_allowed_list /proc/`pgrep rsyslogd`/statusCpus_allowed_list: 0,1,2,3,4,5,6,7 68. Analysis Toolsperf 69. perfUserspace tool to read CPU countersand kernel tracepoints 70. perf listList counters/tracepoints available on your system 71. perf listgrep for something interesting, maybe to see what numabalance is doing ? 72. perf topSystem-wide 'top' view of busy functions 73. perf recordRecord system-wide (-a) 74. perf recordRecord system-wide (-a) A single command 75. perf recordRecord system-wide (-a) A single command An existing process (-p) 76. perf recordRecord system-wide (-a) A single command An existing process (-p) Add call-chain recording (-g) 77. perf recordRecord system-wide (-a) A single command An existing process (-p) Add call-chain recording (-g) Only specific events (-e) 78. perf recordRecord system-wide (-a) A single command An existing process (-p) Add call-chain recording (-g) Only specific events (-e) 79. perf report/dev/zero 80. perf report/dev/zerooflag=direct 81. perf diffCompare 2 perf recordings 82. perf probe (dynamic tracepoints)Insert a tracepoint on any function...Try 'perf probe -F' to list possibilitiesMy Probe Point 83. RHEL7 Performance Tuning SummaryUse Tuned, NumaD and Tuna in RHEL6 and RHEL7 Power savings mode (performance), locked (latency) Transparent Hugepages for anon memory (monitor it) numabalance Multi-instance, consider NumaD Virtualization virtio drivers, consider SR-IOVManually TuneNUMA via numactl, monitor numastat -c pid Huge Pages static hugepages for pinned shared-memory Managing VM, dirty ratio and swappiness tuning Use cgroups for further resource management control 84. Upcoming Performance TalksPerformance tuning: Red Hat Enterprise Linux for databasesSanjay Rao, Wednesday April 16, 2:30pmAutomatic NUMA balancing for bare-metal workloads & KVMvirtualizationRik van Riel, Wednesday April 16, 3:40pmRed Hat Storage Server PerformanceBen England, Thursday April 17, 11:00am 85. Helpful UtilitiesSupportability redhat-support-tool sos kdump perf psmisc strace sysstat systemtap trace-cmd util-linux-ngNUMA hwloc Intel PCM numactl numad numatop (01.org)Power/Tuning cpupowerutils (R6) kernel-tools (R7) powertop tuna tunedNetworking dropwatch ethtool netsniff-ng (EPEL6) tcpdump wireshark/tsharkStorage blktrace iotop iostat 86. Helpful LinksOfficial Red Hat DocumentationRed Hat Low Latency Performance Tuning GuideOptimizing RHEL Performance by Tuning IRQ Affinitynohz_fullPerformance Co-PilotPerfHow do I create my own tuned profile on RHEL7 ?Busy Polling WhitepaperBlog: http://www.breakage.org/ or @jeremyeder 87. Q & A 88. Tuned: Profile virtual-hostthroughput-performancegovernor=performanceenergy_perf_bias=performancemin_perf_pct=100transparent_hugepages=alwaysreadahead=4096sched_min_granularity_ns = 10000000sched_wakeup_granularity_ns = 15000000vm.dirty_ratio = 40vm.dirty_background_ratio = 10vm.swappiness=10virtual-hostvm.dirty_background_ratio = 5sched_migration_cost_ns = 5000000virtual-guestvm.dirty_ratio = 30vm.swappiness = 30 89. RHEL RHS Tuning w/ RHEV/RHEL OSP (tuned) gluster volume set group virt XFS mkfs -n size=8192, mount inode64, noatimeRHS server: tuned-adm profile rhs-virtualization Increase in readahead, lower dirty ratio's KVM host: tuned-adm profile virtual-host Better response time shrink guest block device queue/sys/block/vda/queue/nr_request (16 or 8) Best sequential read throughput, raise VM read-ahead/sys/block/vda/queue/read_ahead_kb (4096/8192) 90. Iozone Performance Comparison RHS2.1/XFS w/ RHEVrnd-write rnd-read seq-write seq-read70006000500040003000200010000Out-of-the-box tuned rhs-virtualization 91. RHS Fuse vs libgfapi integration (RHEL6.5 and RHEL7)FUSE libgfapi FUSE libgfapi140012001000800600400200013152084295639948112011321OSP 4.0 Large File Seq. I/O - FUSE vs. Libgfapi4 RHS servers (repl2), 4 computes, 4G filesz, 64K recszSequential Writes Sequential ReadsTotal Throughput in MB/Sec1 Instance 64 Instances