Datacenter Computers
Transcript of Datacenter Computers
![Page 1: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/1.jpg)
Datacenter Computersmodern challenges in CPU design
Dick SitesGoogle Inc.February 2015
![Page 2: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/2.jpg)
February 2015 2
Thesis: Servers anddesktops require
different design emphasis
![Page 3: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/3.jpg)
February 2015 3
Goals
• Draw a vivid picture of a computing environment that may be foreign to your experience so far
• Expose some ongoing research problems
• Perhaps inspire contributions in this area
![Page 4: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/4.jpg)
February 2015 4
Analogy (S.A.T. pre-2005)
:
:
as
![Page 5: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/5.jpg)
February 2015 5
Analogy (S.A.T. pre-2005)
:
:
as
![Page 6: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/6.jpg)
February 2015 6
Datacenter Servers are Different
Move data: big and small
Real-time transactions: 1000s per second
Isolation between programs
Measurement underpinnings
![Page 7: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/7.jpg)
Move data: big and small
![Page 8: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/8.jpg)
February 2015 8
Move data: big and small
• Move lots of data– Disk to/from RAM– Network to/from RAM– SSD to/from RAM– Within RAM
• Bulk data• Short data: variable-length items• Compress, encrypt, checksum, hash, sort
![Page 9: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/9.jpg)
February 2015 9
Lots of memory
• 4004: no memory
• i7: 12MB L3 cache
![Page 10: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/10.jpg)
February 2015 10
Little brain, LOTS of memory
• Server 64GB-1TB
Drinking straw access
![Page 11: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/11.jpg)
February 2015 11
High-Bandwidth Cache Structure
• Hypothetical 64-CPU-thread server
L3 cache
DRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAM
…0 1 2 3Core 0
L1 caches
4 5 6 7Core 1
L1 caches
8 9 10 11Core 2
L1 caches
60 61 62 63 PCsCore 15L1 caches
…L2 cache L2 cache
![Page 12: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/12.jpg)
February 2015 12
Forcing Function
Move 16 bytes every CPU cycle
What are the direct consequences of this goal?
![Page 13: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/13.jpg)
February 2015 13
Move 16 bytes every CPU cycle
• Load 16B, Store 16B, test, branch– All in one cycle = 4-way issue minimum– Need some 16-byte registers
• At 3.2GHz, 50 GB/s read + 50 GB/s write,– 100 GB/sec, one L1 D-cache
• Must avoid reading cache line before write– if it is to be fully written (else 150 GB/s)
![Page 14: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/14.jpg)
February 2015 14
Short strings
• Parsing, words, packet headers, scatter-gather, checksums
the _ quick brown_ _ fox
IP hdr TCP hdr chksumVLAN …
![Page 15: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/15.jpg)
February 2015 15
Generic Move, 37 bytes
thequ ickbrownfoxjumps overthelazydog..
thequ ickbrownfoxjumps overthelazydog..
5 bytes 16 bytes 16 bytes
![Page 16: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/16.jpg)
February 2015 16
Generic Move, 37 bytes
thequ ickbrownfoxjumps overthelazydog..
thequ ickbrownfoxjumps overthelazydog..
5 bytes 16 bytes 16 bytes
16B aligned cache accesses
![Page 17: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/17.jpg)
February 2015 17
Generic Move, 37 bytes, aligned target
thequick brownfoxjumpsove rthelazydog..
thequickbr ownfoxjumpsovert helazydog..
10 bytes 16 bytes 11 bytes
16B aligned cache accesses
32B shift helpful:LD, >>, ST
![Page 18: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/18.jpg)
February 2015 18
Useful Instructions
• Load Partial R1, R2, R3• Store Partial R1, R2, R3
– Load/store R1 low bytes with 0(R2), length 0..15 from R3 low 4 bits, 0 pad high bytes, R1 = 16 byte register
– 0(R2) can be unaligned– Length zero never segfaults– Similar to Power architecture Move Assist instructions
• Handles all short moves• Remaining length is always multiple of 16
![Page 19: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/19.jpg)
February 2015 19
Move 16 bytes every CPU cycle
• L1 data cache: 2 aligned LD/ST accesses per cycle: 100 GB/s, plus 100 GB/s fill
L3 cache
L2 cache L2 cache
DRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAM
…4 5 6 7Core 1
L1 caches
8 9 10 11Core 2
L1 caches
60 61 62 63 PCs
Core 15L1 caches
0 1 2 3Core 0
L1 caches
![Page 20: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/20.jpg)
February 2015 20
Move 16 bytes every CPU cycle
• Sustained throughout L2, L3, and RAM• Perhaps four CPUs simultaneously
L3 cache
L2 cache L2 cache
DRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAM
…4 5 6 7Core 1
L1 caches
8 9 10 11Core 2
L1 caches
60 61 62 63 PCs
Core 15L1 caches
0 1 2 3Core 0
L1 caches100 GB/s
x 16
200 GB/sx 4
400 GB/sx 1
![Page 21: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/21.jpg)
February 2015 21
Top 20 Stream Copy Bandwidth (April 2014)
Date Machine ID ncpus COPY 1. 2012.08.14 SGI_Altix_UV_2000 2048 6591 GB/s2. 2011.04.05 SGI_Altix_UV_1000 2048 5321 3. 2006.07.10 SGI_Altix_4700 1024 3661 4. 2013.03.26 Fujitsu_SPARC_M10-4S 1024 3474 5. 2011.06.06 ScaleMP_Xeon_X6560_64B 768 1493 6. 2004.12.22 SGI_Altix_3700_Bx2 512 906 7. 2003.11.13 SGI_Altix_3000 512 854 8. 2003.10.02 NEC_SX-7 32 876 9. 2008.04.07 IBM_Power_595 64 679 10. 2013.09.12 Oracle_SPARC_T5-8 128 604 11. 1999.12.07 NEC_SX-5-16A 16 607 12. 2009.08.10 ScaleMP_XeonX5570_vSMP_16B 128 437 13. 1997.06.10 NEC_SX-4 32 434 14. 2004.08.11 HP_AlphaServer_GS1280-1300 64 407
Our Forcing Function 1 400 GB/s15. 1996.11.21 Cray_T932_321024-3E 32 310 16. 2014.04.24 Oracle_Sun_Server_X4-4 60 221 17. 2007.04.17 Fujitsu/Sun_Enterprise_M9000 128 224 18. 2002.10.16 NEC_SX-6 8 202 19. 2006.07.23 IBM_System_p5_595 64 186 20. 2013.09.17 Intel_XeonPhi_SE10P 61 169
https://www.cs.virginia.edu/stream/top20/Bandwidth.html
![Page 22: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/22.jpg)
February 2015 22
Latency
• Buy 256GB of RAM only if you use it– Higher cache miss rates– And main memory is ~200 cycles away
![Page 23: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/23.jpg)
February 2015 23
Cache Hits200-cycle miss @4 HzCache Hits1-cycle hit @4 Hz
![Page 24: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/24.jpg)
February 2015 24
Cache Miss to Main Memory200-cycle miss @4 Hz
cache miss
![Page 25: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/25.jpg)
February 2015 25
What constructive workcould you do during
this time?
![Page 26: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/26.jpg)
February 2015 26
Latency
• Buy 256GB of RAM only if you use it– Higher cache miss rates– And main memory is ~200 cycles away
• For starters, we need to prefetch about 16B * 200cy = 3.2KB to meet our forcing function; call it 4KB
![Page 27: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/27.jpg)
February 2015 27
Additional Considerations
• L1 cache size = associativity * page size– Need bigger than 4KB pages
• Translation buffer at 256 x 4KB covers only 1MB of memory– Covers much less than on-chip caches– TB miss can consume 15% of total CPU time – Need bigger than 4KB pages
• With 256GB of RAM @4KB: 64M pages– Need bigger than 4KB pages
![Page 28: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/28.jpg)
February 2015 28
Move data: big and small
• It’s still the memory
• Need a coordinated design of instruction architecture and memory implementation to achieve high bandwidth with low delay
![Page 29: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/29.jpg)
February 2015 29
Modern challenges in CPU design
• Lots of memory• Multi-issue CPU instructions every cycle
that drive full bandwidth • Full bandwidth all the way to RAM, not
just to L1 cache• More prefetching in software• Bigger page size(s)
Topic: do better
![Page 30: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/30.jpg)
Real-time transactions:1000s per second
![Page 31: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/31.jpg)
February 2015 31
A Single Transaction Across ~40 Racks of ~60 Servers Each• Each arc is a related client-server RPC
(remote procedure call)
![Page 32: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/32.jpg)
February 2015 32
A Single Transaction RPC Tree vs Time: Client & 93 Servers
?
??
![Page 33: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/33.jpg)
February 2015 33
Single Transaction Tail Latency
• One slow response out of 93 parallel RPCsslows the entire transaction
![Page 34: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/34.jpg)
February 2015 34
One Server, One User-Mode Thread vs. TimeEleven Sequential Transactions (circa 2004)
Normal: 60-70 msec
Long: 1800 msec
![Page 35: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/35.jpg)
February 2015 35
One Server, Four CPUs: User/kernel transitions every CPU every nanosecond (Ktrace)
200 usec
CPU 0
CPU 1
CPU 2
CPU 3
![Page 36: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/36.jpg)
February 2015 36
16 CPUs, 600us, Many RPCs
CPUs 0..15
RPCs0..39
![Page 37: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/37.jpg)
February 2015 37
16 CPUs, 600us, Many RPCs
THREADs0..46
LOCKs0..22
![Page 38: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/38.jpg)
February 2015 38
That is A LOT going on at once
• Let’s look at just one long-tail RPC in context
![Page 39: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/39.jpg)
February 2015 39
16 CPUs, 600us, one RPC
CPUs 0..15
RPCs0..39
![Page 40: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/40.jpg)
February 2015 40
16 CPUs, 600us, one RPC
THREADs0..46
LOCKs0..22
![Page 41: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/41.jpg)
February 2015 41
Thread 25 actually runs
Wakeup Detail
Thread 19 frees lock, sends wakeup to waiting thread 25
50 us
50us wakeup delay ??
![Page 42: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/42.jpg)
February 2015 42
Wakeup Detail
• Target CPU was busy; kernel waited
THREADs
50 us 50us
CPUs CPU 9 busy;But 3 was idle
![Page 43: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/43.jpg)
February 2015 43
CPU Scheduling, 2 Designs
• Re-dispatch on any idle CPU core– But if idle CPU core is in deep sleep, can take 75-
100us to wake up
• Wait to re-dispatch on previous CPU core, to get cache hits– Saves squat if could use same L1 cache– Saves ~10us if could use same L2 cache– Saves ~100us if could use same L3 cache– Expensive if cross-socket cache refills– Don’t wait too long…
![Page 44: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/44.jpg)
February 2015 44
Real-time transactions: 1000s per second
• Not your father’s SPECmarks
• To understand delays, need to track simultaneous transactions across servers, CPU cores, threads, queues, locks
![Page 45: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/45.jpg)
February 2015 45
Modern challenges in CPU design
• A single transaction can touch thousands of servers in parallel
• The slowest parallel path dominates • Tail latency is the enemy
– Must control lock-holding times– Must control scheduler delays– Must control interference via shared resources
Topic: do better
![Page 46: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/46.jpg)
Isolation between programs
![Page 47: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/47.jpg)
February 2015 47
Histogram: Disk Server Latency; Long Tail
Latency 99th %ile= 696 msec
![Page 48: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/48.jpg)
Isolation of programs reduces tail latency.Reduced tail latency = higher utilization.
Higher utilization = $$$.
Non-repeatable Tail Latency Comes from Unknown Interference
![Page 49: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/49.jpg)
February 2015 49
Many Sources of Interference
• Most interference comes from software• But a bit from the hardware underpinnings
• In a shared apartment building, most interference comes from jerky neighbors
• But thin walls and bad kitchen venting can be the hardware underpinnings
![Page 50: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/50.jpg)
February 2015 50
Isolation issue: Cache Interference
CPU thread 0 is moving 16B/cycle flat out, filling caches, hurting threads 3, 7, 63
0000000000000000000000000000000000
DRAMDRAMDRAMDRAMDRAMDRAM
DRAMDRAMDRAMDRAMDRAMDRAM
…0 1 2 3Core 0
00000000
4 5 6 7Core 1
L1 caches
60 61 62 63 PCsCore 15L1 caches
0000000000000000000 L2 cache…
000 000
![Page 51: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/51.jpg)
February 2015 51
Isolation issue: Cache Interference
• CPU thread 0 is moving 16B/cycle flat out, hurting threads 3, 7, 63
000000000000000000000…
0 1 2 3
Core 0
00000000
4 5 6 7
Core 1
77777777
Today
~000111222333444555666777…
0 1 2 3
Core 0
~00112233
4 5 6 7
Core 1
~44556677
Desired
![Page 52: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/52.jpg)
February 2015 52
Cache Interference
• How to get there?– Partition by ways
• no good if 16 threads and 8 ways• No good if result is direct-mapped• Underutilizes cache
– Selective allocation• Give each thread a target cache size• Allocate lines freely if under target• Replace only own lines if over target• Allow over-budget slop to avoid underutilization
![Page 53: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/53.jpg)
February 2015 53
Cache Interference
• Each thread has target:current, replace only own lines if over target
2525 25 252427 24 24
TargetCurrent
100 100 …TargetCurrent 105 99 … ~000111222333444555666777…
0 1 2 3
Core 0
~00112233
4 5 6 7
Core 1
~44556677
Desired
![Page 54: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/54.jpg)
February 2015 54
Cache Interference
• Each thread has target:current, replace only own lines if over target
• Requires owner bits per cache line – expensive bits
• Requires 64 target/current at L3• Fails if L3 not at least 64-way associative
– Can rarely find own lines in a set
![Page 55: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/55.jpg)
February 2015 55
Current …0-3 4-7 8-…OV KTarget …0-3 4-7 8-…OV K
Design Improvement
• Track ownership just by incoming paths– Plus separate target for kernel accesses– Plus separate target for over-target accesses
• Fewer bits, 8-way assoc OK
Current 30 1 2OV KTarget 30 1 2OV K 0
~000111222333444555666777…
0 1 2 3
Core 0
~00112233
4 5 6 7
Core 1
~44556677
Desired
![Page 56: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/56.jpg)
February 2015 56
Isolation between programs
• Good fences make good neighbors
• We need better hardware support for program isolation in shared memory systems
![Page 57: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/57.jpg)
February 2015 57
Modern challenges in CPU design
• Isolating programs from each other on a shared server is hard
• As an industry, we do it poorly– Shared CPU scheduling– Shared caches– Shared network links– Shared disks
• More hardware support needed• More innovation needed
![Page 58: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/58.jpg)
Measurement underpinnings
![Page 59: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/59.jpg)
February 2015 59
Profiles: What, not Why
• Samples of 1000s of transactions, merging results
• Pro: understanding average performance
• Blind spots:outliers, idle time
– Chuck Close
![Page 60: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/60.jpg)
February 2015 60
Traces: Why
• Full detail of individual transactions
• Pro: understanding outlier performance
• Blind spot:tracing overhead
![Page 61: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/61.jpg)
February 2015 61
Histogram: Disk Server Latency; Long Tail
Latency 99th %ile= 696 msec
1
23
4 65 7
0 25 50 100 250 500 750 1000
The 1% long tail never shows up
CPU profile showsthe 99% common cases
![Page 62: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/62.jpg)
February 2015 62
Trace: Disk Server, Event-by-Event
• Read RPC + disk time
• Write hit, hit, miss
• 700ms mixture
• 13 disks, three normal seconds:
![Page 63: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/63.jpg)
February 2015 63
Trace: Disk Server, 13 disks, 1.5 sec
• Phase transition to 250ms boundaries exactly
Read
Write
Read hit
Write hit
![Page 64: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/64.jpg)
February 2015 64
Trace: Disk Server, 13 disks, 1.5 sec
• Latencies: 250ms, 500ms, … for nine minutes
Read
Write
Read hit
Write hit
4
5
4
![Page 65: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/65.jpg)
February 2015 65
Why?
• Probably not on your guessing radar…
• Kernel throttling the CPU use of any process that is over purchased quota
• Only happened on old, slow servers
![Page 66: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/66.jpg)
February 2015 66
Disk Server, CPU Quota bug
• Understanding Why sped up 25% of entire disk fleet worldwide!– Had been going on for three years– Savings paid my salary for 10 years
• Hanlon's razor: Never attribute to malice that which is adequately explained by stupidity.
• Sites’ corollary: Never attribute to stupidity that which is adequately explained by software complexity.
4
![Page 67: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/67.jpg)
February 2015 67
Measurement Underpinnings
• All performance mysteries are simple once they are understood
• “Mystery” means that the picture in your head is wrong; software engineers are singularly inept at guessing how their view differs from reality
![Page 68: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/68.jpg)
February 2015 68
Modern challenges in CPU design
• Need low-overhead tools to observe the dynamics of performance anomalies– Transaction IDs– RPC trees– Timestamped transaction begin/end
• Traces– CPU kernel+user, RPC, lock, thread traces– Disk, network, power-consumption
Topic: do better
![Page 69: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/69.jpg)
Summary: Datacenter Servers are Different
![Page 70: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/70.jpg)
February 2015 70
Datacenter Servers are Different
Move data: big and small
Real-time transactions: 1000s per second
Isolation between programs
Measurement underpinnings
![Page 71: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/71.jpg)
February 2015 71
ReferencesClaude Shannon, A Mathematical Theory of Communication, The Bell System
Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948.http://cs.ucf.edu/~dcm/Teaching/COP5611-Spring2013/Shannon48-MathTheoryComm.pdf
Richard Sites, It’s the Memory, Stupid!, Microprocessor Report August 5, 1996.http://cva.stanford.edu/classes/cs99s/papers/architects_look_to_future.pdf
Ravi Iyer, CQoS: A Framework for Enabling QoS in Shared Caches of CMP Platforms, ACM International Conference on Supercomputing, 2004.
http://cs.binghamton.edu/~apatel/cache_sharing/CQoS_iyer.pdf
M Bligh, M Desnoyers, R Schultz , Linux Kernel Debugging on Google-sized clusters, Linux Symposium, 2007
https://www.kernel.org/doc/mirror/ols2007v1.pdf#page=29
![Page 72: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/72.jpg)
February 2015 72
ReferencesDaniel Sanchez and Christos Kozyrakis , The ZCache: Decoupling Ways and
Associativity. IEEE/ACM Symp. on Microarchitecture (MICRO-43), 2010.http://people.csail.mit.edu/sanchez/papers/2010.zcache.micro.pdf
Benjamin H. Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, Chandan Shanbhag, Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Google Technical Report dapper-2010-1, April 2010
http://scholar.google.com/scholar?q=Google+Technical+Report+dapper-2010-1&btnG=&hl=en&as_sdt=0%2C5&as_vis=1
Daniel Sanchez, Christos Kozyrakis, Vantage: Scalable and Efficient Fine-Grained Cache Partitioning, Symp. on Computer Architecture ISCA 2011.
http://ppl.stanford.edu/papers/isca11-sanchez.pdf
Luiz André Barroso and Urs Hölzle , The Datacenter as a Computer An Introduction to the Design of Warehouse-Scale Machines, 2nd Edition 2013.
http://www.morganclaypool.com/doi/pdf/10.2200/S00516ED2V01Y201306CAC024
![Page 73: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/73.jpg)
February 2015 73
Thank You, Questions?
![Page 74: Datacenter Computers](https://reader031.fdocuments.us/reader031/viewer/2022020113/58a2d5a91a28ab6d6a8b70b8/html5/thumbnails/74.jpg)
February 2015 74
Thank You