Latency Trumps All
-
Upload
guest22d4179 -
Category
Technology
-
view
120 -
download
0
description
Transcript of Latency Trumps All
Latency Trumps AllChris Saaritwitter.com/[email protected]
Thursday, November 19, 2009
Packet Latency
Time for a packet to get between points A and B Physical distance + time queued in devices along the way
~60ms
Thursday, November 19, 2009
...
Thursday, November 19, 2009
Anytime...
... the system is waiting for data The system is end to end- Human response time- Network card buffering- System bus/interconnect speed- Interrupt handling- Network stacks- Process scheduling delays- Application process waiting for data from memory to get
to CPU, or from disk to memory to CPU- Routers, modems, last mile speeds- Backbone speed and operating condition- Inter-cluster/colo performance
Thursday, November 19, 2009
Big Picture
CPU
Net
wor
k
Mem
oryU
ser
Dis
k
Thursday, November 19, 2009
Tubes?
Thursday, November 19, 2009
Latency vs. Bandwidth
Latency
BandwidthBits / Second
Time
Thursday, November 19, 2009
Bandwidth of a Truck Full of Tape
Thursday, November 19, 2009
Latency Lags Bandwidth -David Patterson
Given the record ofadvances in bandwidth ver-sus latency, the logicalquestion is why? Here arefive technical reasons andone marketing reason.
1. Moore’s Law helpsbandwidth more thanlatency. The scaling ofsemiconductor processesprovides both faster transis-tors and many more on achip. Moore’s Law predictsa periodic doubling in thenumber of transistors perchip, due to scaling and inpart to larger chips;recently, that rate has been22–24 months [6]. Band-width is helped by fastertransistors, more transis-tors, and more pins operat-ing in parallel. The fastertransistors help latency, butthe larger number of tran-sistors and the relativelylonger distances on theactually larger chips limitthe benefits of scaling tolatency. For example,processors in Table 1 grew by more than a factor of300 in transistors, and by more than a factor of 6 inpins, but area increased by almost a factor of 5. Sincedistance grows by the square root of the area, distancein Table 1 doubled.
2. Distance limits latency. Distance sets a lowerbound to latency. The delay on the long word linesand bit lines are the largest part of the row access timeof a DRAM. The speed of light tells us that if theother computer on the network is 300 meters away, itslatency can never be less than one microsecond.
3. Bandwidth is generally easier to sell. The non-technical reason that latencylags bandwidth is the marketingof performance: it is easier tosell higher bandwidth than tosell lower latency. For example,the benefits of a 10Gbps band-width Ethernet are likely easierto explain to customers todaythan a 10-microsecond latency
Ethernet, no matter whichactually provides bettervalue. One can argue thatgreater advances in band-width led to marketingtechniques to sell band-width that in turn trainedcustomers to desire it. Nomatter what the real chainof events, unquestionablyhigher bandwidth forprocessors, memories, orthe networks is easier tosell today than latency.Since bandwidth sells,engineering resources tendto be thrown at band-width, which further tipsthe balance.
4. Latency helps band-width. Technology im-provements that helplatency usually also helpbandwidth, but not viceversa. For example,
DRAM latency determines the number of accesses persecond, so lower latency means more accesses per sec-ond and hence higher bandwidth. Also, spinningdisks faster reduces the rotational latency, but the readhead must read data at the new faster rate as well.Thus, spinning the disk faster improves both band-width and rotational latency. However, increasing thelinear density of bits per inch on a track helps band-width but offers no help to latency.
5. Bandwidth hurts latency. It is often easy toimprove bandwidth at the expense of latency. Queuingtheory quantifies how buffers help bandwidth buthurt latency. As a second example, adding chips towiden a memory module increases bandwidth but thehigher fan-out on address lines may increase latency.
6. Operating system overhead hurts latency. Auser program that wants to send a message invokes the
COMMUNICATIONS OF THE ACM October 2004/Vol. 47, No. 10 73
Figure 1. Log-log plot of bandwidth and latency
milestones from Table 1 relative to the first milestone.
Table 2. Summary of annual improvements in latency, capacity, and bandwidth in Table 1.
Thursday, November 19, 2009
The Problem
Relative Data Access Latencies, Fastest to Slowest- CPU Registers (1)- L1 Cache (1-2)- L2 Cache (6-10)- Main memory (25-100)
- Hard drive (1e7)- LAN (1e7-1e8)- WAN (1e9-2e9)
--- don’t cross this line, don’t go off mother board! ---
Thursday, November 19, 2009
Relative Data Access Latency
CPU Register L1 L2 RAM
Fast Slow
Thursday, November 19, 2009
Relative Data Access Latency
CPU Register L1 L2 RAM Hard Disk
Fast Slow
Thursday, November 19, 2009
Relative Data Access Latency
Register L1 L2 RAM Hard Disk LANFloppy/CD-ROMWAN
Lower Higher
Thursday, November 19, 2009
CPU Register
CPU Register Latency - Average Human Height
Thursday, November 19, 2009
L1 Cache
Thursday, November 19, 2009
L2 Cache
x 6 x 10
Thursday, November 19, 2009
RAM
x 25 to x 100
Thursday, November 19, 2009
Hard Drive
0.4 x equatorial circumference of Earth
x 10 M
Thursday, November 19, 2009
WAN
x 100 M
0.42 x Earth to Moon Distance
Thursday, November 19, 2009
To experience pain...
Mobile phone network latency 2-10x that of wired- iPhone 3G 500ms ping
x 500 M
2 x Earth to Moon Distance
Thursday, November 19, 2009
500ms isn’t that long...
Thursday, November 19, 2009
Google SPDY
“It is designed specifically for minimizing latency through features such as multiplexed streams, request prioritization and HTTP header compression.”
Thursday, November 19, 2009
Strategy Pattern: Move Data Up
Relative Data Access Latencies- CPU Registers (1)- L1 Cache (1-2)- L2 Cache (6-10)- Main memory (25-50)
- Hard drive (1e7)- LAN (1e7-1e8)- WAN (1e9-2e9)
Thursday, November 19, 2009
Batching: Do it Once
Thursday, November 19, 2009
Batching: Maximize Data Locality
Thursday, November 19, 2009
Let’s Dig In
Relative Data Access Latencies, Fastest to Slowest- CPU Registers (1)- L1 Cache (1-2)- L2 Cache (6-10)- Main memory (25-100)
- Hard drive (1e7)- LAN (1e7-1e8)- WAN (1e9-2e9)
Thursday, November 19, 2009
Network
If you can’t Move Data Up, minimize accesses
Thursday, November 19, 2009
Network
If you can’t Move Data Up, minimize accesses
Souders Performance Rules 1) Make fewer HTTP requests- Avoid going halfway to the moon whenever possible
Thursday, November 19, 2009
Network
If you can’t Move Data Up, minimize accesses
Souders Performance Rules 1) Make fewer HTTP requests- Avoid going halfway to the moon whenever possible
2) Use a content delivery network- Edge caching gets data physically closer to the user
Thursday, November 19, 2009
Network
If you can’t Move Data Up, minimize accesses
Souders Performance Rules 1) Make fewer HTTP requests- Avoid going halfway to the moon whenever possible
2) Use a content delivery network- Edge caching gets data physically closer to the user
3) Add an expires header- Instead of going halfway to the moon (Network),
climb Godzilla (RAM) or go 40% of the way around the Earth (Disk) instead
Thursday, November 19, 2009
Network: Packets and Latency
Less data = less packets = less packet loss = less latency
Thursday, November 19, 2009
Network
1) Make fewer HTTP requests 2) Use a content delivery network 3) Add an expires header 4) Gzip components
Thursday, November 19, 2009
Disk: Falling of the Latency Cliff
Thursday, November 19, 2009
Jim Gray, Microsoft 2006
Tape is DeadDisk is TapeFlash is DiskRAM Locality is King
Thursday, November 19, 2009
Strategy: Move Up: Disk to RAM
RAM gets you above the exponential latency line- Linear cost and power consumption = $$$
Main memory (25-50)Hard drive (1e7)
Thursday, November 19, 2009
Strategy: Avoidance: Bloom Filters
- Probabilistic answer to question if a member is in a set- Constant time via multiple hashes- Constant space bit string
- Used in BigTable, Cassandra, Squid
Thursday, November 19, 2009
In Memory Indexes
Haystack keeps file system indexes in RAM- Cut disk access per image from 3 to 1
Search index compression GFS master node prefix compression of names
Thursday, November 19, 2009
Managing Gigabytes -Witten, Moffat, and Bell
Thursday, November 19, 2009
SSDs
Disk SSD
I/O Ops / Sec ~ 180 - 200 (15K RPM) ~ 70 - 100
~ 10K - 100K
Seek times ~ 7 - 3.2 ms ~ 0.085 - 0.05 ms
SSDs < 1/5th power consumption of spinning disk
Thursday, November 19, 2009
Sequential vs. Random Disk Access
- James Hamilton
Thursday, November 19, 2009
1TB Sequential Read
Thursday, November 19, 2009
1TB Random Read
Sunday Monday Tuesday Wednesday
Thursday
Friday Saturday
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 Done!
Thursday, November 19, 2009
Strategy: Batching and Streaming
Fewer reads/writes of large contiguous chunks of data- GFS 64MB chunks
Thursday, November 19, 2009
Strategy: Batching and Streaming
Fewer reads/writes of large contiguous chunks of data- GFS 64MB chunks
Requires data locality- BigTable app specified data layout and compression
Thursday, November 19, 2009
The CPU
Thursday, November 19, 2009
“CPU Bound”
Data in RAM CPU access to that data
Thursday, November 19, 2009
The Memory Wall
Thursday, November 19, 2009
Latency Lags Bandwidth
-Dave Patterson
Thursday, November 19, 2009
Multicore Makes It Worse!
More cores accelerates the rate of divergence- CPU performance doubled 3x over the past 5 years- Memory performance doubled once
Thursday, November 19, 2009
Evolving CPU Memory Access Designs
Intel Nehalem integrated memory controller and new high-speed interconnect
40 percent shorter latency and increased bandwidth, 4-6x faster system
Thursday, November 19, 2009
More CPU evolution
Intel Nehalem-EX - 8 core, 24MB of cache, 2 integrated memory controllers- ring interconnect on-die network designed to speed
the movement of data among the caches used by each of the cores
IBM Power 7- 32MB Level 3 cache
AMD Magny-Cours - 12 cores, 12MB of Level 3 cache
Thursday, November 19, 2009
Cache Hit Ratio
Thursday, November 19, 2009
Cache Line Awareness
Linked list - Each node as a separate allocation is Bad
Thursday, November 19, 2009
Cache Line Awareness
Linked list - Each node as a separate allocation is Bad
Hash table- Reprobe on collision with stride of 1
Thursday, November 19, 2009
Cache Line Awareness
Linked list - Each node as a separate allocation is Bad
Hash table- Reprobe on collision with stride of 1
Stack allocation- Top of stack is usually in cache, top of the heap is
usually not in cache
Thursday, November 19, 2009
Cache Line Awareness
Linked list - Each node as a separate allocation is Bad
Hash table- Reprobe on collision with stride of 1
Stack allocation- Top of stack is usually in cache, top of the heap is
usually not in cache Pipeline processing- Stages of operations on a piece of data do them all at
once vs. each stage separately
Thursday, November 19, 2009
Cache Line Awareness
Linked list - Each node as a separate allocation is Bad
Hash table- Reprobe on collision with stride of 1
Stack allocation- Top of stack is usually in cache, top of the heap is
usually not in cache Pipeline processing- Stages of operations on a piece of data do them all at
once vs. each stage separately Optimize for size - Might be faster execution than code optimized for speed
Thursday, November 19, 2009
Cycles to Burn
1) Make fewer HTTP requests 2) Use a content delivery network 3) Add an expires header 4) Gzip components- Use excess compute for compression
Thursday, November 19, 2009
Datacenter
Thursday, November 19, 2009
Datacenter Storage HeiracrchyStorage hierarchy: a different view
A bumpy ride that has been getting bumpier over time- Jeff Dean, Google
Thursday, November 19, 2009
Intra-Datacenter Round Trip
x 500,000
~500 miles~NYC to Columbus, OH
Thursday, November 19, 2009
Datacenter Level Systems
Facebook Cassandra
Google BigTable
memcached
Redis Project Voldemort
Yahoo Sherpa
Sawzall / Pig
Google File System
RethinkDB
MonetDB
HBaseFacebook Haystack
Thursday, November 19, 2009
Memcached Facebook Optimizations
- UDP to reduce network traffic - Less Packets
Thursday, November 19, 2009
Memcached Facebook Optimizations
- UDP to reduce network traffic - Less Packets- One core saturated with network interrupt handing- opportunistic polling of the network interfaces and
setting interrupt coalescing thresholds aggressively - Batching
Thursday, November 19, 2009
Memcached Facebook Optimizations
- UDP to reduce network traffic - Less Packets- One core saturated with network interrupt handing- opportunistic polling of the network interfaces and
setting interrupt coalescing thresholds aggressively - Batching
- Contention on network device transmit queue lock, packets added/removed from the queue one at a time- Change dequeue algorithm to batch dequeues for
transmit, drop the queue lock, and then transmit the batched packets
Thursday, November 19, 2009
Memcached Facebook Optimizations
- UDP to reduce network traffic - Less Packets- One core saturated with network interrupt handing- opportunistic polling of the network interfaces and
setting interrupt coalescing thresholds aggressively - Batching
- Contention on network device transmit queue lock, packets added/removed from the queue one at a time- Change dequeue algorithm to batch dequeues for
transmit, drop the queue lock, and then transmit the batched packets
- More lock contention fixes
Thursday, November 19, 2009
Memcached Facebook Optimizations
- UDP to reduce network traffic - Less Packets- One core saturated with network interrupt handing- opportunistic polling of the network interfaces and
setting interrupt coalescing thresholds aggressively - Batching
- Contention on network device transmit queue lock, packets added/removed from the queue one at a time- Change dequeue algorithm to batch dequeues for
transmit, drop the queue lock, and then transmit the batched packets
- More lock contention fixes
- Result 200,000 UDP requests/second with average latency of 173 microseconds
Thursday, November 19, 2009
Google BigTable
Table contains a sequence of blocks- block index loaded into memory - Move Up
Thursday, November 19, 2009
Google BigTable
Table contains a sequence of blocks- block index loaded into memory - Move Up
Table can be completely mapped into memory - Move Up
Thursday, November 19, 2009
Google BigTable
Table contains a sequence of blocks- block index loaded into memory - Move Up
Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up
Thursday, November 19, 2009
Google BigTable
Table contains a sequence of blocks- block index loaded into memory - Move Up
Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up Locality groups loaded in memory - Move Up, Batching- Clients can control compression of locality groups
Thursday, November 19, 2009
Google BigTable
Table contains a sequence of blocks- block index loaded into memory - Move Up
Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up Locality groups loaded in memory - Move Up, Batching- Clients can control compression of locality groups
2 levels of caching - Move Up- Scan cache of key/value pairs and block cache
Thursday, November 19, 2009
Google BigTable
Table contains a sequence of blocks- block index loaded into memory - Move Up
Table can be completely mapped into memory - Move Up Bloom filters hint for data - Move Up Locality groups loaded in memory - Move Up, Batching- Clients can control compression of locality groups
2 levels of caching - Move Up- Scan cache of key/value pairs and block cache
Clients cache tablet server locations- 3 to 6 network trips if cache is invalid - Move Up
Thursday, November 19, 2009
Facebook Cassandra
Bloom filters used for keys in files on disk - Move Up
Thursday, November 19, 2009
Facebook Cassandra
Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead
Thursday, November 19, 2009
Facebook Cassandra
Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead Log to memory and write to commit log on dedicated disk -
Batching
Thursday, November 19, 2009
Facebook Cassandra
Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead Log to memory and write to commit log on dedicated disk -
Batching Programmer controlled data layout for locality - Batching
Thursday, November 19, 2009
Facebook Cassandra
Bloom filters used for keys in files on disk - Move Up Sequential disk access only - Batching Append w/o read ahead Log to memory and write to commit log on dedicated disk -
Batching Programmer controlled data layout for locality - Batching
Result: 2 orders of magnitude better performance than MySQL
Thursday, November 19, 2009
Move the Compute to the Data: YQL Execute
Thursday, November 19, 2009
From the Browser Perspective
Performance bounded by 3 things:
Thursday, November 19, 2009
From the Browser Perspective
Performance bounded by 3 things:- Fetch time- Unless you’re bundling everything it is a cascade of
interdependent requests, at least 2 phases worth
Thursday, November 19, 2009
From the Browser Perspective
Performance bounded by 3 things:- Fetch time- Unless you’re bundling everything it is a cascade of
interdependent requests, at least 2 phases worth- Parse time- HTML- CSS- Javascript
Thursday, November 19, 2009
From the Browser Perspective
Performance bounded by 3 things:- Fetch time- Unless you’re bundling everything it is a cascade of
interdependent requests, at least 2 phases worth- Parse time- HTML- CSS- Javascript
- Execution time- Javascript execution- DOM construction and layout- Style application
Thursday, November 19, 2009
Recap
Move Data Up- Caching- Compression- If You Can’t Move All The Data Up- Indexes- Bloom filters
Batching and Streaming- Maximize locality
Thursday, November 19, 2009
Take 2 And Call Me In The Morning
An Engineer’s Guide to Bandwidth- http://developer.yahoo.net/blog/archives/2009/10/
a_engineers_gui.html High Performance Web Sites- Steve Souders
Even Faster Web Sites- Steve Souders
Managing Gigabytes: Compressing and Indexing Documents and Images- Witten, Moffat, Bell
Yahoo Query Language (YQL)- http://developer.yahoo.com/yql/
Thursday, November 19, 2009