Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi...
Transcript of Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi...
![Page 1: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/1.jpg)
Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini
![Page 2: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/2.jpg)
![Page 3: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/3.jpg)
3
Microsoft’s Quincy datacenter
![Page 4: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/4.jpg)
Servers in US datacenters
4
Serv
ers i
n m
illion
s
06 2008 2010 2012 2014 2016 2018 2020
20
0
4
8
16
12
Unbranded 2+ sockets Unbranded 1 socket Branded 2+ sockets Branded 1 socket
*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.
![Page 5: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/5.jpg)
2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger
Billio
n KW
Hou
rs /
Year
2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger
200 175 150 125 100 75 50 25 0 2000 2005 2010 2015 2020
Actual $1 to $6 billion
Electricity in US datacenters
*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.
![Page 6: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/6.jpg)
~ $30,000,000 Savings from 1% less work Lots more by not building a datacenter
Datacenter economics quick facts*
6
~ $500,000 Cost of one datacenter ~3,000,000 US datacenters in 2016
~ $1.5 trillion US Capital investment to date ~ $3,000,000,000 KW dollars / year
*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.
![Page 7: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/7.jpg)
Improve efficiency!
7
![Page 8: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/8.jpg)
![Page 9: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/9.jpg)
Tail Latency Matters
9
Twosecondslowdownreducedrevenue/userby4.3%.[EricSchurman,Bing]
400milliseconddelaydecreasedsearches/userby0.59%.[JackBrutlag,Google]
TOP
PRIORITY
![Page 10: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/10.jpg)
10
Server architecture
aggregator
workers
client
![Page 11: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/11.jpg)
11
Characteristics of interactive services
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
LC
�Bursty,diurnal�CDFchangesslowly�Slowestserverdictatestail�Ordersofmagnitudediffaverage&99+%Qle
![Page 12: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/12.jpg)
12
Client side observations
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
noise?
![Page 13: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/13.jpg)
13
Client side observations
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
noise?
Solution to noise Replication • All requests? • CFD shows cost
& potential
10 % of requests 5% of requests
![Page 14: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/14.jpg)
14
Client side observations
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
not noise?
![Page 15: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/15.jpg)
Roadmap
What’s in the tail? Continuous profiling to diagnose the tail Real problems
• Noise: replication • Work: parallelism • Other opportunities
Still poor utilization due to bursty diurnal workload • Colocation for utilization without impacting tail latency
Opportunities in hardware/software codesign 15
![Page 16: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/16.jpg)
16
application
worker OS / VM
Java VM
application
worker OS / VM
Java VM
application
worker OS / VM
Java VM
Simplified life of a request
…
request response
![Page 17: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/17.jpg)
Prior state of the art Dick Site’s talk: https://www.youtube.com/watch?v=QBu2Ae8-8LM
17
![Page 18: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/18.jpg)
Dick Sites & team
18
Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system
![Page 19: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/19.jpg)
Dick Sites & team
19
Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system
![Page 20: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/20.jpg)
Dick Sites & team
20
Automated instrumentation 1% on-line budget continuous on-line profiling Off-line schematics Have insight Improve the system + On-line optimization
✗Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system
✗
![Page 21: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/21.jpg)
counters tags
Automated cycle-level on-line profiling
Insight Hardware & software generate signals
21
[ISCA’15(TopPicksHM),ATC’16]
21
hardware signals software signals performance counters memory locations
✓ ✓
✓ ✓
![Page 22: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/22.jpg)
SHIM Design ISCA’15 (Top Picks HM), ATC’16
22
![Page 23: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/23.jpg)
Observe global state from other core
23
23
LLCmissespercycle
while(true):forcounterinLLCmisses,cycles:buf[i++]=readCounter(counter)
![Page 24: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/24.jpg)
Observe local state with SMT hardware
24
24
HT1
HT2
0
4
HT1IPC
0
4
CoreIPC
0
4HT2SHIMIPC
HT1IPC=CoreIPC–HT2SHIMIPC
while(true):forcounterinHT2SHIM,Core,Cycles:buf[i++]=readCounter(counter);
![Page 25: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/25.jpg)
Correlate hardware & software events
01234
HT1IPC
01234
CoreIPC
01234
HT2SHIMIPC
1
2
3
A()B()C()
HT1
HT2
while(true):forcounterinHT2SHIM,Core,cycles:buf[i++]=readCounter(counter);tid=threadonHT1buf[i++]=tid.method;
![Page 26: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/26.jpg)
Fidelity
26
![Page 27: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/27.jpg)
27
1e-05
0.0001
0.001
0.01
0.1
1
0.01 0.1
1 10 100 1000
IPC(logscale)
%ofsamples(logscale)
Raw samples
![Page 28: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/28.jpg)
28
!me
R0C0
IPC1 IPC2 IPC3
R1C1 R2C2 R3C3
IPC=(Rt–Rt-1)/(Ct–Ct-1)
✗✓ ✓
CountersC:cyclesR:reQredinstrucQons
Problem: samples are not atomic
![Page 29: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/29.jpg)
29
!me
IPC1 IPC2 IPC3
✗✓ ✓
Solution: use clock as ground truth CPC=(Cet–Cet-1)/(Cst–Cst-1)thisshouldbe1!
CPC1=1.0+/-1% CPC2=1.0+/-1% CPC3!=1.0+/-1%
Cs0R0C0Ce0 Cs1R1C1Ce1 Cs2R2C2Ce2 Cs3R3C3Ce3
![Page 30: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/30.jpg)
30
1e-05
0.0001
0.001
0.01
0.1
1
0.01 0.1
1 10 100 1000
1e-05
0.0001
0.001
0.01
0.1
1
0.01 0.1
1 10 100 1000
----rawIPC
%ofsam
ples(logscale)
1e-05
0.0001
0.001
0.01
0.1
1
0.01 0.1
1 10 100 1000
----rawCPC----filteredIPC
----filteredCPCin[0.99,1.01]
Filtering Lusearch IPC samples
![Page 31: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/31.jpg)
31
top10methods(74%totalexecuQonQme)
IPC
00.20.40.60.81
1.21.41.6
1 2 3 4 5 6 7 8 9 10
default1KHz maximum100KHz SHIM10MHz
IPC of individual methods in Lucene
![Page 32: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/32.jpg)
32
0
0.5
1
1.5
2
2.5
3
3.5
4
30cycles 1213cycles
methodandloopIDs
Normalize
dto
with
outS
HIM
OverheadsfromwriteinvalidaQons
3MHz:1+orderofmagnitudeoverinterrupt‘maximum’
113MHz:3+ordersofmagnitudeoverinterrupt‘maximum’
Overheads from other core
![Page 33: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/33.jpg)
Understanding Tail Latency
33
![Page 34: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/34.jpg)
SHIM signals
Requests • thread ids • request id (software configured) • time stamps, PC System threads • thread ids • time stamp, PC
34
![Page 35: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/35.jpg)
All requests
35
0
20
40
60
80
100
120
0 20 40 60 80 100
late
ncy
(m
s)
Request groups (from the slowest 1% to the fastest 1%)
Client latency
Average queueing time
![Page 36: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/36.jpg)
Longest 200 requests
36
0
20
40
60
80
100
120
0 50 100 150
200
late
ncy
(ms)
Top 200 requests
Network and networking queueing time
Idle time
CPU time
Dispatch queueing time
laten
cy
Network & other Idle CPU work Queuing at worker
not noise
noise, bursts? queuing?
![Page 37: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/37.jpg)
0 0
0 0
0 0 0 0
0 0
0 0
Parallelism
0 300 600 900
1200 1500
0 10 20 30 40 50
Late
ncy
ms
Lucene RPS
Sequential 99th 4 way 99th
improves at low load
degrades at high load
Parallelismhistoricallyforthroughput
ParallelismfortaillatencyIdea
![Page 38: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/38.jpg)
38
Parallelism
InsightApproach
Longrequestsrevealthemselves
Incrementallyaddparallelismtolongrequests–thetail–basedonrequestprogress&load
Parallelismhistoricallyforthroughput
ParallelismfortaillatencyIdea
Few-to-Many Dynamic Parallelism [ASPLOS’15]
0 0
0 0
0 0 0 0
0 0
0 0
![Page 39: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/39.jpg)
Evaluation 2x8 64 bit 2.3 GHz Xeon, 64 GB
300
600
900
1200
1500
30 32 34 36 38 40 42 44 46 48
Tail
late
ncy
ms
Requests per Second
buy fewer servers
reduce tail latency
Few to Many Sequential
![Page 40: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/40.jpg)
Queuing theory Optimizing average latency maximizes throughput But not the tail! Shortening the tail reduces queuing latency 40
![Page 41: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/41.jpg)
Longest 200 requests
41
0
20
40
60
80
100
120
0 50 100 150
200
late
ncy
(ms)
Top 200 requests
Network and networking queueing time
Idle time
CPU time
Dispatch queueing time
laten
cy
Network & other Idle CPU work Queuing at worker
noise, bursts? queuing?
✔
![Page 42: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/42.jpg)
Correlate bad requests with system state
request CPU0
CPU1
… CPUN
GC thread
GC thread
GC thread
Use time stamps to post-process traces
CPU 0
t0 t1 … ti tj tk tl thread thx
time
![Page 43: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/43.jpg)
Recap & what’s next
SHIM continuous profiling to diagnose the tail • Noise: replication • Work: parallelism • Scalability bottlenecks
Continuous monitoring suggests dynamic optimizations but… still poor utilization due to bursty diurnal workload • Colocation
Looking forward
43
![Page 44: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/44.jpg)
Queuing theory Over provision for maximum burst, otherwise queuing delay degrades average and tail latency
44
![Page 45: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/45.jpg)
0
50
100
150
200
0 20 40 60 80 100 120 140 160 180
Late
ncy
(ms)
RPS
Lucene alone 50%ile Lucene alone 95%ile Lucene alone 99%ile
High Responsiveness—Low Utilization
45
IntelXeon-D1540Broadwell
LC
LuizAndréBarroso,UrsHölzle“TheDatacenterasaComputer:AnIntroducQontotheDesignofWarehouse-ScaleMachines”
“SuchWSCstendtohaverelaQvelylowaverageuQlizaQon,spendingmostofitsQmeinthe10-50%CPUuQlizaQonrange.”
1core,noSMT
ServiceLevelObjecQve100msSLO
0
50
100
150
200
0 20 40 60 80 100 120 140 160 180
Late
ncy
(ms)
RPS
Lucene alone 99%ile
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
0 20 40 60 80 100 120 140 160 180
Uti
lizat
ion
/ no
SM
T
RPS
34%w/SMT67%noSMT
LC
![Page 46: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/46.jpg)
Soak up Slack with Batch?
LC batch
core core core core
sharedcache
Co-runningondifferentcoresSMTturnedoff
GoalNotaillatencyimpact[TOCS’16,EuroSys’14]requiresidlecoresinpartbecauseOSdeschedulingisslow
LC idle LCidlecorebatch
sharedcache sharedcache
core core core core t1 t2 t1 t2 t1 t2 t1 t2t1 t2 t1 t2 t1 t2 t1 t2
Co-runningondifferentcoresSMTturnedoff
Co-runningonsamecoreinSMTlanes
![Page 47: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/47.jpg)
0
100
200
300
400
500
600
10 20 30 40 50 60 70 80 90 100
99%i
le la
tenc
y (m
s)
RPS
Lucene alone with IPC 1.0 with IPC 0.01
SMT Co-Runner
47
EvenIPC0.01violatesSLOatlowload!
while(1);
while(1){movnti();mfence();}
IPC1.0
IPC0.01
1core,2SMTlanes
SLO
GreatuQlizaQon!
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
10 20 30 40 50 60 70 80 90 100
Uti
lizat
ion
/ no
SM
T
RPS
![Page 48: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/48.jpg)
48
Func!onalUnits
LoadStoreQueue
IssueLogicLanes
Qmelucene
Simultaneous Multithreading OFF
![Page 49: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/49.jpg)
49
lane1
Func!onalUnits
LoadStoreQueue
IssueLogicLanes
Dynamicallyshared
StaQcallyparQQoned
Round-robinshared
Qmelucene
IPC0.01
AcQveSMTlanessharecriQcalresources
lane2
Simultaneous Multithreading ON
![Page 50: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/50.jpg)
Principled Borrowing
50
Qme
busyidle
BatchborrowshardwarewhenLCisidle
BatchreleaseshardwarewhenLCisbusy
Canweimplementprincipledborrowingoncurrenthardware?
LC
batch
![Page 51: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/51.jpg)
Hardware is Ready — Software is Not
51
Qme
Batchlanecalls“mwait”
Threadsleeps,releasinghardwaretoOS(~2Kcycles)
OSsupportsthreadsleeping,butnothardwaresleepingreleaseSMThardwaretootherlane
OSschedulesbatchlanewithanyreadyjob
LC
batch
![Page 52: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/52.jpg)
nanonap()
52
ThreadinvokingnanonapreleasesSMThardwarewithoutreleasingSMTcontext
per_cpu_variable:nap_flag;voidnanonap(){enter_kernel();disable_preemption();my_nap_flag=this_cpu_flag(nap_flag);monitor(my_nap_flag);mwait();enable_preemption();leave_kernel();}
OScaninterrupt&wakeupthreadOScannotschedulehardwarecontext
![Page 53: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/53.jpg)
Elfen Scheduler
53
InstrumentbatchworkloadstodetectLCthreads&nap
Bindlatency-criQcalthreadstoLClaneBindbatchthreadstobatchlane
Nochangetolatency-criQcalthreads
![Page 54: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/54.jpg)
Elfen Scheduler
54
Qme
12
1 Batchthreadborrowsresources,conQnuouslychecksLClanestatus
2 LCstarts,batchcallsnanonap()toreleaseSMThardwareresources
OStouchesnap_flagtowakeupbatchthread
/*fastpathcheckinjectedintomethodbody*/check:if(!request_lane_idle)slow_path();slow_path(){nanonap();}
1
2
/*mapslaneIDstotherunningtask*/exposedSHIMsignal:cpu_task_maptask_switch(taskT){cpu_task_map[thiscpu]=T;}idle_task(){//wakeupanywaitingbatchthreadupdate_nap_flag_of_partner_lane();......}
LC
batch
3
3
3
nanonap()
![Page 55: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/55.jpg)
0
20
40
60
80
100
120
140
10 20 30 40 50 60 70 80 90 100
99%i
le la
tenc
y (m
s)
RPS
w antlr
w bloat
w eclipse
w fop
w hsqldb
w jython
w luindex
w lusearch
w pmd
w xalan
Lucene alone
Results: Borrow Idle
55
1core,2SMTlanes
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
10 20 30 40 50 60 70 80 90 100
Uti
lizat
ion
/ no
SM
T
RPS
increaseduQlizaQon10x-1.5x
0
20
40
60
80
100
120
140
200 400 600 800 1000
99%i
le la
tenc
y (m
s)
RPS
7cores,2x7SMTlanes
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
200 400 600 800 1000
Uti
lizat
ion
/ no
SM
T
RPS
4x-0.19x
samelatency!
![Page 56: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/56.jpg)
Exciting times
56
![Page 57: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/57.jpg)
57
big
litle
000
000
0 0
0 0
0 0
0 0
000
000
0 0
0 0
0 0
0 0
custom
Hardware heterogeneity – opportunity & challenge
Processors Memory
DDRNVM flash
PIMpaired
![Page 58: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/58.jpg)
Heterogeneous workload!
58
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
![Page 59: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/59.jpg)
Heterogeneous hardware dominates homogeneous hardware for throughput, performance, and energy with a fixed power budget & variable request demand Slow-to-Fast sacrifice average a bit to reduce energy & tail latency
59
Requirements pull for heterogeneity! [DISC’14, ICAC’13, submission]
![Page 60: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/60.jpg)
60
![Page 61: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/61.jpg)
61
![Page 62: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/62.jpg)
Thank you
62
![Page 63: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/63.jpg)
Extras
63
![Page 64: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/64.jpg)
Online self scheduling
0 0
0 0
0
0
|requests| Interval0=0 Interval1,2=50,100≤2 @0parallelism=33 @0parallelism=1 @50,parallelism=34-6 @50parallelism=1 @100,parallelism=3≥7 @exitparallelism=1 @100,parallelism=3
![Page 65: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/65.jpg)
Software & hardware
Lucene open source enterprise search Wikipedia English 10 GB index of 33 million pages 10k queries from Lucene nightly tests
Bing web search with one Index Serving Node (ISN) 160 GB web index in SSD, 17 GB cache 30k Bing user queries
Hardware 2x8 64 bit 2.3 GHz Xeon, 64 GB Windows 15 request servers, 1 core issues requests Target parallelism = 24 threads
65
![Page 66: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/66.jpg)
Policies Sequential N way single degree of parallelism for each
request Adaptive Select parallelism degree when request starts using system load [EUROSYS’13] Request Clairvoyant parallelizes long requests by
perfect prediction of tail FM Few to Many incrementally add parallelism
66
![Page 67: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/67.jpg)
300
600
900
1200
1500
30 32 34 36 38 40 42 44 46 48
Tail l
atenc
y m
s Lucene RPS
Sequential
4 way
Fixed interval 20 ms
Fixed interval 100 ms
Fixed interval 500 ms
Fixed interval Add thread every X ms
0 0
0 0
0 0 0 0
0 0
0 0
Long intervals good at high load
Short intervals good at low load
![Page 68: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/68.jpg)
Load variation
Alternatebetweenhigh&lowloadFMadaptstoburstswithlowvariance
68
0
400
800
1200
1600
Tail l
atenc
y m
s
Lucene RPS
Sequential 2 way 4 way FM
Low Low High High
![Page 69: Tail Latency: Beyond Queuing Theory · Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini](https://reader036.fdocuments.us/reader036/viewer/2022071218/604e4b58ea3c9e45d6145212/html5/thumbnails/69.jpg)
Fewer servers: Total Cost of ownership
300
600
900
1200
1500
30 32 34 36 38 40 42 44 46 48
Tail
late
ncy
ms
Lucene RPS
Sequential FM
21%
30 32 34 36 38 40 42 44 46 48
Lucene RPS
Adaptive FM
9%