TimeThief: Leveraging Network Variability to Save Datacenter Energy in On-line Data- Intensive...
-
Upload
evelyn-eaton -
Category
Documents
-
view
218 -
download
2
Transcript of TimeThief: Leveraging Network Variability to Save Datacenter Energy in On-line Data- Intensive...
TimeThief: Leveraging Network Variability to Save Datacenter Energy in On-line Data-Intensive Applications
Balajee Vamanan (Purdue UIC)Hamza Bin Sohail (Purdue)Jahangir Hasan (Google)T. N. Vijaykumar (Purdue)
2
Massive growth of big-data Unstructured data doubles every two years
Mostly Internet data Consumed by Online, data-intensive (OLDI)
applications Datacenters are critical computing platforms for OLDIs
Big data and OLDI applications
OLDIs are important Many popular Internet applications are OLDIs
e.g., Google Search, Facebook, Twitter Organize and provide access to Internet data Vital to many big companies
3
Challenges to energy managementTraditional energy management does not work
1. OLDIs are interactive strict service-level agreements (SLAs) cannot batch
2. Shorter response times and inter-arrival times (e.g., 200 ms and ≈ 1 ms for Web Search) Low-power modes (e.g., p-states) not applicable
3. OLDIs distribute data over 1000s of servers each user query searches all servers Cannot consolidate workload to fewer servers
Energy management for OLDIs can only slowdown servers without violating SLAs
4
Previous work Pegasus [ISCA’14] slows down responses at low loads
Response time more accurate than CPU utilization Shifts latency distribution at lower loads (uses
datacenter-wide metrics) Achieves load-proportional energy
Saves energy at lower loads Does not save much at higher loads
Saves 0% at peak Datacenters operate at moderate-peak during the day
(diurnal pattern) savings desired at high loads
Pegasus saves substantial energy at low loads but the savings are limited at high loads
5
TimeThief: contributions1. Exploits slack from the network to slowdown compute
i.e., 80% flows take 1/5th of budget even at peak load
2. Reshapes response time distribution (at all loads) Slows sub-critical leaf servers for each query
3. Leverages network signals to determine slack Does not need fine-grained clock synchronization
4. Employs Earliest Deadline First (EDF) scheduling Decouples critical queries from slowed-down, previously-
queued sub-critical queries
TimeThief identifies sub-critical queries and exploits their slack to save 12% energy even at the peak load
6
Introduction Motivation, previous work Our contributions
Background OLDI architecture Network variability
TimeThief Key Ideas Slack calculation & application EDF
Methodology Results
Talk Organization
7
Background: OLDI architecture OLDI = OnLine Data Intensive applications e.g. Web Search
1. Online: deadline-bound2. Data Intensive: large datasets
highly distributed Partition-aggregate (tree) Request-compute-reply
Request: root leaf Compute: leaf node Reply: leaf data
Wait until deadline Dropped replies affect quality
e.g., SLA of 1% misses
Leaf 1 Leaf n Leaf 1 Leaf n
Agg 1 Agg m
Root
RequestResponse
Tail latency problem
8
OLDI tail latencies and SLA budgets
Tail latency problem: Overall response time affected by slower leaves Deadline budgets based on 99th -99.9th percentiles of
individual leaves’ replies for one query
Component budgets: To optimize network and
compute separately deadline budget split
among nodes (compute) and network
e.g., total = 200ms flow deadlines 10-20ms
1020
1020
30
30
80
200
9
OLDI traffic characteristics: Incast
Many children respond around same time causing incast Causes packet drops and long tails OLDI trees have large degree
Hard to absorb in buffers 1.Datacenter switches use shallow buffers for cost2.Query multiplexing: many in-flight queries’ incast collide3.Inevitable background traffic worsens incast
Requests (parent child) also affected by reply incast (child parent) due to root randomization details in the paper
Root
10
Network variability and incast Network-variability is significant even with root
randomization
Incast leads to long tails; network budgets based on tail latency are about 6-7X of median network delay
1 2 3 4 5 6 7 8 9 101112131415161718192021220%
5%
10%
15%
20%
25%
Flow completion time (ms)
Pe
rce
nt
of
flo
ws
TCP TimeoutsQueuing in
the network
11
Introduction Motivation, previous work Our contributions
Background OLDI architecture Network variability
TimeThief Key Ideas Slack calculation & application EDF
Methodology Results
Talk Organization
12
TimeThief: key ideas TimeThief exploits per-query per-leaf slack
e.g., 80% of requests take 1/5th of budget Three kinds of slack:
Request (before compute)Compute (future extension)Reply (after compute)
TimeThief exploits request slack Mechanism to determine request slack Machinery to effectively exploit slack EDF scheduling to shield tail (critical) requests
Intel’s Running Avg. Power Limit (RAPL) to set power state
TimeThief reshapes response time distribution using per-query slack; EDF pulls the tail in
13
Determining request Slack To determine slack, timestamps at parent and leaf?
Clock skew ≈ slackEstimate based on network signals
1. Explicit Congestion Notification (ECN) Switch marks if buffer occupancy > Threshold
2. TCP Timeouts Sender marks retransmitted packets (lost earlier)
If a leaf doesn’t see either ECN or Timeoutsrequest_slack = network_budget - median_latency
else request_slack = 0 how much slack can be used to slowdown compute?
depends on queuing (load)
14
Slowing down based on request slackTwo questions1. How much to slow down the current request?
We don’t know the current request’s needs ahead! But compute budget accounts for tail!
2. How to account for queueing (load)?Attenuate the slack
slowdown = request_slack * scale / compute_budget
scale depends on load
Feedback controller to dynamically determine scale Controller monitors response times of completed queries
every 5 seconds (more details in the paper)scale += 0.05 if there is 5% roomscale -= 0.05 otherwise
15
Earliest-deadline first scheduling A sub-critical request (request_slack ≠0) could hurt
another critical request (request_slack =0) If the critical request is queued behind the sub-critical
request EDF decouples critical and sub-critical
We compute the deadline at the leaf node as follows: deadline = compute_budget + request_slack
Note: Determining slack enables EDF!
More details in the paper
16
Introduction Motivation, previous work Our contributions
Background OLDI architecture Network variability
TimeThief Key Ideas Slack calculation & application EDF
Methodology Results
Talk Organization
17
Methodology1. Compute (service time, power)
Real measurements feasible at a small scale 2. Network
Tails effects only in large clusters ns-3Workload: Web Search (Search) from CloudSuite 2.0 Search index from Wikipedia 3000 queries/s at peak load 90% load with 100 threads
per leaf (4 sockets * 12 cores * 2 way SMT ≈ 100) Deadline budget: Overall 125 ms
Network (request/reply): 25 ms Compute: 75 ms The budgets are in line with other papers
SLA of 1% missed deadlines
18
Methodology (cont.) Service time and power measurements
Service time measured at low load (no queueing) at the leaf server (matches other papers)
Power measurement with RAPL Measurements at one leaf server
Leaf servers are i.i.d Network
Fat-tree topology 64 racks, 16 servers/rack 10 Gbps links 4 MB shared packet buffers 200 μs round-trip time (unloaded)
19
Service time and Power
0 10 20 30 40 50 600
102030405060708090
100
Service time (ms)
Cum
mul
ative
per
cent
of r
eque
sts
1 1.2 1.4 1.6 1.8 21
1.2
1.4
1.6
1.8
2
Service Slowdown (norm. to base)
Pow
er S
avin
g N
orm
aliz
ed to
ba
selin
e
20
Response time distribution
TimeThief reshapes distribution at all loads; saves 12% energy at the peak load
0 20 40 60 80 100 120 1400%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Timetrader (90%)
Timetrader (30%)
Baseline (30%)
Baseline (90%)
Pegasus (30%)
Latency (ms)
Cu
mm
ula
tiv
e p
erc
en
tile
of
req
ue
sts
TimeThief
TimeThief
21
Power state distribution
Even at the peak load, TimeThief slows down about 80% of the time using per-query slack
P T P T90% 30%
Web search
0%
20%
40%
60%
80%
100%
1.2 GHz 1.5 GHz 1.8 GHz 2.2 GHz 2.5 GHZ
Pe
rce
nt
of
req
ue
sts
22
Conclusion TimeThief exploits per-query slack
Request slack Compute slack (future work)Reply slack not easily predictable
Reshapes response time distribution at all loads Saves 12% energy at peak load
Leverages network signals to estimate slack Does not require fine-grain clock synchronization
Employs EDF to decouple critical requests from sub-critical requests
TimeThief converts OLDIs’ performance disadvantage of latency tail into an energy advantage!
Thank you