TimeThief: Leveraging Network Variability to Save Datacenter Energy in On-line Data- Intensive...

TimeThief: Leveraging Network Variability to Save Datacenter Energy in On-line Data-Intensive Applications

Balajee Vamanan (Purdue UIC)Hamza Bin Sohail (Purdue)Jahangir Hasan (Google)T. N. Vijaykumar (Purdue)

2

Massive growth of big-data Unstructured data doubles every two years

Mostly Internet data Consumed by Online, data-intensive (OLDI)

applications Datacenters are critical computing platforms for OLDIs

Big data and OLDI applications

OLDIs are important Many popular Internet applications are OLDIs

e.g., Google Search, Facebook, Twitter Organize and provide access to Internet data Vital to many big companies

3

Challenges to energy managementTraditional energy management does not work

1. OLDIs are interactive strict service-level agreements (SLAs) cannot batch

2. Shorter response times and inter-arrival times (e.g., 200 ms and ≈ 1 ms for Web Search) Low-power modes (e.g., p-states) not applicable

3. OLDIs distribute data over 1000s of servers each user query searches all servers Cannot consolidate workload to fewer servers

Energy management for OLDIs can only slowdown servers without violating SLAs

4

Previous work Pegasus [ISCA’14] slows down responses at low loads

Response time more accurate than CPU utilization Shifts latency distribution at lower loads (uses

datacenter-wide metrics) Achieves load-proportional energy

Saves energy at lower loads Does not save much at higher loads

Saves 0% at peak Datacenters operate at moderate-peak during the day

(diurnal pattern) savings desired at high loads

Pegasus saves substantial energy at low loads but the savings are limited at high loads

5

TimeThief: contributions1. Exploits slack from the network to slowdown compute

i.e., 80% flows take 1/5th of budget even at peak load

2. Reshapes response time distribution (at all loads) Slows sub-critical leaf servers for each query

3. Leverages network signals to determine slack Does not need fine-grained clock synchronization

4. Employs Earliest Deadline First (EDF) scheduling Decouples critical queries from slowed-down, previously-

queued sub-critical queries

TimeThief identifies sub-critical queries and exploits their slack to save 12% energy even at the peak load

6

Introduction Motivation, previous work Our contributions

Background OLDI architecture Network variability

TimeThief Key Ideas Slack calculation & application EDF

Methodology Results

Talk Organization

7

Background: OLDI architecture OLDI = OnLine Data Intensive applications e.g. Web Search

1. Online: deadline-bound2. Data Intensive: large datasets

highly distributed Partition-aggregate (tree) Request-compute-reply

Request: root leaf Compute: leaf node Reply: leaf data

Wait until deadline Dropped replies affect quality

e.g., SLA of 1% misses

Leaf 1 Leaf n Leaf 1 Leaf n

Agg 1 Agg m

Root

RequestResponse

Tail latency problem

8

OLDI tail latencies and SLA budgets

Tail latency problem: Overall response time affected by slower leaves Deadline budgets based on 99th -99.9th percentiles of

individual leaves’ replies for one query

Component budgets: To optimize network and

compute separately deadline budget split

among nodes (compute) and network

e.g., total = 200ms flow deadlines 10-20ms

1020

1020

30

30

80

200

9

OLDI traffic characteristics: Incast

Many children respond around same time causing incast Causes packet drops and long tails OLDI trees have large degree

Hard to absorb in buffers 1.Datacenter switches use shallow buffers for cost2.Query multiplexing: many in-flight queries’ incast collide3.Inevitable background traffic worsens incast

Requests (parent child) also affected by reply incast (child parent) due to root randomization details in the paper

Root

10

Network variability and incast Network-variability is significant even with root

randomization

Incast leads to long tails; network budgets based on tail latency are about 6-7X of median network delay

1 2 3 4 5 6 7 8 9 101112131415161718192021220%

5%

10%

15%

20%

25%

Flow completion time (ms)

Pe

rce

nt

of

flo

ws

TCP TimeoutsQueuing in

the network

11




Methodology Results

Talk Organization

12

TimeThief: key ideas TimeThief exploits per-query per-leaf slack

e.g., 80% of requests take 1/5th of budget Three kinds of slack:

Request (before compute)Compute (future extension)Reply (after compute)

TimeThief exploits request slack Mechanism to determine request slack Machinery to effectively exploit slack EDF scheduling to shield tail (critical) requests

Intel’s Running Avg. Power Limit (RAPL) to set power state

TimeThief reshapes response time distribution using per-query slack; EDF pulls the tail in

13

Determining request Slack To determine slack, timestamps at parent and leaf?

Clock skew ≈ slackEstimate based on network signals

1. Explicit Congestion Notification (ECN) Switch marks if buffer occupancy > Threshold

2. TCP Timeouts Sender marks retransmitted packets (lost earlier)

If a leaf doesn’t see either ECN or Timeoutsrequest_slack = network_budget - median_latency

else request_slack = 0 how much slack can be used to slowdown compute?

depends on queuing (load)

14

Slowing down based on request slackTwo questions1. How much to slow down the current request?

We don’t know the current request’s needs ahead! But compute budget accounts for tail!

2. How to account for queueing (load)?Attenuate the slack

slowdown = request_slack * scale / compute_budget

scale depends on load

Feedback controller to dynamically determine scale Controller monitors response times of completed queries

every 5 seconds (more details in the paper)scale += 0.05 if there is 5% roomscale -= 0.05 otherwise

15

Earliest-deadline first scheduling A sub-critical request (request_slack ≠0) could hurt

another critical request (request_slack =0) If the critical request is queued behind the sub-critical

request EDF decouples critical and sub-critical

We compute the deadline at the leaf node as follows: deadline = compute_budget + request_slack

Note: Determining slack enables EDF!

More details in the paper

16




Methodology Results

Talk Organization

17

Methodology1. Compute (service time, power)

Real measurements feasible at a small scale 2. Network

Tails effects only in large clusters ns-3Workload: Web Search (Search) from CloudSuite 2.0 Search index from Wikipedia 3000 queries/s at peak load 90% load with 100 threads

per leaf (4 sockets * 12 cores * 2 way SMT ≈ 100) Deadline budget: Overall 125 ms

Network (request/reply): 25 ms Compute: 75 ms The budgets are in line with other papers

SLA of 1% missed deadlines

18

Methodology (cont.) Service time and power measurements

Service time measured at low load (no queueing) at the leaf server (matches other papers)

Power measurement with RAPL Measurements at one leaf server

Leaf servers are i.i.d Network

Fat-tree topology 64 racks, 16 servers/rack 10 Gbps links 4 MB shared packet buffers 200 μs round-trip time (unloaded)

19

Service time and Power

0 10 20 30 40 50 600

102030405060708090

100

Service time (ms)

Cum

mul

ative

per

cent

of r

eque

sts

1 1.2 1.4 1.6 1.8 21

1.2

1.4

1.6

1.8

2

Service Slowdown (norm. to base)

Pow

er S

avin

g N

orm

aliz

ed to

ba

selin

e

20

Response time distribution

TimeThief reshapes distribution at all loads; saves 12% energy at the peak load

0 20 40 60 80 100 120 1400%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Timetrader (90%)

Timetrader (30%)

Baseline (30%)

Baseline (90%)

Pegasus (30%)

Latency (ms)

Cu

mm

ula

tiv

e p

erc

en

tile

of

req

ue

sts

TimeThief

TimeThief

21

Power state distribution

Even at the peak load, TimeThief slows down about 80% of the time using per-query slack

P T P T90% 30%

Web search

0%

20%

40%

60%

80%

100%

1.2 GHz 1.5 GHz 1.8 GHz 2.2 GHz 2.5 GHZ

Pe

rce

nt

of

req

ue

sts

22

Conclusion TimeThief exploits per-query slack

Request slack Compute slack (future work)Reply slack not easily predictable

Reshapes response time distribution at all loads Saves 12% energy at peak load

Leverages network signals to estimate slack Does not require fine-grain clock synchronization

Employs EDF to decouple critical requests from sub-critical requests

TimeThief converts OLDIs’ performance disadvantage of latency tail into an energy advantage!

Thank you

TimeThief: Leveraging Network Variability to Save Datacenter Energy in On-line Data- Intensive...

Documents

Transcript of TimeThief: Leveraging Network Variability to Save Datacenter Energy in On-line Data- Intensive...