CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL...

45
UBIK: EFFICIENT CACHE SHARING WITH STRICT QOS FOR LATENCY CRITICAL WORKLOADS HARSHAD KASTURE, D ANIEL SANCHEZ ASPLOS 2014

Transcript of CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL...

Page 1: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

UBIK: EFFICIENT CACHE SHARING WITH STRICT QOS

FOR LATENCY CRITICAL WORKLOADS

HARSHAD KASTURE, DANIEL SANCHEZ

ASPLOS 2014

Page 2: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Motivation 2

Low server utilization in datacenters is a major source of

inefficiency

L. Barroso and U. Hölzle, The Case for Energy-Proportional Computing

Page 3: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Common Industry Practice 3

Core 0

Last Level Cache

Core 1 Core 2

Core 3 Core 4 Core 5

Latency Critical Application

Dedicated machines for latency-critical applications

guarantees QoS

Page 4: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Core 0

Last Level Cache

Core 1 Core 2

Core 3 Core 4 Core 5

Common Industry Practice 4

Latency Critical Application

Dedicated machines for latency-critical applications guarantees QoS

Under utilization of machine resources

Page 5: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Core 0

Last Level Cache

Core 1 Core 2

Core 3 Core 4 Core 5

Colocation to Improve Utilization 5

Latency Critical Applications

Can utilize spare resources by colocating batch apps

Page 6: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Core 0

Last Level Cache

Core 1 Core 2

Core 3 Core 4 Core 5

Sharing Causes Interference! 6

Latency Critical Applications

Can utilize spare resources by colocating batch apps

Contention in shared resources degrades QoS

Page 7: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Outline 7

Introduction

Analysis of latency-critical apps

Inertia-oblivious cache management schemes

Ubik: Inertia-aware cache management

Evaluation

Page 8: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Understanding Latency-Critical Applications 8

Large number of backend servers participate in handling every user request

Total service time determined by tail latency behavior of backend

Client

Client

Client

Front End

Back End

Back End

Back End

Front End

Datacenter

Page 9: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Understanding Latency-Critical Applications 9

Service latency highly sensitive to changes in load

Page 10: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Understanding Latency-Critical Applications 10

Short bursts of activity interspersed with idle periods

Need guaranteed high performance during active periods

Active

Idle

Time

Page 11: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Inertia and Transient Behavior 11

Core 0 Core 1

Last Level Cache

Core 2

Core 3 Core 4 Core 5

Time

IPC

Page 12: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Inertia and Transient Behavior 12

Transient lengths can dominate tail latency!

Any dynamic reconfiguration scheme has to be inertia-aware

Many hardware resources exhibit inertia

branch predictors, prefetchers, memory bandwidth…

LLCs are one of the biggest sources of inertia

Core 0 Core 1

Last Level Cache

Core 2

Core 3 Core 4 Core 5

Time

IPC

Transient begin Transient end

Page 13: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Outline 13

Introduction

Analysis of latency-critical apps

Inertia-oblivious cache management schemes

Ubik: Inertia-aware cache management

Evaluation

Page 14: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Inertia-Oblivious Cache Management 14

Last Level Cache (LLC)

Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

Active

Idle

Active

Idle

Active

Idle

Active

Idle Time

Page 15: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Unmanaged LLC (LRU Replacement) 15

Last Level Cache (LLC)

Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

Active

Idle

Active

Idle

Active

Idle

Active

Idle Time

Time

LLC

Space

✖ Unconstrained interference

results in poor tail-latency

behavior

Page 16: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Utility Based Cache Partitioning (UCP) 16

Time

LLC

Space

Active

Idle

Active

Idle

Active

Idle

Active

Idle Time

Reconfigure

Last Level Cache (LLC)

Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

✔ High batch throughput

✖ Poor tail latency (low allocation)

Page 17: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

OnOff: Efficient but Unsafe 17

Active

Idle

Active

Idle

Active

Idle

Active

Idle Time

Batch Reconfigure Time

LLC

Space

Last Level Cache (LLC)

Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

✔ High batch throughput

Page 18: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Cross-Request LLC Inertia 18

Other applications qualitatively similar (see paper for details)

Shore-MT, 2 MB LLC

Cross-request hits

LLC

Acc

ess

Bre

akd

ow

n (%

) Misses

Hits (same request)

Page 19: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

StaticLC: Safe but Inefficient 19

Batch Reconfigure Time

LLC

Space

Active

Idle

Active

Idle

Active

Idle

Active

Idle Time

Last Level Cache (LLC)

Core 2 Core 3

Core 0 Core 1

LC1 LC2

Batch1 Batch2

✔ Low tail latency (preserve LLC state)

✖ Low batch throughput (poor space

utilization)

Page 20: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Outline 20

Introduction

Analysis of latency-critical apps

Inertia-oblivious cache management schemes

Ubik: Inertia-aware cache management

Evaluation

Page 21: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Ubik: Performance Guarantee 21

Performance as well as overall progress under Ubik after

the deadline is identical to static partitioning

Instructions

Time

deadline

Progress

with constant

size

Progress

with Ubik

Request begins

Page 22: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Ubik: Overview 22

Activity

Time

Size

Time

nominal static

size

idle

size

Target Size

Actual Size

Page 23: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Ubik: Overview 23

Activity

Time

Size

Time

Target Size

Actual Size nominal static

size

idle

size

boosted

size

Page 24: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Ubik: Overview 24

Activity

Time

Size

Time

Target Size

Actual Size nominal static

size

idle

size

boosted

size

Page 25: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Ubik: Overview 25

Constraint: Cycles lost during should be compensated

for by the cycles gained during before the deadline

Activit

y

Time Size

Time

Target Size

Actual Size nominal static size

idle size

boosted size

Page 26: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Analyzing Transients 26

s2

Size

Time

s1

Target

Size Actual

Size

Ttransient

Need accurate predictions for

The length of the transient from s1 to s2

Cycles lost during the transient from s1 to s2

Instructions

Time

Progress with

constant size (s2)

Progress

with Ubik

Transient

begins Transient ends

Lost

Performanc

e

Page 27: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Hardware Support 27

Utility monitors to measure per-application miss curves

Fine grained cache partitioning

Memory Level Parallelism (MLP) profiler

pS2

Size s1

pS1

s2

Miss probability

Page 28: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Bounds on Transient Behavior 28

s2

Size

Time

s1

Target

Size Actual

Size

Ttransient

Instructions

Time

Progress with

constant size (s2)

Progress

with Ubik

Transient

begins Transient ends

Lost

Performance (L)

Ttransient

c

ps

M s s

1

s21 s

2 s

1 c

ps2

M

L M 1 ps2

ps

Ms s

1

s21 s

2 s

1 1 p

s2

ps1

Page 29: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Ubik: Partition Sizing 29

Use transient analysis to identify feasible (idle size,

boosted size) pairs

Size

Time

1

deadline

Page 30: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Ubik: Partition Sizing 30

Use transient analysis to identify feasible (idle size,

boosted size) pairs

Size

Time

1

2

deadline

Page 31: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Ubik: Partition Sizing 31

Use transient analysis to identify feasible (idle size,

boosted size) pairs

Size

Time

1

2

3

deadline

Page 32: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Ubik: Partition Sizing 32

Use transient analysis to identify feasible (idle size,

boosted size) pairs

I N F E A S I B L E Size

Time

1

2

3

4

deadline

Page 33: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Ubik: Partition Sizing 33

Use transient analysis to identify feasible (idle size,

boosted size) pairs

Choose the pair that yields the maximum batch throughput

Size

Time deadline

1

2

3

See paper for details

Page 34: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Outline 34

Introduction

Analysis of latency-critical apps

Inertia-oblivious cache management schemes

Ubik: Inertia-aware cache management

Evaluation

Page 35: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Workloads 35

Five diverse latency-critical apps

xapian (search engine)

masstree (in-memory key-value store)

moses (statistical machine translation)

shore-mt (multi-threaded DBMS)

specjbb (java middleware)

Batch applications: random mixes of SPECCPU 2006

benchmarks

Page 36: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Target System 36

6 OOO cores

Private L1I, L1D and L2

caches

12MB shared LLC

400 6-app mixes: 3

latency-critical + 3 batch

apps

Apps pinned to cores

Core 0

L3

Bank 0

Core 1

L3

Bank 1

Core 2

L3

Bank 2

Core 3

L3

Bank 3

Core 4

L3

Bank 4

Core 5

L3

Bank 5

Batch1 Batch2 Batch3

LC1 LC2 LC3

Page 37: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Metrics 37

Baseline system has

private LLCs

We report

Normalized tail latency

Throughput improvement

for batch applications Core 0

L3 0

Core 1

L3 1

Core 2

L3 2

Core 3

L3 3

Core 4

L3 4

Core 5

L3 5

Batch1 Batch2 Batch3

LC1 LC2 LC3

Page 38: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Results: Unmanaged LLC (LRU) 38

Higher is better

Page 39: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Results: UCP 39

Higher is better

Page 40: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Results: OnOff 40

Higher is better

Page 41: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Results: StaticLC 41

Higher is better

Page 42: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Results: Ubik 42

Higher is better

Page 43: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Results: Summary 43

Higher is better

OnOff

UCP

Ubik

LRU

StaticLC Private LLC

Page 44: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

Conclusions 44

To guarantee tail latency, dynamic resource management

schemes must be inertia-aware

Ubik: Inertia-aware cache capacity management

Preserves tail of latency-critical apps

Achieves high cache space utilization for batch apps

Requires minimal additional hardware

Page 45: CACHE SHARING WITH TRICT O FOR LATENCY CRITICAL …people.csail.mit.edu/sanchez/papers/2014.ubik.asplos.slides.pdf · Inertia and Transient Behavior 12 Transient lengths can dominate

THANKS FOR YOUR ATTENTION!

QUESTIONS?