Counter Stacks: Storage Workload Analysis via Streaming Algorithms

Counter Stacks:Storage Workload Analysisvia Streaming Algorithms

Nick Harvey University of British Columbia and Coho Data

Joint work with Zachary Drudi, Stephen Ingram, Jake Wires, Andy Warfield

CachingWhat data to keep in fast memory?

Fast, Low-Capacity Memory

Slow, High-Capacity Memory

CachingHistorically

Registers

RAM

Disk

Belady, 1966: FIFO, RAND, MIN

Denning, 1968: LRU

CachingModern

Registers,L1, L2, L3

RAM

Disk

SSD

Proxy

CDN

Associative map

LRU etc.

LRU

ConsistentHashing...

from 1968CPUs are >1000x fasterDisk latency is <10x betterCache misses are increasingly costly

Challenge: ProvisioningHow much cache should you buy to support your

workload?

Challenge: Virtualization

• Modern servers are heavily virtualized• How should we allocate the physical cache among

virtual servers to improve overall performance?• What is “marginal benefit” to giving server more

cache?

• Understanding workloads better can help–Administrators make provisioning

decisions–Software make allocation decisions

• Storing a trace is costly: GBs per day

• Analyzing and distilling traces is a challenge

Understanding Workloads

Hit Rate Curve

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 760

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Hit

rate

0.20.1 0.30 0.4 0.5

MSR Cambridge “TS” Trace, LRU Policy

• Fix a particular workload and caching policy• If cache size is x, what would hit rate be?• HRCs are useful for choosing an appropriate

cache size

Cache Size (GB)

“Elbow”“Knee”“Working Set”

Not muchmarginal benefitof a bigger cache

Hit Rate Curve

• Real-world HRCs need not be concave or smooth

• “Marginal benefit” is meaningless• “Working set” is a fallacy

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 760

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Cache Size (GB)

Hit

rate

MSR Cambridge “Web” Trace, LRU Policy

4020 600 80

“Elbow”?“Knee”?“Working Set”?

LRU Caching• Policy: An LRU cache of size x always

contains thex most recently requested distinct symbols.

A B C A D A B …

• If cache size >3 then B will still be in the cacheduring the second request for B.–Second request for B is a hit for

cache size x if x>3.

• Inclusive: Larger caches always include contents of smaller caches.

3 distinct symbols“Reuse Distance”

Mattson’s Algorithm

• Maintain LRU cache of size n; simulate cache of all sizes x·n.

• Keep list of all blocks, sorted by most recent request time.

• Reuse distance of a request is its position in that list.• If distance is d, this request is a hit for all cache sizes

¸d.• Hit rate curve is CDF of reuse distances.

A B C A D A B …

List: AB AC B AA C BD A C BA D C BB A D C

Requests:

Faster Mattson[Bennett-Kruskal 1975, Olken 1981, Almasi et al. 2001, ...]

• Maintain table mapping block to time of last request

# of blocks whose last request time is ¸ t= # of distinct blocks seen since time t

• Can compute this in O(log n) time with a balanced tree

• Can compute HRC in O(m log n) time

A B C A D A B …Block

ABCD

Space is (n)

n = # blocks

m = length of trace

Is linear space OK?

• A modern disk is 8TB, divided in 4kB blocks) n = 2B

• The problem is worse in multi-disk arrays) n = 15B

• If the algorithm for improving memory usageconsumes 15GB of RAM, that’s counterproductive!

60TB JBOD

• We ran an optimized C implementation of Mattson on theMSR-Cambridge traces of 13 live servers over 1 week

• Trace file is 20GB in size, 2.3B requests, 750M blocks (3TB)

• Processing time: 1 hour• RAM usage: 92GB

• Lesson: Cannot afford linear space to process storage workloads

• Question: Can we estimate HRCs in sublinear space?

Is linear space OK?

Quadratic Space

A B C A D A BRequests:

Set of allsubsequ

ent items:

AB BC C C

A A A

D D D D D

A AB B B B B

Items seen since first requestItems seen since second request

• Reuse distance is size of oldest set that grows.• Hit rate curve is CDF of reuse distances.

Reuse Distance = 2Reuse Distance = 3Reuse Distance = 1

Quadratic SpaceA B C A D A BRequests:

For t=1,…,m Receive request bt;

Find minimum j such that bt is not in jth set

Let vj be cardinality of jth set

Record a hit at reuse distance vj

Insert bt into all previous sets

Set of allsubsequ

ent items:

AB BC C C

A A A

D D D D D

A Avj = 3

j=3

More Abstract Version

For t=1,…,m Let vj be cardinality of jth set

Receive request bt

Let ±j be change in jth set’s cardinality when adding bt

For j=2,…,t Record (±j-±j-1) hits at reuse distance vj

A B C A D A BRequests:

Set of allsubsequ

ent items:

AB BC C C

A A A

D D D D D

A A

±j:

0 0 1 1 1 1±j-±j-1:

0 0 1 0 0 0

vj = 3

How should we represent these sets? Hash table?

; Insert bt into all previous sets

InsertDeleteMember?Cardinalit

y?Space (in

bits)

Random Set Data StructuresHash TableBloom FilterF0 Estimator

YesYesYesYes

(n log n)

YesNo

Yes*No(n)

YesNoNo

Yes*O(log

n)

Op

era

tion

s

Aka “HyperLogLog”“Probabilistic Counter”

“Distinct Element Estimator”* allowing some error

Subquadratic SpaceA B C A D A BRequests:

Set of allsubsequ

ent items:

Items seen since first requestItems seen since second request

• Reuse distance is size of oldest set that grows (cardinality query)

• Hit rate curve is CDF of reuse distances.

F0

Est

imat

or

F0

Est

imat

or

F0

Est

imat

or

Insert

Insert

…

Insert

For t=1,…,m Let vj be value of jth F0-estimator

Receive request bt

Let ±j be change in jth F0-estimator when adding bt

For j=2,…,t Record (±j-±j-1) hits at reuse distance vj

Towards Sublinear SpaceA B C ARequests:

Set of allsubsequ

ent items:

• Note that an earlier F0-estimator is a superset of later one

• Can this be leveraged to achieve sublinear space?

F0

Est

imat

or

F0

Est

imat

or

…

F0

Est

imat

or

F0

Est

imat

or

¶ ¶ ¶

F0 Estimation[Flajolet-Martin ‘83, Alon-Mattias-Szegedy ‘99, …, Kane-Nelson-Woodruff ‘10]

Operations: • Insert(x)• Cardinality(), with (1+²) multiplicative error

Space: log(n)/²2 bits £(²-

2+log n) is optimal

log n rows

²-2 columns

F0 Estimation

A B C A D A B …

Hash function h (uniform)Hash function g(geometric)

Operations: Insert(x), Cardinality()

²-2 columns

11

log n rows

F0 Estimation

1 11

A B C A D A B …



²-2 columns

log n rows

F0 Estimation

1 1 11

A B C A D A B …



²-2 columns

log n rows

F0 Estimation

1 1 1 11 1

1

A B C A D A B …



²-2 columns

log n rows

F0 Estimation

Suppose we insert n distinct elements# of 1s in a column is max of ¼n²2 geometric RVs, so ¼log(n²2)Averaging over all columns gives a concentrated estimate for

log(n²2)Exponentiating and scaling gives concentrated estimate for n

1 1 1 11 1

1


²-2 columns

log n rows

F0 Estimation for a chain

word word word word word





²-2 columns

Operations: • Insert(x)•Cardinality(t), estimate # distinct elements since tth insert

Space: log(n)/²2 words

log n rows


11

A B C A D A B …


²-2 columns

Operations: Insert(x), Cardinality(t)Space: log(n)/²2 words

log n rows

2 11

A B C A D A B …


²-2 columns

F0 Estimation for a chainOperations: Insert(x), Cardinality(t)

log n rows

2 1 31

A B C A D A B …


²-2 columns


log n rows

2 4 34

A B C A D A B …


²-2 columns


log n rows

2 4 5 34 5

5

A B C A D A B …


²-2 columns


log n rows

2 4 5 34 5

5

²-2 columns


• The {0,1}-matrix consisting of all entries ¸t is the same as the matrix for an F0 estimator that started at time t.

• So, for any t, we can estimate # distinct elements since time t.

Operations: Insert(x), Cardinality(t)

log n rows

Theorem: Let n=B¢W.Let C : [n] ! [0,1] be true HRC.Let Ĉ : [n] ! [0,1] be estimated HRC.Using O(B2¢log(n)¢log(m)/²2) words of space, can get C((j-1)¢W) - ² · Ĉ(j¢W) · C(j¢W)+² 8j=1,…,B

Vertical errorHorizontal

error

n = # distinct blocks m = # requestsB = # “bins” W = width of each “bin”

1

0

Hit

Rate

0 nWB bins

CĈ

C(j¢W)

Ĉ(j¢W)

C((j-1)¢W)

C((j-1)¢W)-²

Experiments:MSR-Cambridge traces of 13 live servers over 1 week

• Trace file is 20GB in size, 2.3B requests, 750M blocks

• Optimized C implementation of Mattson’s algorithm– Processing time: ~1 hour– RAM usage: ~92GB

• Java implementation of our algorithm– Processing time: 17 minutes (2M

requests per second)– RAM usage: 80MB (mostly the garbage

collector)


• Trace file has m=2.3B requests, n=750M blocks

heuristic

counter stacks


• Trace file has m=585M requests, n=62M blocks

heuristic

counter stacks


• Trace file has m=75M requests, n=20M blocks

heuristic

counter stacks

Conclusions• Workload analysis by measuring

uniqueness over time.

• Notion of “working set” can be replaced by “hit rate curve”.

• Can estimate HRCs in sublinear space, quickly and accurately.

• On some real-world data sets, its accuracy is noticeably better than heuristics that have been proposed in the literature.

Open Questions• Does algorithm use optimal amount of space?

Can it be improved to O(B¢log(n)¢log(m)/²2) words of space?

• We did not discuss runtime.Can we get runtime independent of B and ²?

• We are taking difference of F0-estimators by subtraction.This seems crude. Is there a better approach?

• Streaming has been used in networks, databases, etc.To date, not used much in storage. Potentially more uses here.

Counter Stacks: Storage Workload Analysis via Streaming Algorithms

Documents

Transcript of Counter Stacks: Storage Workload Analysis via Streaming Algorithms