Cache-Oblivious Priority Queue and Graph Algorithm Applications
Online Cache Analysis And Its Applications For Enterprise ... · Online Cache Analysis And Its...
-
Upload
nguyenduong -
Category
Documents
-
view
225 -
download
0
Transcript of Online Cache Analysis And Its Applications For Enterprise ... · Online Cache Analysis And Its...
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Online Cache Analysis And Its Applications For
Enterprise Storage Systems
Irfan Ahmad
Founder & CTO CloudPhysics, Inc.
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
System Model – Caches are Critical
2
HYPERVISOR
APPS / VMS
SERVERS
STORAGE
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Cache Performance Questions
r Is this performance good? Can it be improved? r What happens if I add / remove some cache? r What if I add / remove workloads? r Is there cache thrashing / pollution r Can I use cache to control performance or QoS
3
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Executive Summary
r Miss Ratio Curves (MRCs): game-changing storage tool r CloudPhysics’ MRC algorithm = up to 10,000x improvement r Online MRCs now practical:
r ~20 million IO/s per core; amortized 60 ns per IO r High accuracy in 1 MB r Feasible for for memory-constrained firmware, drivers
r Looking to partner with storage systems vendors r Applications:
r Workload-aware predictive cache sizing r Software-driven cache partitioning for “free” performance r Latency / Throughput guarantees via cache QoS
4
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Problem & Opportunity
r Cache performance highly non-linear r Benefit varies widely by workload r Opportunity: dynamic cache management
r Efficient sizing, allocation, and scheduling r Improve performance, isolation, QoS
r Problem: online modeling expensive r Too resource-intensive to be broadly practical r Exacerbated by increasing cache sizes
5
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Modeling Cache Performance
���
���
���
���
���
�� �� ��� ��� ���
����������
���������������
r Miss Ratio Curve (MRC) r Performance as f (size) r Working set knees r Inform allocation policy
r Reuse distance r Unique intervening blocks
between use and reuse r LRU, stack algorithms
6
Lower is better
42 84 128 170
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
MRC Algorithm Research
7
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 1965
separate simulation per cache size
Bennett & Kruskal balanced tree
O(N), O(N log N)
Olken tree of unique refs
O(M), O(N log M)
SHARDS spatial hashing
Counter Stacks probabilistic counters
O(1), O(N)
O(log M), O(N log M)
PARDA parallelism
UMON-DSS hw set sampling
RapidMRC on-off periods
Kessler, Hill & Wood set, time sampling
Bryan & Conte cluster sampling
Mattson Stack Algorithm single pass
O(M), O(NM)
Space, Time Complexity N = total refs, M = unique refs
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Key New Idea
r CloudPhysics MRC approximation algorithm r Randomized spatial sampling r Hashing to capture all reuses of same block r High performance in tiny constant footprint r Highly accurate MRCs
r Summary: run full algorithm, using sampled blocks r https://www.usenix.org/conference/fast15/technical-sessions/presentation/waldspurger
8
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Licensable Systems Implementation
r Easy to integrate with existing embedded systems r High-performance SHARDS implementation in C
void mrc_process_ref(MRC *mrc, LBN block);void mrc_get_histo(MRC *mrc, Histo *histo);
r No floating-point, no dynamic memory allocation r Extremely low resource usage
r Accurate MRCs in <1 MB footprint r Single-threaded throughput of 17-20M blocks/sec r Average time of mrc_process_ref() call < 60 ns
r Scaled-down simulation similarly efficient 9
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
MRCs from Customer Workloads (LRU)
msr_mds (1.10%) msr_proj (0.06%) msr_src1 (0.06%) t00 (0.38%) t01 (0.05%) t02 (0.28%) t03 (0.65%)
t04 (0.28%) t05 (1.00%) t06 (0.33%) t07 (0.98%) t08 (0.04%) t09 (0.21%) t10 (0.61%)
t11 (0.65%) t12 (0.43%) t13 (0.46%) t14 (0.38%) t15 (0.10%) t16 (1.20%) t17 (0.54%)
t18 (0.08%) t19 (0.06%) t20 (0.03%) t21 (0.09%) t22 (0.04%) t23 (0.07%) t24 (0.65%)
t25 (1.20%) t26 (0.33%) t27 (0.50%) t28 (0.57%) t29 (0.12%) t30 (0.06%) t31 (0.95%)
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0 40 80 0 500 1000 0 200 0 20 40 0 200 400 0 60 120 0 20
0 30 60 0 9 18 0 50 100 0 20 40 0 300 600 0 60 120 0 20 40
0 20 40 0 20 40 600 30 60 0 100 200 300 0 200 400 0 20 40 0 40 80
0 100 200 0 200 400 0 300 600 0 200 400 0 300 600 0 200 400 0 20 40
0 5 10 0 30 60 0 20 40 0 20 40 0 300 600 0 100 200 300 0 7 14Cache Size (GB)
Mis
s R
atio
smax = 8K exact MRC 10
Lower is better
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Non-LRU Miss Ratio Curve Examples
ARC — MSR-Web Trace
��
����
����
����
����
��
�� ��� ��� ��� ��� ��� ��� ��� ���
����������
���������������
������������������������������
���������
CLOCK-Pro — Trace t04
��
����
����
����
����
����
����
����
����
�� ��� ��� ��� ��� ��� ��� ��� ���
����������
���������������
���������������������������������������������
11
Lower is better
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Applications of Online MRC
12
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Where are the MRCs
13
HYPERVISOR
APPS / VMS
SERVERS
STORAGE
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Overview of Applications
r Without any changes to cache r Cache sizing r Cache parameter tuning
r With cache partitioning support r Optimize performance r Enforce service-level objectives
r Next-generation optimizations r Latency and throughput guarantee SLAs r AI for bending MRC curves
14
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Application: Cache Sizing
r Online recommendations r Integrate SHARDS with storage controller r Show MRCs in storage management UI r Customers and SEs self-service on sizing r Size array cache in the field, trigger upsell, etc. r Tune and optimize customer workloads r Report cache size to achieve desired latency
r Existing CloudPhysics caching analytics service
15
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Example: MRC in UI Dashboard (mockup)
16
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Example: MRC UI Mockup
17
Higher is better
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Application: Tune Cache Policy
r Quantify impact of parameter changes r Cache block size, use of sub-blocks r Write-through vs. write-back r Even replacement policy…
r Scaled-down simulation r Representative “microcosm” of cache behavior r Works for arbitrary policies and parameters r Explore without modifying actual production cache
r Dynamic online optimization
18
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Application: Optimize Performance
r Improve aggregate cache performance r Prevent inefficient clients from wasting
space r Allocate space based on client benefit
r Mechanism: Partition cache across clients r Isolate and control LUNs, VMs,
tenants, etc. r Optimize partition sizes using MRCs
r Adapt to changing workload behavior
19
vs
Partitioned Cache
msr_mds (1.10%) msr_proj (0.06%) msr_src1 (0.06%) t01 (0.05%)
t06 (0.33%) t08 (0.04%) t14 (0.38%) t15 (0.10%)
t18 (0.08%) t19 (0.06%) t30 (0.06%) t32 (0.98%)
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0 40 80 0 500 1000 0 200 0 200 400
0 50 100 0 300 600 0 100 200 300 0 200 400
0 100 200 0 200 400 0 100 200 300 0 9 18Cache Size (GB)
Mis
s R
atio
smax = 8K exact MRC
msr_mds (1.10%) msr_proj (0.06%) msr_src1 (0.06%) t01 (0.05%)
t06 (0.33%) t08 (0.04%) t14 (0.38%) t15 (0.10%)
t18 (0.08%) t19 (0.06%) t30 (0.06%) t32 (0.98%)
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0 40 80 0 500 1000 0 200 0 200 400
0 50 100 0 300 600 0 100 200 300 0 200 400
0 100 200 0 200 400 0 100 200 300 0 9 18Cache Size (GB)
Mis
s R
atio
smax = 8K exact MRC
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Example Partitioning Results
r Customer traces r 27 workload mixes r 8, 32, 128 GB sizes
r SHARDS partitions vs. global LRU
r Results histogram r Effective cache size r 40% larger (avg) r 146% larger (max)
20
Effective Cache Size Increase (%)
Freq
uenc
y (#
Exp
erim
ents
)
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Application: Latency, IOps Guarantees
r Meet service-level objectives r Per-client latency or throughput targets r Use cache allocation as general QoS knob
r Same partitioning mechanism r Isolate and control LUNs, VMs, tenants, etc. r Use MRCs for sizing partitions to meet goals
r Adapt to changing workload behavior
21
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Example: Achieving Latency Target
��
��
���
���
���
�� �� ��� ��� ���
������������������
��
���������������
22
Latency Target
Cache Allocation
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Next-Generation Capabilities
r Unified monitoring and optimization r New invention can “bend” MRC curves r Further performance improvements
r Significant improvement, even for single workload!
r Bigger wins for mixed and partitioned workloads
r Ongoing, in-progress R&D 23
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Next-Gen Curve-Bending AI
��
��
���
���
���
�� �� ��� ��� ���
������������������
��
���������������
24
Ideal Curve
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Conclusion
r Miss Ratio Curves (MRCs): game-changing storage tool r CloudPhysics’ MRC algorithm = up to 10,000x improvement r Online MRCs now practical:
r ~20 million IO/s per core; amortized 60 ns per IO r High accuracy in 1 MB r Feasible for for memory-constrained firmware, drivers
r Looking to partner with storage systems vendors r Applications:
r Workload-aware predictive cache sizing & tuning r Software-driven cache partitioning for “free” performance r Latency / Throughput guarantees via cache QoS
25
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Appendix
26
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Example SHARDS MRCs
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 20 40 60 80Cache Size (GB)
Mis
s R
atio
Sample Size (smax)
Exact MRC32K8K2K512128
r Block I/O trace t04 r Production VM disk r 69.5M refs, 5.2M unique
r Sample size smax r Vary from 128 to 32K r smax ≥ 2K very accurate
r Small constant footprint r SHARDSadj adjustment
27
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Generalizing to Non-LRU Policies
��
����
����
����
����
����
����
����
����
�� ��� ��� ��� ��� ��� ��� ��� ���
����������
���������������
�����������������������������������������������������������
r Sophisticated algorithms r ARC, LIRS, Clock-Pro, … r No single-pass methods!
r Scaled-down simulation r Hashed spatial sampling r Simulate each size
separately r Example ARC results
r 100 different cache sizes r 0.01 MAE with R = 0.001 r 1000× memory reduction
28
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Mattson Algorithm Example
29
A C B references B …
A DC distances ∞ 3 7 4 …
✗ ✓
� Reuse distance � Unique refs since last access � Distance from top of LRU-ordered stack
� Hit if distance < cache size, else miss
1
✓
2 3
✓ ✗
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Spatially Hashed Sampling
30
Li
hash(Li) mod P
Ti
randomize
process
skip
no
yes
sample?
< T
subset inclusion property maintained as R is lowered T
adjustable threshold
0 P
sampled unsampled
sampling rate R = T / P
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Basic SHARDS
31
Each sample statistically represents 1 /R blocks Scale up reuse distances by same factor
randomize sample?
Li
hash(Li) mod P
Ti
< T yes
compute distance
Standard Reuse Distance
Algorithm
scale up
÷ R
skip
no
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
SHARDS in Constant Space
32
randomize scale up sample? compute distance
Li
hash(Li) mod P
Ti
< T yes Standard
Reuse Distance Algorithm
÷ R
sample set
evict samples to bound set size
lower threshold T = Tmax reduces rate R = T / P
T 0 P
Tmax
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Experimental Evaluation
r Data collection r SaaS caching analytics r Remotely stream
VMware vscsiStats r 124 trace files
r 106 week-long traces CloudPhysics customers
r 12 MSR and 6 FIU traces SNIA IOTTA
r LRU, 16 KB block size 33
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Dynamic Rate Adaptation
�������
������
�����
����
�� ��� ��� ��� ��� ���
�����������������
���������������������
������������
r Adjust sampling rate r Start with R = 0.1 r Lower R as M increases r Shape depends on trace
r Rescale histogram counts r Discount evicted
samples r Correct relative
weighting r Scale by Rnew / Rold
34
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Error Analysis
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●●●●●●●●●●●●●●●●●
●
●● ●
●
●●●●●●●
●●●●●●●●●
●●●●●●●●
●●●
●●●●
0.00
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.10
256 512 1K 2K 4K 8K 16K 32KSample Size (smax)
Mea
n A
bsol
ute
Erro
r (M
AE)
r Mean Absolute Error (MAE) r | exact – approx | r Average over all cache
sizes r Full set of 124 traces r Error ∝ 1 / √smax r MAE for smax = 8K
r 0.0027 median r 0.0171 worst-case
35
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Memory Footprint
0.1
1
10
100
1000
10000
100000
0 20 40 60 80 100 120
Mem
ory
Usa
ge (M
B)
Trace Number
Baseline (unsampled) SHARDS R = 0.010 SHARDS R = 0.001 SHARDS Smax = 8K
r Full set of 124 traces r Sequential PARDA r Basic SHARDS
r Modified PARDA r Memory ≈ R × baseline for
larger traces
r Fixed-size SHARDS r New space-efficient code r Constant 1 MB footprint
36
2015 Storage Developer Conference. © CloudPhysics, Inc. All Rights Reserved.
Processing Time
0.01
0.1
1
10
100
1000
10000
100000
0 20 40 60 80 100 120
CPU
Usa
ge (s
ec)
Trace Number
Baseline (unsampled) SHARDS R = 0.010 SHARDS R = 0.001 SHARDS Smax = 8K
r Full set of 124 traces r Sequential PARDA r Basic SHARDS
r Modified PARDA r R=0.001 speedup 41–1029×
r Fixed-size SHARDS r New space-efficient code r Overhead for evictions r Smax= 8K speedup 6–204×
37