Post on 28-Jan-2021
Austere Flash Cachingwith Deduplication and Compression
Qiuping Wang*, Jinhong Li*, Wen Xia#Erik Kruus^, Biplob Debnath^, Patrick P. C. Lee*
*The Chinese University of Hong Kong (CUHK)#Harbin Institute of Technology, Shenzhen
^NEC Labs
1
Flash Caching
ØFlash-based solid-state drives (SSDs)• ü Faster than hard disk drives (HDD)• ü Better reliability• û Limited capacity and endurance
ØFlash caching• Accelerate HDD storage by caching frequently accessed blocks in flash
2
Deduplication and Compression
ØReduce storage and I/O overheads
ØDeduplication (coarse-grained)• In units of chunks (fixed- or variable-size)• Compute fingerprint (e.g., SHA-1) from chunk content• Reference identical (same FP) logical chunks to a physical copy
ØCompression (fine-grained)• In units of bytes• Transform chunks into fewer bytes
3
Deduplicated and Compressed Flash Cache
ØLBA: chunk address in HDD; FP: chunk fingerprint
ØCA: chunk address in flash cache (after dedup’ed + compressed)
4
SSD
Chunking
I/O
Deduplication and compression
LBA à FP
FP à CA, length
FP-index
LBA-index
RAM
HDD
…
Dirty list
Variable-size compressed chunks
(after deduplication)
Fixed-size chunks
LBA, CALBA, CA
Read/write
Memory Amplification for Indexing
ØExample: 512-GiB flash cache with 4-TiB HDD working set
ØConventional flash cache• Memory overhead: 256 MiB
ØDeduplicated and compressed flash cache• LBA-index: 3.5 GiB• FP-index: 512 MiB• Memory amplification: 16x
• Can be higher
5
LBA (8B) à CA (8B)
LBA (8B) à FP (20B)
FP (20B) à CA (8B) + Length (4B)
Related Work
ØNitro [Li et al., ATC’14] • First work to study deduplication and compression in flash caching• Manage compressed data in Write-Evict Units (WEUs)
ØCacheDedup [Li et al, FAST’16]• Propose dedup-aware algorithms for flash caching to improve hit ratios
6
They both suffer from memory amplification!
Our Contribution
7
ØAustereCache: a deduplicated and compressed flash cache with austere memory-efficient management• Bucketization
• No overhead for address mappings• Hash chunks to storage locations
• Fixed-size compressed data management• No tracking for compressed lengths of chunks in memory
• Bucket-based cache replacement• Cache replacement per bucket• Count-Min Sketch [Cormode 2005] for low-memory reference counting
ØExtensive trace-driven evaluation and prototype experiments
Bucketization
ØMain idea• Use hashing to partition index and cache space
• (RAM) LBA-index and FP-index• (SSD) metadata region and data region
• Store partial keys (prefixes) in memory• Memory savings
Ø Layout• Hash entries into equal-sized buckets• Each bucket has fixed-number of slots
8
Bucket
mapping / data
……
slot
…
(RAM) LBA-index and FP-index
Ø (RAM) LBA-index and FP-index
• Locate buckets with hash suffixes• Match slots with hash prefixes• Each slot in FP-index corresponds to a storage location in flash
9
BucketLBA-index
LBA-hash prefix FP hash Flag
…
FP-index
……
FP-hash prefix Flag
slot
Bucket
………
slot
(SSD) Metadata and Data Regions
Ø (SSD) Metadata region and data region
• Each slot has full FP and list of full LBAs in metadata region• For validation against prefix collisions
• Cached chunks in data region
10
Metadataregion
… …
DataregionFP List of LBAs Chunk
Bucket
……
slot
Bucket
……
slot
Fixed-size Compressed Data Management
ØMain idea• Slice and pad a compressed chunk into fixed-size subchunks
ØAdvantages• Compatible with bucketization
• Store each subchunk in one slot• Allow per-chunk management for cache replacement
11
32KiB 20KiBCompress Slice and Pad
8KiB each
Fixed-size Compressed Data Management
Ø Layout• One chunk occupies multiple consecutive slots• No additional memory for compressed length
12
FP-index
……
SSDRAM
… …
FP List of LBAs Length
FP-hash prefix Flag
……
Chunk
Bucket
Metadata Region Data RegionSubchunk
Bucket-based Cache Replacement
ØMain idea• Cache replacement in each bucket independently
• Eliminate priority-based structures for cache decisions
13
Slot
…
LBA-index
Slot
…
…
23
ReferenceCounter
Old…
…
…
FP-index
…
Recent
• Combine recency and deduplication• LBA-index: least-recently-used policy• FP-index: least-referenced policy• Weighted reference counting based on
recency in LBAs
Sketch-based Reference Counting
ØHigh memory overhead for complete reference counting• One counter for every FP-hash
ØCount-Min Sketch [Cormode 2005] • Fixed memory usage with provable error bounds
14
+1
+1
+1
FP-hash count = minimum counter indexed by (i, Hi(FP-hash))
w
h
Evaluation
Ø Implement AustereCache as a user-space block device• ~4.5K lines of C++ code in Linux
ØTraces• FIU traces: WebVM, Homes, Mail• Synthetic traces: varying I/O dedup ratio and write-read ratio
• I/O dedup ratio: fraction of duplicate written chunks in all written chunks
ØSchemes• AustereCache: AC-D, AC-DC• CacheDedup: CD-LRU-D, CD-ARC-D, CD-ARC-DC
15
Memory Overhead
16
ØAC-D incurs 69.9-94.9% and 70.4-94.7% less memory across all traces than CD-LRU-D and CD-ARC-D, respectively.
ØAC-DC incurs 87.0-97.0% less memory than CD-ARC-DC.
AC-D AC-DC CD-LRU-D CD-ARC-D CD-ARC-DC
1
10
100
1000
12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)
Mem
ory
(MiB
)
1
10
100
1000
12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)
Mem
ory
(MiB
)
1
10
100
1000
12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)
Mem
ory
(MiB
)
(a) WebVM (b) Homes (c) MailFigure 5: Exp#1 (Overall memory usage). Note that the y-axes are in log scale.
new SHA-1 fingerprint for the 32 KiB chunk. Table 2 showsthe basic statistics of each regenerated FIU trace on 32 KiBchunks.
The original FIU traces have no compression details. Thus,for each chunk fingerprint, we set its compressibility ratio(i.e., the ratio of raw bytes to the compressed bytes) followinga normal distribution with mean 2 and variance 0.25 as in [24].Synthetic. For throughput measurement (§5.5), we build asynthetic trace generator to account for different access pat-terns. Each synthetic trace is configured by two parameters:(i) I/O deduplication ratio, which specifies the fraction ofwrites that can be removed on the write path due to dedupli-cation; and (ii) write-to-read ratio, which specifies the ratiosof writes to reads.
We generate a synthetic trace as follows. First, we ran-domly generate a working set by choosing arbitrary LBAswithin the primary storage. Then we generate an access pat-tern based on the given write-to-read ratio, such that the writeand read requests each follow a Zipf distribution. We derivethe chunk content of each write request based on the givenI/O deduplication ratio as well as the compressibility ratio asin the FIU trace generation (see above). Currently, our evalua-tion fixes the working set size as 128 MiB, the primary storagesize as 5 GiB, and the Zipf constant as 1.0; such parametersare all configurable.
5.2 SetupTestbed. We conduct our experiments on a machine runningUbuntu 18.04 LTS with Linux kernel 4.15. The machineis equipped with a 10-core 2.2 GHz Intel Xeon E5-2630v4CPU, 32 GiB DDR4 RAM, a 1 TiB Seagate ST1000DM010-2EP1 SATA HDD as the primary storage, and a 128 GiB IntelSSDSC2BW12 SATA SSD as the flash cache.Default setup. For both AustereCache and CacheDedup, weconfigure the size of the FP-index based on a fraction of theworking set size (WSS) of each trace, and fix the size ofthe LBA-index four times that of the FP-index. We storeboth the LBA-index and the FP-index in memory for highperformance. For AustereCache, we set the default chunksize and subchunk size as 32 KiB and 8 KiB, respectively. ForCD-ARC-DC in CacheDedup, we set the WEU size as 2 MiB(the default in [26]).
5.3 Comparative AnalysisWe compare AustereCache and CacheDedup in terms of mem-ory usage, read hit ratios, and write reduction ratios using theFIU traces.
Exp#1 (Overall memory usage). We compare the memoryusage of different schemes. We vary the flash cache sizefrom 12.5% to 100% of WSS of each FIU trace, and con-figure the LBA-index and the FP-index based on our defaultsetup (§5.2). To obtain the actual memory usage (rather thanthe allocated memory space for the index structures), wecall malloc trim at the end of each trace replay to returnall unallocated memory from the process heap to the oper-ating system, and check the residual set size (RSS) from/proc/self/stat as the memory usage.
Figure 5 shows that AustereCache significantly saves thememory usage compared to CacheDedup. For the non-compression schemes (i.e., AC-D, CD-LRU-D, and CD-ARC-D), AC-D incurs 69.9-94.9% and 70.4-94.7% less memoryacross all traces than CD-LRU-D and CD-ARC-D, respec-tively. For the compression schemes (i.e., AC-DC and CD-ARC-DC), AC-DC incurs 87.0-97.0% less memory than CD-ARC-DC.
AustereCache achieves higher memory savings thanCacheDedup in compression mode, since CD-ARC-DC needsto additionally maintain the lengths of all compressed chunks,while AC-DC eliminates such information. If we compare thememory overhead with and without compression, CD-ARC-DC incurs 78-194% more memory usage than CD-ARC-Dacross all traces, implying that compression comes with highmemory usage penalty in CacheDedup. On the other hand,AC-DC only incurs 2-58% more memory than AC-D.
Exp#2 (Impact of design techniques on memory savings).We study how different design techniques of AustereCachehelp memory savings. We mainly focus on bucketization(§3.1) and bucket-based cache replacement (§3.3); for fixed-size compressed data management (§3.2), we refer readers toExp#1 for our analysis.
We choose CD-LRU-D of CacheDedup as our baseline andcompare it with AC-D (both are non-compressed versions),and add individual techniques to see how they contribute tothe memory savings of AC-D. We consider four variants:
Read Hit Ratios
17
ØAC-D has up to 39.2% higher read hit ratio than CD-LRU-D, and similar read hit ratio as CD-ARC-D
ØAC-DC has up to 30.7% higher read hit ratio than CD-ARC-DC
AC-D AC-DC CD-LRU-D CD-ARC-D CD-ARC-DC
0
25
50
75
100
12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)
Rea
d H
it (%
)
0 10 20 30 40 50
12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)
Rea
d H
it (%
)
0
25
50
75
100
12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)
Rea
d H
it (%
)
(a) WebVM (b) Homes (c) MailFigure 5: Exp#3 (Read hit ratio).
Figure 5 shows the results. AustereCache generallyachieves higher read hit ratios than different CacheDedupalgorithms. For the non-compression schemes, AC-D in-creases the read hit ratio of CD-LRU-D by up to 39.2%. Thereason is that CD-LRU-D is only aware of the request re-cency and fails to clean stale chunks in time (§2.4), whileAustereCache favors to evict chunks with small referencecounts. On the other hand, AC-D achieves similar read hitratios to CD-ARC-D, and in particular has a higher read hitratio (up to 13.4%) when the cache size is small in WebVM(12.5% WSS) by keeping highly referenced chunks in cache.For the compression schemes, AC-DC has higher read hitratios than CD-ARC-DC, by 0.5-30.7% in WebVM, 0.7-9.9%in Homes, and 0.3-6.2% in Mail. Note that CD-ARC-DCshows a lower read hit ratio than CD-ARC-D although it intu-itively stores more chunks with compression, mainly becauseit cannot quickly evict stale chunks due to the WEU-basedorganization (§2.4).Exp#4 (Write reduction ratio). We further evaluate differ-ent schemes in terms of the write reduction ratio, defined asthe fraction of reduction of bytes written to the cache due toboth deduplication and compression. A high write reductionratio implies less written data to the flash cache and henceimproved performance and endurance.
Figure ?? shows the results. For the non-compressionschemes, AC-D, CD-LRU-D, and CD-ARC-D show marginaldifferences in WebVM and Homes, while in Mail, AC-D haslower write reduction ratios than CD-LRU-D by up to 17.5%.We find that CD-LRU-D tends to keep more stale chunksin cache, thereby saving the writes that hit the stale chunks.For example, when the cache size is 12.5% of WSS in Mail,17.1% of the write reduction in CD-LRU-D comes from thewrites to the stale chunks, while in WebVM and Homes, thecorresponding numbers are only 3.6% and 1.1%, respectively.AC-D achieves lower write reduction ratios than CD-LRU-D, but achieves much higher read hit ratios by up to 39.2%by favoring to evict the chunks with small reference counts(Exp#3).
For the compression schemes, both CD-ARC-DC and AC-DC have much higher write reduction ratios than the non-compression schemes due to compression. However, AC-DCshows a slightly lower write reduction ratio than CD-ARC-
DC by 7.7-14.5%. The reason is that AC-DC pads the lastsubchunk of each variable-size compressed chunk, therebyincurring extra writes. As we show later in Exp#5 (§5.4),a smaller subchunk size can reduce the padding overhead,although the memory usage also increases.
5.4 Sensitivity to ParametersWe evaluate AustereCache for different parameter settingsusing the FIU traces.Exp#5 (Impact of chunk sizes and subchunk sizes). Weevaluate AustereCache on different chunk sizes and subchunksizes. We focus on the Homes trace and vary the chunk sizesand subchunk sizes as described in §5.1. For varying chunksizes, we fix the subchunk size as one-fourth of the chunk size;for varying subchunk sizes, we fix the chunk size as 32 KiB.We focus on comparing AC-DC and CD-ARC-DC by fixingthe cache size as 25% of WSS. Note that CD-ARC-DC isunaffected by the subchunk size.
AC-DC maintains the significant memory savings com-pared to CD-ARC-DC, by 92.8-95.3% for varying chunksizes (Figure 6(a)) and 93.1-95.1% for varying subchunksizes (Figure 6(b)). It also maintains higher read hit ratiosthan CD-ARC-DC, by 5.0-12.3% for varying chunk sizes(Figure 6(c)) and 7.9-10.4% for varying subchunk sizes (Fig-ure 6(d)). AC-DC incurs a (slightly) less write reduction ratiothan CD-ARC-DC due to padding, by 10.0-14.8% for varyingchunk sizes (Figure 6(e)); the results are consistent with thosein Exp#4. Nevertheless, using a smaller subchunk size canmitigate the padding overhead. As shown in Figure 6(f), thewrite reduction ratio of AC-DC approaches that of CD-ARC-DC when the subchunk size decreases. When the subchunksize is 4 KiB, AC-DC only has a 6.2% less write reductionratio than CD-ARC-DC. Note that if we change the subchunksize from 8 KiB to 4 KiB, the memory usage increases from14.5 MiB to 17.3 MiB (by 18.8%), since the number of buck-ets is doubled in the FP-index (while the LBA-index remainsthe same).Exp#6 (Impact of LBA-index sizes). We study the impactof LBA-index sizes. We vary the LBA-index size from 1⇥ to8⇥ of the FP-index size (recall that the default is 4⇥), and fixthe cache size as 12.5% of WSS.
Figure 7 depicts the memory usage and read hit ratios; we
Write Reduction Ratios
18
ØAC-D is comparable as CD-LRU-D and CD-ARC-D
ØAC-DC is slightly lower (by 7.7-14.5%) than CD-ARC-DC• Due to padding in compressed data management
(a) WebVM (b) Homes (c) MailFigure 7: Exp#3 (Read hit ratio).
AC-D AC-DC CD-LRU-D CD-ARC-D CD-ARC-DC
0
20
40
60
80
12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)
Writ
e Rd
. (%
)
0
20
40
60
80
12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)
Writ
e Rd
. (%
)
0
20
40
60
80
12.5 25 37.5 50 62.5 75 87.5 100Cache Capacity (%)
Writ
e Rd
. (%
)
(a) WebVM (b) Homes (c) MailFigure 8: Exp#4 (Write reduction ratio).
AC-DC maintains the significant memory savings com-pared to CD-ARC-DC, by 92.8-95.3% for varying chunksizes (Figure 9(a)) and 93.1-95.1% for varying subchunksizes (Figure 9(b)). It also maintains higher read hit ratiosthan CD-ARC-DC, by 5.0-12.3% for varying chunk sizes(Figure 9(c)) and 7.9-10.4% for varying subchunk sizes (Fig-ure 9(d)). AC-DC incurs a (slightly) less write reduction ratiothan CD-ARC-DC due to the paddings by 10.0-14.8% forvarying chunk sizes (Figure 9(e)); the results are consistentwith those in Exp#4. Nevertheless, using a smaller subchunksize can mitigate the padding overhead. As shown in Fig-ure 9(f), the write reduction ratio of AC-DC approaches thatof CD-ARC-DC when the subchunk size decreases. Whenthe subchunk size is 4 KiB, AC-DC only has a 6.2% less writereduction ratio than CD-ARC-DC. Note that if we changethe subchunk size from 8 KiB to 4 KiB, the memory usageincreases from 14.5 MiB to 17.3 MiB (by 18.8%), since thenumber of buckets is doubled in the FP-index (while theLBA-index remains the same).
Exp#6 (Impact of LBA-index sizes). We study the impactof LBA-index sizes. We vary the LBA-index size from 1⇥ to8⇥ of the FP-index size (recall that the default is 4⇥), and fixthe cache size as 12.5% of WSS.
Figure 10 depicts the memory usage and read hit ratios;we omit the write reduction ratio as there is nearly no changefor varying LBA-index sizes. When the LBA-index sizeincreases, the memory usage increases by 17.6%, 111.5%,and 160.9% in WebVM, Homes and Mail, respectively (Fig-ure 10(a)), as we allocate more buckets in the LBA-index.Note that the increase in memory usage in WebVM is less
(a) Memory usage vs.chunk size
(b) Memory usage vs.subchunk size
(c) Read hit ratio vs.chunk size
(d) Read hit ratio vs.subchunk size
(e) Write reduction ratio vs.chunk size
(f) Write reduction ratio vs.subchunk size
Figure 9: Exp#5 (Impact of chunk sizes and subchunk sizes).We focus on the Homes trace and fix the cache size as 25%of WSS in Homes.
than those in Homes and Mail, mainly because the WSS ofWebVM is small and incurs a small actual increase of thetotal memory usage. Also, the read hit ratio increases with
Throughput
19
ØAC-DC has highest throughput • Due to high write reduction ratio and high read hit ratio
ØAC-D has slightly lower throughput than CD-ARC-D• AC-D needs to access the metadata region during indexing
(a) Memory usage (b) Read hit ratio
Figure 10: Exp#6 (Impact of LBA-index sizes).
AC-D AC-DC CD-LRU-D CD-ARC-D CD-ARC-DC
(a) Throughput vs. I/O dedupratio (write-to-read ratio 7:3)
(b) Throughput vs. write-to-readratio (I/O dedup ratio 50%)
Figure 11: Exp#7 (Throughput).
the LBA-index size, until the LBA-index reaches 4⇥ of theFP-index size (Figure 10(b)). In particular, for WebVM, theread hit ratio grows from 36.7% (1⇥) to 70.4% (8⇥), whilefor Homes and Mail, the read hit ratios increase by only 4.3%and 5.3%, respectively. The reason is that when the LBA-index size increases, WebVM shows a higher increase in thetotal reference counts of the cached chunks than Homes andMail, implying that more reads can be served by the cachedchunks (i.e., higher read hit ratios).
5.5 Throughput and CPU OverheadWe measure the throughput and CPU overhead of Austere-Cache. We conduct the evaluation on synthetic traces forvarying I/O deduplication ratios and write-to-read ratios. Wefocus on the write-back policy (§2.2), in which AustereCachefirst persists the written chunks to the flash cache and flushesthe chunks to the HDD when they are evicted from the cache.We use direct I/O to remove the impact of page cache. Wereport the averaged results over five runs, while the standarddeviations are small (less than 2.7%) and hence omitted.Exp#7 (Throughput). We compare AustereCache andCacheDedup in throughput using synthetic traces. We fixthe cache size as 50% of the 128 MiB WSS. Both systemswork in single-threaded mode.
Figures 11(a) and 11(b) show the results for varying I/Odeduplication ratios (with a fixed write-to-read ratio 7:3,which represents a write-intensive workload as in FIU traces)and varying write-to-read ratios (with a fixed I/O dedupli-cation ratio 50%), respectively. For the non-compressionschemes, AC-D achieves 18.5-86.6% higher throughput thanCD-LRU-D for all cases except when the write-to-read ratiois 1:9 (slightly slower by 2.3%). Compared to CD-ARC-D,
Figure 12: Exp#8 (CPUoverhead).
Figure 13: Exp#9 (Through-put of multi-threading).
AC-D is slower by 1.1-24.5%, since both AC-D and CD-ARC-D have similar read hit ratios and write reduction ratios(§5.3), while AC-D issues additional reads and writes to themetadata region (CD-ARC-D keeps all indexing informationin memory). AC-D achieves similar throughput to CD-ARC-D when there are more duplicate chunks (i.e., under highI/O deduplication ratios). For compression schemes, AC-DCachieves 6.8-99.6% higher throughput than CD-ARC-DC.
Overall, AC-DC achieves the highest throughput amongall schemes for two reasons. First, AustereCache generallyachieves higher or similar read hit ratios compared to CacheD-edup algorithms (§5.3). Second, AustereCache incorporatesdeduplication awareness into cache replacement by cachingchunks with high reference counts, thereby absorbing morewrites in the SSD and reducing writes to the slow HDD.Exp#8 (CPU overhead). We study the CPU overhead ofdeduplication and compression of AustereCache along theI/O path. We measure the latencies of four computationsteps, including fingerprint computation, compression, indexlookup, and index update. Specifically, we run the WebVMtrace with a cache size of 12.5% of WSS, and collect thestatistics of 100 non-duplicate write requests. We also com-pare their latencies with those of 32 KiB chunk write requeststo the SSD and the HDD using the fio benchmark tool [2].
Figure 12 depicts the results. Fingerprint computation hasthe highest latency (15.5 µs) among all four steps. In total,AustereCache adds around 31.2 µs of CPU overhead. On theother hand, the latencies of 32 KiB writes to the SSD and theHDD are 85 µs and 5,997 µs, respectively. Note that the CPUoverhead can be suppressed via multi-threaded processing, asshown in Exp#9.Exp#9 (Throughput of multi-threading). We evaluate thethroughput gain of AustereCache when it enables multi-threading and issues concurrent requests to multiple buckets(§4). We use synthetic traces with a write-to-read ratio of 7:3,and consider the I/O deduplication ratio of 50% and 80%.
Figure 13 shows the throughput versus the number ofthreads being configured in AustereCache. When the numberof threads increases, AustereCache shows a higher throughputgain under 80% I/O deduplication ratio (from 93.8 MiB/s to235.5 MiB/s, or 2.51⇥) than under 50% I/O deduplicationratio (from 60.0 MiB/s to 124.9 MiB/s, or 2.08⇥). A higherI/O deduplication ratio implies less I/O to flash, and Austere-Cache benefits more from multi-threading on parallelizing
(a) Memory usage (b) Read hit ratio
Figure 10: Exp#6 (Impact of LBA-index sizes).
0255075
100
20 40 60 80I/O Dedup Ratio (%)
Thpt
(MiB
/s)
(a) Throughput vs. I/O dedupratio (write-to-read ratio 7:3)
(b) Throughput vs. write-to-readratio (I/O dedup ratio 50%)
Figure 11: Exp#7 (Throughput).
the LBA-index size, until the LBA-index reaches 4⇥ of theFP-index size (Figure 10(b)). In particular, for WebVM, theread hit ratio grows from 36.7% (1⇥) to 70.4% (8⇥), whilefor Homes and Mail, the read hit ratios increase by only 4.3%and 5.3%, respectively. The reason is that when the LBA-index size increases, WebVM shows a higher increase in thetotal reference counts of the cached chunks than Homes andMail, implying that more reads can be served by the cachedchunks (i.e., higher read hit ratios).
5.5 Throughput and CPU OverheadWe measure the throughput and CPU overhead of Austere-Cache. We conduct the evaluation on synthetic traces forvarying I/O deduplication ratios and write-to-read ratios. Wefocus on the write-back policy (§2.2), in which AustereCachefirst persists the written chunks to the flash cache and flushesthe chunks to the HDD when they are evicted from the cache.We use direct I/O to remove the impact of page cache. Wereport the averaged results over five runs, while the standarddeviations are small (less than 2.7%) and hence omitted.Exp#7 (Throughput). We compare AustereCache andCacheDedup in throughput using synthetic traces. We fixthe cache size as 50% of the 128 MiB WSS. Both systemswork in single-threaded mode.
Figures 11(a) and 11(b) show the results for varying I/Odeduplication ratios (with a fixed write-to-read ratio 7:3,which represents a write-intensive workload as in FIU traces)and varying write-to-read ratios (with a fixed I/O dedupli-cation ratio 50%), respectively. For the non-compressionschemes, AC-D achieves 18.5-86.6% higher throughput thanCD-LRU-D for all cases except when the write-to-read ratiois 1:9 (slightly slower by 2.3%). Compared to CD-ARC-D,
Figure 12: Exp#8 (CPUoverhead).
Figure 13: Exp#9 (Through-put of multi-threading).
AC-D is slower by 1.1-24.5%, since both AC-D and CD-ARC-D have similar read hit ratios and write reduction ratios(§5.3), while AC-D issues additional reads and writes to themetadata region (CD-ARC-D keeps all indexing informationin memory). AC-D achieves similar throughput to CD-ARC-D when there are more duplicate chunks (i.e., under highI/O deduplication ratios). For compression schemes, AC-DCachieves 6.8-99.6% higher throughput than CD-ARC-DC.
Overall, AC-DC achieves the highest throughput amongall schemes for two reasons. First, AustereCache generallyachieves higher or similar read hit ratios compared to CacheD-edup algorithms (§5.3). Second, AustereCache incorporatesdeduplication awareness into cache replacement by cachingchunks with high reference counts, thereby absorbing morewrites in the SSD and reducing writes to the slow HDD.Exp#8 (CPU overhead). We study the CPU overhead ofdeduplication and compression of AustereCache along theI/O path. We measure the latencies of four computationsteps, including fingerprint computation, compression, indexlookup, and index update. Specifically, we run the WebVMtrace with a cache size of 12.5% of WSS, and collect thestatistics of 100 non-duplicate write requests. We also com-pare their latencies with those of 32 KiB chunk write requeststo the SSD and the HDD using the fio benchmark tool [2].
Figure 12 depicts the results. Fingerprint computation hasthe highest latency (15.5 µs) among all four steps. In total,AustereCache adds around 31.2 µs of CPU overhead. On theother hand, the latencies of 32 KiB writes to the SSD and theHDD are 85 µs and 5,997 µs, respectively. Note that the CPUoverhead can be suppressed via multi-threaded processing, asshown in Exp#9.Exp#9 (Throughput of multi-threading). We evaluate thethroughput gain of AustereCache when it enables multi-threading and issues concurrent requests to multiple buckets(§4). We use synthetic traces with a write-to-read ratio of 7:3,and consider the I/O deduplication ratio of 50% and 80%.
Figure 13 shows the throughput versus the number ofthreads being configured in AustereCache. When the numberof threads increases, AustereCache shows a higher throughputgain under 80% I/O deduplication ratio (from 93.8 MiB/s to235.5 MiB/s, or 2.51⇥) than under 50% I/O deduplicationratio (from 60.0 MiB/s to 124.9 MiB/s, or 2.08⇥). A higherI/O deduplication ratio implies less I/O to flash, and Austere-Cache benefits more from multi-threading on parallelizing
(a) Memory usage (b) Read hit ratio
Figure 10: Exp#6 (Impact of LBA-index sizes).
0255075
100
9:1 7:3 5:5 3:7 1:9Write-to-Read Ratio
Thpt
(MiB
/s)
(a) Throughput vs. I/O dedupratio (write-to-read ratio 7:3)
(b) Throughput vs. write-to-readratio (I/O dedup ratio 50%)
Figure 11: Exp#7 (Throughput).
the LBA-index size, until the LBA-index reaches 4⇥ of theFP-index size (Figure 10(b)). In particular, for WebVM, theread hit ratio grows from 36.7% (1⇥) to 70.4% (8⇥), whilefor Homes and Mail, the read hit ratios increase by only 4.3%and 5.3%, respectively. The reason is that when the LBA-index size increases, WebVM shows a higher increase in thetotal reference counts of the cached chunks than Homes andMail, implying that more reads can be served by the cachedchunks (i.e., higher read hit ratios).
5.5 Throughput and CPU OverheadWe measure the throughput and CPU overhead of Austere-Cache. We conduct the evaluation on synthetic traces forvarying I/O deduplication ratios and write-to-read ratios. Wefocus on the write-back policy (§2.2), in which AustereCachefirst persists the written chunks to the flash cache and flushesthe chunks to the HDD when they are evicted from the cache.We use direct I/O to remove the impact of page cache. Wereport the averaged results over five runs, while the standarddeviations are small (less than 2.7%) and hence omitted.Exp#7 (Throughput). We compare AustereCache andCacheDedup in throughput using synthetic traces. We fixthe cache size as 50% of the 128 MiB WSS. Both systemswork in single-threaded mode.
Figures 11(a) and 11(b) show the results for varying I/Odeduplication ratios (with a fixed write-to-read ratio 7:3,which represents a write-intensive workload as in FIU traces)and varying write-to-read ratios (with a fixed I/O dedupli-cation ratio 50%), respectively. For the non-compressionschemes, AC-D achieves 18.5-86.6% higher throughput thanCD-LRU-D for all cases except when the write-to-read ratiois 1:9 (slightly slower by 2.3%). Compared to CD-ARC-D,
Figure 12: Exp#8 (CPUoverhead).
Figure 13: Exp#9 (Through-put of multi-threading).
AC-D is slower by 1.1-24.5%, since both AC-D and CD-ARC-D have similar read hit ratios and write reduction ratios(§5.3), while AC-D issues additional reads and writes to themetadata region (CD-ARC-D keeps all indexing informationin memory). AC-D achieves similar throughput to CD-ARC-D when there are more duplicate chunks (i.e., under highI/O deduplication ratios). For compression schemes, AC-DCachieves 6.8-99.6% higher throughput than CD-ARC-DC.
Overall, AC-DC achieves the highest throughput amongall schemes for two reasons. First, AustereCache generallyachieves higher or similar read hit ratios compared to CacheD-edup algorithms (§5.3). Second, AustereCache incorporatesdeduplication awareness into cache replacement by cachingchunks with high reference counts, thereby absorbing morewrites in the SSD and reducing writes to the slow HDD.Exp#8 (CPU overhead). We study the CPU overhead ofdeduplication and compression of AustereCache along theI/O path. We measure the latencies of four computationsteps, including fingerprint computation, compression, indexlookup, and index update. Specifically, we run the WebVMtrace with a cache size of 12.5% of WSS, and collect thestatistics of 100 non-duplicate write requests. We also com-pare their latencies with those of 32 KiB chunk write requeststo the SSD and the HDD using the fio benchmark tool [2].
Figure 12 depicts the results. Fingerprint computation hasthe highest latency (15.5 µs) among all four steps. In total,AustereCache adds around 31.2 µs of CPU overhead. On theother hand, the latencies of 32 KiB writes to the SSD and theHDD are 85 µs and 5,997 µs, respectively. Note that the CPUoverhead can be suppressed via multi-threaded processing, asshown in Exp#9.Exp#9 (Throughput of multi-threading). We evaluate thethroughput gain of AustereCache when it enables multi-threading and issues concurrent requests to multiple buckets(§4). We use synthetic traces with a write-to-read ratio of 7:3,and consider the I/O deduplication ratio of 50% and 80%.
Figure 13 shows the throughput versus the number ofthreads being configured in AustereCache. When the numberof threads increases, AustereCache shows a higher throughputgain under 80% I/O deduplication ratio (from 93.8 MiB/s to235.5 MiB/s, or 2.51⇥) than under 50% I/O deduplicationratio (from 60.0 MiB/s to 124.9 MiB/s, or 2.08⇥). A higherI/O deduplication ratio implies less I/O to flash, and Austere-Cache benefits more from multi-threading on parallelizing
CPU Overhead and Multi-threading
20
Ø Latency (32 KiB chunk write)• HDD (5,997 µs) and SSD (85 µs)• AustereCache (31.2 µs) (fingerprinting 15.5 µs)• Latency hidden via multi-threading
ØMulti-threading (write-read ratio 7:3)• 50% I/O dedup ratio: 2.08X• 80% I/O dedup ratio: 2.51X• Higher I/O dedup ratio implies less I/O to flash à more computation savings via multi-threading
(a) Memory usage (b) Read hit ratio
Figure 10: Exp#6 (Impact of LBA-index sizes).
(a) Throughput vs. I/O dedupratio (write-to-read ratio 7:3)
(b) Throughput vs. write-to-readratio (I/O dedup ratio 50%)
Figure 11: Exp#7 (Throughput).
the LBA-index size, until the LBA-index reaches 4⇥ of theFP-index size (Figure 10(b)). In particular, for WebVM, theread hit ratio grows from 36.7% (1⇥) to 70.4% (8⇥), whilefor Homes and Mail, the read hit ratios increase by only 4.3%and 5.3%, respectively. The reason is that when the LBA-index size increases, WebVM shows a higher increase in thetotal reference counts of the cached chunks than Homes andMail, implying that more reads can be served by the cachedchunks (i.e., higher read hit ratios).
5.5 Throughput and CPU OverheadWe measure the throughput and CPU overhead of Austere-Cache. We conduct the evaluation on synthetic traces forvarying I/O deduplication ratios and write-to-read ratios. Wefocus on the write-back policy (§2.2), in which AustereCachefirst persists the written chunks to the flash cache and flushesthe chunks to the HDD when they are evicted from the cache.We use direct I/O to remove the impact of page cache. Wereport the averaged results over five runs, while the standarddeviations are small (less than 2.7%) and hence omitted.Exp#7 (Throughput). We compare AustereCache andCacheDedup in throughput using synthetic traces. We fixthe cache size as 50% of the 128 MiB WSS. Both systemswork in single-threaded mode.
Figures 11(a) and 11(b) show the results for varying I/Odeduplication ratios (with a fixed write-to-read ratio 7:3,which represents a write-intensive workload as in FIU traces)and varying write-to-read ratios (with a fixed I/O dedupli-cation ratio 50%), respectively. For the non-compressionschemes, AC-D achieves 18.5-86.6% higher throughput thanCD-LRU-D for all cases except when the write-to-read ratiois 1:9 (slightly slower by 2.3%). Compared to CD-ARC-D,
0 25 50 75
100
Late
ncy
(us) 5975
6000 6025
FingerprintCompression
LookupUpdate
SSDHDD
Figure 12: Exp#8 (CPUoverhead).
Figure 13: Exp#9 (Through-put of multi-threading).
AC-D is slower by 1.1-24.5%, since both AC-D and CD-ARC-D have similar read hit ratios and write reduction ratios(§5.3), while AC-D issues additional reads and writes to themetadata region (CD-ARC-D keeps all indexing informationin memory). AC-D achieves similar throughput to CD-ARC-D when there are more duplicate chunks (i.e., under highI/O deduplication ratios). For compression schemes, AC-DCachieves 6.8-99.6% higher throughput than CD-ARC-DC.
Overall, AC-DC achieves the highest throughput amongall schemes for two reasons. First, AustereCache generallyachieves higher or similar read hit ratios compared to CacheD-edup algorithms (§5.3). Second, AustereCache incorporatesdeduplication awareness into cache replacement by cachingchunks with high reference counts, thereby absorbing morewrites in the SSD and reducing writes to the slow HDD.Exp#8 (CPU overhead). We study the CPU overhead ofdeduplication and compression of AustereCache along theI/O path. We measure the latencies of four computationsteps, including fingerprint computation, compression, indexlookup, and index update. Specifically, we run the WebVMtrace with a cache size of 12.5% of WSS, and collect thestatistics of 100 non-duplicate write requests. We also com-pare their latencies with those of 32 KiB chunk write requeststo the SSD and the HDD using the fio benchmark tool [2].
Figure 12 depicts the results. Fingerprint computation hasthe highest latency (15.5 µs) among all four steps. In total,AustereCache adds around 31.2 µs of CPU overhead. On theother hand, the latencies of 32 KiB writes to the SSD and theHDD are 85 µs and 5,997 µs, respectively. Note that the CPUoverhead can be suppressed via multi-threaded processing, asshown in Exp#9.Exp#9 (Throughput of multi-threading). We evaluate thethroughput gain of AustereCache when it enables multi-threading and issues concurrent requests to multiple buckets(§4). We use synthetic traces with a write-to-read ratio of 7:3,and consider the I/O deduplication ratio of 50% and 80%.
Figure 13 shows the throughput versus the number ofthreads being configured in AustereCache. When the numberof threads increases, AustereCache shows a higher throughputgain under 80% I/O deduplication ratio (from 93.8 MiB/s to235.5 MiB/s, or 2.51⇥) than under 50% I/O deduplicationratio (from 60.0 MiB/s to 124.9 MiB/s, or 2.08⇥). A higherI/O deduplication ratio implies less I/O to flash, and Austere-Cache benefits more from multi-threading on parallelizing
(a) Memory usage (b) Read hit ratio
Figure 10: Exp#6 (Impact of LBA-index sizes).
(a) Throughput vs. I/O dedupratio (write-to-read ratio 7:3)
(b) Throughput vs. write-to-readratio (I/O dedup ratio 50%)
Figure 11: Exp#7 (Throughput).
the LBA-index size, until the LBA-index reaches 4⇥ of theFP-index size (Figure 10(b)). In particular, for WebVM, theread hit ratio grows from 36.7% (1⇥) to 70.4% (8⇥), whilefor Homes and Mail, the read hit ratios increase by only 4.3%and 5.3%, respectively. The reason is that when the LBA-index size increases, WebVM shows a higher increase in thetotal reference counts of the cached chunks than Homes andMail, implying that more reads can be served by the cachedchunks (i.e., higher read hit ratios).
5.5 Throughput and CPU OverheadWe measure the throughput and CPU overhead of Austere-Cache. We conduct the evaluation on synthetic traces forvarying I/O deduplication ratios and write-to-read ratios. Wefocus on the write-back policy (§2.2), in which AustereCachefirst persists the written chunks to the flash cache and flushesthe chunks to the HDD when they are evicted from the cache.We use direct I/O to remove the impact of page cache. Wereport the averaged results over five runs, while the standarddeviations are small (less than 2.7%) and hence omitted.Exp#7 (Throughput). We compare AustereCache andCacheDedup in throughput using synthetic traces. We fixthe cache size as 50% of the 128 MiB WSS. Both systemswork in single-threaded mode.
Figures 11(a) and 11(b) show the results for varying I/Odeduplication ratios (with a fixed write-to-read ratio 7:3,which represents a write-intensive workload as in FIU traces)and varying write-to-read ratios (with a fixed I/O dedupli-cation ratio 50%), respectively. For the non-compressionschemes, AC-D achieves 18.5-86.6% higher throughput thanCD-LRU-D for all cases except when the write-to-read ratiois 1:9 (slightly slower by 2.3%). Compared to CD-ARC-D,
Figure 12: Exp#8 (CPUoverhead).
050
100150200250
1 2 4 6 8Number of threads
Thpt
(MiB
/s)
50% dedup80% dedup
Figure 13: Exp#9 (Through-put of multi-threading).
AC-D is slower by 1.1-24.5%, since both AC-D and CD-ARC-D have similar read hit ratios and write reduction ratios(§5.3), while AC-D issues additional reads and writes to themetadata region (CD-ARC-D keeps all indexing informationin memory). AC-D achieves similar throughput to CD-ARC-D when there are more duplicate chunks (i.e., under highI/O deduplication ratios). For compression schemes, AC-DCachieves 6.8-99.6% higher throughput than CD-ARC-DC.
Overall, AC-DC achieves the highest throughput amongall schemes for two reasons. First, AustereCache generallyachieves higher or similar read hit ratios compared to CacheD-edup algorithms (§5.3). Second, AustereCache incorporatesdeduplication awareness into cache replacement by cachingchunks with high reference counts, thereby absorbing morewrites in the SSD and reducing writes to the slow HDD.Exp#8 (CPU overhead). We study the CPU overhead ofdeduplication and compression of AustereCache along theI/O path. We measure the latencies of four computationsteps, including fingerprint computation, compression, indexlookup, and index update. Specifically, we run the WebVMtrace with a cache size of 12.5% of WSS, and collect thestatistics of 100 non-duplicate write requests. We also com-pare their latencies with those of 32 KiB chunk write requeststo the SSD and the HDD using the fio benchmark tool [2].
Figure 12 depicts the results. Fingerprint computation hasthe highest latency (15.5 µs) among all four steps. In total,AustereCache adds around 31.2 µs of CPU overhead. On theother hand, the latencies of 32 KiB writes to the SSD and theHDD are 85 µs and 5,997 µs, respectively. Note that the CPUoverhead can be suppressed via multi-threaded processing, asshown in Exp#9.Exp#9 (Throughput of multi-threading). We evaluate thethroughput gain of AustereCache when it enables multi-threading and issues concurrent requests to multiple buckets(§4). We use synthetic traces with a write-to-read ratio of 7:3,and consider the I/O deduplication ratio of 50% and 80%.
Figure 13 shows the throughput versus the number ofthreads being configured in AustereCache. When the numberof threads increases, AustereCache shows a higher throughputgain under 80% I/O deduplication ratio (from 93.8 MiB/s to235.5 MiB/s, or 2.51⇥) than under 50% I/O deduplicationratio (from 60.0 MiB/s to 124.9 MiB/s, or 2.08⇥). A higherI/O deduplication ratio implies less I/O to flash, and Austere-Cache benefits more from multi-threading on parallelizing
Conclusion
21
ØAustereCache: memory efficiency in deduplicated and compressed flash caching via• Bucketization• Fixed-size compressed data management• Bucket-based cache replacement
ØSource code: http://adslab.cse.cuhk.edu.hk/software/austerecache
Thank You!Q & A
22