Flash-based (cloud) storage systems

Lecture 25Aditya Akella

• BufferHash: invented in the context of network de-dup (e.g., inter-DC log transfers)

• SILT: more “traditional” key-value store

Cheap and Large CAMs for High Performance Data-Intensive Networked Systems

Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and Aditya AkellaUniversity of Wisconsin-Madison

Suman NathMicrosoft Research

New data-intensive networked systems

Large hash tables (10s to 100s of GBs)

Data center Branch office

WAN optimizersObject

Object store (~4 TB)Hashtable (~32GB)

Look up

Object

Chunks(4 KB)

Key (20 B)

Chunk pointer Large hash tables (32 GB)

High speed (~10 K/sec) inserts and evictions

High speed (~10K/sec) lookups for 500 Mbps link

• Other systems – De-duplication in storage systems (e.g., Datadomain)– CCN cache (Jacobson et al., CONEXT 2009)– DONA directory lookup (Koponen et al., SIGCOMM 2006)

Cost-effective large hash tablesCheap Large cAMs

Candidate options

DRAM 300K $120K+

Disk 250 $30+

Random reads/sec

Cost(128 GB)

Flash-SSD 10K* $225+

Random writes/sec

Too slow Too

expensive

* Derived from latencies on Intel M-18 SSD in experiments

2.5 ops/sec/$

Slow writes

How to deal with slow writes of Flash SSD

+Price statistics from 2008-09

CLAM design

• New data structure “BufferHash” + Flash• Key features– Avoid random writes, and perform sequential writes

in a batch• Sequential writes are 2X faster than random writes (Intel

SSD)• Batched writes reduce the number of writes going to Flash

– Bloom filters for optimizing lookups

BufferHash performs orders of magnitude better than DRAM based traditional hash tables in ops/sec/$

Flash/SSD primer

• Random writes are expensive Avoid random page writes

• Reads and writes happen at the granularity of a flash page

I/O smaller than page should be avoided, if possible

Conventional hash table on Flash/SSD

Keys are likely to hash to random locations

Random writes

SSDs: FTL handles random writes to some extent;But garbage collection overhead is high

~200 lookups/sec and ~200 inserts/sec with WAN optimizer workload, << 10 K/s and 5 K/s

Conventional hash table on Flash/SSD

Can’t assume locality in requests – DRAM as cache won’t work

Our approach: Buffering insertions• Control the impact of random writes• Maintain small hash table (buffer) in memory • As in-memory buffer gets full, write it to flash

– We call in-flash buffer, incarnation of buffer

Incarnation: In-flash hash table

Buffer: In-memory hash table

DRAM Flash SSD

Two-level memory hierarchyDRAM

Buffer

Incarnation table

Incarnation

Net hash table is: buffer + all incarnations

Oldest incarnation

Latest incarnation

Lookups are impacted due to buffers

Buffer

Incarnation table

Lookup key

In-flash look ups

Multiple in-flash lookups. Can we limit to only one?

4 3 2 1

Bloom filters for optimizing lookups

Buffer

Incarnation table

Lookup keyBloom filters

In-memory look ups

False positive!

Configure carefully! 4 3 2 1

2 GB Bloom filters for 32 GB Flash for false positive rate < 0.01!

Update: naïve approachDRAM

Buffer

Incarnation table

Bloom filtersUpdate key

Update keyExpensive random writes

Discard this naïve approach

4 3 2 1

Lazy updatesDRAM

Buffer

Incarnation table

Bloom filtersUpdate key

Insert key

4 3 2 1

Lookups check latest incarnations first

Key, new value

Key, old value

Eviction for streaming apps• Eviction policies may depend on application– LRU, FIFO, Priority based eviction, etc.

• Two BufferHash primitives– Full Discard: evict all items

• Naturally implements FIFO– Partial Discard: retain few items

• Priority based eviction by retaining high priority items

• BufferHash best suited for FIFO– Incarnations arranged by age– Other useful policies at some additional cost

• Details in paper

Issues with using one buffer

• Single buffer in DRAM– All operations and

eviction policies

• High worst case insert latency– Few seconds for 1

GB buffer– New lookups stall

Buffer

Incarnation table

Bloom filters

4 3 2 1

Partitioning buffers

• Partition buffers– Based on first few bits of

key space– Size > page

• Avoid i/o less than page– Size >= block

• Avoid random page writes

• Reduces worst case latency

• Eviction policies apply per buffer

Incarnation table

4 3 2 1

0 XXXXX 1 XXXXX

BufferHash: Putting it all together

• Multiple buffers in memory• Multiple incarnations per buffer in flash• One in-memory bloom filter per incarnation

FlashBuffer 1 Buffer K. .

Net hash table = all buffers + all incarnations

Latency analysis

• Insertion latency – Worst case size of buffer – Average case is constant for buffer > block size

• Lookup latency– Average case Number of incarnations – Average case False positive rate of bloom filter

Parameter tuning: Total size of Buffers

Total size of buffers = B1 + B2 + … + BN

Too small is not optimalToo large is not optimal eitherOptimal = 2 * SSD/entry

Given fixed DRAM, how much allocated to buffers

# Incarnations = (Flash size/Total buffer size)

Lookup #Incarnations * False positive rate

False positive rate increases as the size of bloom filters decrease

Total bloom filter size = DRAM – total size of buffers

Parameter tuning: Per-buffer size

Affects worst case insertion

What should be size of a partitioned buffer (e.g. B1) ?

Adjusted according to application requirement (128 KB – 1 block)

SILT: A Memory-Efficient,High-Performance Key-Value Store

Hyeontaek Lim, Bin Fan, David G. AndersenMichael Kaminsky†

Carnegie Mellon University†Intel Labs

2011-10-24

Key-Value Store

Clients

PUT(key, value)value = GET(key)

DELETE(key)

Key-Value StoreCluster

• E-commerce (Amazon)• Web server acceleration (Memcached)• Data deduplication indexes• Photo storage (Facebook)

• SILT goal: use much less memory than previous systems while retaining high performance.

Three Metrics to Minimize

Memory overhead

Read amplification

Write amplification

• Ideally 0 (no memory overhead)

• Limits query throughput• Ideally 1 (no wasted flash reads)

• Limits insert throughput• Also reduces flash life expectancy• Must be small enough for flash to last a few years

= Index size per entry

= Flash reads per query

= Flash writes per entry

0 2 4 6 8 10 120

Landscape before SILTRead amplification

Memory overhead (bytes/entry)

FAWN-DS

HashCache

BufferHash FlashStore

SkimpyStash

SILT Sorted Index(Memory efficient)

SILT Log Index(Write friendly)

Solution Preview: (1) Three Stores with (2) New Index Data Structures

MemoryFlash

SILT Filter

Inserts only go to Log

Data are moved in background

Queries look up stores in sequence (from new to old)

LogStore: No Control over Data Layout

6.5+ bytes/entry 1Memory overhead Write amplification

Inserted entries are appended

On-flash log

Memory

SILT Log Index (6.5+ B/entry)

(Older) (Newer)

Naive Hashtable (48+ B/entry)

SortedStore: Space-Optimized Layout

0.4 bytes/entry High

On-flash sorted array

Memory overhead Write amplification

Memory

SILT Sorted Index (0.4 B/entry)

Need to perform bulk-insert to amortize cost

Combining SortedStore and LogStore

On-flash log

SILT Sorted Index

SILT Log Index

Achieving both Low Memory Overhead and Low Write Amplification

SortedStore LogStore

SortedStore

LogStore

• Low memory overhead• High write amplification

• High memory overhead• Low write amplification

Now we can achieve simultaneously:Write amplification = 5.4 = 3 year flash lifeMemory overhead = 1.3 B/entry

With “HashStores”, memory overhead = 0.7 B/entry!

1.010.7 bytes/entry 5.4Memory overhead Read amplification Write amplification

SILT’s Design (Recap)

SILT Sorted Index

On-flash log

SILT Log Index

On-flash hashtables

SILT Filter

Merge Conversion

New Index Data Structures in SILT

Partial-key cuckoo hashing

For HashStore & LogStoreCompact (2.2 & 6.5 B/entry)

Very fast (> 1.8 M lookups/sec)

SILT Filter & Log Index

Entropy-coded tries

For SortedStoreHighly compressed (0.4 B/entry)

SILT Sorted Index

0 2 4 6 8 10 120

LandscapeRead amplification

Memory overhead (bytes/entry)

FAWN-DS

HashCache

BufferHash FlashStore

SkimpyStash

BufferHash: Backup

Outline

• Background and motivation

• Our CLAM design– Key operations (insert, lookup, update)– Eviction– Latency analysis and performance tuning

• Evaluation

Evaluation

• Configuration– 4 GB DRAM, 32 GB Intel SSD, Transcend SSD– 2 GB buffers, 2 GB bloom filters, 0.01 false positive

rate– FIFO eviction policy

BufferHash performance

• WAN optimizer workload– Random key lookups followed by inserts– Hit rate (40%)– Used workload from real packet traces also

• Comparison with BerkeleyDB (traditional hash table) on Intel SSDAverage latency BufferHash BerkeleyDB

Look up (ms) 0.06 4.6

Insert (ms) 0.006 4.8

Better lookups!

Better inserts!

Insert performance

0.001 0.01 0.1 1 10 100

BerkeleyDB

0.001 0.01 0.1 1 10 100

Bufferhash

0.40.6

0.81.0

Insert latency (ms) on Intel SSD

99% inserts < 0.1 ms

40% of inserts > 5 ms !

Random writes are slow! Buffering effect!

Lookup performance

0.001 0.01 0.1 1 10 100

Bufferhash

0.001 0.01 0.1 1 10 100

BerkeleyDB

0.20.40.60.81.0

99% of lookups < 0.2ms

40% of lookups > 5 ms

Garbage collection overhead due to writes!

60% lookups don’t go to Flash 0.15 ms Intel SSD latency

Lookup latency (ms) for 40% hit workload

Performance in Ops/sec/$

• 16K lookups/sec and 160K inserts/sec

• Overall cost of $400

• 42 lookups/sec/$ and 420 inserts/sec/$– Orders of magnitude better than 2.5 ops/sec/$ of

DRAM based hash tables

Other workloads

• Varying fractions of lookups• Results on Trancend SSD

Lookup fraction BufferHash BerkeleyDB0 0.007 ms 18.4 ms0.5 0.09 ms 10.3 ms1 0.12 ms 0.3 ms

• BufferHash ideally suited for write intensive workloads

Average latency per operation

Evaluation summary• BufferHash performs orders of magnitude better in

ops/sec/$ compared to traditional hashtables on DRAM (and disks)

• BufferHash is best suited for FIFO eviction policy– Other policies can be supported at additional cost, details in

• WAN optimizer using Bufferhash can operate optimally at 200 Mbps, much better than 10 Mbps with BerkeleyDB– Details in paper

Related Work

• FAWN (Vasudevan et al., SOSP 2009)– Cluster of wimpy nodes with flash storage– Each wimpy node has its hash table in DRAM– We target…• Hash table much bigger than DRAM • Low latency as well as high throughput systems

• HashCache (Badam et al., NSDI 2009)– In-memory hash table for objects stored on disk

WAN optimizer using BufferHash

• With BerkeleyDB, throughput up to 10 Mbps

• With BufferHash, throughput up to 200 Mbps with Transcend SSD– 500 Mbps with Intel SSD

• At 10 Mbps, average throughput per object improves by 65% with BufferHash

SILT Backup Slides

Evaluation

1. Various combinations of indexing schemes2. Background operations (merge/conversion)3. Query latency

Experiment SetupCPU 2.80 GHz (4 cores)

Flash driveSATA 256 GB

(48 K random 1024-byte reads/sec)

Workload size 20-byte key, 1000-byte value, ≥ 50 M keysQuery pattern Uniformly distributed (worst for SILT)

LogStore Alone: Too Much Memory

Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)

LogStore+SortedStore: Still Much Memory

Full SILT: Very Memory Efficient

Small Impact from Background Operations

Workload: 90% GET (100~ M keys) + 10% PUT

Oops! burstyTRIM by ext4 FS

Low Query Latency

# of I/O threads

Workload: 100% GET (100 M keys)Best tput @ 16 threads

Median = 330 μs99.9 = 1510 μs

Conclusion

• SILT provides both memory-efficient andhigh-performance key-value store– Multi-store approach– Entropy-coded tries– Partial-key cuckoo hashing

• Full source code is available– https://github.com/silt/silt

Flash-based (cloud) storage systems

Documents

Transcript of Flash-based (cloud) storage systems

Enterprise Flash Array Storage December 2016 · 2016. 12. 18. · Enterprise Flash Array Storage HPE 3PAR Flash Storage Continued from previous page Top Reviews by Topic VALUABLE

IBM Flash Storage - cibertechnologysolutions.comcibertechnologysolutions.com/.../IBM-Flash-Storage.pdf · IBM Flash Storage is engineered to meet your modern high-performance storage

Cloud object storage - Principled Technologies · Cloud object storage ctober Cloud object storage A comparison of Dell EMC Elastic Cloud Storage and public cloud object storage solutions

Unistore: A Unified Storage Architecture for Cloud Computing · Trading 'Flash Boys: A Wall Street Revolt,' by Michael Lewis . 4 . Big Data, Cloud, and Storage Systems ...

Flash Roadblock: Latency! - How Storage Interconnects are Slowing Flash Storage

Cloud Storage Forensic Analysis - University of … · Cloud Storage Forensic Analysis ... Chapter 2 – Literature Review ... 2.1 Cloud computing and cloud storage ...

Flash Performance for Oracle® RAC with PCIe Shared Storage · PDF fileConclusion 11. 1 Flash Performance for Oracle RAC with PCIe Shared Storage ... cloud platform on which to deploy

HPE STORAGE TO FLASH REFRESH YOUR 3PAR TECHNOLOGY...Investing in HPE PAR all-flash storage, or HPE 3PAR StoreServ for enterprise will be a fraction of the cost of public cloud, but

Flash Storage - Sales

#IBMEdge: Flash Storage Session

NetApp All Flash storage

Flash Memory based Storage

Cloud-integrated Storage – What & Whydownload.microsoft.com/...87D6...Cloud_Integrated_Storage_White_P… · Cloud storage WAN optimization ... storage technology called cloud-integrated

jamesorowe.files.wordpress.com file · Web viewExplain flash storage technologies in the marketplace. Describe how flash storage is different from hard disks. List flash storage components.

Oracle All Flash FS Storage System · All Flash. All the Time. Engineered for Cloud. * IDC- “Oracle Enters the High-Growth All-Flash Array Market with an Oracle-Optimized Solution”

Ceph and Flash Storage

Janus: Optimal Flash Provisioning for Cloud Storage Workloads

Flash storage ebook

Hitachi Accelerated Flash Storage

Flash Storage Hits Its Stride in the Data Center · Flash Storage Hits Its Stride in the Data Center Enterprises are using hybrid-flash and all-flash storage arrays to speed application