HBase Sizing Guide
-
Upload
larsgeorge -
Category
Technology
-
view
676 -
download
3
description
Transcript of HBase Sizing Guide
Sizing Your HBase Cluster
Lars George | @larsgeorge
EMEA Chief Architect @ Cloudera
2
Agenda
• Introduction
• Technical Background/Primer
• Best Practices
• Summary
©2014 Cloudera, Inc. All rights reserved.
3
Who I am…
Lars George [EMEA Chief Architect]
• Clouderan since October 2010
• Hadooper since mid 2007
• HBase/Whirr Committer (of Hearts)
• github.com/larsgeorge
©2014 Cloudera, Inc. All rights reserved.
4
Bruce Lee: ”As you think, so shall you become.”
©2014 Cloudera, Inc. All rights reserved.
5
Introduction
©2014 Cloudera, Inc. All rights reserved.
6
HBase Sizing Is...
• Making the most out of the cluster you have by... – Understanding how HBase uses low-level resources – Helping HBase understand your use-case by configuring it appropriately - and/or - – Design the use-case to help HBase along
• Being able to gauge how many servers are needed for a given use-case
7
Technical Background
“To understand your fear is the beginning of really seeing…”
— Bruce Lee
©2014 Cloudera, Inc. All rights reserved.
8
HBase Dilemma
Although HBase can host many applications, they may require completely opposite features
Events Entities
Time Series Message Store
9
Competing Resources
• Reads and Writes compete for the same low-level resources – Disk (HDFS) and Network I/O – RPC Handlers and Threads – Memory (Java Heap)
• Otherwise they do exercise completely separate code paths
10
Memory Sharing
• By default every region server is dividing its memory (i.e. given maximum heap) into – 40% for in-memory stores (write ops) – 20% (40%) for block caching (reads ops) – Remaining space (here 40% or 20%) go towards usual Java heap usage • Objects etc. • Region information (HFile metadata)
• Share of memory needs to be tweaked
11
Writes
• The cluster size is often determined by the write performance – Simple schema design implies writing to all (entities) or only one region (events)
• Log structured merge trees like – Store mutation in in-memory store and write-ahead log – Flush out aggregated, sorted maps at specified threshold - or - when under pressure – Discard logs with no pending edits – Perform regular compactions of store files
12
Writes: Flushes and Compactions
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
13
Flushes
• Every mutation call (put, delete etc.) causes a check for a flush
• If threshold is met, flush file to disk and schedule a compaction – Try to compact newly flushed files quickly
• The compaction returns - if necessary - where a region should be split
14
Compaction Storms
• Premature flushing because of # of logs or memory pressure – Files will be smaller than the configured flush size
• The background compactions are hard at work merging small flush files into the existing, larger store files – Rewrite hundreds of MB over and over
15
Dependencies
• Flushes happen across all stores/column families, even if just one triggers it
• The flush size is compared to the size of all stores combined – Many column families dilute the size – Example: 55MB + 5MB + 4MB
16
Write-Ahead Log
• Currently only one per region server – Shared across all stores (i.e. column families) – Synchronized on file append calls
• Work being done on mitigating this – WAL Compression – Multithreaded WAL with Ring Buffer – Multiple WAL’s per region server ➜ Start more than one region server per node?
17
Write-Ahead Log (cont.)
• Size set to 95% of default block size – 64MB or 128MB, but check config!
• Keep number low to reduce recovery time – Limit set to 32, but can be increased
• Increase size of logs - and/or - increase the number of logs before blocking
• Compute number based on fill distribution and flush frequencies
18
Write-Ahead Log (cont.)
• Writes are synchronized across all stores – A large cell in one family can stop all writes of another – In this case the RPC handlers go binary, i.e. either work or all block
• Can be bypassed on writes, but means no real durability and no replication – Maybe use coprocessor to restore dependent data sets (preWALRestore)
19
Some Numbers
• Typical write performance of HDFS is 35-50MB/s
Cell Size OPS 0.5MB 70-100
100KB 350-500
10KB 3500-5000 ??
1KB 35000-50000 ????
This is way to high in practice - Contention!
20
Some More Numbers
• Under real world conditions the rate is less, more like 15MB/s or less
– Thread contention and serialization overhead is cause for massive slow down
Cell Size OPS 0.5MB 10
100KB 100
10KB 800
1KB 6000
21
Write Performance
• There are many factors to the overall write performance of a cluster – Key Distribution ➜ Avoid region hotspot – Handlers ➜ Do not pile up too early – Write-ahead log ➜ Bottleneck #1 – Compactions ➜ Badly tuned can cause ever increasing background noise
22
Cheat Sheet
• Ensure you have enough or large enough write-ahead logs
• Ensure you do not oversubscribe available memstore space
• Ensure to set flush size large enough but not too large
• Check write-ahead log usage carefully
• Enable compression to store more data per node
• Tweak compaction algorithm to peg background I/O at some level
• Consider putting uneven column families in separate tables
• Check metrics carefully for block cache, memstore, and all queues
23
Example: Write to All Regions
• Java Xmx heap at 10GB
• Memstore share at 40% (default) – 10GB Heap x 0.4 = 4GB
• Desired flush size at 128MB – 4GB / 128MB = 32 regions max!
• For WAL size of 128MB x 0.95% – 4GB / (128MB x 0.95) = ~33 partially uncommitted logs to keep around
• Region size at 20GB – 20GB x 32 regions = 640GB raw storage used
24
Notes
• Compute memstore sizes based on number of written-to regions x flush size
• Compute number of logs to keep based on fill and flush rate
• Ultimately the capacity is driven by – Java Heap – Region Count and Size – Key Distribution
25
Reads
• Locate and route request to appropriate region server – Client caches information for faster lookups
• Eliminate store files if possible using time ranges or Bloom filter
• Try block cache, if block is missing then load from disk
26
Seeking with Bloom Filters
27
Writes: Where’s the Data at?
Older Newer TIME
SIZE (MB)
1000
0
250
500
750
Existing Row Mutations Unique Row Inserts
28
Block Cache
• Use exported metrics to see effectiveness of block cache – Check fill and eviction rate, as well as hit ratios ➜ random reads are not ideal
• Tweak up or down as needed, but watch overall heap usage
• You absolutely need the block cache – Set to 10% at least for short term benefits
29
Testing: Scans
HBase scan performance • Use available tools to test • Determine raw and KeyValue read performance – Raw is just bytes, while KeyValue means block parsing
• Insert data using YCSB, then compact table – Single region enforced
• Two test cases – Small data: 1 column with 1 byte value – Large(r) data: 1 column with 1KB value
• About same size for both in total: 15GB
©2014 Cloudera, Inc. All rights reserved.
30
Testing: Scans
©2014 Cloudera, Inc. All rights reserved.
31
Scan Row Range
• Set start and end key to limit scan size
32
Best Practices
“If you spend too much time thinking about a thing, you'll never get it done.”
— Bruce Lee
©2014 Cloudera, Inc. All rights reserved.
33
How to Plan
Advice on
• Number of nodes
• Number of disk and total disk capacity
• RAM capacity
• Region sizes and count
• Compaction tuning
©2014 Cloudera, Inc. All rights reserved.
34
Advice on Nodes
• Use previous example to compute effective storage based on heap size, region count and size – 10GB heap x 0.4 / 128MB x 20GB = 640GB, if all regions are active – Address more storage with read-from-only regions
• Typical advice is to use more nodes with fewer, smaller disks (6 x 1TB SATA or 600GB SAS, or SSDs)
• CPU is not an issue, I/O is (even with compression)
©2014 Cloudera, Inc. All rights reserved.
35
Advice on Nodes
• Memory is not an issue, heap sizes small because of Java Garbage Collection limitation – Up to 20GB has been used – Newer versions of Java should help – Use off-heap cache
• Current servers typically have 48GB+ memory
©2014 Cloudera, Inc. All rights reserved.
36
Advice on Tuning
• Trade off throughput against size of single data points – This might cause schema redesign
• Trade off read performance against write amplification – Advise users to understand read/write performance and background write amplification
Ø This drives the number of nodes needed!
©2014 Cloudera, Inc. All rights reserved.
37
Advice on Cluster Sizing
• Compute the number of nodes needed based on – Total storage needed – Throughput required for either reads and writes
• Assume ≈15MB/s minimum for each read and write – Increasing the KeyValue sizes improves this
©2014 Cloudera, Inc. All rights reserved.
38
Example: Twitter Firehose
©2014 Cloudera, Inc. All rights reserved.
39
Example: Consume Data
©2014 Cloudera, Inc. All rights reserved.
40
HBase Heap Usage
• Overall addressable amount of data is driven by heap size – Only read-from regions need space for indexes,
filters – Written-to regions also need MemStore space
• Java heap space is limited still as garbage collections will cause pauses – Typically up to 20GB heap – Or invest is pause-less GC
41
Summary
“All fixed set patterns are incapable of adaptability or pliability. The truth is outside of all fixed patterns.”
— Bruce Lee
©2014 Cloudera, Inc. All rights reserved.
42
WHHAT BRUCE? IT DEPENDS? L
©2014 Cloudera, Inc. All rights reserved.
43
Checklist
To plan for the size of an HBase cluster you have to:
• Know the use-case – Read/write mix – Expected throughput – Retention policy
• Optimize the schema and compaction strategy – Devise a schema that allows for only some regions being written to
• Take “known” numbers to compute cluster size
©2014 Cloudera, Inc. All rights reserved.
Thank you @larsgeorge