HBase at Xiaomi

{xieliang, fenghonghua}@xiaomi.com

Liang Xie / Honghua Feng

1www.mi.com

About Us

Honghua FengLiang Xie

www.mi.com

Outline

Introduction

Latency practice

Some patches we contributed

Some ongoing patches

www.mi.com

About Xiaomi

Mobile internet company founded in 2010

Sold 18.7 million phones in 2013

Over $5 billion revenue in 2013

Sold 11 million phones in Q1, 2014

www.mi.com

Hardware

www.mi.com

Software

www.mi.com

Internet Services

www.mi.com

About Our HBase Team

Founded in October 2012

5 members Liang Xie Shaohui Liu Jianwei Cui Liangliang He Honghua Feng

Resolved 130+ JIRAs so far

www.mi.com

Our Clusters and Scenarios

15 Clusters : 9 online / 2 processing / 4 test

Scenarios MiCloud MiPush MiTalk Perf Counter

www.mi.com

Our Latency Pain Points

Java GC

Stable page write in OS layer

Slow buffered IO (FS journal IO)

Read/Write IO contention

www.mi.com

Bucket cache with off-heap mode

Xmn/ServivorRatio/MaxTenuringThreshold

PretenureSizeThreshold & repl src size

GC concurrent thread number

GC time per day : [2500, 3000] -> [300, 600]s !!!

www.mi.com

HBase GC Practice

HBase client put->HRegion.batchMutate->HLog.sync->SequenceFileLogWriter.sync->DFSOutputStream.flushOrSync->DFSOutputStream.waitForAckedSeqno <Stuck here often!>==================================================

=DataNode pipeline write, in BlockReceiver.receivePacket() :->receiveNextPacket->mirrorPacketTo(mirrorOut) //write packet to the mirror->out.write/flush //write data to local disk. <- buffered IO

[Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also confirmed it

www.mi.com

Write Latency Spikes

write() is expected to be fast

But blocked by write-back sometimes!

www.mi.com

Root Cause of Write Latency Spikes

Workaround :

2.6.32.279(6.3) -> 2.6.32.220(6.2)or2.6.32.279(6.3) -> 2.6.32.358(6.4)

Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive HBase cluster!

www.mi.com

Stable page write issue workaround

...0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4]0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4]0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4]0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel]0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel]0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel]0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4]0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel]0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel]0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel]

XFS in latest kernel can relieve journal IO blocking issue, more friendly to metadata heavy scenarios like HBase + HDFS

www.mi.com

Root Cause of Write Latency Spikes

8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel : 3.12.17

Statistic the stalled write() which costs > 100ms

The largest write() latency in Ext4 : ~600ms !

www.mi.com

Write Latency Spikes Testing

Hedged Read (HDFS-5776)

www.mi.com

Long first “put” issue (HBASE-10010)

Token invalid (HDFS-5637)

Retry/timeout setting in DFSClient

Reduce write traffic? (HLog compression)

HDFS IO Priority (HADOOP-10410)

Other Meaningful Latency Work

www.mi.com

Real-time HDFS, esp. priority related

Core data structure GC friendly

More off-heap; shenandoah GC

TCP/Disk IO characteristic analysis

Need more eyes on OS

Stay tuned…

www.mi.com

Wish List

New write thread model(HBASE-8755)

Reverse scan(HBASE-4811)

Per table/cf replication(HBASE-8751)

Block index key optimization(HBASE-7845)

20www.mi.com

Some Patches Xiaomi Contributed

WriteHandler :sync to HDFS

WriteHandler :write to HDFS

WriteHandler :sync to HDFS

WriteHandler :write to HDFS

1. New Write Thread Model

WriteHandler WriteHandlerWriteHandler ……

WriteHandler : write to HDFS

WriteHandler : sync to HDFS

Local Buffer

Problem : WriteHandler does everything, severe lock race!

Old model:

21www.mi.com

WriteHandler :sync to HDFS WriteHandler :sync to HDFS

New Write Thread Model

WriteHandler WriteHandlerWriteHandler ……

AsyncWriter : write to HDFS

AsyncSyncer : sync to HDFS

Local Buffer

New model :

AsyncNotifier : notify writers

22www.mi.com

New Write Thread Model

Low load : No improvement Heavy load : Huge improvement (3.5x)

23www.mi.com

2. Reverse Scan

Row2 kv2

Row3 kv1

Row3 kv3

Row4 kv2

Row4 kv5

Row5 kv2

Row1 kv2

Row3 kv2

Row3 kv4

Row4 kv4

Row4 kv6

Row5 kv3

Row1 kv1

Row2 kv1

Row2 kv3

Row4 kv1

Row4 kv3

Row6 kv1

1. All scanners seek to ‘previous’ rows (SeekBefore)

2. Figure out next row : max ‘previous’ row

3. All scanners seek to first KV of next row (SeekTo)

Performance : 70% of forward scan

24www.mi.com

Need a way to specify which data to replicate!

3. Per Table/CF Replication

Source

PeerA(backup)

PeerB(T2:cfX)

T1 : cfA, cfBT2 : cfX, cfY

PeerB creates T2 only : replication can’t work!

T1:cfA,cfB; T2:cfX,cfY

PeerB creates T1&T2 : all data replicated!

25www.mi.com

Per Table/CF Replication

Source

PeerB(T2:cfX)

T1:cfA,cfB; T2:cfX,cfY

T2:cfX

add_peer ‘PeerA’, ‘PeerA_ZK’

add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’

T1 : cfA, cfBT2 : cfX, cfY

26www.mi.com

4. Block Index Key Optimization

Block 1 Block 2

… …

k1:“ab” k2 : “ah, hello world”

Before : ‘Block 2’ block index key = “ah, hello world/…”

Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)

Reduce block index size

Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’]

27www.mi.com

Cross-table cross-row transaction(HBASE-10999)

HLog compactor(HBASE-9873)

Adjusted delete semantic(HBASE-8721) Coordinated compaction (HBASE-9528)

Quorum master (HBASE-10296)

28www.mi.com

Some ongoing patches

http://github.com/xiaomi/themis

1. Cross-Row Transaction : Themis

Google Percolator : Large-scale Incremental Processing Using

Distributed Transactions and Notifications

Two-phase commit : strong cross-table/row consistency Global timestamp server : global strictly incremental

timestamp No touch to HBase internal: based on HBase Client and coprocessor Read : 90%, Write : 23% (same downgrade as Google percolator) More details : HBASE-10999

29www.mi.com

2. HLog Compactor HLog 1,2,3

Region 1Memstore

HFiles

Region 2 Region x

Region x : few writes but scatter in many HLogs

PeriodicMemstoreFlusher : flush old memstores forcefully

‘flushCheckInterval’/‘flushPerChanges’ : hard to config

Result in ‘tiny’ HFiles

HBASE-10499 : problematic region can’t be flushed!

30www.mi.com

HLog Compactor HLog 1, 2, 3,4

Region 1Memstore

HFiles

Region 2 Region x

Compact : HLog 1,2,3,4 HLog x

Archive : HLog1,2,3,4

HLog x

31www.mi.com

3. Adjusted Delete Semantic

1. Write kvA at t02. Delete kvA at t0, flush to hfile3. Write kvA at t0 again4. Read kvA

Result : kvA can’t be read out

Scenario 1

1. Write kvA at t02. Delete kvA at t0, flush to hfile3. Major compact4. Write kvA at t0 again

Result : kvA can be read out

Scenario 2

5. Read kvA

Fix : “delete can’t mask kvs with larger mvcc ( put later )”

32www.mi.com

4. Coordinated Compaction

HDFS (global resource)

RS RS RS

Compact storm!

Compact uses a global HDFS, while whether to compact is decided locally!

33www.mi.com

Coordinated Compaction

RS RS RS

MasterCan I ?OK Can I ? OKCan I ?

HDFS (global resource)

Compact is scheduled by master, no compact storm any longer

34www.mi.com

5. Quorum Master

zk3 zk2

RS RSRS

Master

MasterZooKeeper

Read info/states

When active master serves, standby master stays ‘really’ idle When standby master becomes active, it needs to rebuild in-memory status

35www.mi.com

Quorum Master

Master 3 Master 1

Master 2

RS RSRS

Better master failover perf : No phase to rebuild in-memory status

No external(ZooKeeper) dependency No potential consistency issue Simpler deployment

Better restart perf for BIG cluster(10+K regions)

36www.mi.com

Hangjun Ye, Zesheng Wu, Peng ZhangXing Yong, Hao Huang, Hailei Li

Shaohui Liu, Jianwei Cui, Liangliang HeDihao Chen

Acknowledgement

37www.mi.com

Thank You!

xieliang@xiaomi.com

fenghonghua@xiaomi.com

www.mi.com

38www.mi.com

HBase at Xiaomi

Documents

Transcript of HBase at Xiaomi

HBaseCon 2012 | HBase powered Merchant Lookup Service at Intuit

hbaseconasia2017: Apache HBase at Netease

hadoop developer - SevenMentor · 2021. 2. 17. · D. HBASE: Introduction to HBASE Basic Configurations of HBASE Fundamentals of HBase What is NoSQL? HBase Data Model Table and Row

Xiaomi Case Study

My talk on HBase ops engineering at TBD Jun 2016

HBase at Flurry

Xiaomi digital proposal

Xiaomi Inc.

hbaseconasia2017: HBase Disaster Recovery Solution at Huawei

Xiaomi New Product

Xiaomi : Gadget

Xiaomi Redmi 4A- High Performance at Low Price

Xiaomi Mi3 Smartphone

HBase and Hive at StumbleUpon Jean-Daniel Cryans DB Engineer at StumbleUpon HBase Committer @jdcryans, jdcryans@apache.org.

OF BLOOMBERG DATA SYSTEMS HBASE AT BLOOMBERG · HBASE AT BLOOMBERG // THE EVOLUTION OF BLOOMBERG DATA SYSTEMS MEDIUM DATA NEEDS FOR THE FINANCIAL INDUSTRY MAY // 07 // 2015 . HBASE

HBase Practice At Xiaomi - GitHub Pages · 2020. 3. 9. · Pre-load data A new table with 400 regions 100 millions rows whose value is 1000 bytes Pressure test for G1GC tuning 40

Presentation xiaomi

Recommender System at Scale Using HBase and Hadoop

Hbase Operations At facebook

Introduction to Hbase. Agenda What is Hbase About RDBMS Overview of Hbase Why Hbase instead of RDBMS Architecture of Hbase Hbase interface.