HBase at Xiaomi

38
HBase at Xiaomi {xieliang, fenghonghua}@xiaomi.com Liang Xie / Honghua Feng 1 www.mi.com

description

HBase at Xiaomi. Liang Xie / Honghua Feng. {xieliang, fenghonghua}@xiaomi.com. About Us. Honghua Feng. Liang Xie. Outline. Introduction Latency practice Some patches we contributed Some ongoing patches Q&A. About Xiaomi. Mobile internet company founded in 2010 - PowerPoint PPT Presentation

Transcript of HBase at Xiaomi

Page 1: HBase at Xiaomi

HBase at Xiaomi

{xieliang, fenghonghua}@xiaomi.com

Liang Xie / Honghua Feng

1www.mi.com

Page 2: HBase at Xiaomi

2

About Us

Honghua FengLiang Xie

www.mi.com

Page 3: HBase at Xiaomi

3

Outline

Introduction

Latency practice

Some patches we contributed

Some ongoing patches

Q&A

www.mi.com

Page 4: HBase at Xiaomi

4

About Xiaomi

Mobile internet company founded in 2010

Sold 18.7 million phones in 2013

Over $5 billion revenue in 2013

Sold 11 million phones in Q1, 2014

www.mi.com

Page 5: HBase at Xiaomi

5

Hardware

www.mi.com

Page 6: HBase at Xiaomi

6

Software

www.mi.com

Page 7: HBase at Xiaomi

7

Internet Services

www.mi.com

Page 8: HBase at Xiaomi

8

About Our HBase Team

Founded in October 2012

5 members Liang Xie Shaohui Liu Jianwei Cui Liangliang He Honghua Feng

Resolved 130+ JIRAs so far

www.mi.com

Page 9: HBase at Xiaomi

9

Our Clusters and Scenarios

15 Clusters : 9 online / 2 processing / 4 test

Scenarios MiCloud MiPush MiTalk Perf Counter

www.mi.com

Page 10: HBase at Xiaomi

10

Our Latency Pain Points

Java GC

Stable page write in OS layer

Slow buffered IO (FS journal IO)

Read/Write IO contention

www.mi.com

Page 11: HBase at Xiaomi

11

Bucket cache with off-heap mode

Xmn/ServivorRatio/MaxTenuringThreshold

PretenureSizeThreshold & repl src size

GC concurrent thread number

GC time per day : [2500, 3000] -> [300, 600]s !!!

www.mi.com

HBase GC Practice

Page 12: HBase at Xiaomi

12

HBase client put->HRegion.batchMutate->HLog.sync->SequenceFileLogWriter.sync->DFSOutputStream.flushOrSync->DFSOutputStream.waitForAckedSeqno <Stuck here often!>==================================================

=DataNode pipeline write, in BlockReceiver.receivePacket() :->receiveNextPacket->mirrorPacketTo(mirrorOut) //write packet to the mirror->out.write/flush //write data to local disk. <- buffered IO

[Added instrumentation(HDFS-6110) showed the stalled write was the culprit, strace result also confirmed it

www.mi.com

Write Latency Spikes

Page 13: HBase at Xiaomi

13

write() is expected to be fast

But blocked by write-back sometimes!

www.mi.com

Root Cause of Write Latency Spikes

Page 14: HBase at Xiaomi

14

Workaround :

2.6.32.279(6.3) -> 2.6.32.220(6.2)or2.6.32.279(6.3) -> 2.6.32.358(6.4)

Try to avoid deploying REHL6.3/Centos6.3 in an extremely latency sensitive HBase cluster!

www.mi.com

Stable page write issue workaround

Page 15: HBase at Xiaomi

15

...0xffffffffa00dc09d : do_get_write_access+0x29d/0x520 [jbd2]0xffffffffa00dc471 : jbd2_journal_get_write_access+0x31/0x50 [jbd2]0xffffffffa011eb78 : __ext4_journal_get_write_access+0x38/0x80 [ext4]0xffffffffa00fa253 : ext4_reserve_inode_write+0x73/0xa0 [ext4]0xffffffffa00fa2cc : ext4_mark_inode_dirty+0x4c/0x1d0 [ext4]0xffffffffa00fa6c4 : ext4_generic_write_end+0xe4/0xf0 [ext4]0xffffffffa00fdf74 : ext4_writeback_write_end+0x74/0x160 [ext4]0xffffffff81111474 : generic_file_buffered_write+0x174/0x2a0 [kernel]0xffffffff81112d60 : __generic_file_aio_write+0x250/0x480 [kernel]0xffffffff81112fff : generic_file_aio_write+0x6f/0xe0 [kernel]0xffffffffa00f3de1 : ext4_file_write+0x61/0x1e0 [ext4]0xffffffff811762da : do_sync_write+0xfa/0x140 [kernel]0xffffffff811765d8 : vfs_write+0xb8/0x1a0 [kernel]0xffffffff81176fe1 : sys_write+0x51/0x90 [kernel]

XFS in latest kernel can relieve journal IO blocking issue, more friendly to metadata heavy scenarios like HBase + HDFS

www.mi.com

Root Cause of Write Latency Spikes

Page 16: HBase at Xiaomi

16

8 YCSB threads; write 20 million rows, each 3*200 Bytes; 3 DN; kernel : 3.12.17

Statistic the stalled write() which costs > 100ms

The largest write() latency in Ext4 : ~600ms !

www.mi.com

Write Latency Spikes Testing

Page 17: HBase at Xiaomi

17

Hedged Read (HDFS-5776)

www.mi.com

Page 18: HBase at Xiaomi

18

Long first “put” issue (HBASE-10010)

Token invalid (HDFS-5637)

Retry/timeout setting in DFSClient

Reduce write traffic? (HLog compression)

HDFS IO Priority (HADOOP-10410)

Other Meaningful Latency Work

www.mi.com

Page 19: HBase at Xiaomi

19

Real-time HDFS, esp. priority related

Core data structure GC friendly

More off-heap; shenandoah GC

TCP/Disk IO characteristic analysis

Need more eyes on OS

Stay tuned…

www.mi.com

Wish List

Page 20: HBase at Xiaomi

New write thread model(HBASE-8755)

Reverse scan(HBASE-4811)

Per table/cf replication(HBASE-8751)

Block index key optimization(HBASE-7845)

20www.mi.com

Some Patches Xiaomi Contributed

Page 21: HBase at Xiaomi

WriteHandler :sync to HDFS

WriteHandler :write to HDFS

WriteHandler :sync to HDFS

WriteHandler :write to HDFS

1. New Write Thread Model

WriteHandler WriteHandlerWriteHandler ……

WriteHandler : write to HDFS

WriteHandler : sync to HDFS

Local Buffer

Problem : WriteHandler does everything, severe lock race!

Old model:

21www.mi.com

256

256

256

Page 22: HBase at Xiaomi

WriteHandler :sync to HDFS WriteHandler :sync to HDFS

New Write Thread Model

WriteHandler WriteHandlerWriteHandler ……

AsyncWriter : write to HDFS

AsyncSyncer : sync to HDFS

Local Buffer

New model :

AsyncNotifier : notify writers

22www.mi.com

256

1

1

4

Page 23: HBase at Xiaomi

New Write Thread Model

Low load : No improvement Heavy load : Huge improvement (3.5x)

23www.mi.com

Page 24: HBase at Xiaomi

2. Reverse Scan

Row2 kv2

Row3 kv1

Row3 kv3

Row4 kv2

Row4 kv5

Row5 kv2

Row1 kv2

Row3 kv2

Row3 kv4

Row4 kv4

Row4 kv6

Row5 kv3

Row1 kv1

Row2 kv1

Row2 kv3

Row4 kv1

Row4 kv3

Row6 kv1

1. All scanners seek to ‘previous’ rows (SeekBefore)

2. Figure out next row : max ‘previous’ row

3. All scanners seek to first KV of next row (SeekTo)

Performance : 70% of forward scan

24www.mi.com

Page 25: HBase at Xiaomi

Need a way to specify which data to replicate!

3. Per Table/CF Replication

Source

PeerA(backup)

PeerB(T2:cfX)

T1 : cfA, cfBT2 : cfX, cfY

PeerB creates T2 only : replication can’t work!

T1:cfA,cfB; T2:cfX,cfY

?

PeerB creates T1&T2 : all data replicated!

25www.mi.com

Page 26: HBase at Xiaomi

Per Table/CF Replication

Source

PeerA

PeerB(T2:cfX)

T1:cfA,cfB; T2:cfX,cfY

T2:cfX

add_peer ‘PeerA’, ‘PeerA_ZK’

add_peer ‘PeerB’, ‘PeerB_ZK’, ‘T2:cfX’

T1 : cfA, cfBT2 : cfX, cfY

26www.mi.com

Page 27: HBase at Xiaomi

4. Block Index Key Optimization

Block 1 Block 2

… …

k1:“ab” k2 : “ah, hello world”

Before : ‘Block 2’ block index key = “ah, hello world/…”

Now : ‘Block 2’ block index key = “ac/…” ( k1 < key <= k2)

Reduce block index size

Save seeking previous block if the searching key is in [‘ac’, ‘ah, hello world’]

27www.mi.com

Page 28: HBase at Xiaomi

Cross-table cross-row transaction(HBASE-10999)

HLog compactor(HBASE-9873)

Adjusted delete semantic(HBASE-8721) Coordinated compaction (HBASE-9528)

Quorum master (HBASE-10296)

28www.mi.com

Some ongoing patches

Page 29: HBase at Xiaomi

http://github.com/xiaomi/themis

1. Cross-Row Transaction : Themis

Google Percolator : Large-scale Incremental Processing Using

Distributed Transactions and Notifications

Two-phase commit : strong cross-table/row consistency Global timestamp server : global strictly incremental

timestamp No touch to HBase internal: based on HBase Client and coprocessor Read : 90%, Write : 23% (same downgrade as Google percolator) More details : HBASE-10999

29www.mi.com

Page 30: HBase at Xiaomi

2. HLog Compactor HLog 1,2,3

Region 1Memstore

HFiles

Region 2 Region x

Region x : few writes but scatter in many HLogs

PeriodicMemstoreFlusher : flush old memstores forcefully

‘flushCheckInterval’/‘flushPerChanges’ : hard to config

Result in ‘tiny’ HFiles

HBASE-10499 : problematic region can’t be flushed!

30www.mi.com

Page 31: HBase at Xiaomi

HLog Compactor HLog 1, 2, 3,4

Region 1Memstore

HFiles

Region 2 Region x

Compact : HLog 1,2,3,4 HLog x

Archive : HLog1,2,3,4

HLog x

31www.mi.com

Page 32: HBase at Xiaomi

3. Adjusted Delete Semantic

1. Write kvA at t02. Delete kvA at t0, flush to hfile3. Write kvA at t0 again4. Read kvA

Result : kvA can’t be read out

Scenario 1

1. Write kvA at t02. Delete kvA at t0, flush to hfile3. Major compact4. Write kvA at t0 again

Result : kvA can be read out

Scenario 2

5. Read kvA

Fix : “delete can’t mask kvs with larger mvcc ( put later )”

32www.mi.com

Page 33: HBase at Xiaomi

4. Coordinated Compaction

HDFS (global resource)

RS RS RS

Compact storm!

Compact uses a global HDFS, while whether to compact is decided locally!

33www.mi.com

Page 34: HBase at Xiaomi

Coordinated Compaction

RS RS RS

MasterCan I ?OK Can I ? OKCan I ?

NO

HDFS (global resource)

Compact is scheduled by master, no compact storm any longer

34www.mi.com

Page 35: HBase at Xiaomi

5. Quorum Master

zk3 zk2

zk1

RS RSRS

Master

MasterZooKeeper

X

Read info/states

A

A

When active master serves, standby master stays ‘really’ idle When standby master becomes active, it needs to rebuild in-memory status

35www.mi.com

Page 36: HBase at Xiaomi

Quorum Master

Master 3 Master 1

Master 2

RS RSRS

XA

A

Better master failover perf : No phase to rebuild in-memory status

No external(ZooKeeper) dependency No potential consistency issue Simpler deployment

Better restart perf for BIG cluster(10+K regions)

36www.mi.com

Page 37: HBase at Xiaomi

Hangjun Ye, Zesheng Wu, Peng ZhangXing Yong, Hao Huang, Hailei Li

Shaohui Liu, Jianwei Cui, Liangliang HeDihao Chen

Acknowledgement

37www.mi.com