Putting Wings on the Elephant
-
Upload
hadoop-summit -
Category
Technology
-
view
1.107 -
download
0
Transcript of Putting Wings on the Elephant
Putting Wings on the ElephantPritam DamaniaFacebook, Inc.
Putting wings on the Elephant!
Pritam DamaniaSoftware EngineerApril 2, 2014
1 Background
2 Major Issues in I/O path
3 Read Improvements
4 Write Improvements
5 Lessons learnt
Agenda
High level Messages Architecture
HBASE
Application Server
Message
Message
AckWrite
Hbase Cluster Physical Layout▪ Multiple clusters/cells for messaging
▪ 20 servers/rack; 5 or more racks per cluster
Rack #1
ZooKeeper PeerHDFS Namenode
Region ServerData NodeTask Tracker
19x...
Region ServerData NodeTask Tracker
Rack #2
ZooKeeper PeerStandby Namenode
Region ServerData NodeTask Tracker
19x...
Region ServerData NodeTask Tracker
Rack #3
ZooKeeper PeerJob Tracker
Region ServerData NodeTask Tracker
19x...
Region ServerData NodeTask Tracker
Rack #4
ZooKeeper PeerHBase Master
Region ServerData NodeTask Tracker
19x...
Region ServerData NodeTask Tracker
Rack #5
ZooKeeper PeerBackup HBase Master
Region ServerData NodeTask Tracker
19x...
Region ServerData NodeTask Tracker
Write Path Overview
HDFS
Write Ahead Log
RegionServer
Memstore
HFiles
HDFS Write Pipeline
Datanode
OS page cache
Disk
Regionserver
64k packet
Datanode
OS page cache
Disk
Datanode
OS page cache
Disk
Ack
Read Path Overview
HDFS
RegionServer
Memstore
HFiles
Get
Problems in R/W Path
• Skewed Disk Usage
• High Disk iops
• High p99 for r/w
Improvements in Read Path
Disk Skew
Datanode
OS page cache
Disk
Datanode
OS page cache
Disk
Datanode
OS page cache
Disk
• HDFS block size : 256MB• HDFS block resides on single disk• Fsync of 256MB hitting single disk
Disk Skew - Sync File Range
………………………………………………………………………………………………..64k
64k
64k
64k
sync_file_range every 1MB
▪ sync_file_range(SYNC_FILE_RANGE_WRITE)
▪ Initiates Async write
Block File Written on Linux FileSystem
64k
64k
fsync
High IOPS• Messages workload is random read
• Small preads (~4KB) on datanodes
• Two iops for each pread
Datanode
Block File Checksum file
pread
Read checksumRead data
High IOPS - Inline Checksums
…………………… …………………………………4096 byte Data
Chunk4 byte Checksum
• Checksums inline with data
• Single iop for pread
HDFS Block
High IOPS - Results
No. of Put and get above one second
Put avg time
Get avg time
Hbase Locality - HDFS Favored Nodes▪ Each region’s data on 3 specific datanodes
▪ On failure locality preserved
▪ Favored nodes persisted at hbase layerRegionServer
Local Datanode
Hbase Locality - Solution
• Persisting info in NameNode complicated
• Region Directory :▪ /*HBASE/<tablename>/<regionname>/cf1/…
▪ /*HBASE/<tablename>/<regionname>/cf2/…
• Build Histogram of locations in directory
• Pick lowest frequency to delete
Datanodes
040008000
D1 D2D3 D4D5 D6
More Improvements
• Keep fds open
• Throttle re-replication
Improvements in Write Path
Hbase WAL
Datanode
OS page cache
Disk
Regionserver
Datanode
OS page cache
Disk
Datanode
OS page cache
Disk
• Packets never hit disk• > 1s outliers !
Instrumentation
1. Write to OS cache
2. Write to TCP buffers
3. sync_file_range(SYNC_FILE_RANGE_WRITE)
1. & 3. outliers >1s !
Use of strace
Interesting Observations
• write(2) outliers correlated with busy disk
• Reproducible by artificially stressing disk
dd oflag=sync,dsync if=/dev/zero of=/mnt/d7/test/tempfile bs=256M count=1000
Test Program
File Written on Linux FileSystem
……………………………………………………………………………………..64k
64k
64k
64k
sync_file_range every 1MB
64k
64k
………………………………………………………………………………………………..63k
1k 63k
1k
sync_file_range every 1MB
63k
1k
No Outliers !
Outliers Reproduced !
Some suspects
• Too many dirty pages
• Linux stable pages
• Kernel trace points revealed stable pages the culprit
Stable Pages
Persistent Store (Device with Integrity Checking)
OS page
Kernel Checksum
Device Checksum
WriteBack • Checksum Error
• Solution – Lock pages under writeback
Explanation of Write Outliers
Persistent Store
OS Page 4k
WAL write
WriteBack (sync_file_rang
e)
WAL write blocked
Solution ?
Patch : http://thread.gmane.org/gmane.comp.file-systems.ext4/35561
sync_file_range ?
• sync_file_range not async for > 128 write requests
• Solution – Use threadpool
Results
P99 Write latency to OS cache (in ms)
Per request profiling
• Entire profile of client requests
• Full profile of pipeline write• Full profile of pread• Lot of visibility !
Interesting Profiles
• In memory operations >1s• No Java GC• Co-related with busy root disk• Reproducible by stressing root
disk
Investigation
• Use lsof
• /tmp/hsperfdata_hadoop/<pid> suspicious
• Disable using -XX:-UsePerfData
• Stalls disappeared !
• -XX:-UsePerfData breaks jps, jstack
• Mount /tmp/hsperfdata_hadoop/ on tmpfs
Result
p99 WAL write latency(in ms)
Lessons learnt
• Instrumentation is key
• Per request profiling is very useful
• Understanding of Linux kernel and fs is important
Acknowledgements▪ Hairong Kuang
▪ Siying Dong
▪ Kumar Sundararajan
▪ Binu John
▪ Dikang Gu
▪ Paul Tuckfield
▪ Arjen Roodselaar
▪ Matthew Byng-Maddick
▪ Liyin Tang
FB Hadoop code
• https://github.com/facebook/hadoop-20
Questions ?
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0