Putting Wings on the Elephant

Post on 10-May-2015

1.108 views 0 download

Tags:

Transcript of Putting Wings on the Elephant

Putting Wings on the ElephantPritam DamaniaFacebook, Inc.

Putting wings on the Elephant!

Pritam DamaniaSoftware EngineerApril 2, 2014

1 Background

2 Major Issues in I/O path

3 Read Improvements

4 Write Improvements

5 Lessons learnt

Agenda

High level Messages Architecture

HBASE

Application Server

Message

Message

AckWrite

Hbase Cluster Physical Layout▪ Multiple clusters/cells for messaging

▪ 20 servers/rack; 5 or more racks per cluster

Rack #1

ZooKeeper PeerHDFS Namenode

Region ServerData NodeTask Tracker

19x...

Region ServerData NodeTask Tracker

Rack #2

ZooKeeper PeerStandby Namenode

Region ServerData NodeTask Tracker

19x...

Region ServerData NodeTask Tracker

Rack #3

ZooKeeper PeerJob Tracker

Region ServerData NodeTask Tracker

19x...

Region ServerData NodeTask Tracker

Rack #4

ZooKeeper PeerHBase Master

Region ServerData NodeTask Tracker

19x...

Region ServerData NodeTask Tracker

Rack #5

ZooKeeper PeerBackup HBase Master

Region ServerData NodeTask Tracker

19x...

Region ServerData NodeTask Tracker

Write Path Overview

HDFS

Write Ahead Log

RegionServer

Memstore

HFiles

HDFS Write Pipeline

Datanode

OS page cache

Disk

Regionserver

64k packet

Datanode

OS page cache

Disk

Datanode

OS page cache

Disk

Ack

Read Path Overview

HDFS

RegionServer

Memstore

HFiles

Get

Problems in R/W Path

• Skewed Disk Usage

• High Disk iops

• High p99 for r/w

Improvements in Read Path

Disk Skew

Datanode

OS page cache

Disk

Datanode

OS page cache

Disk

Datanode

OS page cache

Disk

• HDFS block size : 256MB• HDFS block resides on single disk• Fsync of 256MB hitting single disk

Disk Skew - Sync File Range

………………………………………………………………………………………………..64k

64k

64k

64k

sync_file_range every 1MB

▪ sync_file_range(SYNC_FILE_RANGE_WRITE)

▪ Initiates Async write

Block File Written on Linux FileSystem

64k

64k

fsync

High IOPS• Messages workload is random read

• Small preads (~4KB) on datanodes

• Two iops for each pread

Datanode

Block File Checksum file

pread

Read checksumRead data

High IOPS - Inline Checksums

…………………… …………………………………4096 byte Data

Chunk4 byte Checksum

• Checksums inline with data

• Single iop for pread

HDFS Block

High IOPS - Results

No. of Put and get above one second

Put avg time

Get avg time

Hbase Locality - HDFS Favored Nodes▪ Each region’s data on 3 specific datanodes

▪ On failure locality preserved

▪ Favored nodes persisted at hbase layerRegionServer

Local Datanode

Hbase Locality - Solution

• Persisting info in NameNode complicated

• Region Directory :▪ /*HBASE/<tablename>/<regionname>/cf1/…

▪ /*HBASE/<tablename>/<regionname>/cf2/…

• Build Histogram of locations in directory

• Pick lowest frequency to delete

Datanodes

040008000

D1 D2D3 D4D5 D6

More Improvements

• Keep fds open

• Throttle re-replication

Improvements in Write Path

Hbase WAL

Datanode

OS page cache

Disk

Regionserver

Datanode

OS page cache

Disk

Datanode

OS page cache

Disk

• Packets never hit disk• > 1s outliers !

Instrumentation

1. Write to OS cache

2. Write to TCP buffers

3. sync_file_range(SYNC_FILE_RANGE_WRITE)

1. & 3. outliers >1s !

Use of strace

Interesting Observations

• write(2) outliers correlated with busy disk

• Reproducible by artificially stressing disk

dd oflag=sync,dsync if=/dev/zero of=/mnt/d7/test/tempfile bs=256M count=1000

Test Program

File Written on Linux FileSystem

……………………………………………………………………………………..64k

64k

64k

64k

sync_file_range every 1MB

64k

64k

………………………………………………………………………………………………..63k

1k 63k

1k

sync_file_range every 1MB

63k

1k

No Outliers !

Outliers Reproduced !

Some suspects

• Too many dirty pages

• Linux stable pages

• Kernel trace points revealed stable pages the culprit

Stable Pages

Persistent Store (Device with Integrity Checking)

OS page

Kernel Checksum

Device Checksum

WriteBack • Checksum Error

• Solution – Lock pages under writeback

Explanation of Write Outliers

Persistent Store

OS Page 4k

WAL write

WriteBack (sync_file_rang

e)

WAL write blocked

Solution ?

Patch : http://thread.gmane.org/gmane.comp.file-systems.ext4/35561

sync_file_range ?

• sync_file_range not async for > 128 write requests

• Solution – Use threadpool

Results

P99 Write latency to OS cache (in ms)

Per request profiling

• Entire profile of client requests

• Full profile of pipeline write• Full profile of pread• Lot of visibility !

Interesting Profiles

• In memory operations >1s• No Java GC• Co-related with busy root disk• Reproducible by stressing root

disk

Investigation

• Use lsof

• /tmp/hsperfdata_hadoop/<pid> suspicious

• Disable using -XX:-UsePerfData

• Stalls disappeared !

• -XX:-UsePerfData breaks jps, jstack

• Mount /tmp/hsperfdata_hadoop/ on tmpfs

Result

p99 WAL write latency(in ms)

Lessons learnt

• Instrumentation is key

• Per request profiling is very useful

• Understanding of Linux kernel and fs is important

Acknowledgements▪ Hairong Kuang

▪ Siying Dong

▪ Kumar Sundararajan

▪ Binu John

▪ Dikang Gu

▪ Paul Tuckfield

▪ Arjen Roodselaar

▪ Matthew Byng-Maddick

▪ Liyin Tang

FB Hadoop code

• https://github.com/facebook/hadoop-20

Questions ?

(c) 2009 Facebook, Inc. or its licensors.  "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0