Putting Wings on the Elephant

Putting Wings on the ElephantPritam DamaniaFacebook, Inc.

Putting wings on the Elephant!

Pritam DamaniaSoftware EngineerApril 2, 2014

1 Background

2 Major Issues in I/O path

3 Read Improvements

4 Write Improvements

5 Lessons learnt

Agenda

High level Messages Architecture

Application Server

Message

AckWrite

Hbase Cluster Physical Layout▪ Multiple clusters/cells for messaging

▪ 20 servers/rack; 5 or more racks per cluster

Rack #1

ZooKeeper PeerHDFS Namenode

Region ServerData NodeTask Tracker

19x...

Rack #2

ZooKeeper PeerStandby Namenode

19x...

Rack #3

ZooKeeper PeerJob Tracker

19x...

Rack #4

ZooKeeper PeerHBase Master

19x...

Rack #5

ZooKeeper PeerBackup HBase Master

19x...

Write Path Overview

Write Ahead Log

RegionServer

Memstore

HFiles

HDFS Write Pipeline

Datanode

OS page cache

Regionserver

64k packet

Datanode

OS page cache

Datanode

OS page cache

Read Path Overview

RegionServer

Memstore

HFiles

Problems in R/W Path

• Skewed Disk Usage

• High Disk iops

• High p99 for r/w

Improvements in Read Path

Disk Skew

Datanode

OS page cache

Datanode

OS page cache

Datanode

OS page cache

• HDFS block size : 256MB• HDFS block resides on single disk• Fsync of 256MB hitting single disk

Disk Skew - Sync File Range

………………………………………………………………………………………………..64k

sync_file_range every 1MB

▪ sync_file_range(SYNC_FILE_RANGE_WRITE)

▪ Initiates Async write

Block File Written on Linux FileSystem

High IOPS• Messages workload is random read

• Small preads (~4KB) on datanodes

• Two iops for each pread

Datanode

Block File Checksum file

Read checksumRead data

High IOPS - Inline Checksums

…………………… …………………………………4096 byte Data

Chunk4 byte Checksum

• Checksums inline with data

• Single iop for pread

HDFS Block

High IOPS - Results

No. of Put and get above one second

Put avg time

Get avg time

Hbase Locality - HDFS Favored Nodes▪ Each region’s data on 3 specific datanodes

▪ On failure locality preserved

▪ Favored nodes persisted at hbase layerRegionServer

Local Datanode

Hbase Locality - Solution

• Persisting info in NameNode complicated

• Region Directory :▪ /*HBASE/<tablename>/<regionname>/cf1/…

▪ /*HBASE/<tablename>/<regionname>/cf2/…

• Build Histogram of locations in directory

• Pick lowest frequency to delete

Datanodes

040008000

D1 D2D3 D4D5 D6

More Improvements

• Keep fds open

• Throttle re-replication

Improvements in Write Path

Hbase WAL

Datanode

OS page cache

Regionserver

Datanode

OS page cache

Datanode

OS page cache

• Packets never hit disk• > 1s outliers !

Instrumentation

1. Write to OS cache

2. Write to TCP buffers

3. sync_file_range(SYNC_FILE_RANGE_WRITE)

1. & 3. outliers >1s !

Use of strace

Interesting Observations

• write(2) outliers correlated with busy disk

• Reproducible by artificially stressing disk

dd oflag=sync,dsync if=/dev/zero of=/mnt/d7/test/tempfile bs=256M count=1000

Test Program

File Written on Linux FileSystem

……………………………………………………………………………………..64k

………………………………………………………………………………………………..63k

1k 63k

No Outliers !

Outliers Reproduced !

Some suspects

• Too many dirty pages

• Linux stable pages

• Kernel trace points revealed stable pages the culprit

Stable Pages

Persistent Store (Device with Integrity Checking)

OS page

Kernel Checksum

Device Checksum

WriteBack • Checksum Error

• Solution – Lock pages under writeback

Explanation of Write Outliers

Persistent Store

OS Page 4k

WAL write

WriteBack (sync_file_rang

WAL write blocked

Solution ?

Patch : http://thread.gmane.org/gmane.comp.file-systems.ext4/35561

sync_file_range ?

• sync_file_range not async for > 128 write requests

• Solution – Use threadpool

Results

P99 Write latency to OS cache (in ms)

Per request profiling

• Entire profile of client requests

• Full profile of pipeline write• Full profile of pread• Lot of visibility !

Interesting Profiles

• In memory operations >1s• No Java GC• Co-related with busy root disk• Reproducible by stressing root

Investigation

• Use lsof

• /tmp/hsperfdata_hadoop/<pid> suspicious

• Disable using -XX:-UsePerfData

• Stalls disappeared !

• -XX:-UsePerfData breaks jps, jstack

• Mount /tmp/hsperfdata_hadoop/ on tmpfs

Result

p99 WAL write latency(in ms)

Lessons learnt

• Instrumentation is key

• Per request profiling is very useful

• Understanding of Linux kernel and fs is important

Acknowledgements▪ Hairong Kuang

▪ Siying Dong

▪ Kumar Sundararajan

▪ Binu John

▪ Dikang Gu

▪ Paul Tuckfield

▪ Arjen Roodselaar

▪ Matthew Byng-Maddick

▪ Liyin Tang

FB Hadoop code

• https://github.com/facebook/hadoop-20

Questions ?

Putting Wings on the Elephant

Technology

Transcript of Putting Wings on the Elephant

The World of Elephant & Piggie - books.disney.combooks.disney.com/content/uploads/2013/10/Elephant-Piggie-TG-ƒ.pdf · About the Elephant & Piggie Books. Mo Willems presents Elephant

Elephant Carpaccio

Wings of Piano - VK - Piano Sheet Musicpianomusicsheet.net/wp-content/uploads/2014/06/wings-of-piano.pdf · Wings Of Piano -84 Wings OfPian01 v.KÆ . Deemo Wings OfPian02 . Deemo

Elephant Water

WINGS CAR RENTAL - Wings Radio Cabs

Slim Chickens | Tenders, wings, salads, sandwiches & wraps · 2020. 2. 17. · Cooked-to-OhdðL WINGS 6 Wings 8 Wings IOWings Three tenders & three vangs Five tenders & five wings

Wings, Wings, Wings › assets › webmenu.pdf · 2020-06-23 · Wings, Wings, Wings Your Choice Boneless or Bone In Wings!! Eight (8) ..... $ 9.99 Sixteen (16) ..... $ 16.99 Buffalo,

Elephant Polo, Elephant Tango - Partnering with Large firms

Survey Report on Elephant Movement, Human-Elephant ...Survey Report on Elephant Movement, Human-Elephant Conflict Situation, and Possible Intervention Sites in and around Kutupalong

Elephant Conservation Center, Elephant Hospital and Nursery, Sayaboury, Laos

PUTTING WINGS ON YOUR DREAMS

Elephant Jumbo

Elephant Brand Suspended Ceiling System · GSystem Elephant Brand Suspended Ceiling System Patent QRC ( Quick Release Clips) Technology. Suspended Ceiling System Elephant Brand Elephant

Elephant walk

Tongue :oR: The Talent of Oliver Elephant Elephant

Elephant Pharma

Tarangire Elephant Project · Tarangire Elephant Project Recent elephant movements After the collaring, Plato (Elephant 3) moved immediately to Tarangire National Park and has remained

PUTTING INGS ON YOUR DREAMS · 2019. 3. 5. · TIPS AND TRICKS PUTTING WINGS ON YOUR DREAMS February 2019 VOLUME XVI – ISSUE 2 ... mechanical or electrical power, had a few moving

Elephant Fiber

Elephant Medley