A Micro-benchmark Suite for Evaluating HDFS Operations on ...

N. S. Islam, X. Lu, M. W. Rahman, J. Jose,

H. Wang & D. K. Panda

Network-Based Computing Laboratory Department of Computer Science and Engineering

The Ohio State University, Columbus, OH, USA

A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters

WBDB 2012

Outline

• Introduction and Motivation

• Problem Statement

• Design Considerations

• Benchmark Suite

• Performance Evaluation

• Conclusion & Future work

2

WBDB 2012

Big Data Technology

• Apache Hadoop is a popular Big

Data technology

– Provides framework for large-

scale, distributed data storage and

processing

• Hadoop is an open-source

implementation of MapReduce

programming model

• Hadoop Distributed File System

(HDFS) (http://hadoop.apache.org/) is

the underlying file system of

Hadoop and Hadoop DataBase,

HBase

HDFS

MapReduce HBase

Hadoop Framework

3

WBDB 2012

Open Standard InfiniBand Networking

Technology • Introduced in Oct 2000

• High Performance Data Transfer – Interprocessor communication and I/O

– Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec -> 100Gbps), and low CPU utilization (5-10%)

• Flexibility for LAN and WAN communication

• Multiple Transport Services – Reliable Connection (RC), Unreliable Connection (UC), Reliable

Datagram (RD), Unreliable Datagram (UD), and Raw Datagram

– Provides flexibility to develop upper layers

• Multiple Operations – Send/Recv

– RDMA Read/Write

– Atomic Operations (very unique) • high performance and scalable implementations of distributed locks, semaphores,

collective communication operations

• Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, ….

• Around 45% TOP500 systems use InfiniBand

4

WBDB 2012

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and initial MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in 70

countries

– More than 142,000 downloads from OSU site directly

– Empowering many TOP500 clusters

• 7th ranked 204,900-core cluster (Stampede) at TACC

• 14th ranked 125,980-core cluster (Pleiades) at NASA

• 17th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology

• and many others

– Available with software stacks of many IB, HSE and server vendors including

Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the U.S. NSF-TACC Stampede (9 PFlop) System

5

MVAPICH2/MVAPICH2-X Software

http://mvapich.cse.ohio-state.edu/



WBDB 2012 6

One-way Latency: MPI over IB

0.00

1.00

2.00

3.00

4.00

5.00

6.00 Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.99 1.09

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR MVAPICH2-Mellanox-ConnectIB-DualFDR

Large Message Latency


Late

ncy

(u

s)

WBDB 2012 7

Bandwidth: MPI over IB

0

2000

4000

6000

8000

10000

12000

14000 Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


3280

3385

1917 1706

6343

12485

0

5000

10000

15000

20000

25000 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR MVAPICH2-Mellanox-ConnectIB-DualFDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


3341 3704

4407

11643

6521

21025

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

WBDB 2012

Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand 10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)

– Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design OSU Design

10 GigE or InfiniBand 10 GigE or InfiniBand

Verbs Interface

8

WBDB 2012

• Adopted by many reputed organizations – eg:- Facebook, Yahoo!

• Highly reliable and fault-tolerant - replication

• NameNode: stores the file system namespace

• DataNode: stores data blocks

• Developed in Java for platform-independence and portability

• Uses Java sockets for communication

(HDFS Architecture)

Hadoop Distributed File System

(HDFS)

9

WBDB 2012

HDFS over InfiniBand

HDFS

IB Verbs

InfiniBand

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

Enables high performance RDMA communication, while supporting traditional socket interface

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,

High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 2012

10

WBDB 2012

RDMA-based Design for Native HDFS-IB: Gain in

Communication Times in HDFS

• Cluster with 32 HDD DataNodes (one disk per node)

– 30% improvement in communication time over IPoIB (32Gbps)

– 87% improvement in communication time over 1GigE

0

10

20

30

40

50

60

70

80

90

100

2GB 4GB 6GB 8GB 10GB

Co

mm

un

ica

tio

n T

ime

(s)

File Size (GB)

1GigE

IPoIB (QDR)

OSU-IB (QDR)

Reduced

by 30%

11

WBDB 2012

HDFS Performance Factors

• Performance of HDFS is determined by:

– Factors related to storage and network configurations

– Controllable parameters (block-size, packet-size)

– Data access pattern

• To achieve optimal performance these factors need

to be tuned based on cluster and workload

characteristics

• A benchmark tool suite to evaluate HDFS

performance metrics in different configurations is

important for tuning

12

WBDB 2012

HDFS Benchmarks

• Most popular benchmark to evaluate HDFS I/O

performance is TestDFSIO

– Involves the MapReduce framework

• There is a lack of standardized benchmark suite to

evaluate the performance of standalone HDFS

• Such benchmark suite is

– Useful for evaluating HDFS performance for different

network and storage configurations on modern cluster

– Relevant for applications using native HDFS (e.g. HBase)

instead of going through MapReduce layer

13

WBDB 2012

Outline




• Benchmark Suite



14

WBDB 2012

Problem Statement

• Can we design and develop a micro-benchmark suite for

evaluating I/O performance of standalone HDFS?

• Can we provide a set of benchmarks to evaluate latency and

throughput of different HDFS operations such as read, write,

and mix workload (read and write)?

• Can we equip the benchmarks with options to set different

HDFS parameters dynamically?

• What will be the performance of HDFS operations when

evaluated using this benchmark suite on modern clusters?

• Similar to OSU Micro-Benchmarks for MPI and PGAS (OMB)

– http://mvapich.cse.ohio-state.edu

15

WBDB 2012

Outline




• Benchmark Suite



16

WBDB 2012

Design Considerations

• HDFS has three main operations:

– Sequential Write

– Sequential Read

– Random Read

• HDFS performance is measured by the latency and

throughput of these operations

• Performance is influenced by underlying network,

storage, HDFS configuration parameters and data

access patterns

17

WBDB 2012

Design Considerations

• Interplay of Network, Storage and Cluster Configuration:

– Faster interconnects and/or protocols can enhance HDFS

performance

– Number and type (HDD, SSD or combination) of

underlying storage device has impact on performance

– Optimal values of HDFS block-size, packet-size, file I/O

buffer size may vary depending on cluster configurations

• Our benchmark-suite

– Focuses on three kinds of data access patterns:

sequential, random and mix (read and write)

– Has options to set HDFS configuration parameters

dynamically

18

WBDB 2012

Outline




• Benchmark Suite



19

WBDB 2012

Benchmark Suite

• Five different benchmarks:

– Sequential Write Latency (SWL)

– Sequential or Random Read Latency (SRL or RRL)

– Sequential Write Throughput (SWT)

– Sequential Read Throughput (SRT)

– Sequential Read-Write Throughput (SRWT)

• All the benchmarks use HDFS API for Read and

Write

20

WBDB 2012

Benchmark Parameter List

Benchmark File

Name

File

Size

HDFS

Parameter

Readers Writers Random/

Sequential

Read

Seek

Interval

SWL √ √ √

SRL/RRL √ √ √ √ √ (RRL)

SWT √ √ √

SRT √ √ √

SRWT √ √ √ √

The benchmark suite calculates statistics like Min, Max, Average

of latency and throughput

21

WBDB 2012

Outline




• Benchmark Suite



22

WBDB 2012

Experimental Setup

• Hardware

– Intel Westmere Cluster

• Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs, 12 GB main memory, 160 GB hard disk

• Network: 1GigE and IPoIB (32Gbps)

• Software

– Hadoop 0.20.2 and Sun Java SDK 1.7.

23

WBDB 2012

Sequential Write Latency (SWL)

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6 7 8 9 10

Tim

e (

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10

Tim

e (

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

File Write Latency with 4 DataNodes File Write Latency with 32 DataNodes

• For 10 GB file size

– Write latency for IPoIB (32Gbps) in 4 DataNodes: 131.26s

– Write latency for IPoIB (32Gbps) in 32 DataNodes: 64.95s

– Large no. of DataNodes reduces write latency due to less I/O

bottleneck 24

WBDB 2012

Sequential and Random Read Latency

(SRL/RRL)

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

Tim

e (

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Sequential Read Latency (SRL) with 4 DataNodes

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10

Tim

e (

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Random Read Latency (RRL) with 4 DataNodes


– Sequential Read latency for IPoIB (32Gbps): 40.79s

– Random Read latency for IPoIB (32Gbps): 44.8s (seek size = 500)

25

WBDB 2012

Sequential Write Throughput

(SWT with 4 Writers)

0

50

100

150

200

250

300

350

400

450

500

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Write Throughput with 4 Writers in 4 DataNodes

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

1GigE

IPoIB (32Gbps)



– Write throughput for IPoIB (32Gbps) in 4 DataNodes: 85.7MBps

– Write throughput for IPoIB (32Gbps) in 32 DataNodes: 429MBps

– Large no. of DataNodes helps maintain the throughput due to less

I/O bottleneck 26

WBDB 2012

Sequential Write Throughput

(SWT with 8 Writers)

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

1GigE

IPoIB (32Gbps)


0

200

400

600

800

1000

1200

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

1GigE

IPoIB (32Gbps)



– Write throughput for IPoIB (32Gbps) in 4 DataNodes: 90MBps

– Write throughput for IPoIB (32Gbps) in 32 DataNodes: 841.2MBps

– Increased no. of writers increases the throughput 27

WBDB 2012

Sequential Read Throughput (SRT)

0

200

400

600

800

1000

1200

1400

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Read Throughput with 4 Readers in 4 DataNodes

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Read Throughput with 8 Readers in 4 DataNodes


– Read throughput for IPoIB (32Gbps) with 4 readers: 931MBps

– Read throughput for IPoIB (32Gbps) with 8 readers: 1902MBps

– Increased no. of readers increases the throughput 28

WBDB 2012

Evaluation of Mixed Workload

(SRWT with 4 Readers, 4 Writers)

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10

Tim

e (

s)

File Size (GB)

Read (1GigE)

Read (IPoIB-32Gbps)

Write (1GigE)

Write (IPoIB-32Gbps)

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

Read (1GigE) Read (IPoIB-32Gbps)

Write (1GigE) Write (IPoIB-32Gbps)

Latency with 4 Readers, 4 Writers in 4 DataNodes Throughput with 4 Readers, 4 Writers in 4 DataNodes


– Read latency for IPoIB (32Gbps) with 4 readers: 30.13s

– Write latency for IPoIB (32Gbps) with 4 writers: 102s

– Increased no. of readers/writers reduces the latency compared to

SRL/SWL 29

WBDB 2012

Outline




• Benchmark Suite



30

WBDB 2012

Conclusion and Future Works

• Design, development and implementation of a micro-

benchmark suite

– Evaluate performance of standalone HDFS

• Flexible infrastructure for the benchmarks to set HDFS

configuration parameters dynamically

• Performance evaluations with our benchmarks over

different interconnects on modern clusters

• The benchmark suite is helpful to design and evaluate

applications that invoke HDFS directly; do not involve

MapReduce layer

• Will be made available to the big data community via an

open-source release 31

WBDB 2012

Thank You!

{islamn, luxi, rahmanmd, jose, wangh, panda}@cse.ohio-state.edu

Network-Based Computing Laboratory

http://nowlab.cse.ohio-state.edu/

MVAPICH Web Page http://mvapich.cse.ohio-state.edu/ 32





A Micro-benchmark Suite for Evaluating HDFS Operations on ...

Documents

Transcript of A Micro-benchmark Suite for Evaluating HDFS Operations on ...