A Micro-benchmark Suite for Evaluating HDFS Operations on ...

32
N. S. Islam, X. Lu, M. W. Rahman, J. Jose, H. Wang & D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, Columbus, OH, USA A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters

Transcript of A Micro-benchmark Suite for Evaluating HDFS Operations on ...

Page 1: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

N. S. Islam, X. Lu, M. W. Rahman, J. Jose,

H. Wang & D. K. Panda

Network-Based Computing Laboratory Department of Computer Science and Engineering

The Ohio State University, Columbus, OH, USA

A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters

Page 2: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Outline

• Introduction and Motivation

• Problem Statement

• Design Considerations

• Benchmark Suite

• Performance Evaluation

• Conclusion & Future work

2

Page 3: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Big Data Technology

• Apache Hadoop is a popular Big

Data technology

– Provides framework for large-

scale, distributed data storage and

processing

• Hadoop is an open-source

implementation of MapReduce

programming model

• Hadoop Distributed File System

(HDFS) (http://hadoop.apache.org/) is

the underlying file system of

Hadoop and Hadoop DataBase,

HBase

HDFS

MapReduce HBase

Hadoop Framework

3

Page 4: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Open Standard InfiniBand Networking

Technology • Introduced in Oct 2000

• High Performance Data Transfer – Interprocessor communication and I/O

– Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec -> 100Gbps), and low CPU utilization (5-10%)

• Flexibility for LAN and WAN communication

• Multiple Transport Services – Reliable Connection (RC), Unreliable Connection (UC), Reliable

Datagram (RD), Unreliable Datagram (UD), and Raw Datagram

– Provides flexibility to develop upper layers

• Multiple Operations – Send/Recv

– RDMA Read/Write

– Atomic Operations (very unique) • high performance and scalable implementations of distributed locks, semaphores,

collective communication operations

• Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, ….

• Around 45% TOP500 systems use InfiniBand

4

Page 5: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and initial MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in 70

countries

– More than 142,000 downloads from OSU site directly

– Empowering many TOP500 clusters

• 7th ranked 204,900-core cluster (Stampede) at TACC

• 14th ranked 125,980-core cluster (Pleiades) at NASA

• 17th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology

• and many others

– Available with software stacks of many IB, HSE and server vendors including

Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Partner in the U.S. NSF-TACC Stampede (9 PFlop) System

5

MVAPICH2/MVAPICH2-X Software

Page 6: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012 6

One-way Latency: MPI over IB

0.00

1.00

2.00

3.00

4.00

5.00

6.00 Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.99 1.09

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR MVAPICH2-Mellanox-ConnectIB-DualFDR

Large Message Latency

Message Size (bytes)

Late

ncy

(u

s)

Page 7: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012 7

Bandwidth: MPI over IB

0

2000

4000

6000

8000

10000

12000

14000 Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3280

3385

1917 1706

6343

12485

0

5000

10000

15000

20000

25000 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR MVAPICH2-Mellanox-ConnectIB-DualFDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3341 3704

4407

11643

6521

21025

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch

Page 8: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand 10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)

– Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design OSU Design

10 GigE or InfiniBand 10 GigE or InfiniBand

Verbs Interface

8

Page 9: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

• Adopted by many reputed organizations – eg:- Facebook, Yahoo!

• Highly reliable and fault-tolerant - replication

• NameNode: stores the file system namespace

• DataNode: stores data blocks

• Developed in Java for platform-independence and portability

• Uses Java sockets for communication

(HDFS Architecture)

Hadoop Distributed File System

(HDFS)

9

Page 10: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

HDFS over InfiniBand

HDFS

IB Verbs

InfiniBand

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

Enables high performance RDMA communication, while supporting traditional socket interface

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,

High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 2012

10

Page 11: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

RDMA-based Design for Native HDFS-IB: Gain in

Communication Times in HDFS

• Cluster with 32 HDD DataNodes (one disk per node)

– 30% improvement in communication time over IPoIB (32Gbps)

– 87% improvement in communication time over 1GigE

0

10

20

30

40

50

60

70

80

90

100

2GB 4GB 6GB 8GB 10GB

Co

mm

un

ica

tio

n T

ime

(s)

File Size (GB)

1GigE

IPoIB (QDR)

OSU-IB (QDR)

Reduced

by 30%

11

Page 12: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

HDFS Performance Factors

• Performance of HDFS is determined by:

– Factors related to storage and network configurations

– Controllable parameters (block-size, packet-size)

– Data access pattern

• To achieve optimal performance these factors need

to be tuned based on cluster and workload

characteristics

• A benchmark tool suite to evaluate HDFS

performance metrics in different configurations is

important for tuning

12

Page 13: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

HDFS Benchmarks

• Most popular benchmark to evaluate HDFS I/O

performance is TestDFSIO

– Involves the MapReduce framework

• There is a lack of standardized benchmark suite to

evaluate the performance of standalone HDFS

• Such benchmark suite is

– Useful for evaluating HDFS performance for different

network and storage configurations on modern cluster

– Relevant for applications using native HDFS (e.g. HBase)

instead of going through MapReduce layer

13

Page 14: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Outline

• Introduction and Motivation

• Problem Statement

• Design Considerations

• Benchmark Suite

• Performance Evaluation

• Conclusion & Future work

14

Page 15: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Problem Statement

• Can we design and develop a micro-benchmark suite for

evaluating I/O performance of standalone HDFS?

• Can we provide a set of benchmarks to evaluate latency and

throughput of different HDFS operations such as read, write,

and mix workload (read and write)?

• Can we equip the benchmarks with options to set different

HDFS parameters dynamically?

• What will be the performance of HDFS operations when

evaluated using this benchmark suite on modern clusters?

• Similar to OSU Micro-Benchmarks for MPI and PGAS (OMB)

– http://mvapich.cse.ohio-state.edu

15

Page 16: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Outline

• Introduction and Motivation

• Problem Statement

• Design Considerations

• Benchmark Suite

• Performance Evaluation

• Conclusion & Future work

16

Page 17: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Design Considerations

• HDFS has three main operations:

– Sequential Write

– Sequential Read

– Random Read

• HDFS performance is measured by the latency and

throughput of these operations

• Performance is influenced by underlying network,

storage, HDFS configuration parameters and data

access patterns

17

Page 18: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Design Considerations

• Interplay of Network, Storage and Cluster Configuration:

– Faster interconnects and/or protocols can enhance HDFS

performance

– Number and type (HDD, SSD or combination) of

underlying storage device has impact on performance

– Optimal values of HDFS block-size, packet-size, file I/O

buffer size may vary depending on cluster configurations

• Our benchmark-suite

– Focuses on three kinds of data access patterns:

sequential, random and mix (read and write)

– Has options to set HDFS configuration parameters

dynamically

18

Page 19: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Outline

• Introduction and Motivation

• Problem Statement

• Design Considerations

• Benchmark Suite

• Performance Evaluation

• Conclusion & Future work

19

Page 20: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Benchmark Suite

• Five different benchmarks:

– Sequential Write Latency (SWL)

– Sequential or Random Read Latency (SRL or RRL)

– Sequential Write Throughput (SWT)

– Sequential Read Throughput (SRT)

– Sequential Read-Write Throughput (SRWT)

• All the benchmarks use HDFS API for Read and

Write

20

Page 21: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Benchmark Parameter List

Benchmark File

Name

File

Size

HDFS

Parameter

Readers Writers Random/

Sequential

Read

Seek

Interval

SWL √ √ √

SRL/RRL √ √ √ √ √ (RRL)

SWT √ √ √

SRT √ √ √

SRWT √ √ √ √

The benchmark suite calculates statistics like Min, Max, Average

of latency and throughput

21

Page 22: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Outline

• Introduction and Motivation

• Problem Statement

• Design Considerations

• Benchmark Suite

• Performance Evaluation

• Conclusion & Future work

22

Page 23: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Experimental Setup

• Hardware

– Intel Westmere Cluster

• Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs, 12 GB main memory, 160 GB hard disk

• Network: 1GigE and IPoIB (32Gbps)

• Software

– Hadoop 0.20.2 and Sun Java SDK 1.7.

23

Page 24: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Sequential Write Latency (SWL)

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6 7 8 9 10

Tim

e (

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10

Tim

e (

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

File Write Latency with 4 DataNodes File Write Latency with 32 DataNodes

• For 10 GB file size

– Write latency for IPoIB (32Gbps) in 4 DataNodes: 131.26s

– Write latency for IPoIB (32Gbps) in 32 DataNodes: 64.95s

– Large no. of DataNodes reduces write latency due to less I/O

bottleneck 24

Page 25: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Sequential and Random Read Latency

(SRL/RRL)

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10

Tim

e (

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Sequential Read Latency (SRL) with 4 DataNodes

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8 9 10

Tim

e (

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Random Read Latency (RRL) with 4 DataNodes

• For 10 GB file size

– Sequential Read latency for IPoIB (32Gbps): 40.79s

– Random Read latency for IPoIB (32Gbps): 44.8s (seek size = 500)

25

Page 26: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Sequential Write Throughput

(SWT with 4 Writers)

0

50

100

150

200

250

300

350

400

450

500

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Write Throughput with 4 Writers in 4 DataNodes

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Write Throughput with 4 Writers in 32 DataNodes

• For 10 GB file size

– Write throughput for IPoIB (32Gbps) in 4 DataNodes: 85.7MBps

– Write throughput for IPoIB (32Gbps) in 32 DataNodes: 429MBps

– Large no. of DataNodes helps maintain the throughput due to less

I/O bottleneck 26

Page 27: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Sequential Write Throughput

(SWT with 8 Writers)

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Write Throughput with 8 Writers in 4 DataNodes

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Write Throughput with 8 Writers in 32 DataNodes

• For 10 GB file size

– Write throughput for IPoIB (32Gbps) in 4 DataNodes: 90MBps

– Write throughput for IPoIB (32Gbps) in 32 DataNodes: 841.2MBps

– Increased no. of writers increases the throughput 27

Page 28: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Sequential Read Throughput (SRT)

0

200

400

600

800

1000

1200

1400

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Read Throughput with 4 Readers in 4 DataNodes

0

500

1000

1500

2000

2500

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

1GigE

IPoIB (32Gbps)

Read Throughput with 8 Readers in 4 DataNodes

• For 10 GB file size

– Read throughput for IPoIB (32Gbps) with 4 readers: 931MBps

– Read throughput for IPoIB (32Gbps) with 8 readers: 1902MBps

– Increased no. of readers increases the throughput 28

Page 29: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Evaluation of Mixed Workload

(SRWT with 4 Readers, 4 Writers)

0

20

40

60

80

100

120

140

160

180

1 2 3 4 5 6 7 8 9 10

Tim

e (

s)

File Size (GB)

Read (1GigE)

Read (IPoIB-32Gbps)

Write (1GigE)

Write (IPoIB-32Gbps)

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7 8 9 10

Th

rou

gh

pu

t (M

Bp

s)

File Size (GB)

Read (1GigE) Read (IPoIB-32Gbps)

Write (1GigE) Write (IPoIB-32Gbps)

Latency with 4 Readers, 4 Writers in 4 DataNodes Throughput with 4 Readers, 4 Writers in 4 DataNodes

• For 10 GB file size

– Read latency for IPoIB (32Gbps) with 4 readers: 30.13s

– Write latency for IPoIB (32Gbps) with 4 writers: 102s

– Increased no. of readers/writers reduces the latency compared to

SRL/SWL 29

Page 30: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Outline

• Introduction and Motivation

• Problem Statement

• Design Considerations

• Benchmark Suite

• Performance Evaluation

• Conclusion & Future work

30

Page 31: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Conclusion and Future Works

• Design, development and implementation of a micro-

benchmark suite

– Evaluate performance of standalone HDFS

• Flexible infrastructure for the benchmarks to set HDFS

configuration parameters dynamically

• Performance evaluations with our benchmarks over

different interconnects on modern clusters

• The benchmark suite is helpful to design and evaluate

applications that invoke HDFS directly; do not involve

MapReduce layer

• Will be made available to the big data community via an

open-source release 31

Page 32: A Micro-benchmark Suite for Evaluating HDFS Operations on ...

WBDB 2012

Thank You!

{islamn, luxi, rahmanmd, jose, wangh, panda}@cse.ohio-state.edu

Network-Based Computing Laboratory

http://nowlab.cse.ohio-state.edu/

MVAPICH Web Page http://mvapich.cse.ohio-state.edu/ 32