Can High-Performance Interconnects Benefit Hadoop Distributed File...

Can High-Performance Interconnects Benefit Hadoop Distributed File System?

Sayantan Sur Hao Wang Jian Huang

Xiangyong Ouyang D. K. Panda

Network-Based Computing Laboratory Department of Computer Science and Engineering

The Ohio State University, USA

Introduction •  MapReduce – scalable model to process Petabytes

•  Hadoop MapReduce framework widely adopted –  Hadoop Distributed Filesystem (HDFS) provides core storage,

distribution and fault-tolerance features

–  Designed with Gigabit Ethernet and Sockets in mind

•  The field of High-Performance Computing (HPC) has adopted advanced interconnects –  Low latency, High Bandwidth

–  Low CPU cycle requirement

–  InfiniBand, 10 Gigabit Ethernet are two examples

•  Solid State Drives providing improved IO characteristics

•  Can HDFS benefit from these two emerging technologies?

2

Typical HPC and Cloud Computing Deployments

•  HPC system design is interconnect centric

•  Cloud computing environment has complex software and historically relied on sockets and ethernet

3

Compute cluster

High-‐Performance Interconnects Frontend GigE

Physical Machine

VM VM Physical Machine

VM VM

Physical Machine

VM VM Virtual FS

Meta-‐Data Meta Data

I/O Server Data

I/O Server Data

I/O Server Data

I/O Server Data

Physical Machine

VM VM

Local Storage

Local Storage

Local Storage

Local Storage

HDFS Architecture •  HDFS is a distributed user-level

file system

•  Provides fault-tolerance by replicating data blocks

•  Block size typically 64MB and replicated three times (possibly in different racks)

•  Dedicated NameNode to store information on data blocks

•  DataNodes just store blocks and schedule Map-reduce computation jobs

•  Dedicated JobTracker to track jobs (and failure)

4

InfiniBand and 10 Gigabit Ethernet •  InfiniBand is an industry standard packet switched network

•  Has been increasingly adopted in HPC systems

•  User-level networking with OS-bypass (verbs)

•  10 Gigabit Ethernet follow up to Gigabit Ethernet

•  Provides user-level networking with OS-bypass (iwarp)

•  Some vendors have accelerated TCP/IP by putting it on the network card (hardware offload)

•  Convergence: possible to use both through OpenFabrics networking stack –  Same software, different interconnects

5

InfiniBand Efficiency in Top500 List

6

0

10

20

30

40

50

60

70

80

90

100

0 50 100 150 200 250 300 350 400 450 500

Efficien

cy (%

)

Top 500 Systems

Computer Cluster Efficiency Comparison

IB-‐CPU IB-‐GPU/Cell GigE 10GigE IBM-‐BlueGene Cray

Outline

•  Introduction

•  Problem Statement

•  Modern Interconnects and Protocols

•  Experimental Results & Analysis

•  Conclusions & Future Work

7

Problem Statement

•  Can High-Performance Interconnects help HDFS performance significantly?

•  How much benefit is possible without any software modifications?

•  Can emerging SSD technology complement performance advantages of high-performance interconnects?

8

Outline

•  Introduction





9

Modern Interconnects and Protocols

10

Application

Verbs Sockets Application Interface

TCP/IP

Hardware Offload

TCP/IP

Ethernet Driver

Kernel Space

Protocol Implementation

1/10 GigE Adapter

Ethernet Switch

Network Adapter

Network Switch

1/10 GigE

InfiniBand Adapter

InfiniBand Switch

IPoIB

IPoIB

SDP

RDMA User space

Verbs

InfiniBand Adapter

InfiniBand Switch

SDP

InfiniBand Adapter

InfiniBand Switch

RDMA

10 GigE Adapter

10 GigE Switch

10 GigE-TOE

MVAPICH and MVAPICH2 Software •  High Performance MPI Library for IB and HSE

–  MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2)

–  Used by more than 1,300 organizations in 60 countries

–  More than 46,000 downloads from OSU site directly

–  Empowering many TOP500 clusters

•  11th ranked 81,920-core cluster (Pleiades) at NASA

•  15th ranked 62,976-core cluster (Ranger) at TACC

–  Available with software stacks of many IB, HSE and server

vendors including Open Fabrics Enterprise Distribution (OFED)

–  http://mvapich.cse.ohio-state.edu

11

One-way Latency: MPI over IB

12

0

1

2

3

4

5

6 Small Message Latency

Message Size (bytes)

Late

ncy

(us)

1.96

1.54 1.60

2.17

0

50

100

150

200

250

300

350

400

MVAPICH-Qlogic-DDR

MVAPICH-Qlogic-QDR-PCIe2

MVAPICH-ConnectX-DDR

Late

ncy

(us)


Large Message Latency

All numbers taken on 2.4 GHz Quad-core (Nehalem) Intel with IB switch

Bandwidth: MPI over IB

13

0

500

1000

1500

2000

2500

3000

3500 Unidirectional Bandwidth

Mill

ionB

ytes

/se

c


2665.6

3023.7

1901.1

1553.2

0

1000

2000

3000

4000

5000

6000

7000

MVAPICH-Qlogic-DDR

MVAPICH-Qlogic-QDR-PCIe2

MVAPICH-ConnectX-DDR

Bidirectional Bandwidth

Mill

ionB

ytes

/se

c


2990.1 3244.1

3642.8

5835.7

All numbers taken on 2.4 GHz Quad-core (Nehalem) Intel with IB switch

Outline

•  Introduction





14

Experimental Setup •  Hadoop 0.20.2

•  Sun/Oracle Java 1.6.0

•  Intel Xeon 2.33GHz Quad Core CPUs

•  Main memory 6GB, 250GB Hard disk

•  Intel X-25E 64GB SSD

•  Mellanox MT25208 DDR (16Gbps) InfiniBand

•  Chelsio T320 (10GbE)

•  We dedicate one node as NameServer another as JobTracker

•  We vary the DataNode from 2, 4 and 8

15

Microbenchmark Level Evaluation

16

•  Sockets level ping-pong bandwidth test

•  Client sends data to server; server receives data; sends an ack back

•  Java performance depends on usage of NIO (allocateDirect)

•  C and Java versions of the benchmark have similar performance

•  HDFS does not use direct allocated blocks or NIO on DataNode

Bandwidth with C version Bandwidth with Java version

DFS IO Write Performance

17

•  DFS IO included in Hadoop, measures sequential access throughput

•  We have two map tasks each writing to a file of increasing size (1-10GB)

•  Eight data nodes for HDD (left) four data nodes for SSD (right)

•  Significant improvement with IPoIB, SDP and 10GigE

•  With SSD, performance improvement is almost seven or eight fold!

•  SSD benefits not seen without using high-performance interconnect!

RandomWriter Performance

•  Each map generates 1GB of random binary data and writes to HDFS

•  30% improvement for HDD using 8 DataNodes (IPoIB, SDP, 10GigE)

•  SSD improves execution time by 50% with 1GigE for two DataNodes

•  However, when using four DataNodes, unless IPoIB, SDP or 10GigE is used, benefits are not observed

•  IPoIB, SDP and 10GigE can improve performance by 59% on four DataNodes

18

Sort Benchmark

19

•  Sort: baseline benchmark for Hadoop

•  Bound by disk IO bandwidth for sort phase, but network performance bound for reduce phase

•  SSD improves performance by 28% using 1GigE with two DataNodes

•  Benefit of 48% on four DataNodes using SDP, IPoIB or 10GigE

Conclusions and Future Work

20

•  High-Performance interconnects can be used to boost performance of HDFS workloads

•  Benefits are observed when SSD is used instead of Hard disk in combination with fast network

•  Disk IO is still a bottleneck in HDFS

•  Using SSDs alone cannot improve performance

•  Must couple High-Performance interconnect with SSDs

•  HDFS design can be improved to use lower level communication (verbs) to further leverage advanced networks

•  We are currently looking at more workloads and HBase performance

Thank You! {surs, wangh, huangjia, ouyangx, panda}@cse.ohio-state.edu

Network-Based Computing Laboratory

http://nowlab.cse.ohio-state.edu/

21

MVAPICH Web Page http://mvapich.cse.ohio-state.edu/�

Can High-Performance Interconnects Benefit Hadoop Distributed File...

Documents

Transcript of Can High-Performance Interconnects Benefit Hadoop Distributed File...