A Micro-benchmark Suite for Evaluating HDFS Operations on ...
Transcript of A Micro-benchmark Suite for Evaluating HDFS Operations on ...
N. S. Islam, X. Lu, M. W. Rahman, J. Jose,
H. Wang & D. K. Panda
Network-Based Computing Laboratory Department of Computer Science and Engineering
The Ohio State University, Columbus, OH, USA
A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters
WBDB 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design Considerations
• Benchmark Suite
• Performance Evaluation
• Conclusion & Future work
2
WBDB 2012
Big Data Technology
• Apache Hadoop is a popular Big
Data technology
– Provides framework for large-
scale, distributed data storage and
processing
• Hadoop is an open-source
implementation of MapReduce
programming model
• Hadoop Distributed File System
(HDFS) (http://hadoop.apache.org/) is
the underlying file system of
Hadoop and Hadoop DataBase,
HBase
HDFS
MapReduce HBase
Hadoop Framework
3
WBDB 2012
Open Standard InfiniBand Networking
Technology • Introduced in Oct 2000
• High Performance Data Transfer – Interprocessor communication and I/O
– Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec -> 100Gbps), and low CPU utilization (5-10%)
• Flexibility for LAN and WAN communication
• Multiple Transport Services – Reliable Connection (RC), Unreliable Connection (UC), Reliable
Datagram (RD), Unreliable Datagram (UD), and Raw Datagram
– Provides flexibility to develop upper layers
• Multiple Operations – Send/Recv
– RDMA Read/Write
– Atomic Operations (very unique) • high performance and scalable implementations of distributed locks, semaphores,
collective communication operations
• Leading to big changes in designing HPC clusters, file systems, cloud computing systems, grid computing systems, ….
• Around 45% TOP500 systems use InfiniBand
4
WBDB 2012
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and
RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and initial MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in 70
countries
– More than 142,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 7th ranked 204,900-core cluster (Stampede) at TACC
• 14th ranked 125,980-core cluster (Pleiades) at NASA
• 17th ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology
• and many others
– Available with software stacks of many IB, HSE and server vendors including
Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the U.S. NSF-TACC Stampede (9 PFlop) System
5
MVAPICH2/MVAPICH2-X Software
WBDB 2012 6
One-way Latency: MPI over IB
0.00
1.00
2.00
3.00
4.00
5.00
6.00 Small Message Latency
Message Size (bytes)
Late
ncy
(u
s)
1.66
1.56
1.64
1.82
0.99 1.09
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
0.00
50.00
100.00
150.00
200.00
250.00 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR MVAPICH2-Mellanox-ConnectIB-DualFDR
Large Message Latency
Message Size (bytes)
Late
ncy
(u
s)
WBDB 2012 7
Bandwidth: MPI over IB
0
2000
4000
6000
8000
10000
12000
14000 Unidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
0
5000
10000
15000
20000
25000 MVAPICH-Qlogic-DDR MVAPICH-Qlogic-QDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX2-PCIe2-QDR MVAPICH-ConnectX3-PCIe3-FDR MVAPICH2-Mellanox-ConnectIB-DualFDR
Bidirectional Bandwidth
Ban
dw
idth
(M
Byt
es/
sec)
Message Size (bytes)
3341 3704
4407
11643
6521
21025
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (Sandybridge) Intel PCI Gen3 with IB switch
WBDB 2012
Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?
Enhanced Designs
Application
Accelerated Sockets
10 GigE or InfiniBand 10 GigE or InfiniBand
Verbs / Hardware Offload
Current Design
Application
Sockets
1/10 GigE Network
• Sockets not designed for high-performance
– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)
– Zero-copy not available for non-blocking sockets
Our Approach
Application
OSU Design OSU Design
10 GigE or InfiniBand 10 GigE or InfiniBand
Verbs Interface
8
WBDB 2012
• Adopted by many reputed organizations – eg:- Facebook, Yahoo!
• Highly reliable and fault-tolerant - replication
• NameNode: stores the file system namespace
• DataNode: stores data blocks
• Developed in Java for platform-independence and portability
• Uses Java sockets for communication
(HDFS Architecture)
Hadoop Distributed File System
(HDFS)
9
WBDB 2012
HDFS over InfiniBand
HDFS
IB Verbs
InfiniBand
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Write Others
OSU Design
Enables high performance RDMA communication, while supporting traditional socket interface
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,
High Performance RDMA-Based Design of HDFS over InfiniBand, Supercomputing (SC), Nov 2012
10
WBDB 2012
RDMA-based Design for Native HDFS-IB: Gain in
Communication Times in HDFS
• Cluster with 32 HDD DataNodes (one disk per node)
– 30% improvement in communication time over IPoIB (32Gbps)
– 87% improvement in communication time over 1GigE
0
10
20
30
40
50
60
70
80
90
100
2GB 4GB 6GB 8GB 10GB
Co
mm
un
ica
tio
n T
ime
(s)
File Size (GB)
1GigE
IPoIB (QDR)
OSU-IB (QDR)
Reduced
by 30%
11
WBDB 2012
HDFS Performance Factors
• Performance of HDFS is determined by:
– Factors related to storage and network configurations
– Controllable parameters (block-size, packet-size)
– Data access pattern
• To achieve optimal performance these factors need
to be tuned based on cluster and workload
characteristics
• A benchmark tool suite to evaluate HDFS
performance metrics in different configurations is
important for tuning
12
WBDB 2012
HDFS Benchmarks
• Most popular benchmark to evaluate HDFS I/O
performance is TestDFSIO
– Involves the MapReduce framework
• There is a lack of standardized benchmark suite to
evaluate the performance of standalone HDFS
• Such benchmark suite is
– Useful for evaluating HDFS performance for different
network and storage configurations on modern cluster
– Relevant for applications using native HDFS (e.g. HBase)
instead of going through MapReduce layer
13
WBDB 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design Considerations
• Benchmark Suite
• Performance Evaluation
• Conclusion & Future work
14
WBDB 2012
Problem Statement
• Can we design and develop a micro-benchmark suite for
evaluating I/O performance of standalone HDFS?
• Can we provide a set of benchmarks to evaluate latency and
throughput of different HDFS operations such as read, write,
and mix workload (read and write)?
• Can we equip the benchmarks with options to set different
HDFS parameters dynamically?
• What will be the performance of HDFS operations when
evaluated using this benchmark suite on modern clusters?
• Similar to OSU Micro-Benchmarks for MPI and PGAS (OMB)
– http://mvapich.cse.ohio-state.edu
15
WBDB 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design Considerations
• Benchmark Suite
• Performance Evaluation
• Conclusion & Future work
16
WBDB 2012
Design Considerations
• HDFS has three main operations:
– Sequential Write
– Sequential Read
– Random Read
• HDFS performance is measured by the latency and
throughput of these operations
• Performance is influenced by underlying network,
storage, HDFS configuration parameters and data
access patterns
17
WBDB 2012
Design Considerations
• Interplay of Network, Storage and Cluster Configuration:
– Faster interconnects and/or protocols can enhance HDFS
performance
– Number and type (HDD, SSD or combination) of
underlying storage device has impact on performance
– Optimal values of HDFS block-size, packet-size, file I/O
buffer size may vary depending on cluster configurations
• Our benchmark-suite
– Focuses on three kinds of data access patterns:
sequential, random and mix (read and write)
– Has options to set HDFS configuration parameters
dynamically
18
WBDB 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design Considerations
• Benchmark Suite
• Performance Evaluation
• Conclusion & Future work
19
WBDB 2012
Benchmark Suite
• Five different benchmarks:
– Sequential Write Latency (SWL)
– Sequential or Random Read Latency (SRL or RRL)
– Sequential Write Throughput (SWT)
– Sequential Read Throughput (SRT)
– Sequential Read-Write Throughput (SRWT)
• All the benchmarks use HDFS API for Read and
Write
20
WBDB 2012
Benchmark Parameter List
Benchmark File
Name
File
Size
HDFS
Parameter
Readers Writers Random/
Sequential
Read
Seek
Interval
SWL √ √ √
SRL/RRL √ √ √ √ √ (RRL)
SWT √ √ √
SRT √ √ √
SRWT √ √ √ √
The benchmark suite calculates statistics like Min, Max, Average
of latency and throughput
21
WBDB 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design Considerations
• Benchmark Suite
• Performance Evaluation
• Conclusion & Future work
22
WBDB 2012
Experimental Setup
• Hardware
– Intel Westmere Cluster
• Each node has 8 processor cores on 2 Intel Xeon 2.67 GHz Quad-core CPUs, 12 GB main memory, 160 GB hard disk
• Network: 1GigE and IPoIB (32Gbps)
• Software
– Hadoop 0.20.2 and Sun Java SDK 1.7.
23
WBDB 2012
Sequential Write Latency (SWL)
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8 9 10
Tim
e (
s)
File Size (GB)
1GigE
IPoIB (32Gbps)
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10
Tim
e (
s)
File Size (GB)
1GigE
IPoIB (32Gbps)
File Write Latency with 4 DataNodes File Write Latency with 32 DataNodes
• For 10 GB file size
– Write latency for IPoIB (32Gbps) in 4 DataNodes: 131.26s
– Write latency for IPoIB (32Gbps) in 32 DataNodes: 64.95s
– Large no. of DataNodes reduces write latency due to less I/O
bottleneck 24
WBDB 2012
Sequential and Random Read Latency
(SRL/RRL)
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10
Tim
e (
s)
File Size (GB)
1GigE
IPoIB (32Gbps)
Sequential Read Latency (SRL) with 4 DataNodes
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10
Tim
e (
s)
File Size (GB)
1GigE
IPoIB (32Gbps)
Random Read Latency (RRL) with 4 DataNodes
• For 10 GB file size
– Sequential Read latency for IPoIB (32Gbps): 40.79s
– Random Read latency for IPoIB (32Gbps): 44.8s (seek size = 500)
25
WBDB 2012
Sequential Write Throughput
(SWT with 4 Writers)
0
50
100
150
200
250
300
350
400
450
500
1 2 3 4 5 6 7 8 9 10
Th
rou
gh
pu
t (M
Bp
s)
File Size (GB)
1GigE
IPoIB (32Gbps)
Write Throughput with 4 Writers in 4 DataNodes
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8 9 10
Th
rou
gh
pu
t (M
Bp
s)
File Size (GB)
1GigE
IPoIB (32Gbps)
Write Throughput with 4 Writers in 32 DataNodes
• For 10 GB file size
– Write throughput for IPoIB (32Gbps) in 4 DataNodes: 85.7MBps
– Write throughput for IPoIB (32Gbps) in 32 DataNodes: 429MBps
– Large no. of DataNodes helps maintain the throughput due to less
I/O bottleneck 26
WBDB 2012
Sequential Write Throughput
(SWT with 8 Writers)
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8 9 10
Th
rou
gh
pu
t (M
Bp
s)
File Size (GB)
1GigE
IPoIB (32Gbps)
Write Throughput with 8 Writers in 4 DataNodes
0
200
400
600
800
1000
1200
1 2 3 4 5 6 7 8 9 10
Th
rou
gh
pu
t (M
Bp
s)
File Size (GB)
1GigE
IPoIB (32Gbps)
Write Throughput with 8 Writers in 32 DataNodes
• For 10 GB file size
– Write throughput for IPoIB (32Gbps) in 4 DataNodes: 90MBps
– Write throughput for IPoIB (32Gbps) in 32 DataNodes: 841.2MBps
– Increased no. of writers increases the throughput 27
WBDB 2012
Sequential Read Throughput (SRT)
0
200
400
600
800
1000
1200
1400
1 2 3 4 5 6 7 8 9 10
Th
rou
gh
pu
t (M
Bp
s)
File Size (GB)
1GigE
IPoIB (32Gbps)
Read Throughput with 4 Readers in 4 DataNodes
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8 9 10
Th
rou
gh
pu
t (M
Bp
s)
File Size (GB)
1GigE
IPoIB (32Gbps)
Read Throughput with 8 Readers in 4 DataNodes
• For 10 GB file size
– Read throughput for IPoIB (32Gbps) with 4 readers: 931MBps
– Read throughput for IPoIB (32Gbps) with 8 readers: 1902MBps
– Increased no. of readers increases the throughput 28
WBDB 2012
Evaluation of Mixed Workload
(SRWT with 4 Readers, 4 Writers)
0
20
40
60
80
100
120
140
160
180
1 2 3 4 5 6 7 8 9 10
Tim
e (
s)
File Size (GB)
Read (1GigE)
Read (IPoIB-32Gbps)
Write (1GigE)
Write (IPoIB-32Gbps)
0
200
400
600
800
1000
1200
1 2 3 4 5 6 7 8 9 10
Th
rou
gh
pu
t (M
Bp
s)
File Size (GB)
Read (1GigE) Read (IPoIB-32Gbps)
Write (1GigE) Write (IPoIB-32Gbps)
Latency with 4 Readers, 4 Writers in 4 DataNodes Throughput with 4 Readers, 4 Writers in 4 DataNodes
• For 10 GB file size
– Read latency for IPoIB (32Gbps) with 4 readers: 30.13s
– Write latency for IPoIB (32Gbps) with 4 writers: 102s
– Increased no. of readers/writers reduces the latency compared to
SRL/SWL 29
WBDB 2012
Outline
• Introduction and Motivation
• Problem Statement
• Design Considerations
• Benchmark Suite
• Performance Evaluation
• Conclusion & Future work
30
WBDB 2012
Conclusion and Future Works
• Design, development and implementation of a micro-
benchmark suite
– Evaluate performance of standalone HDFS
• Flexible infrastructure for the benchmarks to set HDFS
configuration parameters dynamically
• Performance evaluations with our benchmarks over
different interconnects on modern clusters
• The benchmark suite is helpful to design and evaluate
applications that invoke HDFS directly; do not involve
MapReduce layer
• Will be made available to the big data community via an
open-source release 31
WBDB 2012
Thank You!
{islamn, luxi, rahmanmd, jose, wangh, panda}@cse.ohio-state.edu
Network-Based Computing Laboratory
http://nowlab.cse.ohio-state.edu/
MVAPICH Web Page http://mvapich.cse.ohio-state.edu/ 32