Performance Optimization of Hadoop Using InfiniBand RDMA
-
Upload
insidehpc -
Category
Technology
-
view
319 -
download
4
description
Transcript of Performance Optimization of Hadoop Using InfiniBand RDMA
Performance Optimization of Hadoop Using InfiniBand RDMA
Dhabaleswar K. Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at ISC Big Data (October 2014)
by
• Substantial impact on designing and utilizing modern data management and
processing systems in multiple tiers
– Front-end data accessing and serving (Online)
• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)
• Hadoop: HDFS, MapReduce, Spark (This talk)
ISC Big Data '14
Data Management and Processing in Modern Datacenters
2
• Open-source implementation of Google MapReduce, GFS, and BigTable
for Big Data Analytics
• Hadoop Common Utilities (RPC, etc.), HDFS, MapReduce, YARN
• http://hadoop.apache.org
3
Overview of Apache Hadoop Architecture
ISC Big Data '14
Hadoop Distributed File System (HDFS)
MapReduce (Cluster Resource Management & Data Processing)
Hadoop Common/Core (RPC, ..)
Hadoop Distributed File System (HDFS)
YARN (Cluster Resource Management & Job Scheduling)
Hadoop Common/Core (RPC, ..)
MapReduce (Data Processing)
Other Models (Data Processing)
Hadoop 1.x Hadoop 2.x
HDFS and MapReduce in Apache Hadoop
• HDFS: Primary storage of Hadoop; highly reliable and fault-tolerant – NameNode stores the file system namespace
– DataNodes store data blocks
• MapReduce: Computing framework of Hadoop; highly scalable
– Map tasks read data from HDFS, operate on it, and write the intermediate
data to local disk
– Reduce tasks get data by shuffle, operate on it and write output to HDFS
• Adopted by many reputed organizations
– eg: Facebook, Yahoo!
• Developed in Java for platform-independence and portability
• Uses Sockets/HTTP for communication!
4 ISC Big Data '14
Spark Architecture Overview
• An in-memory data-processing framework – Iterative machine learning jobs
– Interactive data analytics
– Scala based Implementation
– Standalone, YARN, Mesos
• Scalable and communication intensive – Wide dependencies between
Resilient Distributed Datasets (RDDs)
– MapReduce-like shuffle operations to repartition RDDs
– Sockets based communication
ISC Big Data '14
http://spark.apache.org
5
• High End Computing (HEC) is growing dramatically
– High Performance Computing
– Big Data Computing
• Technology Advancement
– Multi-core/many-core technologies and accelerators
– Remote Direct Memory Access (RDMA)-enabled
networking (InfiniBand and RoCE)
– Solid State Drives (SSDs) and Non-Volatile Random-
Access Memory (NVRAM)
– Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)
Growth in High End Computing (HEC)
Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)
ISC Big Data '14 6
• 223 IB Clusters (44.3%) in the June 2014 Top500 list
(http://www.top500.org)
• Installations in the Top 50 (25 systems):
Large-scale InfiniBand Installations
519,640 cores (Stampede) at TACC (7th) 120, 640 cores (Nebulae) at China/NSCS (28th)
62,640 cores (HPC2) in Italy (11th) 72,288 cores (Yellowstone) at NCAR (29th)
147, 456 cores (Super MUC) in Germany (12th) 70,560 cores (Helios) at Japan/IFERC (30th)
76,032 cores (Tsubame 2.5) at Japan/GSIC (13th) 138,368 cores (Tera-100) at France/CEA (35th)
194,616 cores (Cascade) at PNNL (15th) 222,072 cores (QUARTETTO) in Japan (37th)
110,400 cores (Pangea) at France/Total (16th) 53,504 cores (PRIMERGY) in Australia (38th)
96,192 cores (Pleiades) at NASA/Ames (21st) 77,520 cores (Conte) at Purdue University (39th)
73,584 cores (Spirit) at USA/Air Force (24th) 44,520 cores (Spruce A) at AWE in UK (40th)
77,184 cores (Curie thin nodes) at France/CEA (26h) 48,896 cores (MareNostrum) at Spain/BSC (41st)
65,320-cores, iDataPlex DX360M4 at Germany/Max-Planck (27th)
and many more!
ISC Big Data '14 7
ISC Big Data '14
Kernel Space
Interconnects and Protocols in the Open Fabrics Stack Application / Middleware
Verbs
Ethernet Adapter
Ethernet Switch
Ethernet Driver
TCP/IP
1/10/40 GigE
InfiniBand
Adapter
InfiniBand
Switch
IPoIB
IPoIB
Ethernet Adapter
Ethernet Switch
Hardware Offload
TCP/IP
10/40 GigE-TOE
InfiniBand
Adapter
InfiniBand
Switch
User Space
RSockets
RSockets
iWARP Adapter
Ethernet Switch
TCP/IP
User Space
iWARP
RoCE Adapter
Ethernet Switch
RDMA
User Space
RoCE
InfiniBand
Switch
InfiniBand
Adapter
RDMA
User Space
IB Native
Sockets
Application / Middleware Interface
Protocol
Adapter
Switch
InfiniBand
Adapter
InfiniBand
Switch
RDMA
SDP
SDP
8
• Challenges for Accelerating Big Data Processing
• The High-Performance Big Data (HiBD) Project
• RDMA-based designs for Apache Hadoop and Spark
– Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities
• Conclusion and Q&A
Presentation Outline
ISC Big Data '14 9
Wide Adoption of RDMA Technology
• Message Passing Interface (MPI) for HPC
• Parallel File Systems
– Lustre
– GPFS
• Delivering excellent performance:
– < 1.0 microsec latency
– 100 Gbps bandwidth
– 5-10% CPU utilization
• Delivering excellent scalability
ISC Big Data '14 10
MVAPICH2/MVAPICH2-X Software
11
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA
over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Support for GPGPUs and MIC
– Used by more than 2,225 organizations in 73 countries
– More than 223,000 downloads from OSU site directly
– Empowering many TOP500 clusters
• 7th, 519,640-core (Stampede) at TACC
• 13th, 74,358-core (Tsubame 2.5) at Tokyo Institute of Technology
• 23rd, 96,192-core (Pleiades) at NASA and many others . . .
– Available with software stacks of many IB, HSE, and server vendors including
Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Partner in the U.S. NSF-TACC Stampede System
ISC Big Data '14
ISC Big Data '14
Latency & Bandwidth: MPI over IB with MVAPICH2
0,00
1,00
2,00
3,00
4,00
5,00
6,00 Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB--DualFDR
Small Message Latency
Message Size (bytes)
Late
ncy
(us)
1.66
1.56
1.64
1.82
0.99 1.09
1.12
DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
12
0
2000
4000
6000
8000
10000
12000
14000 Unidirectional Bandwidth
Band
wid
th (M
Byte
s/se
c)
Message Size (bytes)
3280
3385
1917 1706
6343
12485
12810
Designing Communication and I/O Libraries for Big Data Systems: Challenges
ISC Big Data '14
Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
13
Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?
Current Design
Application
Sockets
1/10 GigE Network
• Sockets not designed for high-performance – Stream semantics often mismatch for upper layers
– Zero-copy not available for non-blocking sockets
Our Approach
Application
OSU Design
10 GigE or InfiniBand
Verbs Interface
ISC Big Data '14 14
• Two strategies – RDMA-based MapReduce/Spark over RDMA-based HDFS
– RDMA-based MapReduce over Lustre
• Challenges for Accelerating Big Data Processing
• The High-Performance Big Data (HiBD) Project
• RDMA-based designs for Apache Hadoop and Spark
– Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities
• Conclusion and Q&A
Presentation Outline
ISC Big Data '14 15
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• RDMA for Memcached (RDMA-Memcached)
• OSU HiBD-Benchmarks (OHB)
• http://hibd.cse.ohio-state.edu
• Users Base: 76 organizations from 13 countries
• RDMA for Apache HBase and Spark
The High-Performance Big Data (HiBD) Project
ISC Big Data '14 16
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance design with native InfiniBand and RoCE support at the verbs-
level for HDFS, MapReduce, and RPC components
– Easily configurable for native InfiniBand, RoCE and the traditional sockets-
based support (Ethernet and InfiniBand with IPoIB)
• Current release: 0.9.9/0.9.1
– Based on Apache Hadoop 1.2.1/2.4.1
– Compliant with Apache Hadoop 1.2.1/2.4.1 APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR and FDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
• Different file systems with disks and SSDs
– http://hibd.cse.ohio-state.edu
RDMA for Apache Hadoop 1.x/2.x Distributions
ISC Big Data '14 17
• Challenges for Accelerating Big Data Processing
• The High-Performance Big Data (HiBD) Project
• RDMA-based designs for Apache Hadoop and Spark
– Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities
• Conclusion and Q&A
Presentation Outline
ISC Big Data '14 18
• RDMA-based Designs and Performance Evaluation – HDFS
– MapReduce
– Spark
Acceleration Case Studies and In-Depth Performance Evaluation
ISC Big Data '14 19
Design Overview of HDFS with RDMA
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based HDFS with communication library written in native code
HDFS
Verbs
RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Write Others
OSU Design
• Design Features – RDMA-based HDFS
write
– RDMA-based HDFS replication
– Parallel replication support
– On-demand connection setup
– InfiniBand/RoCE support
ISC Big Data '14 20
Communication Times in HDFS
• Cluster with HDD DataNodes
– 30% improvement in communication time over IPoIB (QDR)
– 56% improvement in communication time over 10GigE
• Similar improvements are obtained for SSD DataNodes
Reduced by 30%
ISC Big Data '14
0
5
10
15
20
25
2GB 4GB 6GB 8GB 10GB
Com
mun
icat
ion
Tim
e (s
)
File Size (GB)
10GigE IPoIB (QDR) OSU-IB (QDR)
21
N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012
N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014
ISC Big Data '14
Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede
• Cluster with 64 DataNodes (1K cores), single HDD per node
– 64% improvement in throughput over IPoIB (FDR) for 256GB file size
– 37% improvement in latency over IPoIB (FDR) for 256GB file size
0
200
400
600
800
1000
1200
1400
64 128 256
Agg
rega
ted
Thro
ughp
ut (M
Bps)
File Size (GB)
IPoIB (FDR) OSU-IB (FDR) Increased by 64%
Reduced by 37%
0
100
200
300
400
500
600
64 128 256
Exec
utio
n Ti
me
(s)
File Size (GB)
IPoIB (FDR)
OSU-IB (FDR)
22
• RDMA-based Designs and Performance Evaluation – HDFS
– MapReduce
– Spark
Acceleration Case Studies and In-Depth Performance Evaluation
ISC Big Data '14 23
Design Overview of MapReduce with RDMA
MapReduce
Verbs
RDMA Capable Networks
(IB, 10GE/ iWARP, RoCE ..)
OSU Design
Applications
1/10 GigE, IPoIB Network
Java Socket Interface
Java Native Interface (JNI)
Job
Tracker
Task
Tracker
Map
Reduce
ISC Big Data '14
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Java based MapReduce with communication library written in native code
• Design Features – RDMA-based shuffle
– Prefetching and caching map output
– Efficient Shuffle Algorithms
– In-memory merge
– On-demand Shuffle Adjustment
– Advanced overlapping • map, shuffle, and merge
• shuffle, merge, and reduce
– On-demand connection setup
– InfiniBand/RoCE support
24
Advanced Overlapping among different phases
• A hybrid approach to achieve
maximum possible overlapping
in MapReduce across all phases
compared to other approaches – Efficient Shuffle Algorithms
– Dynamic and Efficient Switching
– On-demand Shuffle Adjustment
Default Architecture
Enhanced Overlapping
Advanced Overlapping
M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014.
ISC Big Data '14 25
• For 240GB Sort in 64 nodes (512
cores)
– 40% improvement over IPoIB (QDR)
with HDD used for HDFS
Performance Evaluation of Sort and TeraSort
ISC Big Data '14
0
200
400
600
800
1000
1200
Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
Job
Exec
utio
n Ti
me
(sec
)
IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)
Sort in OSU Cluster
0
100
200
300
400
500
600
700
Data Size: 80 GB Data Size: 160 GB Data Size: 320 GB
Cluster Size: 16 Cluster Size: 32 Cluster Size: 64
Job
Exec
utio
n Ti
me
(sec
)
IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)
TeraSort in TACC Stampede
• For 320GB TeraSort in 64 nodes
(1K cores)
– 38% improvement over IPoIB
(FDR) with HDD used for HDFS 26
• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size
• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size
Evaluations using PUMA Workload
ISC Big Data '14
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)
Nor
mal
ized
Exe
cuti
on T
ime
Benchmarks
10GigE IPoIB (QDR) OSU-IB (QDR)
27
• RDMA-based Designs and Performance Evaluation – HDFS
– MapReduce
– Spark
Acceleration Case Studies and In-Depth Performance Evaluation
ISC Big Data '14 28
Design Overview of Spark with RDMA
• Design Features – RDMA based shuffle
– SEDA-based plugins
– Dynamic connection management and sharing
– Non-blocking and out-of-order data transfer
– Off-JVM-heap buffer management
– InfiniBand/RoCE support
ISC Big Data '14
• Enables high performance RDMA communication, while supporting traditional socket interface
• JNI Layer bridges Scala based Spark with communication library written in native code
X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014
29
ISC Big Data '14
Preliminary Results of Spark-RDMA Design - GroupBy
0
1
2
3
4
5
6
7
8
9
4 6 8 10
Gro
upBy
Tim
e (s
ec)
Data Size (GB)
0
1
2
3
4
5
6
7
8
9
8 12 16 20
Gro
upBy
Tim
e (s
ec)
Data Size (GB)
10GigE
IPoIB
RDMA
Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores
• Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks
– 18% improvement over IPoIB (QDR) for 10GB data size
• Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks
– 20% improvement over IPoIB (QDR) for 20GB data size
30
• Challenges for Accelerating Big Data Processing
• The High-Performance Big Data (HiBD) Project
• RDMA-based designs for Apache Hadoop and Spark
– Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities
• Conclusion and Q&A
Presentation Outline
ISC Big Data '14 31
Optimized Apache Hadoop over Parallel File Systems
MetaData Servers
Object Storage Servers
Compute Nodes TaskTracker
Map Reduce
Lustre Client
Lustre Setup
• HPC Cluster Deployment – Hybrid topological solution of Beowulf
architecture with separate I/O nodes – Lean compute nodes with light OS; more
memory space; small local storage – Sub-cluster of dedicated I/O nodes with
parallel file systems, such as Lustre • MapReduce over Lustre
– Local disk is used as the intermediate data directory
– Lustre is used as the intermediate data directory
ISC Big Data '14 32
• For 500GB Sort in 64 nodes
– 44% improvement over IPoIB (FDR)
Case Study - Performance Improvement of RDMA-MapReduce over Lustre on TACC-Stampede
ISC Big Data '14
• For 640GB Sort in 128 nodes
– 48% improvement over IPoIB (FDR)
0
200
400
600
800
1000
1200
300 400 500
Job
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB (FDR)
OSU-IB (FDR)
0
50
100
150
200
250
300
350
400
450
500
20 GB 40 GB 80 GB 160 GB 320 GB 640 GB
Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128
Job
Exec
utio
n Ti
me
(sec
)
IPoIB (FDR) OSU-IB (FDR)
M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-based Approach Benefit?, Euro-Par, August 2014.
• Local disk is used as the intermediate data directory
33
• For 160GB Sort in 16 nodes
– 35% improvement over IPoIB (FDR)
Case Study - Performance Improvement of RDMA-MapReduce over Lustre on TACC-Stampede
ISC Big Data '14
• For 320GB Sort in 32 nodes
– 33% improvement over IPoIB (FDR)
• Lustre is used as the intermediate data directory
0
20
40
60
80
100
120
140
160
180
200
80 120 160
Job
Exec
utio
n Ti
me
(sec
)
Data Size (GB)
IPoIB (FDR) OSU-IB (FDR)
0
50
100
150
200
250
300
80 GB 160 GB 320 GB
Cluster: 8 Cluster: 16 Cluster: 32 Jo
b Ex
ecut
ion
Tim
e (s
ec)
IPoIB (FDR) OSU-IB (FDR)
• Can more optimizations be achieved by leveraging more features of Lustre? 34
• Challenges for Accelerating Big Data Processing
• The High-Performance Big Data (HiBD) Project
• RDMA-based designs for Apache Hadoop and Spark
– Case studies with HDFS, MapReduce, and Spark
– RDMA-based MapReduce on HPC Clusters with Lustre
• Ongoing and Future Activities
• Conclusion and Q&A
Presentation Outline
ISC Big Data '14 35
Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)
Networking Technologies (InfiniBand, 1/10/40GigE
and Intelligent NICs)
Storage Technologies
(HDD and SSD)
Programming Models (Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
Other Protocols?
Communication and I/O Library
Point-to-Point Communication
QoS
Threaded Models and Synchronization
Fault-Tolerance I/O and File Systems
Virtualization
Benchmarks
RDMA Protocols
ISC Big Data '14 36
Applications-Level
Benchmarks
Micro-Benchmarks
Upper level Changes?
Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges
37
• TestDFSIO on TACC Stampede
– 2.6x improvement over Default-HDFS for 160GB file size on 32 nodes
• RandomWriter on TACC Stampede
– 38% improvement over Default-HDFS for 80GB file size on 32 nodes
TestDFSIO RandomWriter
Preliminary Results of MEM-HDFS Design
0
100
200
300
400
500
600
8:20GB 16:40GB 32:80GB
Exec
utio
n Ti
me
(s)
Cluster Size: Data Size
Default-HDFS MEM-HDFS
0
200
400
600
800
1000
1200
1400
8:40GB 16:80GB 32:160GB
Agg
rega
ted
Thro
ughp
ut (M
Bps)
Cluster Size: Data Size
Default-HDFS MEM-HDFS
Increased by 2.6x
Reduced by 38%
N. Islam, X. Lu, W. Rahman, R. Rajachandrasekar, and D. K. Panda, MEM-HDFS: In-Memory I/O and Replication for HDFS with Memcached: Early Experiences, IEEE Big Data '14, October 2014
ISC Big Data '14
• In-memory I/O and replication for HDFS with Memcached
• Upcoming Releases of RDMA-enhanced Packages will support
– Hadoop 2.x MapReduce & RPC
– Spark
– HBase
• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will
support
– HDFS
– MapReduce
– RPC
• Advanced designs with upper-level changes and optimizations
• E.g. MEM-HDFS
ISC Big Data '14
Future Plans of OSU High Performance Big Data (HiBD) Project
38
• Discussed challenges in accelerating Hadoop and Spark
• Presented initial designs to take advantage of InfiniBand/RDMA
for Hadoop and Spark
• Results are promising
• Many other open issues need to be solved
• Will enable Big Data management and processing community to
take advantage of modern HPC technologies to carry out their
analytics in a fast and scalable manner
ISC Big Data '14
Concluding Remarks
39
ISC Big Data '14
Thank You!
The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/
40
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/
The MVAPICH2/MVAPICH2-X Project http://mvapich.cse.ohio-state.edu/