Designing Convergent HPC and BigData Software Stacks: An … · 2020-01-16 · Designing Convergent...
Embed Size (px)
Transcript of Designing Convergent HPC and BigData Software Stacks: An … · 2020-01-16 · Designing Convergent...

Designing Convergent HPC and BigData Software Stacks: An Overview of the HiBD Project
Dhabaleswar K. (DK) Panda
The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
A Talk at HPCAC Stanford Conference (Feb ’19)
by

HPCAC-Stanford (Feb ’19) 2Network Based Computing Laboratory
• Big Data has changed the way people understand
and harness the power of data, both in the
business and research domains
• Big Data has become one of the most important
elements in business analytics
• Big Data and High Performance Computing (HPC)
are converging to meet large scale data processing
challenges
• Running High Performance Data Analysis (HPDA) workloads in the cloud is gaining popularity
• According to the latest OpenStack survey, 27% of cloud deployments are running HPDA workloads
Introduction to Big Data Analytics and Trends
http://www.coolinfographics.com/blog/tag/data?currentPage=3
http://www.climatecentral.org/news/white-house-brings-together-big-data-and-climate-change-17194

HPCAC-Stanford (Feb ’19) 3Network Based Computing Laboratory
Big Velocity – How Much Data Is Generated Every Minute on the Internet?
The global Internet population grew 7.5% from 2016 and now represents
3.7 Billion People.Courtesy: https://www.domo.com/blog/data-never-sleeps-5/

HPCAC-Stanford (Feb ’19) 4Network Based Computing Laboratory
• Substantial impact on designing and utilizing data management and processing systems in multiple tiers
– Front-end data accessing and serving (Online)
• Memcached + DB (e.g. MySQL), HBase
– Back-end data analytics (Offline)
• HDFS, MapReduce, Spark
Data Management and Processing on Modern Clusters

HPCAC-Stanford (Feb ’19) 5Network Based Computing Laboratory
(1) Prepare Datasets @Scale
(2) Deep Learning @Scale
(3) Non-deep learning
analytics @Scale
(4) Apply ML model @Scale
• Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms
• More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big
data stacks, such as Apache Hadoop and Spark
• Benefits of the DLoBD approach
– Easily build a powerful data analytics pipeline
• E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof
– Better data locality
– Efficient resource sharing and cost effective
Deep Learning over Big Data (DLoBD)

HPCAC-Stanford (Feb ’19) 6Network Based Computing Laboratory
• Scientific Data Management, Analysis, and Visualization
• Applications examples
– Climate modeling
– Combustion
– Fusion
– Astrophysics
– Bioinformatics
• Data Intensive Tasks
– Runs large-scale simulations on supercomputers
– Dump data on parallel storage systems
– Collect experimental / observational data
– Move experimental / observational data to analysis sites
– Visual analytics – help understand data visually
Not Only in Internet Services - Big Data in Scientific Domains

HPCAC-Stanford (Feb ’19) 7Network Based Computing Laboratory
Interconnects and Protocols in OpenFabrics Stack
Kernel
Space
Application / Middleware
Verbs
Ethernet
Adapter
EthernetSwitch
Ethernet Driver
TCP/IP
1/10/40/100 GigE
InfiniBand
Adapter
InfiniBand
Switch
IPoIB
IPoIB
Ethernet
Adapter
Ethernet Switch
Hardware Offload
TCP/IP
10/40/100 GigE-TOE
InfiniBand
Adapter
InfiniBand
Switch
User Space
RSockets
RSockets
iWARP Adapter
Ethernet Switch
TCP/IP
User Space
iWARP
RoCEAdapter
Ethernet Switch
RDMA
User Space
RoCE
InfiniBand
Switch
InfiniBand
Adapter
RDMA
User Space
IB Native
Sockets
Application / Middleware Interface
Protocol
Adapter
Switch
InfiniBand
Adapter
InfiniBand
Switch
RDMA
SDP
SDP

HPCAC-Stanford (Feb ’19) 8Network Based Computing Laboratory
Emerging NVMe-over-Fabric• Remote access to flash with NVMe
over the network
• RDMA fabric is of most importance
– Low latency makes remote access feasible
– 1 to 1 mapping of NVMe I/O queues to RDMA send/recv queues
NVMf Architecture
I/O Submission
Queue
I/O Completion
Queue
RDMA FabricSQ RQ
NVMe
Low latency
overhead compared
to local I/O

HPCAC-Stanford (Feb ’19) 9Network Based Computing Laboratory
Big Data (Hadoop, Spark,
HBase, Memcached,
etc.)
Deep Learning(Caffe, TensorFlow, BigDL,
etc.)
HPC (MPI, RDMA, Lustre, etc.)
Increasing Usage of HPC, Big Data and Deep Learning
Convergence of HPC, Big Data, and Deep Learning!!!

HPCAC-Stanford (Feb ’19) 10Network Based Computing Laboratory
How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data and Deep Learning Applications?
Bring HPC, Big Data processing, and Deep Learning into a “convergent trajectory”!
What are the major
bottlenecks in current Big
Data processing and
Deep Learning
middleware (e.g. Hadoop,
Spark)?
Can the bottlenecks be alleviated with new
designs by taking advantage of HPC
technologies?
Can RDMA-enabled
high-performance
interconnects
benefit Big Data
processing and Deep
Learning?
Can HPC Clusters with
high-performance
storage systems (e.g.
SSD, parallel file
systems) benefit Big
Data and Deep Learning
applications?
How much
performance benefits
can be achieved
through enhanced
designs?
How to design
benchmarks for
evaluating the
performance of Big Data
and Deep Learning
middleware on HPC
clusters?

HPCAC-Stanford (Feb ’19) 11Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

HPCAC-Stanford (Feb ’19) 12Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

HPCAC-Stanford (Feb ’19) 13Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

HPCAC-Stanford (Feb ’19) 14Network Based Computing Laboratory
Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?

HPCAC-Stanford (Feb ’19) 15Network Based Computing Laboratory
Designing Communication and I/O Libraries for Big Data Systems: Challenges
Big Data Middleware(HDFS, MapReduce, HBase, Spark, gRPC/TensorFlow, and Memcached)
Networking Technologies
(InfiniBand, 1/10/40/100 GigE and Intelligent NICs)
Storage Technologies(HDD, SSD, NVM, and NVMe-
SSD)
Programming Models(Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
RDMA?
Communication and I/O Library
Point-to-PointCommunication
QoS & Fault Tolerance
Threaded Modelsand Synchronization
Performance TuningI/O and File Systems
Virtualization (SR-IOV)
Benchmarks
Upper level
Changes?

HPCAC-Stanford (Feb ’19) 16Network Based Computing Laboratory
• RDMA for Apache Spark
• RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)
• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
• RDMA for Apache Kafka
• RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
• OSU HiBD-Benchmarks (OHB)
– HDFS, Memcached, HBase, and Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 300 organizations from 35 countries
• More than 29,100 downloads from the project site
The High-Performance Big Data (HiBD) Project
Available for InfiniBand and RoCE
Also run on Ethernet
Available for x86 and OpenPOWER
Support for Singularity and Docker

HPCAC-Stanford (Feb ’19) 17Network Based Computing Laboratory
HiBD Release Timeline and Downloads
0
5000
10000
15000
20000
25000
30000
Nu
mb
er
of
Do
wn
load
s
Timeline
RD
MA
-Had
oo
p 2
.x 0
.9.1
RD
MA
-Had
oo
p 2
.x 0
.9.6
RD
MA
-Me
mca
ched
0.9
.4
RD
MA
-Had
oo
p 2
.x 0
.9.7
RD
MA
-Sp
ark
0.9
.1
RD
MA
-HB
ase
0.9
.1
RD
MA
-Had
oo
p 2
.x 1
.0.0
RD
MA
-Sp
ark
0.9
.4
RD
MA
-Had
oo
p 2
.x 1
.3.0
RD
MA
-Me
mca
ched
0.9
.6&
OH
B-0
.9.3
RD
MA
-Sp
ark
0.9
.5
RD
MA
-Had
oo
p 1
.x 0
.9.0
RD
MA
-Had
oo
p 1
.x 0
.9.8
RD
MA
-Me
mca
ched
0.9
.1 &
OH
B-0
.7.1
RD
MA
-Had
oo
p 1
.x 0
.9.9

HPCAC-Stanford (Feb ’19) 18Network Based Computing Laboratory
• High-Performance Design of Hadoop over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and
RPC components
– Enhanced HDFS with in-memory and heterogeneous storage
– High performance design of MapReduce over Lustre
– Memcached-based burst buffer for MapReduce over Lustre-integrated HDFS (HHH-L-BB mode)
– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP
– Support for OpenPOWER, Singularity, and Docker
• Current release: 1.3.5
– Based on Apache Hadoop 2.8.0
– Compliant with Apache Hadoop 2.8.0, HDP 2.5.0.3 and CDH 5.8.2 APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms (x86, POWER)
• Different file systems with disks and SSDs and Lustre
RDMA for Apache Hadoop 2.x Distribution
http://hibd.cse.ohio-state.edu

HPCAC-Stanford (Feb ’19) 19Network Based Computing Laboratory
• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well
as performance. This mode is enabled by default in the package.
• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-
memory and obtain as much performance benefit as possible.
• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.
• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst
buffer design is hosted by Memcached servers, each of which has a local SSD.
• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top
of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.
• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-
L, and MapReduce over Lustre).
Different Modes of RDMA for Apache Hadoop 2.x

HPCAC-Stanford (Feb ’19) 20Network Based Computing Laboratory
• High-Performance Design of Spark over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark
– RDMA-based data shuffle and SEDA-based shuffle architecture
– Non-blocking and chunk-based data transfer
– Off-JVM-heap buffer management
– Support for OpenPOWER
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.5
– Based on Apache Spark 2.1.0
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms (x86, POWER)
• RAM disks, SSDs, and HDD
– http://hibd.cse.ohio-state.edu
RDMA for Apache Spark Distribution

HPCAC-Stanford (Feb ’19) 21Network Based Computing Laboratory
Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

HPCAC-Stanford (Feb ’19) 22Network Based Computing Laboratory
Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure

HPCAC-Stanford (Feb ’19) 23Network Based Computing Laboratory
• High-Performance Design of Apache Kafka over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level
for Kafka
– Compliant with Apache Kafka 1.0.0 APIs and applications
– On-demand connection setup
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.1
– Based on Apache Kafka 1.0.0
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
– http://hibd.cse.ohio-state.edu
RDMA for Apache Kafka Distribution

HPCAC-Stanford (Feb ’19) 24Network Based Computing Laboratory
• High-Performance Design of HBase over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level
for HBase
– Compliant with Apache HBase 1.1.2 APIs and applications
– On-demand connection setup
– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)
• Current release: 0.9.1
– Based on Apache HBase 1.1.2
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
– http://hibd.cse.ohio-state.edu
RDMA for Apache HBase Distribution

HPCAC-Stanford (Feb ’19) 25Network Based Computing Laboratory
• High-Performance Design of Memcached over RDMA-enabled Interconnects
– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Memcached and
libMemcached components
– High performance design of SSD-Assisted Hybrid Memory
– Non-Blocking Libmemcached Set/Get API extensions
– Support for burst-buffer mode in Lustre-integrated design of HDFS in RDMA for Apache Hadoop-2.x
– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with
IPoIB)
• Current release: 0.9.6
– Based on Memcached 1.5.3 and libMemcached 1.0.18
– Compliant with libMemcached APIs and applications
– Tested with
• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)
• RoCE support with Mellanox adapters
• Various multi-core platforms
• SSD
– http://hibd.cse.ohio-state.edu
RDMA for Memcached Distribution

HPCAC-Stanford (Feb ’19) 26Network Based Computing Laboratory
• Micro-benchmarks for Hadoop Distributed File System (HDFS)
– Sequential Write Latency (SWL) Benchmark, Sequential Read Latency (SRL) Benchmark, Random Read Latency
(RRL) Benchmark, Sequential Write Throughput (SWT) Benchmark, Sequential Read Throughput (SRT)
Benchmark
– Support benchmarking of
• Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS, Cloudera Distribution of Hadoop (CDH)
HDFS
• Micro-benchmarks for Memcached
– Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark, Non-Blocking API Latency Benchmark, Hybrid
Memory Latency Benchmark
– Yahoo! Cloud Serving Benchmark (YCSB) Extension for RDMA-Memcached
• Micro-benchmarks for HBase
– Get Latency Benchmark, Put Latency Benchmark
• Micro-benchmarks for Spark
– GroupBy, SortBy
• Current release: 0.9.3
• http://hibd.cse.ohio-state.edu
OSU HiBD Micro-Benchmark (OHB) Suite – HDFS, Memcached, HBase, and Spark

HPCAC-Stanford (Feb ’19) 27Network Based Computing Laboratory
• High-Performance Designs with RDMA, In-memory, SSD, Parallel Filesystems
– HDFS
– MapReduce
– Spark
– Hadoop RPC and HBase
– Memcached
– Kafka
Basic Acceleration Case Studies and In-Depth Performance Evaluation

HPCAC-Stanford (Feb ’19) 28Network Based Computing Laboratory
0
50
100
150
200
250
300
350
400
80 120 160
Exec
uti
on
Tim
e (s
)
Data Size (GB)
IPoIB (EDR)OSU-IB (EDR)
0
100
200
300
400
500
600
700
800
80 160 240
Exec
uti
on
Tim
e (s
)
Data Size (GB)
IPoIB (EDR)OSU-IB (EDR)
Performance Numbers of RDMA for Apache Hadoop 2.x –RandomWriter & TeraGen in OSU-RI2 (EDR)
Cluster with 8 Nodes with a total of 64 maps
• RandomWriter
– 3x improvement over IPoIB
for 80-160 GB file size
• TeraGen
– 4x improvement over IPoIB for
80-240 GB file size
RandomWriter TeraGen
Reduced by 3x Reduced by 4x

HPCAC-Stanford (Feb ’19) 29Network Based Computing Laboratory
• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)
• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.
– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)
– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)
Performance Evaluation of RDMA-Spark on SDSC Comet – HiBench PageRank
32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time
0
50
100
150
200
250
300
350
400
450
Huge BigData Gigantic
Tim
e (s
ec)
Data Size (GB)
IPoIB
RDMA
0
100
200
300
400
500
600
700
800
Huge BigData Gigantic
Tim
e (s
ec)
Data Size (GB)
IPoIB
RDMA
43%37%

HPCAC-Stanford (Feb ’19) 30Network Based Computing Laboratory
– RDMA-Accelerated Communication for Memcached Get/Set
– Hybrid ‘RAM+SSD’ slab management for higher data retention
– Non-blocking API extensions • memcached_(iset/iget/bset/bget/test/wait)
• Achieve near in-memory speeds while hiding bottlenecks of network and SSD I/O
• Ability to exploit communication/computation overlap
• Optional buffer re-use guarantees
– Adaptive slab manager with different I/O schemes for higher throughput.
Accelerating Hybrid Memcached with RDMA, Non-blocking Extensions and SSDs
D. Shankar, X. Lu, N. S. Islam, M. W. Rahman, and D. K. Panda, High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits, IPDPS, May 2016
BLOCKING API
NON-BLOCKING API REQ.
NON-BLOCKING API REPLY
CLI
ENT
SER
VER
HYBRID SLAB MANAGER (RAM+SSD)
RDMA-ENHANCED COMMUNICATION LIBRARY
RDMA-ENHANCED COMMUNICATION LIBRARY
LIBMEMCACHED LIBRARY
Blocking API Flow Non-Blocking API Flow

HPCAC-Stanford (Feb ’19) 31Network Based Computing Laboratory
– Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve• >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty• >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid
– Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem
Performance Evaluation with Non-Blocking Memcached API
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set Get
IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB-i
H-RDMA-Opt-NonB-b
Ave
rage
Lat
en
cy (
us)
Miss Penalty (Backend DB Access Overhead)
Client Wait
Server Response
Cache Update
Cache check+Load (Memory and/or SSD read)
Slab Allocation (w/ SSD write on Out-of-Mem)
H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee

HPCAC-Stanford (Feb ’19) 32Network Based Computing Laboratory
0
200
400
600
800
1000
100 1000 10000 100000Thro
ugh
pu
t (M
B/s
)
Record Size
0
500
1000
1500
2000
2500
100 1000 10000 100000
Late
ncy
(m
s)
Record Size
RDMA
IPoIB
RDMA-Kafka: High-Performance Message Broker for Streaming Workloads
Kafka Producer Benchmark (1 Broker 4 Producers)
• Experiments run on OSU-RI2 cluster
• 2.4GHz 28 cores, InfiniBand EDR, 512 GB RAM, 400GB SSD– Up to 98% improvement in latency compared to IPoIB
– Up to 7x increase in throughput over IPoIB
98%7x
H. Javed, X. Lu , and D. K. Panda, Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka,
BDCAT’17

HPCAC-Stanford (Feb ’19) 33Network Based Computing Laboratory
RDMA-Kafka: Yahoo! Streaming Benchmark
• Shows RTT of message sent by producer, processed and its ack received
• 8 producers emitting a total of 120,000 records/sec
– Processed by 4 Consumers
• commit time = 1000 ms => stepwise curve
• Frieda outperforms Kafka, a reduction about 31% in p99
M. H. Javed, X. Lu, and D. K. Panda, Cutting the Tail: Designing
High Performance Message Brokers to Reduce Tail Latencies in
Stream Processing, IEEE Cluster 2018.

HPCAC-Stanford (Feb ’19) 34Network Based Computing Laboratory
Designing Communication and I/O Libraries for Big Data Systems: Challenges
Big Data Middleware(HDFS, MapReduce, HBase, Spark, gRPC/TensorFlow and Memcached)
Networking Technologies
(InfiniBand, 1/10/40/100 GigE and Intelligent NICs)
Storage Technologies(HDD, SSD, NVM, and NVMe-
SSD)
Programming Models(Sockets)
Applications
Commodity Computing System Architectures
(Multi- and Many-core architectures and accelerators)
RDMA Protocols
Communication and I/O Library
Point-to-PointCommunication
QoS & Fault Tolerance
Threaded Modelsand Synchronization
Performance TuningI/O and File Systems
Virtualization (SR-IOV)
Benchmarks

HPCAC-Stanford (Feb ’19) 35Network Based Computing Laboratory
• Advanced Designs
– Hadoop with NVRAM
– Online Erasure Coding for Key-Value Store
– Deep Learning Tools (such as TensorFlow, BigDL) over RDMA-enabled Hadoop and Spark
– Big Data Processing over HPC Cloud
Advanced Designs and Future Activities for Accelerating Big Data Applications

HPCAC-Stanford (Feb ’19) 36Network Based Computing Laboratory
Design Overview of NVM and RDMA-aware HDFS (NVFS)
• Design Features• RDMA over NVM• HDFS I/O with NVM• Block Access• Memory Access
• Hybrid design• NVM with SSD as a hybrid
storage for HDFS I/O• Co-Design with Spark and HBase
• Cost-effectiveness• Use-case
Applications and Benchmarks
Hadoop MapReduce Spark HBase
Co-Design(Cost-Effectiveness, Use-case)
RDMAReceiver
RDMASender
DFSClientRDMA
Replicator
RDMAReceiver
NVFS-BlkIO
Writer/Reader
NVM
NVFS-MemIO
SSD SSD SSD
NVM and RDMA-aware HDFS (NVFS)DataNode
N. S. Islam, M. W. Rahman , X. Lu, and D. K.
Panda, High Performance Design for HDFS with
Byte-Addressability of NVM and RDMA, 24th
International Conference on Supercomputing
(ICS), June 2016

HPCAC-Stanford (Feb ’19) 37Network Based Computing Laboratory
Evaluation with Hadoop MapReduce
0
50
100
150
200
250
300
350
Write Read
Ave
rage
Th
rou
ghp
ut
(MB
ps) HDFS (56Gbps)
NVFS-BlkIO (56Gbps)
NVFS-MemIO (56Gbps)
• TestDFSIO on SDSC Comet (32 nodes)
– Write: NVFS-MemIO gains by 4x over
HDFS
– Read: NVFS-MemIO gains by 1.2x over
HDFS
TestDFSIO
0
200
400
600
800
1000
1200
1400
Write Read
Ave
rage
Th
rou
ghp
ut
(MB
ps) HDFS (56Gbps)
NVFS-BlkIO (56Gbps)
NVFS-MemIO (56Gbps)
4x
1.2x
4x
2x
SDSC Comet (32 nodes) OSU Nowlab (4 nodes)
• TestDFSIO on OSU Nowlab (4 nodes)
– Write: NVFS-MemIO gains by 4x over
HDFS
– Read: NVFS-MemIO gains by 2x over
HDFS

HPCAC-Stanford (Feb ’19) 38Network Based Computing Laboratory
QoS-aware SPDK Design for NVMe-SSD
0
50
100
150
1 5 9 13 17 21 25 29 33 37 41 45 49
Ban
dw
idth
(M
B/s
)
Time
Scenario 1
High Priority Job (WRR) Medium Priority Job (WRR)
High Priority Job (OSU-Design) Medium Priority Job (OSU-Design)
0
1
2
3
4
5
2 3 4 5
Job
Ban
dw
idth
Rat
io
Scenario
Synthetic Application Scenarios
SPDK-WRR OSU-Design Desired
• Synthetic application scenarios with different QoS requirements
– Comparison using SPDK with Weighted Round Robbin NVMe arbitration
• Near desired job bandwidth ratios
• Stable and consistent bandwidthS. Gugnani, X. Lu, and D. K. Panda, Analyzing, Modeling, and
Provisioning QoS for NVMe SSDs, UCC 2018.

HPCAC-Stanford (Feb ’19) 39Network Based Computing Laboratory
• Advanced Designs
– Hadoop with NVRAM
– Online Erasure Coding for Key-Value Store
– Deep Learning Tools (such as TensorFlow, BigDL) over RDMA-enabled Hadoop and Spark
– Big Data Processing over HPC Cloud
Advanced Designs and Future Activities for Accelerating Big Data Applications

HPCAC-Stanford (Feb ’19) 40Network Based Computing Laboratory
• Replication is a popular FT technique
– M additional puts for replication factor M+1
– Storage overhead of M+1
• Alternative storage efficient FT technique: Online Erasure Coding
– E.g., MDS codes such as Reed-Solomon Codes RS(K,M)
– K data and M parity chunks distributed to K+M (=N) nodes
– Recover data from any K fragments
– Storage overhead of N/K (K<N)
Online Erasure Coding for Resilient In-Memory KV
Replication
Erasure Coding (Reed-Solomon Code)

HPCAC-Stanford (Feb ’19) 41Network Based Computing Laboratory
Performance Evaluations
• 150 Clients on 10 nodes on SDSC Comet Cluster over 5-node RDMA-Memcached Cluster
• Performance improvements observed for Key-Value pair sizes >=16 KB
– Era-CE-CD gains ~2.3 over Async-RDMA-Rep for YCSB 50:50 and ~2.6x for YCSB 95:5
– Era-SE-CD performs similar to Async-Rep for YCSB 50:50 and and ~2.5x for YCSB 95:5
0
3000
6000
9000
12000
15000
18000
10 GB 20 GB 40 GB
Agg. T
hro
ughput
(MB
ps)
Lustre-Direct Boldio_Async-Rep=3
Boldio_Era(3,2)-CE-CD Boldio_Era(3,2)-SE-CD
• 8 node Hadoop cluster on Intel Westmere QDR Cluster over 5-node RDMA-Memcached Cluster
• Key-Value Store-based Burst-Buffer System over Lustre (Boldio) for Hadoop I/O workloads.
• Performs on par with Async Replication scheme while enabling 50% gain in memory efficiency
D. Shankar, X. Lu, D. K. Panda, High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads, ICDCS, 2017
0
500
1000
1500
2000
2500
3000
1K 4K 16K 1K 4K 16K
YCSB-A YCSB-B
Aver
age
Lat
ency
(us)
Memc-IPoIB-NoRep Memc-RDMA-NoRep
Async-Rep=3 Era(3,2)-CE-CD
Era(3,2)-SE-CD

HPCAC-Stanford (Feb ’19) 42Network Based Computing Laboratory
• Advanced Designs
– Hadoop with NVRAM
– Online Erasure Coding for Key-Value Store
– Deep Learning Tools (such as TensorFlow, BigDL) over RDMA-enabled Hadoop and Spark
– Big Data Processing over HPC Cloud
Advanced Designs and Future Activities for Accelerating Big Data Applications

HPCAC-Stanford (Feb ’19) 43Network Based Computing Laboratory
Using HiBD Packages for Deep Learning on Existing HPC Infrastructure

HPCAC-Stanford (Feb ’19) 44Network Based Computing Laboratory
Architecture Overview of TensorFlowOnSpark
• Developed by Yahoo!
• Infrastructure to run TensorFlow in a distributed
manner
• Utilizes Spark for distributing computation
• Can use RDMA instead of TCP based gRPC over
high-performance interconnects
• Bypasses Spark to ingest data directly from HDFS
• Minimal changes required to run default
TensorFlow programs over TensorFlowOnSpark
• Linear scalability with more workers
Courtesy: http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep

HPCAC-Stanford (Feb ’19) 45Network Based Computing Laboratory
X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017.
High-Performance Deep Learning over Big Data (DLoBD) Stacks• Challenges of Deep Learning over Big Data
(DLoBD)▪ Can RDMA-based designs in DLoBD stacks improve
performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs?
▪ What are the performance characteristics of representative DLoBD stacks on RDMA networks?
• Characterization on DLoBD Stacks▪ CaffeOnSpark, TensorFlowOnSpark, and BigDL▪ IPoIB vs. RDMA; In-band communication vs. Out-
of-band communication; CPU vs. GPU; etc.▪ Performance, accuracy, scalability, and resource
utilization ▪ RDMA-based DLoBD stacks (e.g., BigDL over
RDMA-Spark) can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy
0
20
40
60
10
1010
2010
3010
4010
5010
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Acc
ura
cy (
%)
Ep
och
s T
ime
(sec
s)
Epoch Number
IPoIB-TimeRDMA-TimeIPoIB-AccuracyRDMA-Accuracy
2.6X

HPCAC-Stanford (Feb ’19) 46Network Based Computing Laboratory
• Advanced Designs
– Hadoop with NVRAM
– Online Erasure Coding for Key-Value Store
– Deep Learning Tools (such as TensorFlow, BigDL) over RDMA-enabled Hadoop and Spark
– Big Data Processing over HPC Cloud
Advanced Designs and Future Activities for Accelerating Big Data Applications

HPCAC-Stanford (Feb ’19) 47Network Based Computing Laboratory
• Challenges
– Existing designs in Hadoop not virtualization-
aware
– No support for automatic topology detection
• Design
– Automatic Topology Detection using
MapReduce-based utility
• Requires no user input
• Can detect topology changes during
runtime without affecting running jobs
– Virtualization and topology-aware
communication through map task scheduling and
YARN container allocation policy extensions
Virtualization-aware and Automatic Topology Detection Schemes in Hadoop on InfiniBand
S. Gugnani, X. Lu, and D. K. Panda, Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating
Hadoop on SR-IOV-enabled Clouds, CloudCom’16, December 2016
0
2000
4000
6000
40 GB 60 GB 40 GB 60 GB 40 GB 60 GB
EXEC
UTI
ON
TIM
E
Hadoop BenchmarksRDMA-Hadoop Hadoop-Virt
0
100
200
300
400
DefaultMode
DistributedMode
DefaultMode
DistributedMode
EXEC
UTI
ON
TIM
E
Hadoop ApplicationsRDMA-Hadoop Hadoop-Virt
CloudBurst Self-join
Sort WordCount PageRank
Reduced by 55%
Reduced by 34%

HPCAC-Stanford (Feb ’19) 48Network Based Computing Laboratory
• Upcoming Releases of RDMA-enhanced Packages will support
– Upgrades to the latest versions of Hadoop (3.0) and Spark
• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support
– MapReduce, Hadoop RPC, and gRPC
• Advanced designs with upper-level changes and optimizations
– Boldio (Burst Buffer over Lustre for Big Data I/O Acceleration)
– Efficient Indexing
On-going and Future Plans of OSU High Performance Big Data (HiBD) Project

HPCAC-Stanford (Feb ’19) 49Network Based Computing Laboratory
• Supported through X-ScaleSolutions (http://x-scalesolutions.com)
• Benefits:
– Help and guidance with installation of the library
– Platform-specific optimizations and tuning
– Timely support for operational issues encountered with the library
– Web portal interface to submit issues and tracking their progress
– Advanced debugging techniques
– Application-specific optimizations and tuning
– Obtaining guidelines on best practices
– Periodic information on major fixes and updates
– Information on major releases
– Help with upgrading to the latest release
– Flexible Service Level Agreements
• Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years
Commercial Support for HiBD, MVAPICH2 and HiDL Libraries

HPCAC-Stanford (Feb ’19) 50Network Based Computing Laboratory
• Discussed challenges in accelerating Big Data middleware with HPC
technologies
• Presented basic and advanced designs to take advantage of InfiniBand/RDMA
for HDFS, Spark, Memcached, and Kafka
• RDMA can benefit DL workloads as showed by our AR-gRPC, and RDMA-
TensorFlow designs
• Many other open issues need to be solved
• Will enable Big Data and AI community to take advantage of modern HPC
technologies to carry out their analytics in a fast and scalable manner
• Looking forward to collaboration with the community
Concluding Remarks

HPCAC-Stanford (Feb ’19) 51Network Based Computing Laboratory
Funding Acknowledgments
Funding Support by
Equipment Support by

HPCAC-Stanford (Feb ’19) 52Network Based Computing Laboratory
Personnel AcknowledgmentsCurrent Students (Graduate)
– A. Awan (Ph.D.)
– M. Bayatpour (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– S. Guganani (Ph.D.)
Past Students
– A. Augustine (M.S.)
– P. Balaji (Ph.D.)
– R. Biswas (M.S.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– R. Rajachandrasekar (Ph.D.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
– J. Zhang (Ph.D.)
Past Research Scientist
– K. Hamidouche
– S. Sur
Past Post-Docs
– D. Banerjee
– X. Besseron
– H.-W. Jin
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– K. Kulkarni (M.S.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– M. Li (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.)
– J. Hashmi (Ph.D.)
– A. Jain (Ph.D.)
– K. S. Khorassani (Ph.D.)
– P. Kousha (Ph.D.)
– D. Shankar (Ph.D.)
– J. Lin
– M. Luo
– E. Mancini
Current Research Asst. Professor
– X. Lu
Past Programmers
– D. Bureddy
– J. Perkins
Current Research Specialist
– J. Smith
– S. Marcarelli
– J. Vienne
– H. Wang
Current Post-doc
– A. Ruhela
– K. Manian
Current Students (Undergraduate)
– V. Gangal (B.S.)
– M. Haupt (B.S.)
– N. Sarkauskas (B.S.)
– A. Yeretzian (B.S.)
Past Research Specialist
– M. Arnold
Current Research Scientist
– H. Subramoni

HPCAC-Stanford (Feb ’19) 53Network Based Computing Laboratory
The 5th International Workshop on High-Performance Big Data and Cloud Computing (HPBDC)
In conjunction with IPDPS’19, Rio de Janeiro, Brazil, Monday, May 20th, 2019
HPBDC 2018 was held in conjunction with IPDPS’18
http://web.cse.ohio-state.edu/~luxi/hpbdc2018
HPBDC 2017 was held in conjunction with IPDPS’17
http://web.cse.ohio-state.edu/~luxi/hpbdc2017
HPBDC 2016 was held in conjunction with IPDPS’16
http://web.cse.ohio-state.edu/~luxi/hpbdc2016
Deadline Important Date
Abstract January 15th, 2019
Paper Submission February 15, 2019
Acceptance notification March 1st, 2019
Camera-Ready deadline March 15th, 2019

HPCAC-Stanford (Feb ’19) 54Network Based Computing Laboratory
http://www.cse.ohio-state.edu/~panda
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/