Designing Convergent HPC and BigData Software Stacks: An … · 2020-01-16 · Designing Convergent...

Designing Convergent HPC and BigData Software Stacks: An Overview of the HiBD Project

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

A Talk at HPCAC Stanford Conference (Feb ’19)

by


HPCAC-Stanford (Feb ’19) 2Network Based Computing Laboratory

• Big Data has changed the way people understand

and harness the power of data, both in the

business and research domains

• Big Data has become one of the most important

elements in business analytics

• Big Data and High Performance Computing (HPC)

are converging to meet large scale data processing

challenges

• Running High Performance Data Analysis (HPDA) workloads in the cloud is gaining popularity

• According to the latest OpenStack survey, 27% of cloud deployments are running HPDA workloads

Introduction to Big Data Analytics and Trends

http://www.coolinfographics.com/blog/tag/data?currentPage=3

http://www.climatecentral.org/news/white-house-brings-together-big-data-and-climate-change-17194


Big Velocity – How Much Data Is Generated Every Minute on the Internet?

The global Internet population grew 7.5% from 2016 and now represents

3.7 Billion People.Courtesy: https://www.domo.com/blog/data-never-sleeps-5/


• Substantial impact on designing and utilizing data management and processing systems in multiple tiers

– Front-end data accessing and serving (Online)

• Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)

• HDFS, MapReduce, Spark

Data Management and Processing on Modern Clusters


(1) Prepare Datasets @Scale

(2) Deep Learning @Scale

(3) Non-deep learning

analytics @Scale

(4) Apply ML model @Scale

• Deep Learning over Big Data (DLoBD) is one of the most efficient analyzing paradigms

• More and more deep learning tools or libraries (e.g., Caffe, TensorFlow) start running over big

data stacks, such as Apache Hadoop and Spark

• Benefits of the DLoBD approach

– Easily build a powerful data analytics pipeline

• E.g., Flickr DL/ML Pipeline, “How Deep Learning Powers Flickr”, http://bit.ly/1KIDfof

– Better data locality

– Efficient resource sharing and cost effective

Deep Learning over Big Data (DLoBD)


• Scientific Data Management, Analysis, and Visualization

• Applications examples

– Climate modeling

– Combustion

– Fusion

– Astrophysics

– Bioinformatics

• Data Intensive Tasks

– Runs large-scale simulations on supercomputers

– Dump data on parallel storage systems

– Collect experimental / observational data

– Move experimental / observational data to analysis sites

– Visual analytics – help understand data visually

Not Only in Internet Services - Big Data in Scientific Domains


Interconnects and Protocols in OpenFabrics Stack

Kernel

Space

Application / Middleware

Verbs

Ethernet

Adapter

EthernetSwitch

Ethernet Driver

TCP/IP

1/10/40/100 GigE

InfiniBand

Adapter

InfiniBand

Switch

IPoIB

IPoIB

Ethernet

Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40/100 GigE-TOE

InfiniBand

Adapter

InfiniBand

Switch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCEAdapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBand

Switch

InfiniBand

Adapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBand

Adapter

InfiniBand

Switch

RDMA

SDP

SDP


Emerging NVMe-over-Fabric• Remote access to flash with NVMe

over the network

• RDMA fabric is of most importance

– Low latency makes remote access feasible

– 1 to 1 mapping of NVMe I/O queues to RDMA send/recv queues

NVMf Architecture

I/O Submission

Queue

I/O Completion

Queue

RDMA FabricSQ RQ

NVMe

Low latency

overhead compared

to local I/O


Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning!!!


How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data and Deep Learning Applications?

Bring HPC, Big Data processing, and Deep Learning into a “convergent trajectory”!

What are the major

bottlenecks in current Big

Data processing and

Deep Learning

middleware (e.g. Hadoop,

Spark)?

Can the bottlenecks be alleviated with new

designs by taking advantage of HPC

technologies?

Can RDMA-enabled

high-performance

interconnects

benefit Big Data

processing and Deep

Learning?

Can HPC Clusters with

high-performance

storage systems (e.g.

SSD, parallel file

systems) benefit Big

Data and Deep Learning

applications?

How much

performance benefits

can be achieved

through enhanced

designs?

How to design

benchmarks for

evaluating the

performance of Big Data

and Deep Learning

middleware on HPC

clusters?


Can We Run Big Data and Deep Learning Jobs on Existing HPC Infrastructure?


Designing Communication and I/O Libraries for Big Data Systems: Challenges

Big Data Middleware(HDFS, MapReduce, HBase, Spark, gRPC/TensorFlow, and Memcached)

Networking Technologies

(InfiniBand, 1/10/40/100 GigE and Intelligent NICs)

Storage Technologies(HDD, SSD, NVM, and NVMe-

SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

RDMA?

Communication and I/O Library

Point-to-PointCommunication

QoS & Fault Tolerance

Threaded Modelsand Synchronization

Performance TuningI/O and File Systems

Virtualization (SR-IOV)

Benchmarks

Upper level

Changes?


• RDMA for Apache Spark

• RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x)

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

• RDMA for Apache Kafka

• RDMA for Apache HBase

• RDMA for Memcached (RDMA-Memcached)

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, HBase, and Spark Micro-benchmarks

• http://hibd.cse.ohio-state.edu

• Users Base: 300 organizations from 35 countries

• More than 29,100 downloads from the project site

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCE

Also run on Ethernet

Available for x86 and OpenPOWER

Support for Singularity and Docker

http://hibd.cse.ohio-state.edu/


HiBD Release Timeline and Downloads

0

5000

10000

15000

20000

25000

30000

Nu

mb

er

of

Do

wn

load

s

Timeline

RD

MA

-Had

oo

p 2

.x 0

.9.1

RD

MA

-Had

oo

p 2

.x 0

.9.6

RD

MA

-Me

mca

ched

0.9

.4

RD

MA

-Had

oo

p 2

.x 0

.9.7

RD

MA

-Sp

ark

0.9

.1

RD

MA

-HB

ase

0.9

.1

RD

MA

-Had

oo

p 2

.x 1

.0.0

RD

MA

-Sp

ark

0.9

.4

RD

MA

-Had

oo

p 2

.x 1

.3.0

RD

MA

-Me

mca

ched

0.9

.6&

OH

B-0

.9.3

RD

MA

-Sp

ark

0.9

.5

RD

MA

-Had

oo

p 1

.x 0

.9.0

RD

MA

-Had

oo

p 1

.x 0

.9.8

RD

MA

-Me

mca

ched

0.9

.1 &

OH

B-0

.7.1

RD

MA

-Had

oo

p 1

.x 0

.9.9


• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and

RPC components

– Enhanced HDFS with in-memory and heterogeneous storage

– High performance design of MapReduce over Lustre

– Memcached-based burst buffer for MapReduce over Lustre-integrated HDFS (HHH-L-BB mode)

– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP

– Support for OpenPOWER, Singularity, and Docker

• Current release: 1.3.5

– Based on Apache Hadoop 2.8.0

– Compliant with Apache Hadoop 2.8.0, HDP 2.5.0.3 and CDH 5.8.2 APIs and applications

– Tested with

• Mellanox InfiniBand adapters (DDR, QDR, FDR, and EDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms (x86, POWER)

• Different file systems with disks and SSDs and Lustre

RDMA for Apache Hadoop 2.x Distribution

http://hibd.cse.ohio-state.edu

http://hadoop-rdma.cse.ohio-state.edu/


• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well

as performance. This mode is enabled by default in the package.

• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-

memory and obtain as much performance benefit as possible.

• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.

• HHH-L-BB: This mode deploys a Memcached-based burst buffer system to reduce the bandwidth bottleneck of shared file system access. The burst

buffer design is hosted by Memcached servers, each of which has a local SSD.

• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top

of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.

• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-

L, and MapReduce over Lustre).

Different Modes of RDMA for Apache Hadoop 2.x


• High-Performance Design of Spark over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Spark

– RDMA-based data shuffle and SEDA-based shuffle architecture

– Non-blocking and chunk-based data transfer

– Off-JVM-heap buffer management

– Support for OpenPOWER

– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)


– Based on Apache Spark 2.1.0

– Tested with



• Various multi-core platforms (x86, POWER)

• RAM disks, SSDs, and HDD

– http://hibd.cse.ohio-state.edu

RDMA for Apache Spark Distribution



Using HiBD Packages for Big Data Processing on Existing HPC Infrastructure


• High-Performance Design of Apache Kafka over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level

for Kafka

– Compliant with Apache Kafka 1.0.0 APIs and applications

– On-demand connection setup



– Based on Apache Kafka 1.0.0

– Tested with



• Various multi-core platforms


RDMA for Apache Kafka Distribution



• High-Performance Design of HBase over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level

for HBase

– Compliant with Apache HBase 1.1.2 APIs and applications

– On-demand connection setup



– Based on Apache HBase 1.1.2

– Tested with





RDMA for Apache HBase Distribution



• High-Performance Design of Memcached over RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for Memcached and

libMemcached components

– High performance design of SSD-Assisted Hybrid Memory

– Non-Blocking Libmemcached Set/Get API extensions

– Support for burst-buffer mode in Lustre-integrated design of HDFS in RDMA for Apache Hadoop-2.x

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-based support (Ethernet and InfiniBand with

IPoIB)


– Based on Memcached 1.5.3 and libMemcached 1.0.18

– Compliant with libMemcached APIs and applications

– Tested with




• SSD


RDMA for Memcached Distribution



• Micro-benchmarks for Hadoop Distributed File System (HDFS)

– Sequential Write Latency (SWL) Benchmark, Sequential Read Latency (SRL) Benchmark, Random Read Latency

(RRL) Benchmark, Sequential Write Throughput (SWT) Benchmark, Sequential Read Throughput (SRT)

Benchmark

– Support benchmarking of

• Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS, Cloudera Distribution of Hadoop (CDH)

HDFS

• Micro-benchmarks for Memcached

– Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark, Non-Blocking API Latency Benchmark, Hybrid

Memory Latency Benchmark

– Yahoo! Cloud Serving Benchmark (YCSB) Extension for RDMA-Memcached

• Micro-benchmarks for HBase

– Get Latency Benchmark, Put Latency Benchmark

• Micro-benchmarks for Spark

– GroupBy, SortBy


• http://hibd.cse.ohio-state.edu

OSU HiBD Micro-Benchmark (OHB) Suite – HDFS, Memcached, HBase, and Spark

http://hibd.cse.ohio-state.edu/


• High-Performance Designs with RDMA, In-memory, SSD, Parallel Filesystems

– HDFS

– MapReduce

– Spark

– Hadoop RPC and HBase

– Memcached

– Kafka

Basic Acceleration Case Studies and In-Depth Performance Evaluation


0

50

100

150

200

250

300

350

400

80 120 160

Exec

uti

on

Tim

e (s

)

Data Size (GB)

IPoIB (EDR)OSU-IB (EDR)

0

100

200

300

400

500

600

700

800

80 160 240

Exec

uti

on

Tim

e (s

)

Data Size (GB)

IPoIB (EDR)OSU-IB (EDR)

Performance Numbers of RDMA for Apache Hadoop 2.x –RandomWriter & TeraGen in OSU-RI2 (EDR)

Cluster with 8 Nodes with a total of 64 maps

• RandomWriter

– 3x improvement over IPoIB

for 80-160 GB file size

• TeraGen

– 4x improvement over IPoIB for

80-240 GB file size

RandomWriter TeraGen

Reduced by 3x Reduced by 4x


• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)

• RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.

– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps)

– 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)

Performance Evaluation of RDMA-Spark on SDSC Comet – HiBench PageRank

32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

0

50

100

150

200

250

300

350

400

450

Huge BigData Gigantic

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

0

100

200

300

400

500

600

700

800

Huge BigData Gigantic

Tim

e (s

ec)

Data Size (GB)

IPoIB

RDMA

43%37%


– RDMA-Accelerated Communication for Memcached Get/Set

– Hybrid ‘RAM+SSD’ slab management for higher data retention

– Non-blocking API extensions • memcached_(iset/iget/bset/bget/test/wait)

• Achieve near in-memory speeds while hiding bottlenecks of network and SSD I/O

• Ability to exploit communication/computation overlap

• Optional buffer re-use guarantees

– Adaptive slab manager with different I/O schemes for higher throughput.

Accelerating Hybrid Memcached with RDMA, Non-blocking Extensions and SSDs

D. Shankar, X. Lu, N. S. Islam, M. W. Rahman, and D. K. Panda, High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and SSDs: Non-blocking Extensions, Designs, and Benefits, IPDPS, May 2016

BLOCKING API

NON-BLOCKING API REQ.

NON-BLOCKING API REPLY

CLI

ENT

SER

VER

HYBRID SLAB MANAGER (RAM+SSD)

RDMA-ENHANCED COMMUNICATION LIBRARY

RDMA-ENHANCED COMMUNICATION LIBRARY

LIBMEMCACHED LIBRARY

Blocking API Flow Non-Blocking API Flow


– Data does not fit in memory: Non-blocking Memcached Set/Get API Extensions can achieve• >16x latency improvement vs. blocking API over RDMA-Hybrid/RDMA-Mem w/ penalty• >2.5x throughput improvement vs. blocking API over default/optimized RDMA-Hybrid

– Data fits in memory: Non-blocking Extensions perform similar to RDMA-Mem/RDMA-Hybrid and >3.6x improvement over IPoIB-Mem

Performance Evaluation with Non-Blocking Memcached API

0

500

1000

1500

2000

2500

Set Get Set Get Set Get Set Get Set Get Set Get

IPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-Block H-RDMA-Opt-NonB-i

H-RDMA-Opt-NonB-b

Ave

rage

Lat

en

cy (

us)

Miss Penalty (Backend DB Access Overhead)

Client Wait

Server Response

Cache Update

Cache check+Load (Memory and/or SSD read)

Slab Allocation (w/ SSD write on Out-of-Mem)

H = Hybrid Memcached over SATA SSD Opt = Adaptive slab manager Block = Default Blocking API NonB-i = Non-blocking iset/iget API NonB-b = Non-blocking bset/bget API w/ buffer re-use guarantee


0

200

400

600

800

1000

100 1000 10000 100000Thro

ugh

pu

t (M

B/s

)

Record Size

0

500

1000

1500

2000

2500

100 1000 10000 100000

Late

ncy

(m

s)

Record Size

RDMA

IPoIB

RDMA-Kafka: High-Performance Message Broker for Streaming Workloads

Kafka Producer Benchmark (1 Broker 4 Producers)

• Experiments run on OSU-RI2 cluster

• 2.4GHz 28 cores, InfiniBand EDR, 512 GB RAM, 400GB SSD– Up to 98% improvement in latency compared to IPoIB

– Up to 7x increase in throughput over IPoIB

98%7x

H. Javed, X. Lu , and D. K. Panda, Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka,

BDCAT’17


RDMA-Kafka: Yahoo! Streaming Benchmark

• Shows RTT of message sent by producer, processed and its ack received

• 8 producers emitting a total of 120,000 records/sec

– Processed by 4 Consumers

• commit time = 1000 ms => stepwise curve

• Frieda outperforms Kafka, a reduction about 31% in p99

M. H. Javed, X. Lu, and D. K. Panda, Cutting the Tail: Designing

High Performance Message Brokers to Reduce Tail Latencies in

Stream Processing, IEEE Cluster 2018.


Designing Communication and I/O Libraries for Big Data Systems: Challenges

Big Data Middleware(HDFS, MapReduce, HBase, Spark, gRPC/TensorFlow and Memcached)

Networking Technologies

(InfiniBand, 1/10/40/100 GigE and Intelligent NICs)

Storage Technologies(HDD, SSD, NVM, and NVMe-

SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

RDMA Protocols

Communication and I/O Library

Point-to-PointCommunication

QoS & Fault Tolerance

Threaded Modelsand Synchronization

Performance TuningI/O and File Systems

Virtualization (SR-IOV)

Benchmarks


• Advanced Designs

– Hadoop with NVRAM

– Online Erasure Coding for Key-Value Store

– Deep Learning Tools (such as TensorFlow, BigDL) over RDMA-enabled Hadoop and Spark

– Big Data Processing over HPC Cloud

Advanced Designs and Future Activities for Accelerating Big Data Applications


Design Overview of NVM and RDMA-aware HDFS (NVFS)

• Design Features• RDMA over NVM• HDFS I/O with NVM• Block Access• Memory Access

• Hybrid design• NVM with SSD as a hybrid

storage for HDFS I/O• Co-Design with Spark and HBase

• Cost-effectiveness• Use-case

Applications and Benchmarks

Hadoop MapReduce Spark HBase

Co-Design(Cost-Effectiveness, Use-case)

RDMAReceiver

RDMASender

DFSClientRDMA

Replicator

RDMAReceiver

NVFS-BlkIO

Writer/Reader

NVM

NVFS-MemIO

SSD SSD SSD

NVM and RDMA-aware HDFS (NVFS)DataNode

N. S. Islam, M. W. Rahman , X. Lu, and D. K.

Panda, High Performance Design for HDFS with

Byte-Addressability of NVM and RDMA, 24th

International Conference on Supercomputing

(ICS), June 2016


Evaluation with Hadoop MapReduce

0

50

100

150

200

250

300

350

Write Read

Ave

rage

Th

rou

ghp

ut

(MB

ps) HDFS (56Gbps)

NVFS-BlkIO (56Gbps)

NVFS-MemIO (56Gbps)

• TestDFSIO on SDSC Comet (32 nodes)

– Write: NVFS-MemIO gains by 4x over

HDFS

– Read: NVFS-MemIO gains by 1.2x over

HDFS

TestDFSIO

0

200

400

600

800

1000

1200

1400

Write Read

Ave

rage

Th

rou

ghp

ut

(MB

ps) HDFS (56Gbps)

NVFS-BlkIO (56Gbps)

NVFS-MemIO (56Gbps)

4x

1.2x

4x

2x

SDSC Comet (32 nodes) OSU Nowlab (4 nodes)

• TestDFSIO on OSU Nowlab (4 nodes)

– Write: NVFS-MemIO gains by 4x over

HDFS

– Read: NVFS-MemIO gains by 2x over

HDFS


QoS-aware SPDK Design for NVMe-SSD

0

50

100

150

1 5 9 13 17 21 25 29 33 37 41 45 49

Ban

dw

idth

(M

B/s

)

Time

Scenario 1

High Priority Job (WRR) Medium Priority Job (WRR)

High Priority Job (OSU-Design) Medium Priority Job (OSU-Design)

0

1

2

3

4

5

2 3 4 5

Job

Ban

dw

idth

Rat

io

Scenario

Synthetic Application Scenarios

SPDK-WRR OSU-Design Desired

• Synthetic application scenarios with different QoS requirements

– Comparison using SPDK with Weighted Round Robbin NVMe arbitration

• Near desired job bandwidth ratios

• Stable and consistent bandwidthS. Gugnani, X. Lu, and D. K. Panda, Analyzing, Modeling, and

Provisioning QoS for NVMe SSDs, UCC 2018.


• Replication is a popular FT technique

– M additional puts for replication factor M+1

– Storage overhead of M+1

• Alternative storage efficient FT technique: Online Erasure Coding

– E.g., MDS codes such as Reed-Solomon Codes RS(K,M)

– K data and M parity chunks distributed to K+M (=N) nodes

– Recover data from any K fragments

– Storage overhead of N/K (K<N)

Online Erasure Coding for Resilient In-Memory KV

Replication

Erasure Coding (Reed-Solomon Code)

mailto:[email protected]

mailto:[email protected]


Performance Evaluations

• 150 Clients on 10 nodes on SDSC Comet Cluster over 5-node RDMA-Memcached Cluster

• Performance improvements observed for Key-Value pair sizes >=16 KB

– Era-CE-CD gains ~2.3 over Async-RDMA-Rep for YCSB 50:50 and ~2.6x for YCSB 95:5

– Era-SE-CD performs similar to Async-Rep for YCSB 50:50 and and ~2.5x for YCSB 95:5

0

3000

6000

9000

12000

15000

18000

10 GB 20 GB 40 GB

Agg. T

hro

ughput

(MB

ps)

Lustre-Direct Boldio_Async-Rep=3

Boldio_Era(3,2)-CE-CD Boldio_Era(3,2)-SE-CD

• 8 node Hadoop cluster on Intel Westmere QDR Cluster over 5-node RDMA-Memcached Cluster

• Key-Value Store-based Burst-Buffer System over Lustre (Boldio) for Hadoop I/O workloads.

• Performs on par with Async Replication scheme while enabling 50% gain in memory efficiency

D. Shankar, X. Lu, D. K. Panda, High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads, ICDCS, 2017

0

500

1000

1500

2000

2500

3000

1K 4K 16K 1K 4K 16K

YCSB-A YCSB-B

Aver

age

Lat

ency

(us)

Memc-IPoIB-NoRep Memc-RDMA-NoRep

Async-Rep=3 Era(3,2)-CE-CD

Era(3,2)-SE-CD


Using HiBD Packages for Deep Learning on Existing HPC Infrastructure


Architecture Overview of TensorFlowOnSpark

• Developed by Yahoo!

• Infrastructure to run TensorFlow in a distributed

manner

• Utilizes Spark for distributing computation

• Can use RDMA instead of TCP based gRPC over

high-performance interconnects

• Bypasses Spark to ingest data directly from HDFS

• Minimal changes required to run default

TensorFlow programs over TensorFlowOnSpark

• Linear scalability with more workers

Courtesy: http://yahoohadoop.tumblr.com/post/157196317141/open-sourcing-tensorflowonspark-distributed-deep


X. Lu, H. Shi, M. H. Javed, R. Biswas, and D. K. Panda, Characterizing Deep Learning over Big Data (DLoBD) Stacks on RDMA-capable Networks, HotI 2017.

High-Performance Deep Learning over Big Data (DLoBD) Stacks• Challenges of Deep Learning over Big Data

(DLoBD)▪ Can RDMA-based designs in DLoBD stacks improve

performance, scalability, and resource utilization on high-performance interconnects, GPUs, and multi-core CPUs?

▪ What are the performance characteristics of representative DLoBD stacks on RDMA networks?

• Characterization on DLoBD Stacks▪ CaffeOnSpark, TensorFlowOnSpark, and BigDL▪ IPoIB vs. RDMA; In-band communication vs. Out-

of-band communication; CPU vs. GPU; etc.▪ Performance, accuracy, scalability, and resource

utilization ▪ RDMA-based DLoBD stacks (e.g., BigDL over

RDMA-Spark) can achieve 2.6x speedup compared to the IPoIB based scheme, while maintain similar accuracy

0

20

40

60

10

1010

2010

3010

4010

5010

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Acc

ura

cy (

%)

Ep

och

s T

ime

(sec

s)

Epoch Number

IPoIB-TimeRDMA-TimeIPoIB-AccuracyRDMA-Accuracy

2.6X


• Challenges

– Existing designs in Hadoop not virtualization-

aware

– No support for automatic topology detection

• Design

– Automatic Topology Detection using

MapReduce-based utility

• Requires no user input

• Can detect topology changes during

runtime without affecting running jobs

– Virtualization and topology-aware

communication through map task scheduling and

YARN container allocation policy extensions

Virtualization-aware and Automatic Topology Detection Schemes in Hadoop on InfiniBand

S. Gugnani, X. Lu, and D. K. Panda, Designing Virtualization-aware and Automatic Topology Detection Schemes for Accelerating

Hadoop on SR-IOV-enabled Clouds, CloudCom’16, December 2016

0

2000

4000

6000

40 GB 60 GB 40 GB 60 GB 40 GB 60 GB

EXEC

UTI

ON

TIM

E

Hadoop BenchmarksRDMA-Hadoop Hadoop-Virt

0

100

200

300

400

DefaultMode

DistributedMode

DefaultMode

DistributedMode

EXEC

UTI

ON

TIM

E

Hadoop ApplicationsRDMA-Hadoop Hadoop-Virt

CloudBurst Self-join

Sort WordCount PageRank

Reduced by 55%

Reduced by 34%


• Upcoming Releases of RDMA-enhanced Packages will support

– Upgrades to the latest versions of Hadoop (3.0) and Spark

• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support

– MapReduce, Hadoop RPC, and gRPC

• Advanced designs with upper-level changes and optimizations

– Boldio (Burst Buffer over Lustre for Big Data I/O Acceleration)

– Efficient Indexing

On-going and Future Plans of OSU High Performance Big Data (HiBD) Project


• Supported through X-ScaleSolutions (http://x-scalesolutions.com)

• Benefits:

– Help and guidance with installation of the library

– Platform-specific optimizations and tuning

– Timely support for operational issues encountered with the library

– Web portal interface to submit issues and tracking their progress

– Advanced debugging techniques

– Application-specific optimizations and tuning

– Obtaining guidelines on best practices

– Periodic information on major fixes and updates

– Information on major releases

– Help with upgrading to the latest release

– Flexible Service Level Agreements

• Support provided to Lawrence Livermore National Laboratory (LLNL) for the last two years

Commercial Support for HiBD, MVAPICH2 and HiDL Libraries

http://x-scalesolutions.com/


• Discussed challenges in accelerating Big Data middleware with HPC

technologies

• Presented basic and advanced designs to take advantage of InfiniBand/RDMA

for HDFS, Spark, Memcached, and Kafka

• RDMA can benefit DL workloads as showed by our AR-gRPC, and RDMA-

TensorFlow designs

• Many other open issues need to be solved

• Will enable Big Data and AI community to take advantage of modern HPC

technologies to carry out their analytics in a fast and scalable manner

• Looking forward to collaboration with the community

Concluding Remarks


Funding Acknowledgments

Funding Support by

Equipment Support by


Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– M. Bayatpour (Ph.D.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

– S. Guganani (Ph.D.)

Past Students

– A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– R. Biswas (M.S.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– R. Rajachandrasekar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

– J. Zhang (Ph.D.)

Past Research Scientist

– K. Hamidouche

– S. Sur

Past Post-Docs

– D. Banerjee

– X. Besseron

– H.-W. Jin

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– J. Hashmi (Ph.D.)

– A. Jain (Ph.D.)

– K. S. Khorassani (Ph.D.)

– P. Kousha (Ph.D.)

– D. Shankar (Ph.D.)

– J. Lin

– M. Luo

– E. Mancini

Current Research Asst. Professor

– X. Lu

Past Programmers

– D. Bureddy

– J. Perkins

Current Research Specialist

– J. Smith

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc

– A. Ruhela

– K. Manian

Current Students (Undergraduate)

– V. Gangal (B.S.)

– M. Haupt (B.S.)

– N. Sarkauskas (B.S.)

– A. Yeretzian (B.S.)

Past Research Specialist

– M. Arnold

Current Research Scientist

– H. Subramoni


The 5th International Workshop on High-Performance Big Data and Cloud Computing (HPBDC)

In conjunction with IPDPS’19, Rio de Janeiro, Brazil, Monday, May 20th, 2019

HPBDC 2018 was held in conjunction with IPDPS’18

http://web.cse.ohio-state.edu/~luxi/hpbdc2018





Deadline Important Date

Abstract January 15th, 2019

Paper Submission February 15, 2019

Acceptance notification March 1st, 2019

Camera-Ready deadline March 15th, 2019





[email protected]


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/

Designing Convergent HPC and BigData Software Stacks: An … · 2020-01-16 · Designing Convergent...

Documents

Transcript of Designing Convergent HPC and BigData Software Stacks: An … · 2020-01-16 · Designing Convergent...