Accelera’ngBigDataProcessingwithHadoop,Spark...

Accelera'ng Big Data Processing with Hadoop, Spark and Memcached

Dhabaleswar K. (DK) Panda The Ohio State University

E-‐mail: [email protected]‐state.edu h<p://www.cse.ohio-‐state.edu/~panda

Talk at HPC Advisory Council Switzerland Conference (Mar ‘15)

by

Introduc'on to Big Data Applica'ons and Analy'cs

•  Big Data has become the one of the most important elements of business analyFcs

•  Provides groundbreaking opportuniFes for enterprise informaFon management and decision making

•  The amount of data is exploding; companies are capturing and digiFzing more informaFon than ever

•  The rate of informaFon growth appears to be exceeding Moore’s Law

2 HPCAC Switzerland Conference (Mar '15)

•  Commonly accepted 3V’s of Big Data •  Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, hSp://www.nist.gov/itl/ssd/is/upload/NIST-‐stonebraker.pdf

•  5V’s of Big Data – 3V + Value, Veracity

•  SubstanFal impact on designing and uFlizing modern data management and processing systems in mulFple Fers

–  Front-‐end data accessing and serving (Online) •  Memcached + DB (e.g. MySQL), HBase

–  Back-‐end data analyFcs (Offline) •  HDFS, MapReduce, Spark

HPCAC Switzerland Conference (Mar '15)

Data Management and Processing on Modern Clusters

3

Internet

Front-end Tier Back-end Tier

Web ServerWeb

ServerWeb Server

Memcached+ DB (MySQL)Memcached

+ DB (MySQL)Memcached+ DB (MySQL)

NoSQL DB (HBase)NoSQL DB

(HBase)NoSQL DB (HBase)

HDFS

MapReduce Spark

Data Analytics Apps/Jobs

Data Accessing and Serving

•  Open-‐source implementaFon of Google MapReduce, GFS, and BigTable for Big Data AnalyFcs •  Hadoop Common UFliFes (RPC, etc.), HDFS, MapReduce, YARN

•  h<p://hadoop.apache.org

4

Overview of Apache Hadoop Architecture


Hadoop Distributed File System (HDFS)

MapReduce (Cluster Resource Management & Data Processing)

Hadoop Common/Core (RPC, ..)

Hadoop Distributed File System (HDFS)

YARN (Cluster Resource Management & Job Scheduling)

Hadoop Common/Core (RPC, ..)

MapReduce (Data Processing)

Other Models (Data Processing)

Hadoop 1.x Hadoop 2.x

Spark Architecture Overview

•  An in-‐memory data-‐processing framework –  IteraFve machine learning jobs –  InteracFve data analyFcs –  Scala based ImplementaFon –  Standalone, YARN, Mesos

•  Scalable and communicaFon intensive –  Wide dependencies between

Resilient Distributed Datasets (RDDs)

–  MapReduce-‐like shuffle operaFons to reparFFon RDDs

–  Sockets based communicaFon


hSp://spark.apache.org

Worker

Worker

Worker

Worker

Master

HDFSDriver

Zookeeper

Worker

SparkContext

5

Memcached Architecture

•  Three-‐layer architecture of Web 2.0 –  Web Servers, Memcached Servers, Database Servers

•  Distributed Caching Layer –  Allows to aggregate spare memory from mulFple nodes –  General purpose

•  Typically used to cache database queries, results of API calls •  Scalable model, but typical usage very network intensive

6

Main memory CPUs

SSD HDD

High Performance Networks

... ...

...

Main memory CPUs

SSD HDD

Main memory CPUs

SSD HDD

Main memory CPUs

SSD HDD

Main memory CPUs

SSD HDD!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High Performance

Networks

High Performance

NetworksInternet


•  Challenges for AcceleraFng Big Data Processing •  The High-‐Performance Big Data (HiBD) Project

•  RDMA-‐based designs for Apache Hadoop and Spark –  Case studies with HDFS, MapReduce, and Spark

–  Sample Performance Numbers for RDMA-‐Hadoop 2.x 0.9.6 Release

•  RDMA-‐based designs for Memcached and HBase –  RDMA-‐based Memcached with SSD-‐assisted Hybrid Memory –  RDMA-‐based HBase

•  Challenges in Designing Benchmarks for Big Data Processing –  OSU HiBD Benchmarks

•  Conclusion and Q&A

Presenta'on Outline

HPCAC Switzerland Conference (Mar '15) 7

•  High End CompuFng (HEC) is growing dramaFcally –  High Performance CompuFng

–  Big Data CompuFng

•  Technology Advancement –  MulF-‐core/many-‐core technologies and accelerators

–  Remote Direct Memory Access (RDMA)-‐enabled networking (InfiniBand and RoCE)

–  Solid State Drives (SSDs) and Non-‐VolaFle Random-‐Access Memory (NVRAM)

–  Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

Drivers for Modern HPC Clusters

Tianhe – 2 Titan Stampede Tianhe – 1A



Kernel Space

Interconnects and Protocols in the Open Fabrics Stack Applica'on / Middleware

Verbs

Ethernet Adapter

Ethernet Switch

Ethernet Driver

TCP/IP

1/10/40 GigE

InfiniBand Adapter

InfiniBand Switch

IPoIB

IPoIB

Ethernet Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40 GigE-‐TOE

InfiniBand Adapter

InfiniBand Switch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCE Adapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBand Switch

InfiniBand Adapter

RDMA

User Space

IB Na've

Sockets

Applica'on / Middleware Interface

Protocol

Adapter

Switch

InfiniBand Adapter

InfiniBand Switch

RDMA

SDP

SDP

9

Wide Adop'on of RDMA Technology

•  Message Passing Interface (MPI) for HPC

•  Parallel File Systems –  Lustre

–  GPFS

•  Delivering excellent performance: –  < 1.0 microsec latency

–  100 Gbps bandwidth

–  5-‐10% CPU uFlizaFon

•  Delivering excellent scalability


Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applica'ons

Commodity Compu'ng System Architectures

(Mul'-‐ and Many-‐core architectures and accelerators)

Other Protocols?

Communica'on and I/O Library

Point-‐to-‐Point Communica'on

QoS

Threaded Models and Synchroniza'on

Fault-‐Tolerance I/O and File Systems

Virtualiza'on

Benchmarks

RDMA Protocol

Challenges in Designing Communica'on and I/O Libraries for Big Data Systems


Upper level Changes?

11

Can Big Data Processing Systems be Designed with High-‐Performance Networks and Protocols?

Current Design

Applica'on

Sockets

1/10 GigE Network

•  Sockets not designed for high-‐performance –  Stream semanFcs onen mismatch for upper layers –  Zero-‐copy not available for non-‐blocking sockets

Our Approach

Applica'on

OSU Design

10 GigE or InfiniBand

Verbs Interface




–  Sample Performance Numbers for RDMA-‐Hadoop 2.x 0.9.6 Release




Presenta'on Outline


•  RDMA for Apache Hadoop 2.x (RDMA-‐Hadoop-‐2.x)

•  RDMA for Apache Hadoop 1.x (RDMA-‐Hadoop)

•  RDMA for Memcached (RDMA-‐Memcached)

•  OSU HiBD-‐Benchmarks (OHB)

•  hSp://hibd.cse.ohio-‐state.edu

•  Users Base: 95 organizaFons from 18 countries

•  More than 2,900 downloads

•  RDMA for Apache HBase and Spark will be available in near future

Overview of the HiBD Project and Releases


•  High-‐Performance Design of Hadoop over RDMA-‐enabled Interconnects

–  High performance design with naFve InfiniBand and RoCE support at the verbs-‐level for HDFS, MapReduce, and RPC components

–  Enhanced HDFS with in-‐memory and heterogeneous storage

–  High performance design of MapReduce over Lustre

–  Easily configurable for different running modes (HHH, HHH-‐M, HHH-‐L, and MapReduce over Lustre) and different protocols (naFve InfiniBand, RoCE, and IPoIB)

•  Current release: 0.9.6

–  Based on Apache Hadoop 2.6.0

–  Compliant with Apache Hadoop 2.6.0 APIs and applicaFons

–  Tested with •  Mellanox InfiniBand adapters (DDR, QDR and FDR)

•  RoCE support with Mellanox adapters

•  Various mulF-‐core plarorms

•  Different file systems with disks and SSDs and Lustre

–  hSp://hibd.cse.ohio-‐state.edu

RDMA for Apache Hadoop 2.x Distribu'on


•  High-‐Performance Design of Memcached over RDMA-‐enabled Interconnects

–  High performance design with naFve InfiniBand and RoCE support at the verbs-‐level for Memcached and libMemcached components

–  Easily configurable for naFve InfiniBand, RoCE and the tradiFonal sockets-‐based support (Ethernet and InfiniBand with IPoIB)

–  High performance design of SSD-‐Assisted Hybrid Memory

•  Current release: 0.9.3

–  Based on Memcached 1.4.22 and libMemcached 1.0.18

–  Compliant with libMemcached APIs and applicaFons

–  Tested with •  Mellanox InfiniBand adapters (DDR, QDR and FDR) •  RoCE support with Mellanox adapters •  Various mulF-‐core plarorms •  SSD

–  hSp://hibd.cse.ohio-‐state.edu

RDMA for Memcached Distribu'on


•  Released in OHB 0.7.1 (ohb_memlat)

•  Evaluates the performance of stand-‐alone Memcached

•  Three different micro-‐benchmarks –  SET Micro-‐benchmark: Micro-‐benchmark for memcached set

operaFons

–  GET Micro-‐benchmark: Micro-‐benchmark for memcached get operaFons

–  MIX Micro-‐benchmark: Micro-‐benchmark for a mix of memcached set/get operaFons (Read:Write raFo is 90:10)

•  Calculates average latency of Memcached operaFons

•  Can measure throughput in TransacFons Per Second


OSU HiBD Micro-‐Benchmark (OHB) Suite -‐ Memcached

17



–  Sample Performance Numbers for Hadoop 2.x 0.9.6 Release




Presenta'on Outline


•  RDMA-‐based Designs and Performance EvaluaFon –  HDFS –  MapReduce –  Spark

Accelera'on Case Studies and In-‐Depth Performance Evalua'on


Design Overview of HDFS with RDMA

•  Enables high performance RDMA communicaFon, while supporFng tradiFonal socket interface

•  JNI Layer bridges Java based HDFS with communicaFon library wri<en in naFve code

HDFS

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applica'ons

1/10 GigE, IPoIB Network

Java Socket Interface

Java Na've Interface (JNI)

Write Others

OSU Design

•  Design Features –  RDMA-‐based HDFS

write –  RDMA-‐based HDFS

replicaFon –  Parallel replicaFon

support –  On-‐demand connecFon

setup –  InfiniBand/RoCE

support


Communica'on Times in HDFS

•  Cluster with HDD DataNodes

–  30% improvement in communicaFon Fme over IPoIB (QDR)

–  56% improvement in communicaFon Fme over 10GigE

•  Similar improvements are obtained for SSD DataNodes

Reduced by 30%


0

5

10

15

20

25

2GB 4GB 6GB 8GB 10GB

Commun

ica'

on Tim

e (s)

File Size (GB)

10GigE IPoIB (QDR) OSU-‐IB (QDR)

21

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-‐Based Design of HDFS over InfiniBand , Supercompu'ng (SC), Nov 2012

N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-‐HDFS: A SEDA-‐based Approach to Maximize Overlapping in RDMA-‐Enhanced HDFS, HPDC '14, June 2014

Triple-‐H

Heterogeneous Storage

•  Design Features –  Three modes

•  Default (HHH) •  In-‐Memory (HHH-‐M)

•  Lustre-‐Integrated (HHH-‐L) –  Policies to efficiently uFlize the

heterogeneous storage devices •  RAM, SSD, HDD, Lustre

–  EvicFon/PromoFon based on data usage pa<ern

–  Hybrid ReplicaFon –  Lustre-‐Integrated mode:

•  Lustre-‐based fault-‐tolerance

22

Enhanced HDFS with In-‐memory and Heterogeneous Storage

Hybrid Replica'on

Data Placement Policies

Evic'on/ Promo'on

RAM Disk SSD HDD

Lustre

N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-‐H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015

Applica'ons


•  HHH (default): Heterogeneous storage devices with hybrid replicaFon schemes –  I/O operaFons over RAM disk, SSD, and HDD

–  Hybrid replicaFon (in-‐memory and persistent storage)

–  Be<er fault-‐tolerance as well as performance

•  HHH-‐M: High-‐performance in-‐memory I/O operaFons –  Memory replicaFon (in-‐memory only with lazy persistence)

–  As much performance benefit as possible

•  HHH-‐L: Lustre integrated –  Take advantage of the Lustre available in HPC clusters –  Lustre-‐based fault-‐tolerance (No HDFS replicaFon) –  Reduced local storage space usage


Enhanced HDFS with In-‐memory and Heterogeneous Storage – Three Modes

•  For 160GB TestDFSIO in 32 nodes –  Write Throughput: 7x improvement

over IPoIB (FDR)

–  Read Throughput: 2x improvement over IPoIB (FDR)

Performance Improvement on TACC Stampede (HHH)

•  For 120GB RandomWriter in 32 nodes –  3x improvement over IPoIB (QDR)

24

0

1000

2000

3000

4000

5000

6000

write read

Total Throu

ghpu

t (MBp

s)

IPoIB (FDR) OSU-‐IB (FDR)

TestDFSIO

0

50

100

150

200

250

300

8:30 16:60 32:120

Execu'

on Tim

e (s)

Cluster Size:Data Size


RandomWriter


Increased by 7x

Increased by 2x Reduced by 3x

•  For 60GB Sort in 8 nodes –  24% improvement over default HDFS

–  54% improvement over Lustre

–  33% storage space saving compared to default HDFS

Performance Improvement on SDSC Gordon (HHH-‐L)

25

0 50

100 150 200 250 300 350 400 450 500

20 40 60

Execu'

on Tim

e (s)

Data Size (GB)

HDFS-‐IPoIB (QDR) Lustre-‐IPoIB (QDR) OSU-‐IB (QDR)

Storage space for 60GB Sort Sort

Storage Used (GB)

HDFS-‐IPoIB (QDR)

360

Lustre-‐IPoIB (QDR)

120

OSU-‐IB (QDR) 240


Reduced by 54%

26

Evalua'on with PUMA and CloudBurst (HHH-‐L/HHH)

0

500

1000

1500

2000

2500

SequenceCount Grep

Execu'

on Tim

e (s)

HDFS-‐IPoIB (QDR) Lustre-‐IPoIB (QDR) OSU-‐IB (QDR) HDFS-‐IPoIB

(FDR) OSU-‐IB (FDR)

60.24 s 48.3 s

CloudBurst PUMA

•  PUMA on OSU RI –  SequenceCount with HHH-‐L: 17% benefit over Lustre, 8% over HDFS

–  Grep with HHH: 29.5% benefit over Lustre, 13.2% over HDFS

•  CloudBurst on TACC Stampede –  With HHH: 19% improvement over HDFS


Reduced by 17%

Reduced by 29.5%

Design Overview of MapReduce with RDMA

MapReduce

Verbs


OSU Design

Applica'ons

1/10 GigE, IPoIB Network



Job Tracker

Task Tracker

Map

Reduce


•  Enables high performance RDMA communicaFon, while supporFng tradiFonal socket interface •  JNI Layer bridges Java based MapReduce with communicaFon library wri<en in naFve code

•  Design Features –  RDMA-‐based shuffle –  Prefetching and caching map

output –  Efficient Shuffle Algorithms –  In-‐memory merge –  On-‐demand Shuffle Adjustment –  Advanced overlapping

•  map, shuffle, and merge •  shuffle, merge, and reduce

–  On-‐demand connecFon setup –  InfiniBand/RoCE support

28

Advanced Overlapping among different phases

•  A hybrid approach to achieve maximum possible overlapping in MapReduce across all phases compared to other approaches –  Efficient Shuffle Algorithms –  Dynamic and Efficient Switching –  On-‐demand Shuffle Adjustment

Default Architecture

Enhanced Overlapping

Advanced Overlapping

M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014.


•  For 240GB Sort in 64 nodes (512 cores) –  40% improvement over IPoIB (QDR)

with HDD used for HDFS

Performance Evalua'on of Sort and TeraSort


0

200

400

600

800

1000

1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job Execu'

on Tim

e (sec)

IPoIB (QDR) UDA-‐IB (QDR) OSU-‐IB (QDR)

Sort in OSU Cluster

0

100

200

300

400

500

600

700

Data Size: 80 GB Data Size: 160 GB Data Size: 320 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job Execu'

on Tim

e (sec)

IPoIB (FDR) UDA-‐IB (FDR) OSU-‐IB (FDR)

TeraSort in TACC Stampede

•  For 320GB TeraSort in 64 nodes (1K cores) –  38% improvement over IPoIB

(FDR) with HDD used for HDFS 30

Reduced by 40%

Reduced by 38%

•  50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

•  49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evalua'ons using PUMA Workload


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

Normalized

Execu'o

n Time

Benchmarks

10GigE IPoIB (QDR) OSU-‐IB (QDR)

31

Reduced by 50% Reduced by 49%

Op'mize Hadoop YARN MapReduce over Parallel File Systems

MetaData Servers

Object Storage Servers

Compute Nodes App Master

Map Reduce

Lustre Client

Lustre Setup

•  HPC Cluster Deployment –  Hybrid topological soluFon of Beowulf

architecture with separate I/O nodes –  Lean compute nodes with light OS; more

memory space; small local storage –  Sub-‐cluster of dedicated I/O nodes with

parallel file systems, such as Lustre •  MapReduce over Lustre

–  Local disk is used as the intermediate data directory

–  Lustre is used as the intermediate data directory


Intermediate Data Directory

Design Overview of Shuffle Strategies for MapReduce over Lustre


•  Design Features –  Two shuffle approaches

•  Lustre read based shuffle •  RDMA based shuffle

–  Hybrid shuffle algorithm to take benefit from both shuffle approaches

–  Dynamically adapts to the be<er shuffle approach for each shuffle request based on profiling values for each Lustre read operaFon

–  In-‐memory merge and overlapping of different phases are kept similar to RDMA-‐enhanced MapReduce design

33

Map 1 Map 2 Map 3

Lustre

Reduce 1 Reduce 2

Lustre Read / RDMA

In-‐memory merge/sort

reduce

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015.

In-‐memory merge/sort

reduce

•  For 500GB Sort in 64 nodes –  44% improvement over IPoIB (FDR)

Performance Improvement of MapReduce over Lustre on TACC-‐Stampede


•  For 640GB Sort in 128 nodes –  48% improvement over IPoIB (FDR)

0

200

400

600

800

1000

1200

300 400 500

Job Execu'

on Tim

e (sec)

Data Size (GB)


0

50

100

150

200

250

300

350

400

450

500

20 GB 40 GB 80 GB 160 GB 320 GB 640 GB

Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128

Job Execu'

on Tim

e (sec)


M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-‐based Approach Benefit?, Euro-‐Par, August 2014.

•  Local disk is used as the intermediate data directory

34


•  For 80GB Sort in 8 nodes –  34% improvement over IPoIB (QDR)

Case Study -‐ Performance Improvement of MapReduce over Lustre on SDSC-‐Gordon


•  For 120GB TeraSort in 16 nodes –  25% improvement over IPoIB (QDR)

•  Lustre is used as the intermediate data directory

35

0

100

200

300

400

500

600

700

800

900

40 60 80

Job Execu'

on Tim

e (sec)

Data Size (GB)

IPoIB (QDR) OSU-‐Lustre-‐Read (QDR) OSU-‐RDMA-‐IB (QDR) OSU-‐Hybrid-‐IB (QDR)

0

100

200

300

400

500

600

700

800

900

40 80 120

Job Execu'

on Tim

e (sec)

Data Size (GB)

IPoIB (QDR) OSU-‐Lustre-‐Read (QDR) OSU-‐RDMA-‐IB (QDR) OSU-‐Hybrid-‐IB (QDR)


Design Overview of Spark with RDMA

Spark Applications(Scala/Java/Python)

Spark(Scala/Java)

BlockFetcherIteratorBlockManager

Task TaskTask Task

Java NIO ShuffleServer

(default)

Netty ShuffleServer

(optional)

RDMA ShuffleServer

(plug-in)

Java NIO ShuffleFetcher(default)

Netty ShuffleFetcher

(optional)

RDMA ShuffleFetcher(plug-in)

Java Socket RDMA-based Shuffle Engine(Java/JNI)

1/10 Gig Ethernet/IPoIB (QDR/FDR) Network

Native InfiniBand (QDR/FDR)

•  Design Features –  RDMA based shuffle –  SEDA-‐based plugins –  Dynamic connecFon

management and sharing –  Non-‐blocking and out-‐of-‐

order data transfer –  Off-‐JVM-‐heap buffer

management –  InfiniBand/RoCE support



•  JNI Layer bridges Scala based Spark with communicaFon library wri<en in naFve code X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelera'ng Spark with RDMA for Big Data

Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

37


Preliminary Results of Spark-‐RDMA Design -‐ GroupBy

0

1

2

3

4

5

6

7

8

9

4 6 8 10

Group

By Tim

e (sec)

Data Size (GB)

0

1

2

3

4

5

6

7

8

9

8 12 16 20

Group

By Tim

e (sec)

Data Size (GB)

10GigE

IPoIB

RDMA

Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores

•  Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks

–  18% improvement over IPoIB (QDR) for 10GB data size

•  Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks

–  20% improvement over IPoIB (QDR) for 20GB data size

38








Presenta'on Outline


0

1000

2000

3000

4000

5000

6000

7000

8000

80 100 120

Total Throu

ghpu

t (MBp

s)

Data Size (GB)


0

50

100

150

200

250

300

80 100 120

Execu'

on Tim

e (s)

Data Size (GB)

IPoIB (FDR)

OSU-‐IB (FDR)

•  For Throughput, –  6-‐7.8x improvement over IPoIB

for 80-‐120 GB file size

40

Performance Benefits – TestDFSIO in TACC Stampede (HHH)

Cluster with 32 Nodes with HDD, TestDFSIO with a total 128 maps

•  For Latency, –  2.5-‐3x improvement over

IPoIB for 80-‐120 GB file size

Increased by 7.8x Reduced by 2.5x


0

50

100

150

200

250

80 100 120

Execu'

on Tim

e (s)

Data Size (GB)


0

50

100

150

200

250

80 100 120

Execu'

on Tim

e (s)

Data Size (GB)


41

Performance Benefits – RandomWriter & TeraGen in TACC-‐Stampede (HHH)

Cluster with 32 Nodes with a total of 128 maps

•  RandomWriter –  3-‐4x improvement over IPoIB


•  TeraGen –  4-‐5x improvement over IPoIB


RandomWriter TeraGen

Reduced by 3x Reduced by 4x


0

100

200

300

400

500

600

700

800

900

80 100 120

Execu'

on Tim

e (s)

Data Size (GB)


0

100

200

300

400

500

600

80 100 120

Execu'

on Tim

e (s)

Data Size (GB)


42

Performance Benefits – Sort & TeraSort in TACC-‐Stampede (HHH)

Cluster with 32 Nodes with a total of 128 maps and 64 reduces

•  Sort with single HDD per node –  40-‐52% improvement over IPoIB

for 80-‐120 GB data

•  TeraSort with single HDD per node –  42-‐44% improvement over IPoIB

for 80-‐120 GB data


Cluster with 32 Nodes with a total of 128 maps and 57 reduces


0

5000

10000

15000

20000

25000

30000

60 80 100

Total Throu

ghpu

t (MBp

s)

Data Size (GB)


•  TestDFSIO Write on TACC Stampede –  28x improvement over IPoIB

for 60-‐100 GB file size 43

Performance Benefits – TestDFSIO on TACC Stampede and SDSC Gordon (HHH-‐M)

TACC Stampede Cluster with 32 Nodes with HDD, TestDFSIO with a total 128 maps

•  TestDFSIO Write on SDSC Gordon –  6x improvement over IPoIB


Increased by 28x


0 1000 2000 3000 4000 5000 6000 7000 8000 9000

60 80 100

Total Throu

ghpu

t (MBp

s)

Data Size (GB)

IPoIB (QDR) OSU-‐IB (QDR) Increased by 6x

SDSC Gordon Cluster with 16 Nodes with SSD, TestDFSIO with a total 64 maps

44

Performance Benefits – TestDFSIO and Sort in SDSC-‐Gordon (HHH-‐L)

0 50

100 150 200 250 300 350 400 450 500

40 60 80

Execu'

on Tim

e (s)

Data Size (GB)

HDFS Lustre OSU-‐IB

•  Sort –  up to 28% improvement over

HDFS

–  up to 50% improvement over Lustre

Sort TestDFSIO

•  TestDFSIO for 80GB data size –  Write: 9x improvement over

HDFS

–  Read: 29% improvement over Lustre

Cluster with 16 Nodes

Reduced by 50% Increased by 29%

0

2000

4000

6000

8000

10000

12000

14000

16000

Write Read

Total Throu

ghpu

t (MBp

s)

HDFS Lustre OSU-‐IB

Increased by 9x








Presenta'on Outline


Memcached-‐RDMA Design

•  Server and client perform a negoFaFon protocol –  Master thread assigns clients to appropriate worker thread

•  Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

•  All other Memcached data structures are shared among RDMA and Sockets worker threads

•  NaFve IB-‐verbs-‐level Design and evaluaFon with –  Server : Memcached (h<p://memcached.org)

–  Client : libmemcached (h<p://libmemcached.org)

–  Different networks and protocols: 10GigE, IPoIB, naFve IB (RC, UD)

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared Data

Memory Slabs Items …

1

1

2

2


1

10

100

1000

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Time (us)

Message Size

OSU-‐IB (FDR) IPoIB (FDR)

0

100

200

300

400

500

600

700

16 32 64 128 256 512 1024 2048 4080

Thou

sand

s of T

ransac'o

ns per

Second

(TPS)

No. of Clients

•  Memcached Get latency –  4 bytes OSU-‐IB: 2.84 us; IPoIB: 75.53 us –  2K bytes OSU-‐IB: 4.49 us; IPoIB: 123.42 us

•  Memcached Throughput (4bytes) –  4080 clients OSU-‐IB: 556 Kops/sec, IPoIB: 233 Kops/s –  Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X

2X


•  IllustraFon with Read-‐Cache-‐Read access pa<ern using modified mysqlslap load tesFng tool

•  Memcached-‐RDMA can -  improve query latency by up to 66% over IPoIB (32Gbps)

-  throughput by up to 69% over IPoIB (32Gbps)


Micro-‐benchmark Evalua'on for OLDP workloads

0

2

4

6

8

64 96 128 160 320 400

Latency (sec)

No. of Clients

Memcached-‐IPoIB (32Gbps)

Memcached-‐RDMA (32Gbps)

0

500

1000

1500

2000

2500

3000

3500

64 96 128 160 320 400

Throughp

ut (Kq

/s)

No. of Clients

Memcached-‐IPoIB (32Gbps)

Memcached-‐RDMA (32Gbps)

D. Shankar, X. Lu, J. Jose, M. W. Rahman, N. Islam, and D. K. Panda, Can RDMA Benefit On-‐Line Data Processing Workloads with Memcached and MySQL, ISPASS’15

Reduced by 66%

•  ohb_memlat & ohb_memthr latency & throughput micro-‐benchmarks

•  Memcached-‐RDMA can -  improve query latency by up to 70% over IPoIB (32Gbps)

-  improve throughput by up to 2X over IPoIB (32Gbps)

-  No overhead in using hybrid mode when all data can fit in memory


Performance Benefits on SDSC-‐Gordon – OHB Latency & Throughput Micro-‐Benchmarks

0 1 2 3 4 5 6 7 8 9 10

64 128 256 512 1024

Throughp

ut (m

illion tran

s/sec)

No. of Clients

IPoIB (32Gbps) RDMA-‐Mem (32Gbps) RDMA-‐Hybrid (32Gbps)

0

100

200

300

400

500

Average latency (us)

Message Size (Bytes)

IPoIB (32Gbps) RDMA-‐Mem (32Gbps) RDMA-‐Hyb (32Gbps) 2X

ohb_memhybrid – Uniform Access Pa<ern, single client and single server with 64MB

•  Success Rate of In-‐Memory Vs. Hybrid SSD-‐Memory for different spill factors

–  100% success rate for Hybrid design while that of pure In-‐memory degrades

•  Average Latency with penalty for In-‐Memory Vs. Hybrid SSD-‐Assisted mode for spill factor 1.5.

–  up to 53% improvement over In-‐memory with server miss penalty as low as 1.5 ms

50

Performance Benefits on OSU-‐RI-‐SSD – OHB Micro-‐benchmark for Hybrid Memcached

0 100 200 300 400 500 600 700 800 900

512 2K 8K 32K 128K

Latency with

pen

alty (u

s)

Message Size (Bytes)

RDMA-‐Mem (32Gbps)

RDMA-‐Hybrid (32Gbps)

55

65

75

85

95

105

115

125

1 1.15 1.3 1.45

Success R

ate

Spill Factor

RDMA-‐Mem-‐Uniform RDMA-‐Hybrid


53%







Presenta'on Outline



HBase-‐RDMA Design Overview

•  JNI Layer bridges Java based HBase with communicaFon library wri<en in naFve code


HBase

IB Verbs


OSU-‐IB Design

Applica'ons

1/10 GigE, IPoIB Networks



52


HBase – YCSB Read-‐Write Workload

•  HBase Get latency (Yahoo! Cloud Service Benchmark) –  64 clients: 2.0 ms; 128 Clients: 3.5 ms –  42% improvement over IPoIB for 128 clients

•  HBase Put latency –  64 clients: 1.9 ms; 128 Clients: 3.5 ms –  40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Time (us)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Time (us)

No. of Clients

10GigE

Read Latency Write Latency

OSU-‐IB (QDR) IPoIB (QDR)

53

J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-‐Performance Design of HBase with RDMA over InfiniBand, IPDPS’12







Presenta'on Outline





(HDD and SSD)


Applica'ons



Other Protocols?



QoS



Virtualiza'on

Benchmarks

RDMA Protocol

Designing Communica'on and I/O Libraries for Big Data Systems: Solved a Few Ini'al Challenges


Upper level Changes?

55

•  The current benchmarks provide some performance behavior

•  However, do not provide any informaFon to the designer/developer on: –  What is happening at the lower-‐layer?

–  Where the benefits are coming from?

–  Which design is leading to benefits or bo<lenecks?

–  Which component in the design needs to be changed and what will be its impact?

–  Can performance gain/loss at the lower-‐layer be correlated to the performance gain/loss observed at the upper layer?


Are the Current Benchmarks Sufficient for Big Data Management and Processing?

56

OSU MPI Micro-‐Benchmarks (OMB) Suite •  A comprehensive suite of benchmarks to

–  Compare performance of different MPI libraries on various networks and systems –  Validate low-‐level funcFonaliFes –  Provide insights to the underlying MPI-‐level designs

•  Started with basic send-‐recv (MPI-‐1) micro-‐benchmarks for latency, bandwidth and bi-‐direcFonal bandwidth

•  Extended later to –  MPI-‐2 one-‐sided –  CollecFves –  GPU-‐aware data movement –  OpenSHMEM (point-‐to-‐point and collecFves) –  UPC

•  Has become an industry standard •  Extensively used for design/development of MPI libraries, performance comparison of

MPI libraries and even in procurement of large-‐scale systems •  Available from h<p://mvapich.cse.ohio-‐state.edu/benchmarks

•  Available in an integrated manner with MVAPICH2 stack





(HDD and SSD)


Applica'ons



Other Protocols?



QoS



Virtualiza'on

Benchmarks

RDMA Protocols

Challenges in Benchmarking of RDMA-‐based Designs

Current Benchmarks

No Benchmarks

Correla'on?





(HDD and SSD)


Applica'ons



Other Protocols?



QoS



Virtualiza'on

Benchmarks

RDMA Protocols

Itera've Process – Requires Deeper Inves'ga'on and Design for Benchmarking Next Genera'on Big Data Systems and Applica'ons


Applica'ons-‐Level

Benchmarks

Micro-‐Benchmarks

59

•  Evaluate the performance of standalone HDFS

•  Five different benchmarks –  SequenFal Write Latency (SWL)

–  SequenFal or Random Read Latency (SRL or RRL)

–  SequenFal Write Throughput (SWT)

–  SequenFal Read Throughput (SRT)

–  SequenFal Read-‐Write Throughput (SRWT)


OSU HiBD Micro-‐Benchmark (OHB) Suite -‐ HDFS

Benchmark File Name

File Size

HDFS Parameter

Readers Writers Random/ SequenFal

Read

Seek Interval

SWL √ √ √

SRL/RRL √ √ √ √ √ (RRL)

SWT √ √ √

SRT √ √ √

SRWT √ √ √ √

N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-‐benchmark Suite for Evalua'ng HDFS Opera'ons on Modern Clusters, Int'l Workshop on Big Data Benchmarking (WBDB '12), December 2012.

60

OSU HiBD Micro-‐Benchmark (OHB) Suite -‐ RPC •  Two different micro-‐benchmarks to evaluate the performance of standalone

Hadoop RPC –  Latency: Single Server, Single Client

–  Throughput: Single Server, MulFple Clients

•  A simple script framework for job launching and resource monitoring

•  Calculates staFsFcs like Min, Max, Average

•  Network configuraFon, Tunable parameters, DataType, CPU UFlizaFon

Component Network Address

Port Data Type Min Msg Size

Max Msg Size

No. of IteraFons

Handlers Verbose

lat_client √ √ √ √ √ √ √

lat_server √ √ √ √

Component Network Address

Port Data Type Min Msg Size

Max Msg Size

No. of IteraFons

No. of Clients Handlers Verbose

thr_client √ √ √ √ √ √ √

thr_server √ √ √ √ √ √

X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-‐Benchmark Suite for Evalua'ng Hadoop RPC on High-‐Performance Networks, Int'l Workshop on Big Data Benchmarking (WBDB '13), July 2013.


•  Evaluate the performance of stand-‐alone MapReduce

•  Does not require or involve HDFS or any other distributed file system

•  Considers various factors that influence the data shuffling phase –  underlying network configuraFon, number of map and reduce tasks, intermediate

shuffle data pa<ern, shuffle data size etc.

•  Three different micro-‐benchmarks based on intermediate shuffle data pa<erns –  MR-‐AVG micro-‐benchmark: intermediate data is evenly distributed among

reduce tasks.

–  MR-‐RAND micro-‐benchmark: intermediate data is pseudo-‐randomly distributed among reduce tasks.

–  MR-‐SKEW micro-‐benchmark: intermediate data is unevenly distributed among reduce tasks.


OSU HiBD Micro-‐Benchmark (OHB) Suite -‐ MapReduce

D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-‐Benchmark Suite for Evalua'ng Hadoop MapReduce on High-‐Performance Networks, BPOE-‐5 (2014).

62

•  Upcoming Releases of RDMA-‐enhanced Packages will support –  Spark –  HBase –  Plugin-‐based designs

•  Upcoming Releases of OSU HiBD Micro-‐Benchmarks (OHB) will support –  HDFS

–  MapReduce

–  RPC

•  ExploraFon of other components (Threading models, QoS, VirtualizaFon, Accelerators, etc.)

•  Advanced designs with upper-‐level changes and opFmizaFons


Future Plans of OSU High Performance Big Data Project

63

•  Presented an overview of Big Data processing middleware

•  Discussed challenges in acceleraFng Big Data middleware

•  Presented iniFal designs to take advantage of InfiniBand/RDMA for Hadoop, Spark, Memcached, and HBase

•  Presented challenges in designing benchmarks

•  Results are promising

•  Many other open issues need to be solved

•  Will enable Big processing community to take advantage of

modern HPC technologies to carry out their analyFcs in a fast and scalable manner


Concluding Remarks

64

Personnel Acknowledgments

65

Current Students –  A. Awan (Ph.D.)

–  A. Bhat (M.S.)

–  S. Chakraborthy (Ph.D.)

–  C.-‐H. Chu (Ph.D.)

–  N. Islam (Ph.D.)

Past Students –  P. Balaji (Ph.D.)

–  D. BunFnas (Ph.D.)

–  S. Bhagvat (M.S.)

–  L. Chai (Ph.D.)

–  B. Chandrasekharan (M.S.)

–  N. Dandapanthula (M.S.)

–  V. Dhanraj (M.S.)

–  T. Gangadharappa (M.S.)

–  K. Gopalakrishnan (M.S.)

–  G. Santhanaraman (Ph.D.) –  A. Singh (Ph.D.)

–  J. Sridhar (M.S.)

–  S. Sur (Ph.D.)

–  H. Subramoni (Ph.D.)

–  K. Vaidyanathan (Ph.D.)

–  A. Vishnu (Ph.D.)

–  J. Wu (Ph.D.)

–  W. Yu (Ph.D.)

Past Research Scien8st –  S. Sur

Current Post-‐Doc –  J. Lin

Current Programmer –  J. Perkins

Past Post-‐Docs –  H. Wang –  X. Besseron –  H.-‐W. Jin

–  M. Luo

–  W. Huang (Ph.D.)

–  W. Jiang (M.S.)

–  J. Jose (Ph.D.)

–  S. Kini (M.S.)

–  M. Koop (Ph.D.)

–  R. Kumar (M.S.)

–  S. Krishnamoorthy (M.S.)

–  K. Kandalla (Ph.D.)

–  P. Lai (M.S.)

–  J. Liu (Ph.D.)

–  M. Luo (Ph.D.)

–  A. Mamidala (Ph.D.)

–  G. Marsh (M.S.)

–  V. Meshram (M.S.)

–  S. Naravula (Ph.D.)

–  R. Noronha (Ph.D.)

–  X. Ouyang (Ph.D.)

–  S. Pai (M.S.)

–  S. Potluri (Ph.D.) –  R. Rajachandrasekar (Ph.D.)

–  M. Li (Ph.D.) –  M. Rahman (Ph.D.)

–  D. Shankar (Ph.D.)

–  A. Venkatesh (Ph.D.)

–  J. Zhang (Ph.D.)

–  E. Mancini –  S. Marcarelli

–  J. Vienne

Current Senior Research Associates –  K. Hamidouche

–  X. Lu

Past Programmers –  D. Bureddy

–  H. Subramoni

Current Research Specialist –  M. Arnold


[email protected]‐state.edu


Thank You!

The High-‐Performance Big Data Project h<p://hibd.cse.ohio-‐state.edu/

66

Network-‐Based CompuFng Laboratory h<p://nowlab.cse.ohio-‐state.edu/

The MVAPICH2/MVAPICH2-‐X Project h<p://mvapich.cse.ohio-‐state.edu/


Call For Par'cipa'on

Interna'onal Workshop on High-‐Performance Big Data Compu'ng (HPBDC 2015)

In conjuncFon with InternaFonal Conference on Distributed

CompuFng Systems (ICDCS 2015) In Hilton Downtown, Columbus, Ohio, USA, Monday, June 29th, 2015

h<p://web.cse.ohio-‐state.edu/~luxi/hpbdc2015

•  Looking for Bright and EnthusiasFc Personnel to join as –  Post-‐Doctoral Researchers

–  PhD Students

–  Hadoop/Big Data Programmer/Sonware Engineer

–  MPI Programmer/Sonware Engineer

•  If interested, please contact me at this conference and/or send an e-‐mail to [email protected]‐state.edu

68

Mul'ple Posi'ons Available in My Group

HPCAC China, Nov. '14

Accelera’ngBigDataProcessingwithHadoop,Spark...

Documents

Transcript of Accelera’ngBigDataProcessingwithHadoop,Spark...

Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark...

Documents

Transcript of Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark...

Accelera’ngBigDataProcessingwithHadoop,Spark...

Transcript of Accelera’ngBigDataProcessingwithHadoop,Spark...