Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark...

68
Accelera’ng Big Data Processing with Hadoop, Spark and Memcached Dhabaleswar K. (DK) Panda The Ohio State University Email: [email protected] h<p://www.cse.ohiostate.edu/~panda Talk at HPC Advisory Council Switzerland Conference (Mar ‘15) by

Transcript of Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark...

Page 1: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Accelera'ng  Big  Data  Processing  with  Hadoop,  Spark  and  Memcached  

Dhabaleswar  K.  (DK)  Panda  The  Ohio  State  University  

E-­‐mail:  [email protected]­‐state.edu  h<p://www.cse.ohio-­‐state.edu/~panda  

Talk  at  HPC  Advisory  Council  Switzerland  Conference  (Mar  ‘15)  

by  

Page 2: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Introduc'on  to  Big  Data  Applica'ons  and  Analy'cs  

•  Big  Data  has  become  the  one  of  the  most  important  elements  of  business  analyFcs  

•  Provides  groundbreaking  opportuniFes  for  enterprise  informaFon  management  and  decision  making    

•  The  amount  of  data  is  exploding;  companies  are  capturing  and  digiFzing  more  informaFon  than  ever  

•  The  rate  of  informaFon  growth  appears  to  be  exceeding  Moore’s  Law  

2  HPCAC  Switzerland  Conference  (Mar  '15)    

•  Commonly  accepted  3V’s  of  Big  Data  •  Volume,  Velocity,  Variety  Michael  Stonebraker:  Big  Data  Means  at  Least  Three  Different  Things,  hSp://www.nist.gov/itl/ssd/is/upload/NIST-­‐stonebraker.pdf  

•  5V’s  of  Big  Data  –  3V  +  Value,  Veracity  

Page 3: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  SubstanFal  impact  on  designing  and  uFlizing  modern  data  management  and  processing  systems  in  mulFple  Fers  

–  Front-­‐end  data  accessing  and  serving  (Online)  •  Memcached  +  DB  (e.g.  MySQL),  HBase  

–  Back-­‐end  data  analyFcs  (Offline)  •  HDFS,  MapReduce,  Spark  

HPCAC  Switzerland  Conference  (Mar  '15)    

Data  Management  and  Processing  on  Modern  Clusters

3  

Internet

Front-end Tier Back-end Tier

Web ServerWeb

ServerWeb Server

Memcached+ DB (MySQL)Memcached

+ DB (MySQL)Memcached+ DB (MySQL)

NoSQL DB (HBase)NoSQL DB

(HBase)NoSQL DB (HBase)

HDFS

MapReduce Spark

Data Analytics Apps/Jobs

Data Accessing and Serving

Page 4: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Open-­‐source  implementaFon  of  Google  MapReduce,  GFS,  and  BigTable  for  Big  Data  AnalyFcs  •  Hadoop  Common  UFliFes  (RPC,  etc.),  HDFS,  MapReduce,  YARN  

•  h<p://hadoop.apache.org    

 

4  

Overview  of  Apache  Hadoop  Architecture  

HPCAC  Switzerland  Conference  (Mar  '15)    

Hadoop  Distributed  File  System  (HDFS)  

MapReduce  (Cluster  Resource  Management  &  Data  Processing)  

Hadoop  Common/Core  (RPC,  ..)  

Hadoop  Distributed  File  System  (HDFS)  

YARN  (Cluster  Resource  Management  &  Job  Scheduling)  

Hadoop  Common/Core  (RPC,  ..)  

MapReduce  (Data  Processing)  

Other  Models  (Data  Processing)  

Hadoop  1.x   Hadoop  2.x    

Page 5: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Spark  Architecture  Overview  

•  An  in-­‐memory  data-­‐processing  framework    –  IteraFve  machine  learning  jobs    –  InteracFve  data  analyFcs    –  Scala  based  ImplementaFon  –  Standalone,  YARN,  Mesos  

•  Scalable  and  communicaFon  intensive  –  Wide  dependencies  between  

Resilient  Distributed  Datasets  (RDDs)  

–  MapReduce-­‐like  shuffle  operaFons  to  reparFFon  RDDs    

–  Sockets  based  communicaFon  

HPCAC  Switzerland  Conference  (Mar  '15)    

hSp://spark.apache.org  

Worker

Worker

Worker

Worker

Master

HDFSDriver

Zookeeper

Worker

SparkContext

5  

Page 6: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Memcached  Architecture  

•  Three-­‐layer  architecture  of  Web  2.0  –  Web  Servers,  Memcached  Servers,  Database  Servers  

•  Distributed  Caching  Layer  –  Allows  to  aggregate  spare  memory  from  mulFple  nodes  –  General  purpose  

•  Typically  used  to  cache  database  queries,  results  of  API  calls  •  Scalable  model,  but  typical  usage  very  network  intensive  

6  

Main memory CPUs

SSD HDD

High Performance Networks

... ...

...

Main memory CPUs

SSD HDD

Main memory CPUs

SSD HDD

Main memory CPUs

SSD HDD

Main memory CPUs

SSD HDD!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High Performance

Networks

High Performance

NetworksInternet

HPCAC  Switzerland  Conference  (Mar  '15)    

Page 7: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Challenges  for  AcceleraFng  Big  Data  Processing  •  The  High-­‐Performance  Big  Data  (HiBD)  Project  

•  RDMA-­‐based  designs  for  Apache  Hadoop  and  Spark  –  Case  studies  with  HDFS,  MapReduce,  and  Spark  

–  Sample  Performance  Numbers  for  RDMA-­‐Hadoop  2.x  0.9.6  Release  

•  RDMA-­‐based  designs  for  Memcached  and  HBase  –  RDMA-­‐based  Memcached  with  SSD-­‐assisted  Hybrid  Memory  –  RDMA-­‐based  HBase  

•  Challenges  in  Designing  Benchmarks  for  Big  Data  Processing  –  OSU  HiBD  Benchmarks    

•  Conclusion  and  Q&A  

Presenta'on  Outline  

HPCAC  Switzerland  Conference  (Mar  '15)     7  

Page 8: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  High  End  CompuFng  (HEC)  is  growing  dramaFcally  –  High  Performance  CompuFng    

–  Big  Data  CompuFng  

•  Technology  Advancement  –  MulF-­‐core/many-­‐core  technologies  and  accelerators  

–  Remote  Direct  Memory  Access  (RDMA)-­‐enabled  networking  (InfiniBand  and  RoCE)  

–  Solid  State  Drives  (SSDs)  and  Non-­‐VolaFle  Random-­‐Access  Memory  (NVRAM)  

–  Accelerators  (NVIDIA  GPGPUs  and  Intel  Xeon  Phi)  

Drivers  for  Modern  HPC  Clusters

Tianhe  –  2     Titan   Stampede   Tianhe  –  1A  

HPCAC  Switzerland  Conference  (Mar  '15)     8  

Page 9: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

HPCAC  Switzerland  Conference  (Mar  '15)    

Kernel Space

Interconnects  and  Protocols  in  the  Open  Fabrics  Stack  Applica'on  /  Middleware  

Verbs  

Ethernet  Adapter  

Ethernet  Switch  

Ethernet  Driver  

TCP/IP  

1/10/40  GigE  

InfiniBand  Adapter  

InfiniBand  Switch  

IPoIB  

IPoIB  

Ethernet  Adapter  

Ethernet  Switch  

Hardware  Offload  

TCP/IP  

10/40  GigE-­‐TOE  

InfiniBand  Adapter  

InfiniBand  Switch  

User  Space  

RSockets  

RSockets  

iWARP  Adapter  

Ethernet  Switch  

TCP/IP  

User  Space  

iWARP  

RoCE  Adapter  

Ethernet  Switch  

RDMA  

User  Space  

RoCE  

InfiniBand  Switch  

InfiniBand  Adapter  

RDMA  

User  Space  

IB  Na've  

Sockets  

Applica'on  /  Middleware  Interface  

Protocol  

Adapter  

Switch  

InfiniBand  Adapter  

InfiniBand  Switch  

RDMA  

SDP  

SDP  

9  

Page 10: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Wide  Adop'on  of  RDMA  Technology  

•  Message  Passing  Interface  (MPI)  for  HPC  

•  Parallel  File  Systems  –  Lustre  

–  GPFS  

•  Delivering  excellent  performance:  –  <  1.0  microsec  latency  

–  100  Gbps  bandwidth  

–  5-­‐10%  CPU  uFlizaFon  

•  Delivering  excellent  scalability  

HPCAC  Switzerland  Conference  (Mar  '15)     10  

Page 11: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Big  Data  Middleware  (HDFS,  MapReduce,  HBase,  Spark    and  Memcached)  

Networking  Technologies  (InfiniBand,  1/10/40GigE  and  Intelligent  NICs)  

 Storage  Technologies  

(HDD  and  SSD)  

Programming  Models  (Sockets)  

Applica'ons  

Commodity  Compu'ng  System  Architectures  

(Mul'-­‐  and  Many-­‐core  architectures  and  accelerators)  

Other  Protocols?  

Communica'on  and  I/O  Library  

Point-­‐to-­‐Point  Communica'on  

QoS  

Threaded  Models  and  Synchroniza'on  

Fault-­‐Tolerance  I/O  and  File  Systems  

Virtualiza'on  

Benchmarks  

RDMA  Protocol  

Challenges  in  Designing  Communica'on  and  I/O  Libraries  for  Big  Data  Systems  

HPCAC  Switzerland  Conference  (Mar  '15)    

Upper  level  Changes?  

11  

Page 12: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Can  Big  Data  Processing  Systems  be  Designed  with  High-­‐Performance  Networks  and  Protocols?  

Current    Design  

Applica'on  

Sockets  

1/10  GigE  Network  

•  Sockets  not  designed  for  high-­‐performance  –  Stream  semanFcs  onen  mismatch  for  upper  layers  –  Zero-­‐copy  not  available  for  non-­‐blocking  sockets  

Our  Approach  

Applica'on  

OSU    Design  

10  GigE  or  InfiniBand  

Verbs  Interface  

HPCAC  Switzerland  Conference  (Mar  '15)     12  

Page 13: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Challenges  for  AcceleraFng  Big  Data  Processing  •  The  High-­‐Performance  Big  Data  (HiBD)  Project  

•  RDMA-­‐based  designs  for  Apache  Hadoop  and  Spark  –  Case  studies  with  HDFS,  MapReduce,  and  Spark  

–  Sample  Performance  Numbers  for  RDMA-­‐Hadoop  2.x  0.9.6  Release  

•  RDMA-­‐based  designs  for  Memcached  and  HBase  –  RDMA-­‐based  Memcached  with  SSD-­‐assisted  Hybrid  Memory  –  RDMA-­‐based  HBase  

•  Challenges  in  Designing  Benchmarks  for  Big  Data  Processing  –  OSU  HiBD  Benchmarks    

•  Conclusion  and  Q&A  

Presenta'on  Outline  

HPCAC  Switzerland  Conference  (Mar  '15)     13  

Page 14: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  RDMA  for  Apache  Hadoop  2.x  (RDMA-­‐Hadoop-­‐2.x)  

•  RDMA  for  Apache  Hadoop  1.x  (RDMA-­‐Hadoop)  

•  RDMA  for  Memcached  (RDMA-­‐Memcached)  

•  OSU  HiBD-­‐Benchmarks  (OHB)  

•  hSp://hibd.cse.ohio-­‐state.edu  

•  Users  Base:  95  organizaFons  from  18  countries  

•  More  than  2,900  downloads  

•  RDMA  for  Apache  HBase  and  Spark  will  be  available  in  near  future  

Overview  of  the  HiBD  Project  and  Releases  

HPCAC  Switzerland  Conference  (Mar  '15)     14  

Page 15: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  High-­‐Performance  Design  of  Hadoop  over  RDMA-­‐enabled  Interconnects  

–  High  performance  design  with  naFve  InfiniBand  and  RoCE  support  at  the  verbs-­‐level  for  HDFS,  MapReduce,  and  RPC  components  

–  Enhanced  HDFS  with  in-­‐memory  and  heterogeneous  storage  

–  High  performance  design  of  MapReduce  over  Lustre  

–  Easily  configurable  for  different  running  modes  (HHH,  HHH-­‐M,  HHH-­‐L,  and  MapReduce  over  Lustre)  and  different  protocols  (naFve  InfiniBand,  RoCE,  and  IPoIB)  

•  Current  release:  0.9.6    

–  Based  on  Apache  Hadoop  2.6.0  

–  Compliant  with  Apache  Hadoop  2.6.0  APIs  and  applicaFons  

–  Tested  with  •  Mellanox  InfiniBand  adapters  (DDR,  QDR  and  FDR)  

•  RoCE  support  with  Mellanox  adapters  

•  Various  mulF-­‐core  plarorms  

•  Different  file  systems  with  disks  and  SSDs  and  Lustre  

–  hSp://hibd.cse.ohio-­‐state.edu  

RDMA  for  Apache  Hadoop  2.x  Distribu'on  

HPCAC  Switzerland  Conference  (Mar  '15)     15  

Page 16: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  High-­‐Performance  Design  of  Memcached  over  RDMA-­‐enabled  Interconnects  

–  High  performance  design  with  naFve  InfiniBand  and  RoCE  support  at  the  verbs-­‐level  for  Memcached  and  libMemcached  components  

–  Easily  configurable  for  naFve  InfiniBand,  RoCE  and  the  tradiFonal  sockets-­‐based  support  (Ethernet  and  InfiniBand  with  IPoIB)  

–  High  performance  design  of  SSD-­‐Assisted  Hybrid  Memory  

•  Current  release:  0.9.3    

–  Based  on  Memcached  1.4.22  and  libMemcached  1.0.18  

–  Compliant  with  libMemcached  APIs  and  applicaFons  

–  Tested  with  •  Mellanox  InfiniBand  adapters  (DDR,  QDR  and  FDR)  •  RoCE  support  with  Mellanox  adapters  •  Various  mulF-­‐core  plarorms  •  SSD  

–  hSp://hibd.cse.ohio-­‐state.edu  

RDMA  for  Memcached  Distribu'on  

HPCAC  Switzerland  Conference  (Mar  '15)     16  

Page 17: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Released  in  OHB  0.7.1  (ohb_memlat)  

•  Evaluates  the  performance  of  stand-­‐alone  Memcached  

•  Three  different  micro-­‐benchmarks  –  SET  Micro-­‐benchmark:  Micro-­‐benchmark  for  memcached  set  

operaFons  

–  GET  Micro-­‐benchmark:  Micro-­‐benchmark  for  memcached  get  operaFons  

–  MIX  Micro-­‐benchmark:  Micro-­‐benchmark  for  a  mix  of  memcached  set/get  operaFons  (Read:Write  raFo  is  90:10)  

•  Calculates  average  latency  of  Memcached  operaFons  

•  Can  measure  throughput  in  TransacFons  Per  Second    

HPCAC  Switzerland  Conference  (Mar  '15)    

OSU  HiBD  Micro-­‐Benchmark  (OHB)  Suite  -­‐  Memcached

17  

Page 18: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Challenges  for  AcceleraFng  Big  Data  Processing  •  The  High-­‐Performance  Big  Data  (HiBD)  Project  

•  RDMA-­‐based  designs  for  Apache  Hadoop  and  Spark  –  Case  studies  with  HDFS,  MapReduce,  and  Spark  

–  Sample  Performance  Numbers  for  Hadoop  2.x  0.9.6  Release  

•  RDMA-­‐based  designs  for  Memcached  and  HBase  –  RDMA-­‐based  Memcached  with  SSD-­‐assisted  Hybrid  Memory  –  RDMA-­‐based  HBase  

•  Challenges  in  Designing  Benchmarks  for  Big  Data  Processing  –  OSU  HiBD  Benchmarks    

•  Conclusion  and  Q&A  

Presenta'on  Outline  

HPCAC  Switzerland  Conference  (Mar  '15)     18  

Page 19: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  RDMA-­‐based  Designs  and  Performance  EvaluaFon  –  HDFS  –  MapReduce  –  Spark  

Accelera'on  Case  Studies  and  In-­‐Depth  Performance  Evalua'on  

HPCAC  Switzerland  Conference  (Mar  '15)     19  

Page 20: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Design  Overview  of  HDFS  with  RDMA  

•  Enables  high  performance  RDMA  communicaFon,  while  supporFng  tradiFonal  socket  interface  

•  JNI  Layer  bridges  Java  based  HDFS  with  communicaFon  library  wri<en  in  naFve  code    

HDFS  

Verbs  

RDMA  Capable  Networks  (IB,  10GE/  iWARP,  RoCE  ..)  

Applica'ons  

1/10  GigE,  IPoIB    Network  

Java  Socket    Interface  

Java  Na've  Interface  (JNI)  

Write  Others  

 OSU  Design  

 

•  Design  Features  –  RDMA-­‐based  HDFS  

write  –  RDMA-­‐based  HDFS  

replicaFon  –  Parallel  replicaFon  

support  –  On-­‐demand  connecFon  

setup  –  InfiniBand/RoCE  

support  

HPCAC  Switzerland  Conference  (Mar  '15)     20  

Page 21: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Communica'on  Times  in  HDFS  

•  Cluster  with  HDD  DataNodes  

–  30%  improvement  in  communicaFon  Fme  over  IPoIB  (QDR)  

–  56%  improvement  in  communicaFon  Fme  over    10GigE  

•  Similar  improvements  are  obtained  for  SSD  DataNodes  

Reduced    by  30%  

HPCAC  Switzerland  Conference  (Mar  '15)    

0  

5  

10  

15  

20  

25  

2GB   4GB   6GB   8GB   10GB  

Commun

ica'

on  Tim

e  (s)  

File  Size  (GB)  

10GigE   IPoIB  (QDR)   OSU-­‐IB  (QDR)  

21  

N.  S.  Islam,  M.  W.  Rahman,  J.  Jose,  R.  Rajachandrasekar,  H.  Wang,  H.  Subramoni,  C.  Murthy  and  D.  K.  Panda  ,  High  Performance  RDMA-­‐Based  Design  of  HDFS  over  InfiniBand  ,  Supercompu'ng  (SC),  Nov  2012  

 N.  Islam,  X.  Lu,  W.  Rahman,  and  D.  K.  Panda,  SOR-­‐HDFS:  A  SEDA-­‐based  Approach  to  Maximize  Overlapping  in  RDMA-­‐Enhanced  HDFS,    HPDC  '14,    June  2014  

Page 22: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Triple-­‐H    

Heterogeneous  Storage    

•  Design  Features  –  Three  modes  

•  Default  (HHH)  •  In-­‐Memory  (HHH-­‐M)  

•  Lustre-­‐Integrated  (HHH-­‐L)  –  Policies  to  efficiently  uFlize  the  

heterogeneous  storage  devices  •  RAM,  SSD,  HDD,  Lustre  

–  EvicFon/PromoFon  based  on  data  usage  pa<ern  

–  Hybrid  ReplicaFon  –  Lustre-­‐Integrated  mode:  

•  Lustre-­‐based  fault-­‐tolerance  

22  

Enhanced  HDFS  with  In-­‐memory  and  Heterogeneous  Storage  

Hybrid  Replica'on  

Data  Placement  Policies  

Evic'on/  Promo'on  

RAM  Disk   SSD   HDD  

Lustre

N.  Islam,  X.  Lu,  M.  W.  Rahman,  D.  Shankar,  and  D.  K.  Panda,  Triple-­‐H:    A  Hybrid  Approach  to  Accelerate  HDFS  on  HPC  Clusters  with  Heterogeneous  Storage  Architecture,  CCGrid  ’15,    May  2015  

Applica'ons  

HPCAC  Switzerland  Conference  (Mar  '15)    

Page 23: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  HHH  (default):  Heterogeneous  storage  devices  with  hybrid  replicaFon  schemes  –  I/O  operaFons  over  RAM  disk,  SSD,  and  HDD  

–  Hybrid  replicaFon  (in-­‐memory  and  persistent  storage)  

–  Be<er  fault-­‐tolerance  as  well  as  performance  

•  HHH-­‐M:  High-­‐performance  in-­‐memory  I/O  operaFons    –  Memory  replicaFon  (in-­‐memory  only  with  lazy  persistence)  

–  As  much  performance  benefit  as  possible  

•  HHH-­‐L:  Lustre  integrated  –  Take  advantage  of  the  Lustre  available  in  HPC  clusters  –  Lustre-­‐based  fault-­‐tolerance  (No  HDFS  replicaFon)  –  Reduced  local  storage  space  usage  

HPCAC  Switzerland  Conference  (Mar  '15)     23  

Enhanced  HDFS  with  In-­‐memory  and  Heterogeneous  Storage  –  Three  Modes

Page 24: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  For  160GB  TestDFSIO  in  32  nodes  –  Write  Throughput:  7x  improvement  

over  IPoIB  (FDR)  

–  Read  Throughput:  2x  improvement  over  IPoIB  (FDR)  

Performance  Improvement  on  TACC  Stampede  (HHH)  

•  For  120GB  RandomWriter  in  32  nodes  –  3x  improvement  over  IPoIB  (QDR)  

24  

0  

1000  

2000  

3000  

4000  

5000  

6000  

write   read  

Total  Throu

ghpu

t  (MBp

s)  

IPoIB  (FDR)   OSU-­‐IB  (FDR)  

TestDFSIO  

0  

50  

100  

150  

200  

250  

300  

8:30   16:60   32:120  

Execu'

on  Tim

e  (s)  

Cluster  Size:Data  Size  

IPoIB  (FDR)   OSU-­‐IB  (FDR)  

RandomWriter  

HPCAC  Switzerland  Conference  (Mar  '15)    

Increased  by  7x  

Increased  by  2x  Reduced  by  3x  

Page 25: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  For  60GB  Sort  in  8  nodes  –  24%  improvement  over  default  HDFS  

–  54%  improvement  over  Lustre  

–  33%  storage  space  saving  compared  to  default  HDFS  

Performance  Improvement  on  SDSC  Gordon  (HHH-­‐L)  

25  

0  50  

100  150  200  250  300  350  400  450  500  

20   40   60  

Execu'

on  Tim

e  (s)  

Data  Size  (GB)  

HDFS-­‐IPoIB  (QDR)  Lustre-­‐IPoIB  (QDR)  OSU-­‐IB  (QDR)  

Storage  space  for  60GB  Sort  Sort  

Storage  Used  (GB)  

HDFS-­‐IPoIB  (QDR)  

360

Lustre-­‐IPoIB  (QDR)  

120

OSU-­‐IB  (QDR)   240

HPCAC  Switzerland  Conference  (Mar  '15)    

Reduced  by  54%  

Page 26: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

26  

Evalua'on  with  PUMA  and  CloudBurst  (HHH-­‐L/HHH)  

0  

500  

1000  

1500  

2000  

2500  

SequenceCount   Grep  

Execu'

on  Tim

e  (s)  

HDFS-­‐IPoIB  (QDR)  Lustre-­‐IPoIB  (QDR)  OSU-­‐IB  (QDR)   HDFS-­‐IPoIB  

(FDR)  OSU-­‐IB    (FDR)  

60.24  s   48.3  s  

CloudBurst  PUMA  

•  PUMA  on  OSU  RI  –  SequenceCount  with  HHH-­‐L:  17%  benefit  over  Lustre,  8%  over  HDFS  

–  Grep  with  HHH:  29.5%  benefit  over  Lustre,  13.2%  over  HDFS  

•  CloudBurst  on  TACC  Stampede  –  With  HHH:  19%  improvement  over  HDFS  

HPCAC  Switzerland  Conference  (Mar  '15)    

Reduced  by  17%  

Reduced  by  29.5%  

Page 27: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  RDMA-­‐based  Designs  and  Performance  EvaluaFon  –  HDFS  –  MapReduce  –  Spark  

Accelera'on  Case  Studies  and  In-­‐Depth  Performance  Evalua'on  

HPCAC  Switzerland  Conference  (Mar  '15)     27  

Page 28: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Design  Overview  of  MapReduce  with  RDMA  

MapReduce  

Verbs  

RDMA  Capable  Networks  (IB,  10GE/  iWARP,  RoCE  ..)  

OSU  Design  

Applica'ons  

1/10  GigE,  IPoIB    Network  

Java  Socket    Interface  

Java  Na've  Interface  (JNI)  

Job  Tracker  

Task  Tracker  

Map  

Reduce  

HPCAC  Switzerland  Conference  (Mar  '15)    

•  Enables  high  performance  RDMA  communicaFon,  while  supporFng  tradiFonal  socket  interface  •  JNI  Layer  bridges  Java  based  MapReduce  with  communicaFon  library  wri<en  in  naFve  code  

•  Design  Features  –  RDMA-­‐based  shuffle  –  Prefetching  and  caching  map  

output  –  Efficient  Shuffle  Algorithms  –  In-­‐memory  merge  –  On-­‐demand  Shuffle  Adjustment  –  Advanced  overlapping  

•  map,  shuffle,  and  merge  •  shuffle,  merge,  and  reduce  

–  On-­‐demand  connecFon  setup  –  InfiniBand/RoCE  support  

28  

Page 29: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

     Advanced  Overlapping  among  different  phases  

•  A  hybrid  approach  to  achieve  maximum  possible  overlapping  in  MapReduce  across  all  phases  compared  to  other  approaches  –  Efficient  Shuffle  Algorithms  –  Dynamic  and  Efficient  Switching  –  On-­‐demand  Shuffle  Adjustment  

Default  Architecture  

Enhanced  Overlapping  

Advanced  Overlapping  

M.  W.  Rahman,  X.  Lu,  N.  S.  Islam,  and  D.  K.  Panda,  HOMR:  A  Hybrid  Approach  to  Exploit  Maximum  Overlapping  in  MapReduce  over  High  Performance  Interconnects,  ICS,  June  2014.  

29  HPCAC  Switzerland  Conference  (Mar  '15)    

Page 30: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  For  240GB  Sort  in  64  nodes  (512  cores)  –  40%  improvement  over  IPoIB  (QDR)  

with  HDD  used  for  HDFS  

Performance  Evalua'on  of  Sort  and  TeraSort  

HPCAC  Switzerland  Conference  (Mar  '15)    

0  

200  

400  

600  

800  

1000  

1200  

Data  Size:  60  GB  Data  Size:  120  GB  Data  Size:  240  GB  

Cluster  Size:  16   Cluster  Size:  32   Cluster  Size:  64  

Job  Execu'

on  Tim

e  (sec)  

IPoIB  (QDR)   UDA-­‐IB  (QDR)   OSU-­‐IB  (QDR)  

Sort  in  OSU  Cluster  

0  

100  

200  

300  

400  

500  

600  

700  

Data  Size:  80  GB  Data  Size:  160  GB  Data  Size:  320  GB  

Cluster  Size:  16   Cluster  Size:  32   Cluster  Size:  64  

Job  Execu'

on  Tim

e  (sec)  

IPoIB  (FDR)   UDA-­‐IB  (FDR)   OSU-­‐IB  (FDR)  

TeraSort  in  TACC  Stampede  

•  For  320GB  TeraSort  in  64  nodes  (1K  cores)  –  38%  improvement  over  IPoIB  

(FDR)  with  HDD  used  for  HDFS  30  

Reduced  by  40%  

Reduced  by  38%  

Page 31: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  50%  improvement  in  Self  Join  over  IPoIB  (QDR)  for  80  GB  data  size  

•  49%  improvement  in  Sequence  Count  over  IPoIB  (QDR)  for  30  GB  data  size  

Evalua'ons  using  PUMA  Workload  

HPCAC  Switzerland  Conference  (Mar  '15)    

0  

0.1  

0.2  

0.3  

0.4  

0.5  

0.6  

0.7  

0.8  

0.9  

1  

AdjList  (30GB)   SelfJoin  (80GB)   SeqCount  (30GB)   WordCount  (30GB)   InvertIndex  (30GB)  

Normalized

 Execu'o

n  Time    

Benchmarks  

10GigE   IPoIB  (QDR)   OSU-­‐IB  (QDR)  

31  

Reduced  by  50%   Reduced  by  49%  

Page 32: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Op'mize  Hadoop  YARN  MapReduce  over  Parallel  File  Systems

MetaData Servers

Object Storage Servers

Compute Nodes App Master

Map Reduce

Lustre Client

Lustre Setup

•  HPC  Cluster  Deployment  –  Hybrid  topological  soluFon  of  Beowulf  

architecture  with  separate  I/O  nodes  –  Lean  compute  nodes  with  light  OS;  more  

memory  space;  small  local  storage  –  Sub-­‐cluster  of  dedicated  I/O  nodes  with  

parallel  file  systems,  such  as  Lustre  •  MapReduce  over  Lustre  

–  Local  disk  is  used  as  the  intermediate  data  directory  

–  Lustre  is  used  as  the  intermediate  data  directory  

32  HPCAC  Switzerland  Conference  (Mar  '15)    

Page 33: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Intermediate  Data  Directory  

Design  Overview  of  Shuffle  Strategies  for  MapReduce  over  Lustre  

HPCAC  Switzerland  Conference  (Mar  '15)    

•  Design  Features  –  Two  shuffle  approaches  

•  Lustre  read  based  shuffle  •  RDMA  based  shuffle  

–  Hybrid  shuffle  algorithm  to  take  benefit  from  both  shuffle  approaches  

–  Dynamically  adapts  to  the  be<er  shuffle  approach  for  each  shuffle  request  based  on  profiling  values  for  each  Lustre  read  operaFon  

–  In-­‐memory  merge  and  overlapping  of  different  phases  are  kept  similar  to  RDMA-­‐enhanced  MapReduce  design  

33  

Map  1   Map  2   Map  3  

Lustre  

Reduce  1   Reduce  2  

Lustre  Read  /  RDMA  

In-­‐memory  merge/sort  

reduce  

M.  W.  Rahman,  X.  Lu,  N.  S.  Islam,  R.  Rajachandrasekar,  and  D.  K.  Panda,  High  Performance  Design  of  YARN  MapReduce  on  Modern  HPC  Clusters  with  Lustre  and  RDMA,  IPDPS,  May  2015.  

In-­‐memory  merge/sort  

reduce  

Page 34: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  For  500GB  Sort  in  64  nodes  –  44%  improvement  over  IPoIB  (FDR)  

Performance  Improvement  of  MapReduce  over  Lustre  on  TACC-­‐Stampede  

HPCAC  Switzerland  Conference  (Mar  '15)    

•  For  640GB  Sort  in  128  nodes  –  48%  improvement  over  IPoIB  (FDR)  

0  

200  

400  

600  

800  

1000  

1200  

300   400   500  

Job  Execu'

on  Tim

e  (sec)  

Data  Size  (GB)  

IPoIB  (FDR)  OSU-­‐IB  (FDR)  

0  

50  

100  

150  

200  

250  

300  

350  

400  

450  

500  

20  GB   40  GB   80  GB   160  GB   320  GB   640  GB  

Cluster:  4   Cluster:  8   Cluster:  16   Cluster:  32   Cluster:  64   Cluster:  128  

Job  Execu'

on  Tim

e  (sec)  

IPoIB  (FDR)   OSU-­‐IB  (FDR)  

M.  W.  Rahman,  X.  Lu,  N.  S.  Islam,  R.  Rajachandrasekar,  and  D.  K.  Panda,  MapReduce  over  Lustre:  Can  RDMA-­‐based  Approach  Benefit?,  Euro-­‐Par,  August  2014.  

•  Local  disk  is  used  as  the  intermediate  data  directory  

34  

Reduced  by  48%  Reduced  by  44%  

Page 35: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  For  80GB  Sort  in  8  nodes  –  34%  improvement  over  IPoIB  (QDR)  

Case  Study  -­‐  Performance  Improvement  of  MapReduce  over  Lustre  on  SDSC-­‐Gordon  

HPCAC  Switzerland  Conference  (Mar  '15)    

•  For  120GB  TeraSort  in  16  nodes  –  25%  improvement  over  IPoIB  (QDR)  

•  Lustre  is  used  as  the  intermediate  data  directory  

35  

0  

100  

200  

300  

400  

500  

600  

700  

800  

900  

40   60   80  

Job  Execu'

on  Tim

e  (sec)  

Data  Size  (GB)  

IPoIB  (QDR)  OSU-­‐Lustre-­‐Read  (QDR)  OSU-­‐RDMA-­‐IB  (QDR)  OSU-­‐Hybrid-­‐IB  (QDR)  

0  

100  

200  

300  

400  

500  

600  

700  

800  

900  

40   80   120  

Job  Execu'

on  Tim

e  (sec)  

Data  Size  (GB)  

IPoIB  (QDR)  OSU-­‐Lustre-­‐Read  (QDR)  OSU-­‐RDMA-­‐IB  (QDR)  OSU-­‐Hybrid-­‐IB  (QDR)  

Reduced  by  25%  Reduced  by  34%  

Page 36: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  RDMA-­‐based  Designs  and  Performance  EvaluaFon  –  HDFS  –  MapReduce  –  Spark  

Accelera'on  Case  Studies  and  In-­‐Depth  Performance  Evalua'on  

HPCAC  Switzerland  Conference  (Mar  '15)     36  

Page 37: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Design  Overview  of  Spark  with  RDMA

Spark Applications(Scala/Java/Python)

Spark(Scala/Java)

BlockFetcherIteratorBlockManager

Task TaskTask Task

Java NIO ShuffleServer

(default)

Netty ShuffleServer

(optional)

RDMA ShuffleServer

(plug-in)

Java NIO ShuffleFetcher(default)

Netty ShuffleFetcher

(optional)

RDMA ShuffleFetcher(plug-in)

Java Socket RDMA-based Shuffle Engine(Java/JNI)

1/10 Gig Ethernet/IPoIB (QDR/FDR) Network

Native InfiniBand (QDR/FDR)

•  Design  Features  –  RDMA  based  shuffle  –  SEDA-­‐based  plugins  –  Dynamic  connecFon  

management  and  sharing  –  Non-­‐blocking  and  out-­‐of-­‐

order  data  transfer  –  Off-­‐JVM-­‐heap  buffer  

management  –  InfiniBand/RoCE  support  

HPCAC  Switzerland  Conference  (Mar  '15)    

•  Enables  high  performance  RDMA  communicaFon,  while  supporFng  tradiFonal  socket  interface  

•  JNI  Layer  bridges  Scala  based  Spark  with  communicaFon  library  wri<en  in  naFve  code     X.  Lu,  M.  W.  Rahman,  N.  Islam,  D.  Shankar,  and  D.  K.  Panda,  Accelera'ng  Spark  with  RDMA  for  Big  Data  

Processing:  Early  Experiences,  Int'l  Symposium  on  High  Performance  Interconnects  (HotI'14),  August  2014  

37  

Page 38: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

HPCAC  Switzerland  Conference  (Mar  '15)    

Preliminary  Results  of  Spark-­‐RDMA  Design  -­‐  GroupBy

0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

4   6   8   10  

Group

By  Tim

e  (sec)

Data  Size  (GB)

0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

8   12   16   20  

Group

By  Tim

e  (sec)

Data  Size  (GB)

10GigE  

IPoIB  

RDMA  

Cluster  with  4  HDD  Nodes,  GroupBy  with  32  cores   Cluster  with  8  HDD  Nodes,  GroupBy  with  64  cores  

•  Cluster  with  4  HDD  Nodes,  single  disk  per  node,  32  concurrent  tasks  

–  18%  improvement  over  IPoIB  (QDR)  for  10GB  data  size  

•  Cluster  with  8  HDD  Nodes,  single  disk  per  node,  64  concurrent  tasks  

–  20%  improvement  over  IPoIB  (QDR)  for  20GB  data  size  

38  

Reduced  by  20%  Reduced  by  18%  

Page 39: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Challenges  for  AcceleraFng  Big  Data  Processing  •  The  High-­‐Performance  Big  Data  (HiBD)  Project  

•  RDMA-­‐based  designs  for  Apache  Hadoop  and  Spark  –  Case  studies  with  HDFS,  MapReduce,  and  Spark  

–  Sample  Performance  Numbers  for  Hadoop  2.x  0.9.6  Release  

•  RDMA-­‐based  designs  for  Memcached  and  HBase  –  RDMA-­‐based  Memcached  with  SSD-­‐assisted  Hybrid  Memory  –  RDMA-­‐based  HBase  

•  Challenges  in  Designing  Benchmarks  for  Big  Data  Processing  –  OSU  HiBD  Benchmarks    

•  Conclusion  and  Q&A  

Presenta'on  Outline  

HPCAC  Switzerland  Conference  (Mar  '15)     39  

Page 40: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

0  

1000  

2000  

3000  

4000  

5000  

6000  

7000  

8000  

80   100   120  

Total  Throu

ghpu

t  (MBp

s)  

Data  Size  (GB)  

IPoIB  (FDR)  OSU-­‐IB  (FDR)  

0  

50  

100  

150  

200  

250  

300  

80   100   120  

Execu'

on  Tim

e  (s)  

Data  Size  (GB)  

IPoIB  (FDR)  

OSU-­‐IB  (FDR)  

•  For  Throughput,  –  6-­‐7.8x  improvement  over  IPoIB  

for  80-­‐120  GB  file  size  

40  

Performance  Benefits  –  TestDFSIO  in  TACC  Stampede  (HHH)  

Cluster  with  32  Nodes  with  HDD,  TestDFSIO  with  a  total  128  maps  

•  For  Latency,  –  2.5-­‐3x  improvement  over  

IPoIB  for  80-­‐120  GB  file  size  

Increased  by  7.8x   Reduced  by  2.5x  

HPCAC  Switzerland  Conference  (Mar  '15)    

Page 41: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

0  

50  

100  

150  

200  

250  

80   100   120  

Execu'

on  Tim

e  (s)  

Data  Size  (GB)  

IPoIB  (FDR)  OSU-­‐IB  (FDR)  

0  

50  

100  

150  

200  

250  

80   100   120  

Execu'

on  Tim

e  (s)  

Data  Size  (GB)  

IPoIB  (FDR)  OSU-­‐IB  (FDR)  

41  

Performance  Benefits  –  RandomWriter  &  TeraGen  in  TACC-­‐Stampede  (HHH)  

Cluster  with  32  Nodes  with  a  total  of  128  maps  

•  RandomWriter  –  3-­‐4x  improvement  over  IPoIB  

for  80-­‐120  GB  file  size  

•  TeraGen  –  4-­‐5x  improvement  over  IPoIB  

for  80-­‐120  GB  file  size  

RandomWriter   TeraGen  

Reduced  by  3x   Reduced  by  4x  

HPCAC  Switzerland  Conference  (Mar  '15)    

Page 42: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

0  

100  

200  

300  

400  

500  

600  

700  

800  

900  

80   100   120  

Execu'

on  Tim

e  (s)  

Data  Size  (GB)  

IPoIB  (FDR)   OSU-­‐IB  (FDR)  

0  

100  

200  

300  

400  

500  

600  

80   100   120  

Execu'

on  Tim

e  (s)  

Data  Size  (GB)  

IPoIB  (FDR)   OSU-­‐IB  (FDR)  

42  

Performance  Benefits  –  Sort  &  TeraSort  in  TACC-­‐Stampede  (HHH)  

Cluster  with  32  Nodes  with  a  total  of    128  maps  and  64  reduces  

•  Sort  with  single  HDD  per  node  –  40-­‐52%  improvement  over  IPoIB  

for  80-­‐120  GB  data    

•  TeraSort  with  single  HDD  per  node  –  42-­‐44%  improvement  over  IPoIB  

for  80-­‐120  GB  data  

Reduced  by  52%   Reduced  by  44%  

Cluster  with  32  Nodes  with  a  total  of    128  maps  and  57  reduces  

HPCAC  Switzerland  Conference  (Mar  '15)    

Page 43: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

0  

5000  

10000  

15000  

20000  

25000  

30000  

60   80   100  

Total  Throu

ghpu

t  (MBp

s)  

Data  Size  (GB)  

IPoIB  (FDR)   OSU-­‐IB  (FDR)  

•  TestDFSIO  Write  on  TACC  Stampede  –  28x  improvement  over  IPoIB  

for  60-­‐100  GB  file  size  43  

Performance  Benefits  –  TestDFSIO  on  TACC  Stampede  and  SDSC  Gordon  (HHH-­‐M)  

TACC  Stampede  Cluster  with  32  Nodes  with  HDD,    TestDFSIO  with  a  total  128  maps  

•  TestDFSIO  Write  on  SDSC  Gordon  –  6x  improvement  over  IPoIB  

for  60-­‐100  GB  file  size  

Increased  by  28x  

HPCAC  Switzerland  Conference  (Mar  '15)    

0  1000  2000  3000  4000  5000  6000  7000  8000  9000  

60   80   100  

Total  Throu

ghpu

t  (MBp

s)  

Data  Size  (GB)  

IPoIB  (QDR)   OSU-­‐IB  (QDR)   Increased  by  6x  

SDSC  Gordon  Cluster  with  16  Nodes  with  SSD,  TestDFSIO  with  a  total  64  maps  

Page 44: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

44  

Performance  Benefits  –  TestDFSIO  and  Sort  in  SDSC-­‐Gordon  (HHH-­‐L)  

0  50  

100  150  200  250  300  350  400  450  500  

40   60   80  

Execu'

on  Tim

e  (s)  

Data  Size  (GB)  

HDFS   Lustre  OSU-­‐IB  

•  Sort  –  up  to  28%  improvement  over  

HDFS  

–  up  to  50%  improvement  over  Lustre    

Sort  TestDFSIO  

•  TestDFSIO  for  80GB  data  size  –  Write:  9x  improvement  over  

HDFS  

–  Read:  29%  improvement  over  Lustre    

Cluster  with  16  Nodes  

Reduced  by  50%  Increased  by  29%  

0  

2000  

4000  

6000  

8000  

10000  

12000  

14000  

16000  

Write   Read  

Total  Throu

ghpu

t  (MBp

s)  

HDFS   Lustre   OSU-­‐IB  

Increased  by  9x  

HPCAC  Switzerland  Conference  (Mar  '15)    

Page 45: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Challenges  for  AcceleraFng  Big  Data  Processing  •  The  High-­‐Performance  Big  Data  (HiBD)  Project  

•  RDMA-­‐based  designs  for  Apache  Hadoop  and  Spark  –  Case  studies  with  HDFS,  MapReduce,  and  Spark  

–  Sample  Performance  Numbers  for  Hadoop  2.x  0.9.6  Release  

•  RDMA-­‐based  designs  for  Memcached  and  HBase  –  RDMA-­‐based  Memcached  with  SSD-­‐assisted  Hybrid  Memory  –  RDMA-­‐based  HBase  

•  Challenges  in  Designing  Benchmarks  for  Big  Data  Processing  –  OSU  HiBD  Benchmarks    

•  Conclusion  and  Q&A  

Presenta'on  Outline  

HPCAC  Switzerland  Conference  (Mar  '15)     45  

Page 46: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Memcached-­‐RDMA  Design  

•  Server  and  client  perform  a  negoFaFon  protocol  –  Master  thread  assigns  clients  to  appropriate  worker  thread  

•  Once  a  client  is  assigned  a  verbs  worker  thread,  it  can  communicate  directly  and  is  “bound”  to  that  thread  

•  All  other  Memcached  data  structures  are  shared  among  RDMA  and  Sockets  worker  threads  

•  NaFve  IB-­‐verbs-­‐level  Design  and  evaluaFon  with  –  Server  :  Memcached  (h<p://memcached.org)  

–  Client  :  libmemcached  (h<p://libmemcached.org)  

–  Different  networks  and  protocols:  10GigE,    IPoIB,  naFve  IB  (RC,  UD)  

Sockets  Client  

RDMA  Client  

Master  Thread  

Sockets  Worker  Thread  

Verbs  Worker  Thread  

Sockets  Worker  Thread  

Verbs  Worker  Thread  

 Shared  Data    

Memory  Slabs  Items  …  

1  

1  

2  

2  

HPCAC  Switzerland  Conference  (Mar  '15)     46  

Page 47: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

1  

10  

100  

1000  

1   2   4   8   16   32   64  128  256  512  1K   2K   4K  

Time  (us)  

Message  Size  

OSU-­‐IB  (FDR)  IPoIB  (FDR)  

0  

100  

200  

300  

400  

500  

600  

700  

16   32   64   128   256   512   1024  2048  4080  

Thou

sand

s  of  T

ransac'o

ns  per  

Second

 (TPS)  

No.  of  Clients  

•  Memcached  Get  latency  –  4  bytes  OSU-­‐IB:  2.84  us;  IPoIB:  75.53  us  –  2K  bytes  OSU-­‐IB:  4.49  us;  IPoIB:  123.42  us  

•  Memcached  Throughput  (4bytes)  –  4080  clients  OSU-­‐IB:  556  Kops/sec,  IPoIB:  233  Kops/s  –  Nearly  2X  improvement  in  throughput    

Memcached  GET  Latency   Memcached  Throughput  

Memcached  Performance  (FDR  Interconnect)  

Experiments  on  TACC  Stampede  (Intel  SandyBridge  Cluster,  IB:  FDR)  

Latency  Reduced    by  nearly  20X  

2X  

HPCAC  Switzerland  Conference  (Mar  '15)     47  

Page 48: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  IllustraFon  with  Read-­‐Cache-­‐Read  access  pa<ern  using  modified  mysqlslap  load  tesFng  tool    

•  Memcached-­‐RDMA  can    -  improve  query  latency  by  up  to  66%  over  IPoIB  (32Gbps)  

-  throughput  by  up  to  69%  over  IPoIB  (32Gbps)  

HPCAC  Switzerland  Conference  (Mar  '15)     48  

Micro-­‐benchmark  Evalua'on  for  OLDP  workloads    

0  

2  

4  

6  

8  

64   96   128   160   320   400  

Latency  (sec)  

No.  of  Clients  

 Memcached-­‐IPoIB  (32Gbps)  

Memcached-­‐RDMA  (32Gbps)  

0  

500  

1000  

1500  

2000  

2500  

3000  

3500  

64   96   128   160   320   400  

Throughp

ut    (Kq

/s)  

No.  of  Clients  

Memcached-­‐IPoIB  (32Gbps)  

Memcached-­‐RDMA  (32Gbps)  

D.  Shankar,  X.  Lu,  J.  Jose,  M.  W.  Rahman,  N.  Islam,  and  D.  K.  Panda,  Can  RDMA  Benefit  On-­‐Line  Data  Processing  Workloads  with  Memcached  and  MySQL,  ISPASS’15  

Reduced  by  66%  

Page 49: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  ohb_memlat  &  ohb_memthr  latency  &  throughput  micro-­‐benchmarks  

•  Memcached-­‐RDMA  can  -  improve  query  latency  by  up  to  70%  over  IPoIB  (32Gbps)  

-  improve  throughput  by  up  to  2X  over  IPoIB  (32Gbps)  

-  No  overhead  in  using  hybrid  mode  when  all  data  can  fit  in  memory  

HPCAC  Switzerland  Conference  (Mar  '15)     49  

Performance  Benefits  on  SDSC-­‐Gordon  –  OHB  Latency  &  Throughput  Micro-­‐Benchmarks  

0  1  2  3  4  5  6  7  8  9  10  

64   128   256   512   1024  

Throughp

ut  (m

illion  tran

s/sec)  

No.  of  Clients  

 IPoIB  (32Gbps)  RDMA-­‐Mem  (32Gbps)  RDMA-­‐Hybrid  (32Gbps)  

0  

100  

200  

300  

400  

500  

Average  latency  (us)  

Message  Size  (Bytes)  

IPoIB  (32Gbps)  RDMA-­‐Mem  (32Gbps)  RDMA-­‐Hyb  (32Gbps)   2X  

Page 50: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

ohb_memhybrid  –  Uniform  Access  Pa<ern,    single  client  and  single  server  with  64MB  

•  Success  Rate  of  In-­‐Memory  Vs.  Hybrid  SSD-­‐Memory  for  different  spill  factors  

–  100%  success  rate  for  Hybrid  design  while  that  of  pure  In-­‐memory  degrades    

•  Average  Latency  with  penalty  for  In-­‐Memory  Vs.  Hybrid  SSD-­‐Assisted  mode  for  spill  factor  1.5.  

–  up  to  53%  improvement  over  In-­‐memory  with  server  miss  penalty  as  low  as  1.5  ms  

50  

Performance  Benefits  on  OSU-­‐RI-­‐SSD  –  OHB  Micro-­‐benchmark  for  Hybrid  Memcached  

0  100  200  300  400  500  600  700  800  900  

512   2K   8K   32K   128K    

Latency  with

 pen

alty  (u

s)  

Message  Size  (Bytes)  

 RDMA-­‐Mem  (32Gbps)  

RDMA-­‐Hybrid  (32Gbps)  

55  

65  

75  

85  

95  

105  

115  

125  

1   1.15   1.3   1.45  

Success  R

ate  

Spill  Factor  

RDMA-­‐Mem-­‐Uniform  RDMA-­‐Hybrid  

HPCAC  Switzerland  Conference  (Mar  '15)    

53%  

Page 51: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Challenges  for  AcceleraFng  Big  Data  Processing  •  The  High-­‐Performance  Big  Data  (HiBD)  Project  

•  RDMA-­‐based  designs  for  Apache  Hadoop  and  Spark  –  Case  studies  with  HDFS,  MapReduce,  and  Spark  

–  Sample  Performance  Numbers  for  Hadoop  2.x  0.9.6  Release  

•  RDMA-­‐based  designs  for  Memcached  and  HBase  –  RDMA-­‐based  Memcached  with  SSD-­‐assisted  Hybrid  Memory  –  RDMA-­‐based  HBase  

•  Challenges  in  Designing  Benchmarks  for  Big  Data  Processing  –  OSU  HiBD  Benchmarks    

•  Conclusion  and  Q&A  

Presenta'on  Outline  

HPCAC  Switzerland  Conference  (Mar  '15)     51  

Page 52: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

HPCAC  Switzerland  Conference  (Mar  '15)    

HBase-­‐RDMA  Design  Overview  

•  JNI  Layer  bridges  Java  based  HBase  with  communicaFon  library  wri<en  in  naFve  code  

•  Enables  high  performance  RDMA  communicaFon,  while  supporFng  tradiFonal  socket  interface  

HBase  

IB  Verbs  

RDMA  Capable  Networks  (IB,  10GE/  iWARP,  RoCE  ..)  

OSU-­‐IB  Design  

Applica'ons  

1/10  GigE,  IPoIB    Networks  

Java  Socket    Interface  

Java  Na've  Interface  (JNI)  

52  

Page 53: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

HPCAC  Switzerland  Conference  (Mar  '15)    

HBase  –  YCSB  Read-­‐Write  Workload  

•  HBase  Get  latency  (Yahoo!  Cloud  Service  Benchmark)  –  64  clients:  2.0  ms;  128  Clients:  3.5  ms  –  42%  improvement  over  IPoIB  for  128  clients  

•  HBase  Put  latency  –  64  clients:  1.9  ms;  128  Clients:  3.5  ms  –  40%  improvement  over  IPoIB  for  128  clients    

0  

1000  

2000  

3000  

4000  

5000  

6000  

7000  

8   16   32   64   96   128  

Time  (us)  

No.  of  Clients  

0  

1000  

2000  

3000  

4000  

5000  

6000  

7000  

8000  

9000  

10000  

8   16   32   64   96   128  

Time  (us)  

No.  of  Clients  

10GigE  

Read  Latency   Write  Latency  

OSU-­‐IB  (QDR)  IPoIB  (QDR)  

53  

J.  Huang,  X.  Ouyang,  J.  Jose,  M.  W.  Rahman,  H.  Wang,  M.  Luo,  H.  Subramoni,  Chet  Murthy,  and  D.  K.  Panda,  High-­‐Performance  Design  of  HBase  with  RDMA  over  InfiniBand,  IPDPS’12  

Page 54: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Challenges  for  AcceleraFng  Big  Data  Processing  •  The  High-­‐Performance  Big  Data  (HiBD)  Project  

•  RDMA-­‐based  designs  for  Apache  Hadoop  and  Spark  –  Case  studies  with  HDFS,  MapReduce,  and  Spark  

–  Sample  Performance  Numbers  for  Hadoop  2.x  0.9.6  Release  

•  RDMA-­‐based  designs  for  Memcached  and  HBase  –  RDMA-­‐based  Memcached  with  SSD-­‐assisted  Hybrid  Memory  –  RDMA-­‐based  HBase  

•  Challenges  in  Designing  Benchmarks  for  Big  Data  Processing  –  OSU  HiBD  Benchmarks    

•  Conclusion  and  Q&A  

Presenta'on  Outline  

HPCAC  Switzerland  Conference  (Mar  '15)     54  

Page 55: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Big  Data  Middleware  (HDFS,  MapReduce,  HBase,  Spark  and  Memcached)  

Networking  Technologies  (InfiniBand,  1/10/40GigE  and  Intelligent  NICs)  

 Storage  Technologies  

(HDD  and  SSD)  

Programming  Models  (Sockets)  

Applica'ons  

Commodity  Compu'ng  System  Architectures  

(Mul'-­‐  and  Many-­‐core  architectures  and  accelerators)  

Other  Protocols?  

Communica'on  and  I/O  Library  

Point-­‐to-­‐Point  Communica'on  

QoS  

Threaded  Models  and  Synchroniza'on  

Fault-­‐Tolerance  I/O  and  File  Systems  

Virtualiza'on  

Benchmarks  

RDMA  Protocol  

Designing  Communica'on  and  I/O  Libraries  for  Big  Data  Systems:  Solved  a  Few  Ini'al  Challenges    

HPCAC  Switzerland  Conference  (Mar  '15)    

Upper  level  Changes?  

55  

Page 56: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  The  current  benchmarks  provide  some  performance  behavior  

•  However,  do  not  provide  any  informaFon  to  the  designer/developer  on:  –  What  is  happening  at  the  lower-­‐layer?  

–  Where  the  benefits  are  coming  from?  

–  Which  design  is  leading  to  benefits  or  bo<lenecks?  

–  Which  component  in  the  design  needs  to  be  changed  and  what  will  be  its  impact?  

–  Can  performance  gain/loss  at  the  lower-­‐layer  be  correlated  to  the  performance  gain/loss  observed  at  the  upper  layer?        

HPCAC  Switzerland  Conference  (Mar  '15)    

Are  the  Current  Benchmarks  Sufficient  for  Big  Data  Management  and  Processing?

56  

Page 57: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

OSU  MPI  Micro-­‐Benchmarks  (OMB)  Suite  •  A  comprehensive  suite  of  benchmarks  to    

–  Compare  performance  of  different  MPI  libraries  on  various  networks  and  systems  –  Validate  low-­‐level  funcFonaliFes  –  Provide  insights  to  the  underlying  MPI-­‐level  designs  

•  Started  with  basic  send-­‐recv  (MPI-­‐1)  micro-­‐benchmarks  for  latency,  bandwidth  and  bi-­‐direcFonal  bandwidth  

•  Extended  later  to  –  MPI-­‐2  one-­‐sided  –  CollecFves  –  GPU-­‐aware  data  movement  –  OpenSHMEM  (point-­‐to-­‐point  and  collecFves)  –  UPC  

•  Has  become  an  industry  standard    •  Extensively  used  for  design/development  of  MPI  libraries,  performance  comparison  of  

MPI  libraries  and  even  in  procurement  of  large-­‐scale  systems  •  Available  from  h<p://mvapich.cse.ohio-­‐state.edu/benchmarks  

•  Available  in  an  integrated  manner  with  MVAPICH2  stack      

HPCAC  Switzerland  Conference  (Mar  '15)     57  

Page 58: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Big  Data  Middleware  (HDFS,  MapReduce,  HBase,  Spark  and  Memcached)  

Networking  Technologies  (InfiniBand,  1/10/40GigE  and  Intelligent  NICs)  

 Storage  Technologies  

(HDD  and  SSD)  

Programming  Models  (Sockets)  

Applica'ons  

Commodity  Compu'ng  System  Architectures  

(Mul'-­‐  and  Many-­‐core  architectures  and  accelerators)  

Other  Protocols?  

Communica'on  and  I/O  Library  

Point-­‐to-­‐Point  Communica'on  

QoS  

Threaded  Models  and  Synchroniza'on  

Fault-­‐Tolerance  I/O  and  File  Systems  

Virtualiza'on  

Benchmarks  

RDMA  Protocols  

Challenges  in  Benchmarking  of  RDMA-­‐based  Designs  

Current  Benchmarks  

No  Benchmarks  

Correla'on?  

HPCAC  Switzerland  Conference  (Mar  '15)     58  

Page 59: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Big  Data  Middleware  (HDFS,  MapReduce,  HBase,  Spark  and  Memcached)  

Networking  Technologies  (InfiniBand,  1/10/40GigE  and  Intelligent  NICs)  

 Storage  Technologies  

(HDD  and  SSD)  

Programming  Models  (Sockets)  

Applica'ons  

Commodity  Compu'ng  System  Architectures  

(Mul'-­‐  and  Many-­‐core  architectures  and  accelerators)  

Other  Protocols?  

Communica'on  and  I/O  Library  

Point-­‐to-­‐Point  Communica'on  

QoS  

Threaded  Models  and  Synchroniza'on  

Fault-­‐Tolerance  I/O  and  File  Systems  

Virtualiza'on  

Benchmarks  

RDMA  Protocols  

Itera've  Process  –  Requires  Deeper  Inves'ga'on  and  Design  for  Benchmarking  Next  Genera'on  Big  Data  Systems  and  Applica'ons    

HPCAC  Switzerland  Conference  (Mar  '15)    

Applica'ons-­‐Level  

Benchmarks  

Micro-­‐Benchmarks  

59  

Page 60: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Evaluate  the  performance  of  standalone  HDFS    

•  Five  different  benchmarks  –  SequenFal  Write  Latency  (SWL)  

–  SequenFal  or  Random  Read  Latency  (SRL  or  RRL)  

–  SequenFal  Write  Throughput  (SWT)  

–  SequenFal  Read  Throughput  (SRT)  

–  SequenFal  Read-­‐Write  Throughput  (SRWT)  

HPCAC  Switzerland  Conference  (Mar  '15)    

OSU  HiBD  Micro-­‐Benchmark  (OHB)  Suite  -­‐  HDFS  

Benchmark   File    Name  

File  Size  

HDFS  Parameter  

Readers   Writers   Random/  SequenFal  

Read  

Seek  Interval  

SWL   √   √   √  

SRL/RRL   √   √   √   √   √  (RRL)  

SWT   √   √   √  

SRT   √   √   √  

SRWT   √   √   √   √  

N.  S.  Islam,  X.  Lu,  M.  W.  Rahman,  J.  Jose,  and  D.  K.  Panda,  A  Micro-­‐benchmark  Suite  for  Evalua'ng  HDFS  Opera'ons  on  Modern  Clusters,  Int'l  Workshop  on  Big  Data  Benchmarking  (WBDB  '12),  December  2012.  

 

60  

Page 61: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

OSU  HiBD  Micro-­‐Benchmark  (OHB)  Suite  -­‐  RPC  •  Two  different  micro-­‐benchmarks  to  evaluate  the  performance  of  standalone  

Hadoop  RPC  –  Latency:  Single  Server,  Single  Client  

–  Throughput:  Single  Server,  MulFple  Clients  

•  A  simple  script  framework  for  job  launching  and  resource  monitoring  

•  Calculates  staFsFcs  like  Min,  Max,  Average  

•  Network  configuraFon,  Tunable  parameters,  DataType,  CPU  UFlizaFon  

Component   Network  Address  

Port   Data  Type   Min  Msg  Size  

Max  Msg  Size  

No.  of  IteraFons  

Handlers   Verbose  

lat_client   √   √   √   √   √   √   √  

lat_server   √   √   √   √  

Component   Network  Address  

Port   Data  Type   Min  Msg  Size  

Max  Msg  Size  

No.  of  IteraFons  

No.  of  Clients   Handlers   Verbose  

thr_client   √   √   √   √   √   √   √  

thr_server   √   √   √   √   √   √  

X.  Lu,  M.  W.  Rahman,  N.  Islam,  and  D.  K.  Panda,  A  Micro-­‐Benchmark  Suite  for  Evalua'ng  Hadoop  RPC  on  High-­‐Performance  Networks,  Int'l  Workshop  on  Big  Data  Benchmarking  (WBDB  '13),  July  2013.  

HPCAC  Switzerland  Conference  (Mar  '15)     61  

Page 62: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Evaluate  the  performance  of  stand-­‐alone  MapReduce  

•  Does  not  require  or  involve  HDFS  or  any  other  distributed  file  system  

•  Considers  various  factors  that  influence  the  data  shuffling  phase  –  underlying  network  configuraFon,  number  of  map  and  reduce  tasks,  intermediate  

shuffle  data  pa<ern,  shuffle  data  size  etc.  

•  Three  different  micro-­‐benchmarks  based  on  intermediate  shuffle  data  pa<erns  –  MR-­‐AVG  micro-­‐benchmark:  intermediate  data  is  evenly  distributed  among  

reduce  tasks.  

–  MR-­‐RAND  micro-­‐benchmark:  intermediate  data  is  pseudo-­‐randomly  distributed  among  reduce  tasks.  

–  MR-­‐SKEW  micro-­‐benchmark:  intermediate  data  is  unevenly  distributed  among  reduce  tasks.    

HPCAC  Switzerland  Conference  (Mar  '15)    

OSU  HiBD  Micro-­‐Benchmark  (OHB)  Suite  -­‐  MapReduce

D.  Shankar,  X.  Lu,  M.  W.  Rahman,  N.  Islam,  and  D.  K.  Panda,  A  Micro-­‐Benchmark  Suite  for  Evalua'ng  Hadoop  MapReduce  on  High-­‐Performance  Networks,  BPOE-­‐5  (2014).  

62  

Page 63: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Upcoming  Releases  of  RDMA-­‐enhanced  Packages  will  support  –  Spark  –  HBase  –  Plugin-­‐based  designs  

•  Upcoming  Releases  of  OSU  HiBD  Micro-­‐Benchmarks  (OHB)  will  support  –  HDFS  

–  MapReduce  

–  RPC  

•  ExploraFon  of  other  components  (Threading  models,  QoS,  VirtualizaFon,  Accelerators,  etc.)  

•  Advanced  designs  with  upper-­‐level  changes  and  opFmizaFons  

HPCAC  Switzerland  Conference  (Mar  '15)    

Future  Plans  of  OSU  High  Performance  Big  Data  Project    

63  

Page 64: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Presented  an  overview  of  Big  Data  processing  middleware  

•  Discussed  challenges  in  acceleraFng  Big  Data  middleware  

•  Presented  iniFal  designs  to  take  advantage  of  InfiniBand/RDMA  for  Hadoop,  Spark,  Memcached,  and  HBase  

•  Presented  challenges  in  designing  benchmarks  

•  Results  are  promising    

•  Many  other  open  issues  need  to  be  solved    

•  Will  enable  Big  processing  community  to  take  advantage  of  

modern  HPC  technologies  to  carry  out  their  analyFcs  in  a  fast  and  scalable  manner    

HPCAC  Switzerland  Conference  (Mar  '15)    

Concluding  Remarks  

64  

Page 65: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

Personnel  Acknowledgments  

65  

Current  Students    –  A.  Awan  (Ph.D.)  

–  A.  Bhat  (M.S.)  

–  S.  Chakraborthy    (Ph.D.)  

–  C.-­‐H.  Chu  (Ph.D.)  

–  N.  Islam  (Ph.D.)  

Past  Students    –  P.  Balaji  (Ph.D.)  

–  D.  BunFnas  (Ph.D.)  

–  S.  Bhagvat  (M.S.)  

–  L.  Chai  (Ph.D.)  

–  B.  Chandrasekharan  (M.S.)  

–  N.  Dandapanthula  (M.S.)  

–  V.  Dhanraj  (M.S.)  

–  T.  Gangadharappa  (M.S.)  

–  K.  Gopalakrishnan  (M.S.)  

–  G.  Santhanaraman  (Ph.D.)  –  A.  Singh  (Ph.D.)  

–  J.  Sridhar  (M.S.)  

–  S.  Sur  (Ph.D.)  

–  H.  Subramoni  (Ph.D.)  

–  K.  Vaidyanathan  (Ph.D.)  

–  A.  Vishnu  (Ph.D.)  

–  J.  Wu  (Ph.D.)  

–  W.  Yu  (Ph.D.)  

Past  Research  Scien8st  –  S.  Sur  

Current  Post-­‐Doc  –  J.  Lin  

Current  Programmer  –  J.  Perkins  

Past  Post-­‐Docs  –  H.  Wang  –  X.  Besseron  –  H.-­‐W.  Jin  

–  M.  Luo  

–  W.  Huang  (Ph.D.)  

–  W.  Jiang  (M.S.)  

–  J.  Jose  (Ph.D.)  

–  S.  Kini  (M.S.)  

–  M.  Koop  (Ph.D.)  

–  R.  Kumar  (M.S.)  

–  S.  Krishnamoorthy  (M.S.)  

–  K.  Kandalla  (Ph.D.)  

–  P.  Lai  (M.S.)  

–  J.  Liu  (Ph.D.)  

–  M.  Luo  (Ph.D.)  

–  A.  Mamidala  (Ph.D.)  

–  G.  Marsh  (M.S.)  

–  V.  Meshram  (M.S.)  

–  S.  Naravula  (Ph.D.)  

–  R.  Noronha  (Ph.D.)  

–  X.  Ouyang  (Ph.D.)  

–  S.  Pai  (M.S.)  

–  S.  Potluri  (Ph.D.) –  R.  Rajachandrasekar  (Ph.D.)  

 

–  M.  Li  (Ph.D.)  –  M.  Rahman  (Ph.D.)  

–  D.  Shankar  (Ph.D.)  

–  A.  Venkatesh  (Ph.D.)  

–  J.  Zhang  (Ph.D.)  

–  E.  Mancini  –  S.  Marcarelli  

–  J.  Vienne  

Current  Senior  Research  Associates  –  K.  Hamidouche  

–  X.  Lu  

Past  Programmers  –  D.  Bureddy  

–  H.  Subramoni  

Current  Research  Specialist  –  M.  Arnold  

HPCAC  Switzerland  Conference  (Mar  '15)    

Page 66: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

[email protected]­‐state.edu  

HPCAC  Switzerland  Conference  (Mar  '15)    

Thank  You!  

The  High-­‐Performance  Big  Data  Project  h<p://hibd.cse.ohio-­‐state.edu/

66  

Network-­‐Based  CompuFng  Laboratory  h<p://nowlab.cse.ohio-­‐state.edu/

The  MVAPICH2/MVAPICH2-­‐X  Project  h<p://mvapich.cse.ohio-­‐state.edu/

Page 67: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

HPCAC  Switzerland  Conference  (Mar  '15)     67  

Call  For  Par'cipa'on  

Interna'onal  Workshop  on  High-­‐Performance  Big  Data  Compu'ng  (HPBDC  2015)  

In  conjuncFon  with  InternaFonal  Conference  on  Distributed  

CompuFng  Systems  (ICDCS  2015)  In  Hilton  Downtown,  Columbus,  Ohio,  USA,  Monday,  June  29th,  2015  

 

h<p://web.cse.ohio-­‐state.edu/~luxi/hpbdc2015  

Page 68: Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark …mvapich.cse.ohio-state.edu/static/media/talks/slide/...Accelera’ng*Big*Data*Processing*with*Hadoop,*Spark and*Memcached* DhabaleswarK.(DK)Panda*

•  Looking  for  Bright  and  EnthusiasFc  Personnel  to  join  as    –  Post-­‐Doctoral  Researchers  

–  PhD  Students  

–  Hadoop/Big  Data  Programmer/Sonware  Engineer  

–  MPI  Programmer/Sonware  Engineer  

•  If  interested,  please  contact  me  at  this  conference  and/or  send  an  e-­‐mail  to  [email protected]­‐state.edu  

68  

Mul'ple  Posi'ons  Available  in  My  Group  

HPCAC  China,  Nov.  '14