UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 •...

13
The HDF Group www.hdfgroup.org Use of a New I/O Stack for Extremescale Systems in Scientific Applications Michael Breitenfeld, Neil Fortner, Jerome Soumagne The HDF Group Mohamad Chaarawi, Intel Quincey Koziol, Lawrence Berkeley National Laboratory 11/14/2016 1 Collaborators: Intel and LBNL PDSWDISC 2016 WIP

Transcript of UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 •...

Page 1: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

The  HDF  Group

www.hdfgroup.org

Use  of  a  New  I/O  Stack  for  Extreme-­‐scale  Systems  in  Scientific  Applications

Michael  Breitenfeld,  Neil  Fortner,  Jerome  SoumagneThe  HDF  Group

Mohamad  Chaarawi,  IntelQuincey  Koziol,  Lawrence  Berkeley  National  Laboratory

11/14/2016 1

Collaborators:  Intel  and  LBNLPDSW-­‐DISC  2016  -­‐ WIP

Page 2: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

www.hdfgroup.org

ESSIO  Storage  Architecture

11/14/2016 2

• Compute  Node  NVRAM• Hot  data

• High  valence  &  velocity• Brute-­‐force,  ad-­‐hoc  analysis

• Extreme  scale-­‐out

• Full  fabric  bandwidth• O(1PB/s)→O(10PB/s)

• Extremely  low  fabric  &  NVRAM  latency• Extreme  fine  grain  • New  programming  models

• I/O  Node  NVRAM/SSD• Semi-­‐hot  data/staging  buffer• Fractional  fabric  bandwidth

• O(10TB/s)→O(100TB/s)

• Parallel  Filesystem  NVRAM/SSD/Disk• Site-­‐wide  shared  warm  storage

• SAN  limited  –O(1TB/s)→O(10TB/s)

ComputeNodes(NVRAM)

I/O Nodes(NVRAM, SSD)

ComputeFabric

Site-wideStorageFabric

Parallel Filesystem(NVRAM, SSD, Disk)

Compute Cluster

PDSW-­‐DISC  2016  -­‐ WIP

Page 3: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

www.hdfgroup.org

FastForward/ESSIO  Transactional  Stack

11/14/2016 3

• A  transaction consists  of  a  set of  updates  to  a  container• container  ≈  file• Updates  are  added  to  a  transaction,  not  made  directly  to  a  container

• Updates  include  additions,  deletions,  and  modifications

PDSW-­‐DISC  2016  -­‐ WIP

Page 4: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

www.hdfgroup.org

HACC  — Overview

11/14/2016 4

HACC  -­‐ Hardware/Hybrid  Accelerated  Cosmology  CodeN-­‐body  cosmology  code  framework  where  a  typical  simulation  of  the  universe  demands  extreme  scale  simulation  capabilities  

Primary  data  model• 9  arrays  at  full  scale  of  application

• Position  in  3-­‐D,  Velocity  in  3-­‐D,  Simulation  Info,  Science  Data

• Additional  metadata  augmenting  provenance,  etc

PDSW-­‐DISC  2016  -­‐ WIP

Page 5: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

www.hdfgroup.org

Application  creates  custom  binary  files

HACC  — Data  Model:  Currently

11/14/2016 5

?

PDSW-­‐DISC  2016  -­‐ WIP

Page 6: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

www.hdfgroup.org

Application  creates  custom  binary  files

HACC  — Data  Model:  ESSIO

11/14/2016 6

• All  application  metadata  stored  in  HDF5  container

• HDF5  format  is  self-­‐describing,  using  groups,  datasets  and  attributes

• Any  visualization  or  analysis  process  can  be  used  to  investigate  science  results

PDSW-­‐DISC  2016  -­‐ WIP

Page 7: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

www.hdfgroup.org

HACC  — Data  Resiliency:  Currently

11/14/2016 7

þ Application  stores  and  verifies  checksum  from  memory  to  the  file  and  back

?(with checksums)

PDSW-­‐DISC  2016  -­‐ WIP

Page 8: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

www.hdfgroup.org

þ Application  stores  and  verifies  checksum  from  memory  to  the  file  and  back

HACC  — Data  Resiliency:  ESSIO

11/14/2016 8

• Each  process  calculates  and  passes  checksum  of  the  local  array  section  to  HDF5

• HDF5  optionally  verifies  buffer,  and  passes  checksum  with  data  down  the  stack

• Checksum  verified  for  every  data  buffer  operation  from  HDF5  to  storage  and  back

PDSW-­‐DISC  2016  -­‐ WIP

Page 9: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

www.hdfgroup.org

HACC  — Fault  Tolerance:  Currently

11/14/2016 9

Application  retries  I/O  until  completed

?(with checksums)

?

PDSW-­‐DISC  2016  -­‐ WIP

Page 10: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

www.hdfgroup.org

Application  retries  I/O  until  completed

HACC  — Fault  Tolerance:  ESSIO

11/14/2016 10

• Each  process  writes  all  checkpoint  data  to  transaction

• Transaction  is  committed  to  storage,  possibly  asynchronously

• If  asynchronous,  application  can  test/wait  to  guarantee  data  is  persistent

• Future  work:  replay  event  stack  on  error

PDSW-­‐DISC  2016  -­‐ WIP

Page 11: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

www.hdfgroup.org

High-­‐level  HDF5  ESSIO  stack  libraries

11/14/2016 11

ObjectiveHave  the  high-­‐level  I/O  code  manage  the  transaction  requests  

and  isolate  the  application  code  from  the  ESSIO  stack

Ported  Two  High-­‐level  HDF5  based  I/O  libraries(1)  NetCDF – A  set  of  software  libraries  used  to  

facilitate  the  creation,  access,  and  sharing  of  array-­‐oriented  scientific  data  in  self-­‐describing,  machine-­‐independent  data  formats

(2)  Parallel  I/O  (PIO)  – A  high-­‐level  I/O  library  which  uses  as  its  backend  NetCDF

PDSW-­‐DISC  2016  -­‐ WIP

Page 12: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

www.hdfgroup.org

PIO/NetCDF: ESSIO

11/14/2016 12

• Global  stack  variables  are  passed  as  arguments  to  NetCDF and  from  the  application• Stack  parameters  are  controlled  from  within  PIO  

Nth  time  step

Initialize  stack:  Read  context  id,Version  number,Event  stack  id,  transaction  id

Application  (ACME)PIOWrite  Array

netCDF• Writes  data  to  stack  via  

multiple  netCDF  APIs• Increments  and  

automatically  handles  FF  variables

PDSW-­‐DISC  2016  -­‐ WIP

Page 13: UseofaNewI/OStackfor$ …PDSW @DISC$2016$ WIP ESSIOStorageArchitecture 11/14/2016 2 • Compute(Node(NVRAM • Hotdata • High(valence&(velocity • BrutePforce,ad Phoc(analysis •

www.hdfgroup.org

Future  Work

11/14/2016 13

New  superset  of  DAOS  – DAOS-­‐MDistributed  Persistent  Memory  Class  Storage  Model  

• DAOS-­‐M  server  will  access  memory  class  storage  using  a  Persistent  Memory  programming  model  that  directly  utilizes  load-­‐store  access  to  NVRAM  DIMMs

• Extends  the  current  DAOS  API  to  support  key-­‐value  objects  natively

Port  and  benchmark  to  DAOS-­‐M:(1)  Legion  Programing  System  (not  presented  here)(2)  NetCDF  to  DAOS-­‐M

PDSW-­‐DISC  2016  -­‐ WIP