Designing File Systems for NVMe - eResearch...

19
1 Dell - Internal Use - Confidential Designing Parallel File Systems With Scalable NVMe Building Blocks Matt Wallis High Performance Computing and Artificial Intelligence Specialist Dell EMC

Transcript of Designing File Systems for NVMe - eResearch...

Page 1: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

1 Dell - Internal Use - Confidential

Designing Parallel File Systems With Scalable NVMe Building Blocks

Matt Wallis

High Performance Computing and Artificial Intelligence Specialist

Dell EMC

Page 2: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

© Copyright 2019 Dell Inc.2 of Y

Agenda

• How NVMe is Accelerating Science Workloads

• Risk Management Approaches

• Dell EMC Architectural Approaches– CSIRO - BeeGFS Buddy Mirroring

– A Medical Research Institute - BeeGFS BeeOND

– Cambridge University – Lustre Data Accelerator

– Imperial College of London – ArcaStream and Excelero

• Results

Page 3: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

Dell EMC – Confidential – internal use only3

The very definition of HPC is expanding

Blazing Fast Speed Accessibility and flexibility

Simulation and ModellingComputationally-intensive workloads as ‘traditional HPC’

• Computer-aided design (CAD/CAM/CAE)

• Weather forecasting

• Oil exploration

High Performance Data AnalyticsAnalyzing big data for rapid insights, real-time results, predictive analytics

• Personalized medicine

• Fraud detection

• Signal processing

Artificial Intelligence

• Deep learning

• Machine learning

Page 4: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

Dell EMC – Confidential – internal use only4

001010

001010001011011

001010

0010100010110

00

0010100010110

001

01

00

01

011

Evolving HPC workloads demands are shifting

Data explosion More complex algorithms IT archiving mandates

Inadequate I/O capability can severely degrade overall cluster performance

1. Data is rarely touched after 2–4 weeks, yet you may be required to keep it for up

to 20–30 years

Storage administrators & researchers are challenged with the growing burden of managing

and monitoring complex storage systems

Page 5: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

© Copyright 2019 Dell Inc.5 of Y

DEFINING A RISK VERSUS REWARD STRATEGY

Performance

Persistence

Performance

High Availability,

Backup, Data

Protection

Data Sharing and

Accessibility

Management &

IntegrationCritical Requirement

Check point, parallel access Not the critical featureUptime important, but no

special backup

requirements

Important, but not the

critical feature

Scalable performance,

tunable for workload

Able to fulfill compliance

requirements, protect

important data

Pre and post

processing, analytics,

desktop access

Management

functionality and

support; connections

with other tools

Ephemeral ‘scratch’

Traditional ‘scratch’

Page 6: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

7

DELL EMC NVMe BUILDING BLOCK with POWEREDGE R740XD

Multi-Vector CoolingMulti-vector cooling delivers correct air

flow to each PCIe slot

Innovative Design

Up to 24 Intel P46100 NVME or Optane NVME with 64 PCIe lanes remaining for networking and peripherals

Highest Performance

Cyber Built upon the Intel Xeon Scalable Processor range

Intelligent automation

New OpenManage™ Enterprise

console delivers crystal clear

reporting & full lifecycle automation

Based on Dell Internal Analyses 03/01/2017.

THE BEDROCK OF THE MODERN SUPERCOMPUTER

Optimized for high throughput scale out performance

Page 7: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

BeeGFS ARCHITECTURE BUDDY MIRROR ON POWEREDGE STORAGE SERVERS

Storage

Servers

Compute node Compute node

Scale out

over Eth, IB or OPA

Storage

Servers

Compute node Compute node Compute node

Management Server

Buddy Mirror Buddy Mirror Buddy Mirror

**Combine Metadata function within storage servers in separate RAID pool

Private

management

LAN

primary

HDD HDD

secondary

write read

write

Page 8: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

© Copyright 2019 Dell Inc.9 of Y

BUDDY MIRROR ON POWEREDGE

Page 9: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

© Copyright 2019 Dell Inc.10 of Y

BeeGFS On Demand (BeeOND)

• On Demand Scratch File System

• Simple implementation, low barriers to entry

• Runs on the same nodes your jobs do

• Failures take out job and file systems

• Performance and Capacity depend on the

number of nodes requested

• Can make debugging hard if debug files

deleted with the job failing

Page 10: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

Dell EMC – Confidential – internal use only11

BeeGFS On DEMAND (BeeOND) ARCHITECTURE

Compute node Compute node

over Eth, IB or OPA

Compute node Compute node Compute nodeCompute node

Persistent

Storage

Per Job BeeGFS Client Storage

Page 11: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

© Copyright 2019 Dell Inc.12 of Y

CAMBRIDGE DATA ACCELERATOR

Storage

Servers

Compute node Compute node

Scale out

over Eth, IB or OPA

Storage

Servers

Lemur Movers

Parallel Data

Movers

Compute node Compute node Compute node

Management Server

Private

management

LAN

Page 12: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

13

CAMBRIDGE DATA ACCELERATOR SLURM PLUG IN

Page 13: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

© Copyright 2019 Dell Inc.14 of Y

Storage

Servers

Compute node Compute node

Scale out

over Eth, IB or OPA

Storage

Servers

Windows client

CIFS Gateway

Compute node Compute node Compute node

Management Server

Private

management

LAN

ARCASTREAM - PIXSTOR

FILE STRIPE ONE FILE STRIPE TWO FILE STRIPE THREE FILE STRIPE FOUR

Page 14: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

© Copyright 2019 Dell Inc.15 of Y

Cryo-EM

4 hours on disk

10 minutes on NVMe

Page 15: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

© Copyright 2019 Dell Inc.16 of Y

Atmospheric Modelling

1hr 16 minutes disk

22 minutes NVMe

Page 16: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

© Copyright 2019 Dell Inc.17 of Y

Computer Vision Deep Learning

• 3 weeks on disk

• 1 week on NVMe

• Enabled multi-node

scaling

Page 17: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

18

ANNOUNCING THE WORLDS FASTEST STORAGE

Page 18: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

19

WORK DIRECTLY WITH HPC AND AI EXPERTS

The Dell EMC HPC and AI Innovation

Lab designs and builds solutions,

tests and explores new technology,

shares performance results and best

practices

Typical HPC and AI Innovation Lab projects

“Our lab is staffed by engineers with advanced degrees and many years of industry experience in

domains such as mechanical engineering and bioinformatics. We also have engineers with computer

science backgrounds, providing expertise in file systems, interconnects and HPC management tools.”

—Onur Celebioglu, HPC Engineering Director and head of the HPC and AI Innovation Lab

• Technology comparison

• System parameter sweeps

• GPU test comparison

• Efficiency tuning

• HPC network evaluation

• HPC storage system optimization

• Proof of concept studies

• Vertical Solutions

Page 19: Designing File Systems for NVMe - eResearch 2019conference.eresearch.edu.au/wp-content/uploads/... · Data explosion More complex algorithms IT archiving mandates Inadequate I/O capability

IN PARTNERSHIP WITH