Percipient StorAGe for Exascale Data Centric Computing Documents/Fusion... · – Co-design Extreme...
Transcript of Percipient StorAGe for Exascale Data Centric Computing Documents/Fusion... · – Co-design Extreme...
This project has received funding from the European
Union’s Horizon 2020 research and innovation
programme under grant agreement No 671500
Percipient StorAGe for Exascale Data Centric Computing Computing for the Exascale
Shaun de Witt Culham Centre for Fusion Energy, UK
2nd Technical Meeting on Fusion Data Processing, Validation and Analysis - June 2nd 2017
Per-cip-i-ent (pr-sp-nt)
Adj.
Having the power of perceiving, especially perceiving keenly and readily.
n.
One that perceives.
Storage cannot keep up w/ Compute!
Way too much data
Way too much energy to move data
New Storage devices use unclear
Opportunity: Big Data Analytics and Extreme Computing
Overlaps
Storage Problems at Extreme Scale
SAGE Project Goal
Big (Massive!)
Data Analysis • Avoid Data Movements
• Manage & Process
extremely large data
sets
Need Exascale Data Centric Computing Systems
Big Data Extreme Computing (BDEC Systems)
SAGE Validates a BDEC System which can Ingest,
Store, Process and Manage extreme amounts of data
Extreme
Computing • Changing I/O needs
• HDDs cannot keep up
• Provide/Validate Novel Storage Architecture
– Object Storage w/ a very flexible API , driving
• Multi-tiered Hierarchical Storage System
• Providing integrated compute capability
• Purpose: Increase overall scientific throughput!
– Co-designed with Use cases
– Integrated with Ecosystem tools
• Provide roadmap of component technologies for achieving Extreme Scale
– Including Programing Models and Access Methods
• European Excellence in the area of Exascale Data Centric Computing by targeting;
– HPC & Big Data technology Influencers
– Scientific Communities & Infrastructure/Wider Markets
SAGE Overall Objectives
• Tracking very strongly European Exascale and HPC objectives
– Very Active Participation in ETP4HPC
– Strategic Research Agenda (SRA) Goals – SRA2
(http://www.etp4hpc.eu/en/sra.html)
• Continue to be extremely well aligned
• Driving future H2020 projects ( FETHPC2, ESDs, etc)
• Synergistic with worldwide initiatives – ECP (Exascale Computing Project): https://exascaleproject.org/
– BDEC (Big Data Extreme Compute) : http://www.exascale.org/bdec/
SAGE Overall Objectives
SAGE Relevance to Fusion
• ITER Data Analysis
– Prompt analysis, pre-emptive caching, …
• Engineering and Modelling
– Note PPPL already working on
Exasacale development with XGC* –
opportunities for collaboration?
* http://www.pppl.gov/news/2017/02/advanced-fusion-code-led-pppl-selected-participate-early-science-programs-three-new-0
Applications
• Primary Goal
– Demonstrate Use Cases & Co-Design
the system
• Methodology
– Obtain Requirements from:
• Various Use Cases
• Detailed profiling supported by
Tools
– Feedback requirements to the
platform (“Co-Design”)
Co-Design Requirements
Automated Application Characterization w/ Tools
• CCFE fusion energy applications – Analytics of Log-files for Fusion (ALF)
– Spectre: Providing near real time feedback on plasma
– Finite element analysis using ParaFEM
• Savu – Tomography reconstruction and processing pipeline
• Ray – Distributed assembly of metagenome
• JURASSIC – Fast radiative transfer model simulation code
• iPIC3D – Particle-in-Cell code for simulations of space plasma
• NEST – Simulator for spiking neural network models
• Angelia benchmarking framework – Benchmarking framework for Apache Flink
Applications
• Goal
– Build the data centric computing
platform
• Methodology
– Advanced Object Storage
– New NVRAM Technologies in I/O
stack
– Ability for I/O to Accept computation
• Incl. Memory as part of storage
tiers
– API for massive data ingest and
extreme I/O
– Commodity Server & Computing
Components in I/O stack
Percipient Storage Overview
Tiered Object Storage
HSM design
• Data organization is based on Mero
composite layout
• Data migration decision come from
– User access
– Knowledge base filled from all Mero
information (hints, events, …)
Containers
• Logical Grouping of Objects
– System properties • access latency, resilience,…
– Object properties • Formats, access mechanism…
– Scientific properties • Location, time, diagnostic…
– Event based • ‘shot’, earthquake, hurricane,…
MPI and PGAS
Status:
• Global Partitioned Address Space for hierarchical storage
– Realized via MPI windows allocated on storage
– Can substitute MPI I/O eliminating distinction between programming interfaces for memory and storage
• Implemented in PMPI and MPICH (available as open-source on github)
• Use MPI “hints”
• No change to the MPI standard
20
• iPIC3D – PGAS I/O
– Function shipping for data analysis
• JURASSIC – Function off-loading or run-time
system for
• Data pre-processing (data extraction)
• Data compress/decompression
– Asynchronous I/O with data staging using semi-persistent cache
• NEST – Data post-processing (data
analysis) using run-time system
• Savu – Native HDF5 support
SAGE Feature Analysis
doi:10.1016/j.fusengdes.2017.03.113
– Native block store with Ceph-like
interface
– Function off-loading for slicing of data
regions using a Python interface
• Spectre
– Apache Flink for parallelising FFT and
buffered streaming of data to storage
Clients Clients
Clients Clients
Clients Clients
pNFS Services on Object Storage
• pNFS is standard parallel file system
protocol
• Separate MetaData access from Data
access
• Posix namespace is stored in Mero KV
Store
• Files data are stored in Mero Objects
22
Mero
KVS MD Server
Mero
Objects
pNFS Protocol
Mero Protocol
Network
Extreme resiliency for applications
Distributed Transactions
Groups of storage (including I/O) operations that are
atomic in the face of certain failures known as
allowed failures.
Allowed Failures
Transient network failures
Node crash and restart
Distributed Transactions Manager (DTM)
Creates transactions
Controls transactions
Actions
Scatter-gather write of data into a Mero object
Scatter-gather read of data from a Mero object
Creation of a new sub-directory
Renaming of a file
Writing of a data unit reconstructed from parity
blocks in a spare unit
Mero DTM Mero DTM
Mero
DTM
Mero
DTM
Mero DTM Mero DTM
Mero DTM Mero DTM
Integration, Demonstration
• Goal
– Hardware definition, integration and demonstration
• Methodology
– Design and Bring-up of SAGE hardware • Seagate Hardware
• Atos Hardware
– Integration of all the software components • Juelich Supercomputer Center(JSC)
– Demonstrate use cases • Extrapolate performance to Exascale
• Study other Object stores vis-à-vis Mero
SAGE Hardware Prototype Built, shipped and integrated at JSC
Sage is extremely well aligned to the broader goals for Europe in the
area of Storage, I/O and Energy Efficiency
• M-BIO-1: Tightly coupled Storage class memory io systems demo
• M-BIO-3: Multi-tiered heterogeneous storage system demo
• M-BIO-5: Big data analytics tools developed for hpc use
• M-BIO-6: ‘Active Storage’ capability demonstrated
• M-BIO-8: Extreme scale multi-tier data management tools available
• M-ARCH-3: New compute nodes and storage architecture use nvram
• M-ENER –X: Addresses Energy goals by avoiding data movements
– 100x more energy to move data compared to compute!!
• M-ENER-FT-10: Application survival on unreliable hardware
Alignment with European Goals [ ETP4HPC SRA]
Expected Impacts & Innovation
Commercial and Market Impacts
(Storage, Systems & Tools)
Key inputs into European
Road mapping
Key inputs into “Data Intensive”
Research programs
Primary European Storage
Platform for Extreme Scale
(Applicability: Big Science and
BDEC)
Acknowledgements Sai Narasimhamurthy(Seagate)
Dirk Pleiter (Forschungszentrum Jülich)
Stefano Markidis (Kungliga Tekniska Högskolan)
Questions ? [email protected]
Services, Systemware & Tools
• Goal
– Explore tools and services on top of Mero
• Methodology
– “HSM” Methods to automatically move
data across tiers
– “pNFS” parallel file system access on Mero
– Scale out Object Storage Integrity
checking service provision
– Allinea Performance Analysis Tools
provision
HSM PoC implementation
ready
pNFS PoC implementation
ready
Performance Analysis tools
framework ready
Completed scoping/Arch of
Data Integrity Checking
• Goal
– Mero Object Storage Platform
Development(w/ Clovis API)
– Evaluate NVRAM Options
• Methodology
– Co-design Extreme Scale Object
Store Components
– Study NVRAM technologies incl.
Emulation
Mero Extreme scale Features & NVRAM
Concept & Architecture of Key Mero Exascale
components
NVRAM state of the Art studies
Low Level system software/Emulation of NVRAM
Programming Models and Analytics
• Goal
– Explore usage of SAGE by programming models,
runtimes and data analytics solutions
• Methodology
– Usage of SAGE through MPI and PGAS • Adapt MPI-IO for SAGE
• Adapt PGAS for SAGE
– Runtimes for SAGE • Pre/Post Processing
– Volume Rendering • Exploit Caching hierarchy
– Data Analytics methods on top of Clovis • Apache Flink over Clovis, looking beyond Hadoop
• Exploit NVRAM as extension of Memory
PGAS for SAGE Proof of
Concept
Following up from MPI-IO
gap analysis
Runtimes Proof of Concept
Detailed architecture of Data
Analytics
SAGE to lay the foundation for a European
storage platform to be #1 at Extreme Scale
SAGE Project Ambition
M9 Review Recommendations
[Recommendation 1] – As part of the validation process, we recommend the
consortium to include criteria to evaluate the performance of the proposed storage
system. These criteria will serve to monitor internally the progress of the project and
will not entail additional obligations towards the European Commission.
o WP5 Discussion
o A PEWG (Performance Evaluation Working Group) was setup
o Methods to continuously track the performance of the SAGE system
[Recommendation 2] – As part of the dissemination activities concerning future work, we
recommend the consortium to better advertise the SAGE project to the scientific
community, for instance through the publication of scientific papers. In order to facilitate
user engagement, we also suggest exploring the possibility to give access to the
prototype to users outside the consortium.
o More focus on open publications (WP6 Discussion)
o Initiation of activity to provide access to the prototype (WP5 Discussion)
• Exascale/Extreme Computing
– Computing, Exaflop and Beyond
• Object Storage
– Grouping data into user defined “objects”. No particular notion of grouping in
hierarchical trees.
• Data Centric Computing
– Computing that depends on and generates lots of data
– Classical HPC was mainly only about simulations, data was secondary
• Parallel File systems
– Popular paradigm for accessing storage in HPC – by parallelizing I/O to storage
subsystem
• pNFS
– A type of parallel file system
• NVM/NVRAM
– Non Volatile Memory
• MPI
– Popular programming model in HPC, MPI-IO is the IO library
Definition of Terms
WP6: Dissemination, Exploitation..
• Goal
Dissemination, Exploitation & Collaboration
– Methodology
• Disseminate SAGE through events, conferences, publications
and Talks
• Exploitation through exploring marketing opportunity and
expanding European IP
• Collaboration with other European and International Projects
– Seeking influence on Def-Facto standard methods Continued Website Updated
Continued Publications
Continued Social Media
Continued Participation in Key events & Talks
Press Releases and Press Coverage
Discussion with potential users of SAGE technology