Post on 03-Feb-2022
Garching 2007-07-18 ScicomP 13
High Performance Global File Systems Easy Data Management in Supercomputer Grids
Andreas Schott (schott@rzg.mpg.de)
Garching 2007-07-18 ScicomP 13 2
Overview
Motivation / Choices
GPFS / MC-GPFS
DEISA’s Implementation and Status
Garching 2007-07-18 ScicomP 13 3
Motivation for Global File Systems
Advantages
• Simple access
• Standard commands
• No special data preparation
• No re-writing of jobs and binaries
• Everything everywhere at any time
Issues
• Network stability
• Latency
• Performance
• Availability
Garching 2007-07-18 ScicomP 13 4
Available Choices
• (Open)AFS
• GFS
• PVFS
• OCFS
• NFS
• NFS4
• Lustre
• MC-GPFS
Garching 2007-07-18 ScicomP 13 5
General Concepts of MC-GPFS
MC-GPFS = Multiple Cluster General Parallel File System
available for all HPC architectures in DEISA
servers available for AIX and Linux
Principle Structure
distributed – shared – striped
kernel add-on for file system
block oriented data transfer
Features achieved
shared and high performance access
safe and secure data
high administrative flexibility
Garching 2007-07-18 ScicomP 13 6
General Concepts of MC-GPFS
Technical Aspects
each site with its own servers possible
local disk space locally administered
scalability and high performance access by inherent parallelism
easy extensible
file consistency by sophisticated token management
high recoverability and increased data availability
simplified storage management
storage pools, file sets
simplified administration
globally acting commands
Garching 2007-07-18 ScicomP 13 7
General Concepts of MC-GPFS
Security Aspects
separate network communication for administration possible
remote security
authenticated remote access for servers
mount and/or data with SSL-keys
easy root-mapping
easy no-suid functionality
userid mapping for remote access via interfaces
Garching 2007-07-18 ScicomP 13 8
General Concepts of MC-GPFS
Access and Availability
transparent access
no special data transfer commands required
global visibility inside DEISA
extended access rights
no single point of failure communication
delegated locking and other communication
Garching 2007-07-18 ScicomP 13 9
Summary of MC-GPFS
Local and Remote High Performance Access
high parallelism in data and file access
very large file and file system support
High Availability
each site with its own servers
redundant access path
simply extensible and scalable
striped data
parallel access path
Garching 2007-07-18 ScicomP 13 10
Advantages of GPFS (admin)
• Easy Management
• Easy Extensibility
• High Performance
• Security Features
• Add-On Features like HSM Functionality
Garching 2007-07-18 ScicomP 13 11
Advantages of GPFS (user)
• Standard Access Methods
Transparent Access
• Data globally visible
No special actions for data transfer required
• Simplicity
• Extended Access Right Features
• Add-On Features like HSM Functionality
Garching 2007-07-18 ScicomP 13 12
Local GPFS File Servers
Network
File Server
1
File Server
2
File Server
N
FC-Switch
Disk
System
1
Disk
System
2
Disk
System
M
...
...
Garching 2007-07-18 ScicomP 13 13
Local GPFS Access
Network
File Server
1 ... N
FC-Switch
Disk
System
1 ... M
Compute
Server 1
Compute
Server N
Separate
Clusters
One
Clusters
Garching 2007-07-18 ScicomP 13 14
Remote GPFS Access
Network
Site A
Network
Site B
File Server
1 ... N
FC-Switch
Disk
System
1 ... M
Compute
Server 1
Compute
Server N
File Server
1 ... N
FC-Switch
Disk
System
1 ... M
Compute
Server N
Compute
Server 1
WAN
Garching 2007-07-18 ScicomP 13 16
Aims of DEISA
Providing HPC resources to the Scientific CommunityOffering an add-on value to local facilities
optimal hardware selection
easy usability
transparent data access
Achievement of these Aimscommon network structure
using internal features of job schedulers
additional middleware for easy access (e.g. UNICORE)
global file system in a network of trust
Garching 2007-07-18 ScicomP 13 17
MC-LoadLeveler in DEISA
Implementation
• Environment Variables for DATA
• Modules
• Local Home Directories
• Job Movement (Filters)
Caveats
• Path Unification
• Treatment of HSM
• Data Availability
Pre- and Post-processing
Garching 2007-07-18 ScicomP 13 18
NJS CINECA IBM P5
IDB UUDB
GatewayCINECA
AIXLL-MC
AIXLL
AIXLL-MC
AIXLL-MC
CINECA user
Super-UXNQS II
AIXLL
job
LINUXLSF
LINUXPBS Pro
AIXLL-MC
AIXLL-MC
LINUXLL
Super-UXNQS II
johannes.reetz@rzg.mpg.de
Garching 2007-07-18 ScicomP 13 19
GatewayCSC
GatewayECMWF
GatewayFZJ
GatewayIDRIS
GatewaySARA
GatewayLRZ
GatewayHPCX
GatewayHLRS
NJS CINECA IBM P5
IDB UUDB
GatewayBSC
GatewayCINECA NJS
FZJ IBM P4
IDB UUDB
NJS RZG IBM P4
IDB UUDB
NJS ECMWF IBM P5
IDB UUDB
NJS CSC IBM P4
IDB UUDB
NJS HPCX IBM P5
IDB UUDB
NJS LRZ SGI ALTIX
IDB UUDB
NJS HLRS NEC SX8
IDB UUDB
AIXLL-MC
AIXLL
AIXLL-MC
AIXLL-MC
CINECA user
Super-UXNQS II
AIXLL
job
NJS SARA SGI ALTIX
IDB UUDB
NJS BSC IBM PPC
IDB UUDB
LINUXLSF
LINUXPBS Pro
GatewayRZG
NJSIDRIS IBM P4
IDB UUDB
AIXLL-MC
AIXLL-MC
LINUXLL
Super-UXNQS II
johannes.reetz@rzg.mpg.de
Garching 2007-07-18 ScicomP 13 20
GatewayCSC
GatewayECMWF
GatewayFZJ
GatewayIDRIS
GatewaySARA
GatewayLRZ
GatewayHPCX
GatewayHLRS
NJS CINECA IBM P5
IDB UUDB
GatewayBSC
GatewayCINECA NJS
FZJ IBM P4
IDB UUDB
NJS RZG IBM P4
IDB UUDB
NJS ECMWF IBM P5
IDB UUDB
NJS CSC IBM P4
IDB UUDB
NJS HPCX IBM P5
IDB UUDB
NJS LRZ SGI ALTIX
IDB UUDB
NJS HLRS NEC SX8
IDB UUDB
AIXLL-MC
AIXLL
AIXLL-MC
AIXLL-MC
CINECA user
Super-UXNQS II
AIXLL
job
NJS SARA SGI ALTIX
IDB UUDB
NJS BSC IBM PPC
IDB UUDB
LINUXLSF
LINUXPBS Pro
GatewayRZG
NJSIDRIS IBM P4
IDB UUDB
AIXLL-MC
LINUXLL
AIXLL-MC
Super-UXNQS II
johannes.reetz@rzg.mpg.de
Garching 2007-07-18 ScicomP 13 21
postgres DB
RFT
IO node
grid gateway
gg.rzg.mpg.de
GLOBUS client tools
grid-proxy-init
globusrun-ws
globus-url-copy
gsissh
internetinternet gsissh
DMZ
hig
h p
erfo
rman
ce
sw
itch
p5io3.rzg.mpg.de
GPFS
LRMS (master)
(head node)
Linux
AIX
disk system
intranet
LRMS (node hosting the LoadLeveler master)
Local Resource Management System (IBM LoadLeverer)
head node (e.g., for code development and testing)
gsisshd 2222
LRMS client
full DEISA CPE available
Cluster compute nodes (IBM P5)
grid gateway (job submission host)
gridftp frontend 2811 (user mode)
gridftp backend (root)
globus container 8443DMZ firewall inbound ports (8443,20000-25000)
(fork), LRMS client
GPFS available
grid-mapfile: (DN � D-GRID username)
D-GRID user
Johannes.Reetz@rzg.mpg.de
Globus Installation at RZG
Garching 2007-07-18 ScicomP 13 22
GPFS Configuration in DEISA
Each AIX-site provides its own server
Some non-AIX-sites will provide servers based on Linux
RZG hosts disk space for non-AIX-sites without servers
RZG provides HSM-functionality on GPFS
locally disk space performs like local disk space
total of more than 30 TB
wide area network connection with 10GBit/s (mostly)
remotely disk space no longer limited by network
Garching 2007-07-18 ScicomP 13 23
DEISA „proof of concept“ phase
Premium IP:
IP Priority:
LSPs:
DFN
RENATER
GARR
GÈANT
RENATER
GARR
DFN
1 Gb/s
Garching 2007-07-18 ScicomP 13 24
Evolution of GPFS in DEISA
RZG (DE)
Power4
AIX
FZJ (DE)
Power4
AIX
IDRIS (FR)
Power4
AIX
CINECA (IT)
Power5
AIX
October 2004
Garching 2007-07-18 ScicomP 13 25
SDSC
Chicago
New York Amsterdam
GEANT
Milano
Paris
Teragrid
Frankfurt
FZJ
Jülich
RZG
Munich
DFNNREN Germany
Cineca
Bologna
GARRNREN Italy
IDRIS
Orsay
RENATERNREN France
Internet2/Abilene
R.Niederberger@fz-juelich.de
1 Gb/s Premium IP
1 Gb/s LSP
10 Gb/s
30-40 Gb/s
10 Gb/s
DEISA – TeraGrid Connection
Super Computing 2005
Garching 2007-07-18 ScicomP 13 26
DEISA 1 Gb/s network infrastructure
RENATER
FUNET
SURFnet
DFN
GARR
UKERNA
RedIris
GÉANTLSPs
Garching 2007-07-18 ScicomP 13 27
Evolution of GPFS in DEISA
RZG (DE)
Power4
AIX
FZJ (DE)
Power4
AIX
IDRIS (FR)
Power4
AIX
CINECA (IT)
Power5
AIX
BSC (ES)
PowerPC
Linux
CSC (FI)
Power4
AIX
SARA (NL)
SGI-Altix
Linux
July 2006
Garching 2007-07-18 ScicomP 13 28
Upgrade of Multiple Cluster GPFS
Problems with GPFS 2.3
Initial MC-functionality not inherently integrated
Each-to-Any communication required
Limitation of participating nodes
Advantages of GPFS 3.1
Better Multi-Cluster Support
Better Encapsulation by possible use of private addresses
Higher Independence between sites
Higher Stability
Better Performance
Garching 2007-07-18 ScicomP 13 29
Evolution of GPFS in DEISA
RZG (DE)
Power4
AIX
FZJ (DE)
Power4
AIX
IDRIS (FR)
Power4
AIX
CINECA (IT)
Power5
AIX
LRZ (DE)
SGI-Altix
Linux
CSC (FI)
Power4
AIX
ECMWF (GB)
Power5+
AIXFebruary 2007
Garching 2007-07-18 ScicomP 13 30
Status of Multiple Cluster GPFS
2250 GB20.12640 Power5+ (1.9 GHz)1 TB2ECMWF
39064 GB62.39728 Montecito (1.6 GHz)0 TB(RZG)LRZ
672 GB2.2512 Power4 (1.1 GHz)2 TB2CSC
4.6
6.7
8.9
2.6
TFlops
2368 GB928 Power4 (1.3 GHz)10 TB2RZG
3136 GB1024 Power4 (1.3 GHz)2 TB2IDRIS
5152 GB1288 Power4 (1.7 GHz)4 TB2FZJ
1152 GB480 Power5 (1.9 GHz)2 TB2CINECA
MemoryCompute-CPUsStorageFile-
serverSite
Garching 2007-07-18 ScicomP 13 31
DEISA – Network (estimated Q3 / 2007)
SURFnet
UKERNA FUNET
RedIris
GARR1 Gb/s 10 Gb/s 10 Gb/s 10 Gb/s
RENATER
10 Gb/s
GÉANT2
DFN10 Gb/s
10 Gb/s
10 Gb/s
10 Gb/s
Dedicated 10 Gb/s wavelength
1 Gb/s LSP
Dedicated 10 Gb/s wavelength(potential)
GÉANTLSP
DFN/GÉANTFrankfurt
ralph.niederberger@fz-juelich.de
Garching 2007-07-18 ScicomP 13 32
Evolution of GPFS in DEISA
RZG (DE)
Power4
AIX
FZJ (DE)
Power4
AIX
IDRIS (FR)
Power4
AIX
CINECA (IT)
Power5
AIX
LRZ (DE)
SGI-Altix
Linux
BSC (ES)
PowerPC
Linux
HLRS (DE)
NEC-SX8
Super-UX
CSC (FI)
Power4
AIX
SARA (NL)
SGI-Altix
LinuxEPCC (GB)
Power4
AIX
ECMWF (GB)
Power5+
AIX
CSC (FI)
Cray XT4
Linux
SARA (NL)
Power5
Linux
/deisa/<site>/home/<group>/<user>
/deisa/<site>/data /<group>/<user>