Agenda
• HLRS
• The hardware parts of the Cray XE6
Node (procesor, interconnect)
Packaging
• XE6 Software
CLE (Cray Linux Environment) ESM, CCM
CCE (Cray Compiler Environment)
Other Compiler Enviroments (shorter)
• How to submit a job
Slide 2
Leading Edge HPC infrastructure in Germany
HLRS is one of the three national
supercomputing centers in Germany
responsible for engineering and
industrial simulation
The national supercomputing
centers are working together in the
Gauss Centre for Supercomputing
GCS
GCS is the means to contribute to
the Partnership for Advanced
Computing in Europe (PRACE)
The BMBF project petaGCS is the
source for 50% of funding for
investment and operation
The remaining 50% are provided by
the Ministry of Science, Research
and the Arts Baden-Württemberg
3
Hermit Phase1 Step1 (2011)
System Design Overview
4
External Login Server
Fast Local Storage
Remote Visualization
Server Pre- & Postprocessing
Server
Parallel Filesystem
(Lustre) HLRS wide shared NAS Home Space
Hermit Phase1 Step 1b
GPGPU Add-on
Hermit Phase1 Step 2 (2013)
Other HLRS Server
Storage and Archive
Phase 1 Step 1 Overview
5
Configuration:
Peak Performance ~ 1PF
38 racks with 96 nodes each
96 service nodes and 3552 compute nodes
Each compute node will have 2 sockets
AMD Interlagos @ 2.3GHz 16 Cores each
leading to 113.664 cores
Nodes with 32GB and 64GB memory reflecting different user needs
2.7PB storage capacity @ ~ 150GB/s IO bandwidth
External Access Nodes, Pre- & Postprocessing Nodes, Remote
Visualization Nodes
~2MW maximal power consumption
Support for ISV Codes depending on the application under ESM
(Extreme Scalability Mode, „native“) or CCM (Cluster Compatibility
Mode)
CRAY <-> HLRS Collaboration
CRAY and HLRS have set up a
Cray development center in
Stuttgart with on site staff for
production & joint research
Work closely with the users to
port and optimize for the big
installation step1 in Q3/2011
Joined development and definition of
the details of Phase 1 Step 2 based
on
Results of joint optimization and
scaling efforts on Phase 1Step1
Experiences with accelerators on
Step1b
Target is tailored Step2 system for
HLRS’ industrial and academic users
6
HLRS-CRAY Collaboration
<2011
Cray XE6 “Hermit1”
Phase1 Step1 ~1PF Peak
Q3/2011
Cray Cascade “Hermit2”
Phase1 Step2 ~4-5PF Peak
2013
Cray XE6 Phase1 Step0
Testsystem 2010
Update of “Hermit1”
with 32 Nodes CRAY XK6
2012
A glimpse on Phase 1 Step 2 – Hermit 2
7
Step 2 will run in parallel to Step 1 realizing an integrated system
Goal is to maintain similar software stack for both installation steps
Expected architectural changes
Next generation interconnect
Newest generation of CPUs
Partially relying on accelerators
Updated storage infrastructure
Additional external servers
Significantly increased sustained application performance
Scheduled for Autumn 2013
Overall peak performance of complete Phase 1 will be >5PF
Specification is driven by agreed sustained application performance and
not peak performance tough
Differentiation Strategy
Unique resources in terms of
size and architecture (tier-0
systems)
High level of expertise in
emerging technologies /
Technology Watch (e.g. WP9
prototypes)
Consultancy in
Porting/Optimisation
Layer integration
Unification of access
models, application
procedures across the tier-x
Co-Development
Collaboration approaches
aiming to shorten time to
market in Software,
Hardware,
Solutions/Architecture
Training and Consulting
Lower the barrier to exploit
available resources
Support selection of most
appropriate system for the
problem size (tier-x)
Training of trainers
8
Three year EU-funded
collaborative project, 13
partners, €12 million costs, €8.5
million funding
Collaborative Research into
Exascale Systemware, Tools
and Applications
Project coordinator: EPCC at
The University of Edinburgh
CRESTA has a very strong
focus on exascale software
challenges
Uses a co-design model of
applications with exascale
potential interacting with
systemware and tools activities
The hardware partner is Cray
Applications represent broad
spectrum from science and
engineering (OpenFOAM!)
CRESTA will compare and
contrast incremental and
disruptive solutions to Exascale
challenges
9
epcc|crestaVisual Identity Designs
CREST
CREST
CREST
CREST
• Towards EXascale ApplicatTions (TEXT)
• EU-funded CP & CSA in FP7-Infrastructures-2010-2
• 4 HPC Centers, 4 Universities, 1 Industrial)
• Centered around StarSS programming model from BSC: #pragma css task input(v1, v2, len) output(v3)
void vadd (float *v1, float *v2, float *v3, int len)
{ ... }
• Project Goal: Apply StarSS to a set of applications • Hybrid Parallelization using StarSS and MPI for:
• BEST / LBC Lattice Boltzmann codes (Fotran) • LS1-Mardyn MD code (C++)
10
The Cray XE6 node We are concentrating on the HLRS XE6 installed at HLRS, which is called
‘hermit1’
There are other XE6 models using different processors and different
interconnects topology which we don’t cover in this presentation
We start by introducing the node parts (processors used, interconnect, …)
and shows how they are packaged
Processor
The new Opteron 6200 Series (Interlagos)
Hermit uses the Model 6276 (2.3 GHz)
● Interlagos is composed of a number of Bulldozer core “modules”
● A core module has shared and dedicated components
● There are two independent integer cores and a shared, 256-bit FP resource
● A single Integer Core can make use of the entire FP resource with 256-bit AVX instructions
● This architecture is very flexible, and can be applied effectively to a variety of workloads and problems
● DL1 is 16 KB, L2 is 2 MB and L3 is 8MB
Interlagos Processor Architecture
Shared L2 Cache
Fetch
Decode
Shared L3 Cache and NB
FP
Scheduler
128-b
it F
MA
C
L1 DCache L1 DCache
128-b
it F
MA
C
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Int
Scheduler
Int
Scheduler
Int Core 0 Int Core 1
Dedicated Components
Shared at the module level
Shared at the chip level
May 16, 2012 Slide 13 Cray Proprietary 13
Why share components
Multi Core Multi Threading Hyper Thread
Hardware Overhead 2x ~ 1.2x < 1.05x
Performance gain Max 2x Max 1.8 Max 1.25
Performance gain vs. Hardware overhead
1 1.5 1.2
In this example we are going from 1 to 2 ‘instances’ (1 to 2 cores, 1 to 2 integer cores, …
By letting the cores share parts, the performance of processors can be
increased and still keeping the complexity of the processers (#transistors) down
compared by simply adding more cores.
The following table shows this gain by comparing different strategies. The
numbers quoted where found in a ht4u article :
http://ht4u.net/reviews/2011/amd_bulldozer_fx_prozessoren/index8.php
14
● An Orochi die consists of 4 Bulldozer modules
● An 8MB Level 3 Cache and memory controller is shared among the 4 modules
● Two Orochi die make up a single Interlagos processor
● The HLRS machine runs at 2.3 GHz
● Cores can run at faster clock speeds depending on the workload running on the part
Orochi Die
Orochi Die
Shared Level 3 Cache
Integrated Memory Controller
Integrated Northbridge Controller
May 16, 2012 Cray Proprietary 15
8 M
B L
3 C
ach
e
DDR3 Channel
DDR3 Channel
DDR3 Channel
DDR3 Channel
DDR3 Channel
DDR3 Channel
DDR3 Channel
DDR3 Channel
ToGemini
Bulldozer
Bulldozer
L1
L1L2
Bulldozer
Bulldozer
L1
L1L2
Bulldozer
Bulldozer
L1
L1L2
Bulldozer
Bulldozer
L1
L1L2
8M
B L
3 C
ach
e
Bulldozer
Bulldozer
L1
L1L2
Bulldozer
Bulldozer
L1
L1L2
Bulldozer
Bulldozer
L1
L1L2
Bulldozer
Bulldozer
L1
L1L2
HT
3
8 M
B L
3 C
ach
e
Bulldozer
Bulldozer
L1
L1L2
Bulldozer
Bulldozer
L1
L1L2
Bulldozer
Bulldozer
L1
L1L2
Bulldozer
Bulldozer
L1
L1L2
8M
B L
3 C
ach
e
Bulldozer
Bulldozer
L1
L1L2
Bulldozer
Bulldozer
L1
L1L2
Bulldozer
Bulldozer
L1
L1L2
Bulldozer
Bulldozer
L1
L1L2
HT
3
HT3
HT3
HT3
XE6 Node Details – 32-core Interlagos
● 2 Multi-Chip Modules, 4 Opteron Dies: ~300 Gflops
● 8 Channels of DDR3 Bandwidth to 8 DIMMs: ~105 GB/s
● 32 Computational Cores, 32 MB of L3 cache
● Dies are fully connected with HT3
May 16, 2012 Cray Proprietary 16
Gemini
The Cray interconnect
Cray Gemini ASIC (application-specific integrated circuit)
Supports 2 Nodes per ASIC
3D Torus network
XT5/XT6 systems field upgradable
Scales to over 100,000 network endpoints
Link Level Reliability and Adaptive Routing
Advanced Resiliency Features
Advanced features
MPI – millions of messages / second
One-sided MPI
UPC, Coarray FORTRAN, Shmem, Global Arrays
Atomic memory operations
Gemini
Hyper
Transport 3
NIC 1
Netlink
Block
48-Port
YARC Router
Hyper
Transport 3
NIC 0
18
Gemini NIC Design
● Fast memory access (FMA) ● Mechanism for most MPI transfers,
involves processor ● Supports tens of millions of MPI
requests per second
● Block transfer engine (BTE) ● Supports asynchronous block
transfers between local and remote memory, in either direction
● For large MPI transfers that happen in the background
● Hardware pipeline maximizes issue rate
● HyperTransport 3 host interface
● Hardware translation of user ranks and addresses
● AMO cache ● Network bandwidth
dynamically shared between NICs
May 16, 2012 Slide 19 Cray Proprietary
HT
3 C
av
e
vc0
vc1
vc1
vc0
LB Ring
LB
LM
NL
FMA
CQ
NPT
RMTnet req
H
A
R
B
net
rsp
ht p
ireq
ht treq p
ht irsp
ht np
ireq
ht np req
ht np reqnet req
ht p req O
R
B
RAT
NAT
BTE
net
req
net
rsp
ht treq np
ht trsp net
req
net
req
net
req
net
req
net
reqnet req
ht p req
ht p req
ht p req net rsp
CLM
AMOnet rsp headers
T
A
R
B
net req
net rsp
S
S
I
D
Ro
ute
r T
ile
s
19
• Cray MPI uses MPICH2 distribution from Argonne
CH3 device Nemesis: multi-method device with a highly optimized shared memory sub-method
• MPI device for Gemini based on
User level Gemini Network Interface (uGNI)
Distributed Memory Applications (DMAPP) library
• FMA (Fast Memory Access)
In general used for small transfers
FMA transfers are lower latency
• BTE (Block Transfer Engine)
BTE transfers take longer to start but can transfer large amount of data without CPU involvement
• AMOs provide a fast synchronization method for collectives
AMO=Atomic Memory Operations
Gemini Software
20
● PGAS= Partitioned Global Address Space
● Globally addressable memory provides efficient support for ● UPC, Co-array FORTRAN, SHMEM
● Pipelined global loads and stores ● Allows for fast execution of irregular communication patterns
● Atomic memory operations ● Provides fast synchronization method for one-sided communication
● Cray DMAPP application interface ● Cray Programming Environment targets this directly
● Available for 3rd party tools (check docs.cray.com for the API)
Gemini PGAS Features
May 16, 2012 Slide 21 Cray Proprietary 21
● MPI latency of 1.4 msec ● 3X improvement on Seastar
● MPI message rate of 9M/sec ● 20X improvement on Seastar
● Injection bandwidths in excess of 6 GB/sec ● 3X improvement on Seastar
● Cray SHMEM put rate of 25M/sec
● Scattered/indexed put rates of 60-90M/sec
Gemini Performance Highlights
May 16, 2012 Slide 22 Cray Proprietary 22
● Low latencies are maintained across the whole system with cores sending non-local messages (HPCC natural+random ring)
MPI Latency at Scale
0
1
2
3
4
5
6
7
8
9
10
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
La
ten
cy (m
icro
se
co
nd
s)
Number of processes
Nehalem + IB natural ring Nehalem + IB random ring
Westmere + IB natural ring Westmere + IB random ring
Small Cray XE6 natural ring Small Cray XE6 random ring
Large Cray XE6 natural ring Large Cray XE6 random ring
May 16, 2012 Slide 23 Cray Proprietary 23
● Gemini MPI bandwidth exceeds 5 GB/sec
MPI Bandwidth
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
MP
I ban
dw
idth
(MB
yte
s/se
c)
Message size (bytes)
Single message
Multiple messages
May 16, 2012 Slide 24 Cray Proprietary 24
Each Gemini supports 2 XE6 Compute Nodes
• Built around the Gemini Interconnect
• Each Gemini ASIC provides 2 NICs enabling it to connect 2 dual-socket nodes
Y
X
Z
25
The Cray XE6 packaging
Compute blade on the XE6
• Configuration
4 compute nodes per compute blade
Each compute node has 2 Opteron sockets
Each socket hosts a Magny-Cours or Interlago MCM for a total of up to 128 compute cores per blade
32 DDR3 Memory DIMMS + 32 DDR3 Memory channels
2 Gemini ASICs
L0 Blade management processor
• Runs Cray Linux Environment (CLE)
Linux-based operating system
designed to run large, complex applications and scale efficiently to hundreds of thousands of processor cores
27
Cray XE6 Compute Blade (4 nodes)
May 16, 2012 Slide 28 Cray Proprietary
Node 0 Node 1
Node 2 Node 3
Ge
min
i Me
zzanin
e Card
AMD Opteron with heatsink
memory DIMMs
voltage regulators
into backplane
Message flow from
Node 0 to Node 2
Message flow from
Node 1 to Node 0
28
Service nodes on the XE
• Overview
• Run full Linux (SuSe SLES 11)
• 4 nodes per service blade
• Boot node
• first XE6 node to be booted: boots all other components
• IO nodes
• Run Lustre processes (OST, MDT)
• SDB node
• hosts MySQL database
• processors, allocation, accounting, PBS information
• Login nodes
• User login and code preparation activities: compile, launch
• Partition allocation: ALPS (Application Level Placement Scheduler)
29
XK6 Nodes
Crays GPU nodes.
HLRS has a small system with 16 XK6 nodes (Tesla) now, the current
plan is to grow it to 32 nodes (Kepler) by end of the year
Currently the nodes and the software is being tested in a TDS. Expect the
nodes to join hermit in June/July timeframe
Cray XK6 Node
Y
X
Z
High Radix
YARC Router
with adaptive
Routing
168 GB/sec
capacity
10 12X Gemini
Channels
(Each Gemini
acts like two
nodes on the 3D
Torus)
XE6 Node Characteristics
Number of Cores 32 (Interlago)
Peak Performance 2 x IL-16 (2.3)
295 Gflops/sec
Memory Size 32 GB per node 64 GB per node
Memory Bandwidth (Peak)
104.5 GB/sec
May 16, 2012 Slide 31 Cray Proprietary
XK6 Compute Node Characteristics
Host Processor AMD Series 6100
(Interlagos)
Host Processor 147 Gflops
Tesla X2090 Cores 448
Tesla X2090 Perf. 600+ Gflops
Host Memory 16 or 32GB
1633 MHz DDR3
Tesla X090 Memory 6GB GDDR5 capacity
170 GB/sec
Gemini High Speed Interconnect
Upgradeable to KEPLER many-core processor
31
XK6 Compute Blade
Cray XK6 Compute Blade
+
NVIDIA Tesla X2090
+
Cray Gemini Interconnect
32
Cray XK6 supercomputer HPCwire readers: “Top 5 New Products or Technologies to Watch”
• Nvidia Fermi 2090 GPU 20% better performance than 2070
compute: 448512 cores; 1.151.30 GHz clock
memory: 6GB; 150178GB/s bandwidth
Upgradable to Kepler in 2012
• AMD Series 6200 Interlagos CPU (16 cores)
• Cray Gemini interconnect high bandwidth/low latency scalability
HPCwire editors: “Best HPC Interconnect Product or Technology”
• Fully integrated/optimised/supported Hardware and full software stack stack (including libraries)
Also supports Cray Cluster Compatibility Mode for ISV applications
• Fully blendable with Cray XE6 product line HPCwire readers: “Best HPC Server Product or Technology”
• Fully upgradeable from Cray XT/XE systems
"Accelerating the Way to Better Science"
33
Cray hybrids in future Top500
ORNL Titan: 200 cabinets of Cray XK6
NCSA Blue Waters: 235 cabinets of Cray XE6 + 30 cabinets of Cray XK6
34
• Most important hurdle for widespread adoption of accelerated computing is programming difficulty Need a single programming model that is portable across machine types,
and also forward scalable in time Portable expression of heterogeneity and multi-level parallelism Programming model and optimization should not be significantly difference for “accelerated”
nodes and multi-core x86 processors Allow users to maintain a single code base
• Cray’s Approach to Ease of Use Accelerator Programming is to provide a tightly coupled high level programming environment with compilers, libraries, and tools that will hide the complexity of the system Focus on integration and differentiation Target ease of use with extended functionality and increased automation
• Ease of use is possible with
Compiler making it feasible for users to write applications in Fortran, C, C++ (OpenACC)
Tools to help users port and optimize for accelerators Auto-tuned scientific libraries
Cray Vision for Accelerated Computing
35
XE6 Cabinets
• A XE6 cabinet contains 3 cages (aka chassis)
• A cage contains 8 blades
• A compute blade contains
8 sockets
2 Gemini interconnects
Memory
L0 controller
VRMs
No moving parts
• One blower at the bottom
XE6 configuration details
37
Cray XE6 Packaging
Slide 38
XIO blades
cables to GigE backbone
IB or FC disk
PDUs
3D torus
interconnect
from backplane
blower blower
enclosure
3 chassis
each with 8 blades
(compute or XIO)
Compute blade XIO blade
3D torus
interconnect
over head to
next row
38
Topology: 16 Cabinets, 8 x 6 x 16 Each chassis has a 1x2x8 topology; HLRS has a 19x6x16 topology
X
Y
Z
39
External Services
• esFS Provides globally shared data between multiple systems
Cray XE systems and others
Provides access to other file systems
DVS is used to project Panasas or StorNext to the compute nodes
• esLogin Increased availability of data and system services to users
An enhanced user environment
larger memory, swap space, and more horsepower
Dell 905, 4 socket, quad-core and 128 GB of memory
• esDM More options for data management and data protection
40
Why External Services for Cray Systems
To address customer requirements: More flexible user access
More options for data management, data protection
Leverage commodity components in customer-specific implementations
Provide faster access to new devices and technologies
Repeatable solutions that remain open to custom configuration
Enable each solution to be used, scaled, and configured independently
esFS esLogin esDM
41
Scalable Software Architecture
Microkernel on Compute nodes, full featured Linux on Service nodes.
Service PEs specialize by function
Software Architecture eliminates OS “Jitter”
Software Architecture enables reproducible
run times Service Partition
Compute Partition
Specialized Linux nodes
Scalable Software Architecture: CLE
43
Trimming OS – Standard Linux Server
Linux Kernel
Portmap
sshd
slpd
nscd
resmgrd
powersaved
cupsd
kdm
cron mingetty(s)
qmgr master
pickup
ndbd
…
init
klogd
44
Linux on a Diet – CNL
Linux Kernel
ALPS client
syslogd
Lustre Client init
klogd
45
FTQ Plot of Stock SuSE (most daemons removed)
27550
27750
27950
28150
28350
0 1 2 3
Time - Seconds
Co
un
t
46
FTQ plot of CNL
27550
27750
27950
28150
28350
0 1 2 3
Time - Seconds
Co
un
t
47
Cray Software Ecosystem
CrayPAT Cray Apprentice
Cray Scientific Libraries
DVS
May 16, 2012 Slide 48 Cray Proprietary
CLE4, An Adaptive Linux OS designed specifically for HPC
• No compromise scalability
• Low-Noise Kernel for scalability
• Native Comm. & Optimized MPI
• Application-specific performance tuning and scaling
ESM – Extreme Scalability Mode
• No compromise compatibility
• Fully standard x86/Linux
• Standardized Communication Layer
• Out-of-the-box ISV Installation
• ISV applications simply install and run
CCM –Cluster Compatibility Mode
49
Cluster Compatibility Mode: Overview
• Provides the runtime environment on compute nodes expected by ISV applications
• Associated with specific batch queues
• Dynamically allocates and configures compute nodes at job start
Nodes are not permanently dedicated to CCM
Any compute node can be used
Allocated like any other batch job (on demand)
• MPI and Third party MPI runs over InfiniBand or TCP/IP over HSN
• Supports standard services: ssh, rsh, nscd, ldap
• Complete root file system on the compute nodes
built on top of the Dynamic Shared Libraries (DSL) environment
Under CCM, everything the application can “see” is identical to a standard Linux
cluster: Linux OS, x86 processor, and MPI
50
CCM IAA
Look just like InfiniBand to third-party MPIs.
Emulate IB characteristics not in the spec.
Be invisible to the user.
Be invisible in performance profiles.
51
Cray XE I/O architecture
• All I/O is offloaded to service nodes
• Lustre
High performance parallel I/O file system
Direct data transfer between compute nodes and files
• DVS
Virtualization service
Allows compute nodes to access NFS mounted on service node
Applications must execute on file systems mounted on compute nodes
• No local disks
• /tmp is a MEMORY file system, on each login node
52
Scaling Shared Libraries with DVS
Diskless Compute Node 0
/dvs
Diskless Compute Node 1
/dvs
Diskless Compute Node N
/dvs
Diskless Compute Node 2
/dvs
Diskless Compute Node 3
/dvs
DVS Server Node 0
Requests for shared libraries (.so files) are routed through DVS Servers
Provides similar functionality as NFS, but scales to 1000s of compute nodes
Central point of administration for shared libraries
DVS Servers can be “re-purposed” compute nodes
Cray
Interconnect
NFS Shared
Libraries
May 16, 2012 Slide 53 Cray Proprietary 53
DSL : Dynamic shared libraries
54
• Benefit: root file system environment available to applications
• Shared root from SIO nodes will be available on compute nodes
• Standard libraries / tools will be in the standard places
• Able to deliver customer-provided root file system to compute nodes
• Programming environment supports static and dynamic linking
• Performance impact negligible, due to scalable implementation
The Cray Programming Environment Overview
The Cray Programming Environment Vision
• It is the role of the Programming Environment to close the gap between observed performance and peak performance
Help users achieve highest possible performance from the hardware
• The Cray Programming Environment is addressing the issues of scale and complexity of high end HPC systems with:
Increased automation
Ease of use
Hiding the system complexity
Extended functionality
Focus on scalability
Improved Reliability
Strong academic collaborations
Close interaction with users
For feedback targeting functionality enhancements
56
Cray Programming Environment Distribution
Focus on Differentiation and Productivity
Programming Languages
Fortran
C
C++
Chapel
Python
I/O Libraries
NetCDF
HDF5
Optimized Scientific
Libraries
LAPACK
ScaLAPCK
BLAS (libgoto)
Iterative Refinement
Toolkit
Cray Adaptive FFTs (CRAFFT)
FFTW
Cray PETSc (with CASK)
Cray Trilinos (with CASK)
Cray developed
#: Under development
Licensed ISV SW
3rd party packaging
Cray added value to 3rd party
PGI
GNU
Compilers
Cray Compiling Environment
(CCE)
•CrayPat
• Cray Apprentice2
Tools
Environment setup
Debuggers
Modules
DDT
gdb
Modules
Debugging Support
Tools
• Fast Track Debugger (CCE w/ DDT)
• Abnormal Termination Processing
DDT
Performance Analysis
STAT
Cray Comparative Debugger#
Programming
models
Distributed Memory (Cray MPT)
• MPI
• SHMEM
PGAS & Global View
• UPC (CCE)
• CAF (CCE)
• Chapel
Shared Memory
• OpenMP 3.0
PGI CCE
57
• Cray technology focused on scientific applications Takes advantage of automatic vectorization Takes advantage of automatic shared memory parallelization
• Standard conforming languages and programming models
Fortran 2003 standard compliant with F2008 features already available C++98/2003 compliant OpenMP 3.0 compliant, working on OpenMP 3.1 and OpenMP 4.0
• OpenMP and automatic multithreading fully integrated
Share the same runtime and resource pool Aggressive loop restructuring and scalar optimization done in the presence of
OpenMP Consistent interface for managing OpenMP and automatic multithreading
• PGAS languages (UPC & Fortran Coarrays) fully optimized and integrated into the
compiler UPC 1.2 and Fortran 2008 coarray support No preprocessor involved Target the network appropriately
CCE : The Cray Compilation Environment
58
• MPI
Implementation based on MPICH2 from ANL
Optimized Remote Memory Access (one-sided) fully supported including passive RMA
Full MPI-2 support with the exception of Dynamic process management (MPI_Comm_spawn)
MPI3 Forum active participant
• Cray SHMEM
Fully optimized Cray SHMEM library supported Cray XE implementation close to the T3E model
Cray MPI & Cray SHMEM
59
• From performance measurement to performance analysis
• Assist the user with application performance analysis and optimization
Help user identify important and meaningful information from potentially massive data sets
Help user identify problem areas instead of just reporting data
Bring optimization knowledge to a wider set of users
• Focus on ease of use and intuitive user interfaces
Automatic program instrumentation
Automatic analysis
• Target scalability issues in all areas of tool development
Cray Performance Analysis Tools
60
• Systems with hundreds of thousands of threads of execution need a new debugging paradigm
Innovative techniques for productivity and scalability Scalable Solutions based on MRNet from University of Wisconsin
STAT - Stack Trace Analysis Tool
Scalable generation of a single, merged, stack backtrace tree
• running at 216K back-end processes
ATP - Abnormal Termination Processing
Scalable analysis of a sick application, delivering a STAT tree and a minimal, comprehensive, core file set.
Fast Track Debugging
Debugging optimized applications
Added to Allinea's DDT 2.6 (June 2010)
Comparative debugging
A data-centric paradigm instead of the traditional control-centric paradigm
Collaboration with Monash University and University of Wisconsin for scalability
Support for traditional debugging mechanism TotalView, DDT, and gdb (TotalView is not on Hermit)
Debuggers on Cray Systems
61
FFT
FFTW
CRAFFT
Sparse
Trilinos
PETSc
CASK
Dense
BLAS
LAPACK
ScaLAPACK
IRT
Scientific libraries – functional view
FFTW
fftw-2.1.5
fftw
PETSc
petsc-
Petsc-complex
CASK (petsc)
Trilinos
Trilinos 10.8.3.0
CASK (trilinos)
LibSci
BLAS
LAPACK
ScaLAPACK
IRT
CRAFFT
Scientific libraries – package view
How to access and use the software
• The Cray XE system uses modules in the user environment to support multiple software versions and to create integrated software packages
As new versions of the supported software and associated man pages become available, they are added automatically to the Programming Environment, while earlier versions are retained to support legacy applications
You can use the default version of an application, or you can choose another version by using Modules system commands
Environment Setup
65
• How can we get appropriate Compiler, Tools, and Libraries? The modules tool is used to handle different versions of
packages e.g.: module load compiler_v1 e.g.: module swap compiler_v1 compiler_v2 e.g.: module load perftools
• Taking care of changing of PATH, MANPATH, LM_LICENSE_FILE,....
environment Modules also provide a simple mechanism for updating certain
environment variables, such as PATH, MANPATH, and LD_LIBRARY_PATH
In general, you should make use of the modules system rather than embedding specific directory paths into your startup files, makefiles, and scripts.
• It is also easy to setup your own modules for your own software
The module tool on the Cray XE
66
The PrgEnv-X module
• The PrgEnv-X is the ‚basic‘ module for all XE6 users
X=cray, pgi, gnu, intel [pathscale]
• With PrgEnv you decide which compiler you want to use and all needed modules (math libs, mpi, …) are loaded automatically
• Modules not loaded by default can be loaded any time e.g. perftools for performance analysis
Slide
67
module list
eslogin002:~> module list
Currently Loaded Modulefiles:
1) modules/3.2.6.6 13) xe-sysroot/4.0.36
2) xtpe-network-gemini 14) rca/1.0.0-2.0400.30002.5.75.gem
3) xtpe-interlagos 15) xt-asyncpe/5.07
4) cce/8.0.2 16) atp/1.4.2
5) acml/4.4.0 17) PrgEnv-cray/4.0.36
6) xt-libsci/11.0.05 18) xt-mpich2/5.4.3
7) udreg/2.3.1-1.0400.3911.5.13.gem 19) eswrap/1.0.9
8) ugni/2.3-1.0400.4127.5.20.gem 20) torque/2.5.9
9) pmi/3.0.0-1.0000.8661.28.2807.gem 21) moab/6.1.5.s1992
10) dmapp/3.2.1-1.0400.3965.10.63.gem 22) system/ws_tools
11) gni-headers/2.1-1.0400.4156.6.1.gem 23) system/hlrs-defaults
12) xpmem/0.1-2.0400.30792.5.6.gem
PrgEnv-cray is the default on Hermit
68
@eslogin002:~> module show xtpe-interlagos
-------------------------------------------------------------------
/opt/cray/xt-asyncpe/default/modulefiles/xtpe-interlagos:
conflict xtpe-barcelona
conflict xtpe-quadcore
conflict xtpe-shanghai
conflict xtpe-istanbul
conflict xtpe-interlagos-cu
conflict xtpe-mc8
conflict xtpe-mc12
conflict xtpe-xeon
prepend-path PE_PRODUCT_LIST XTPE_INTERLAGOS
setenv XTPE_INTERLAGOS_ENABLED ON
setenv CRAY_CPU_TARGET interlagos
setenv INTEL_PRE_COMPILE_OPTS -msse3
setenv PATHSCALE_PRE_COMPILE_OPTS -march=barcelona
-------------------------------------------------------------------
What is xtpe-interlagos?
I should build for the right compute-node
architecture.
It’d probably be a really bad idea to load two architectures at once.
Oh yeah, let’s link in the tuned math libraries for this architecture too.
69
Useful module commands
• Which modules are loaded?
module list
• Load software
module load perftools
• Change programming environment
module swap PrgEnv-cray PrgEnv-gnu
• Change software version
module swap cce cce/7.4.4
• Check which version are available
module avail cce
70
Which Software Versions Are Available?
hpcnicho@eslogin002:~> module avail perftools
--------------------------- /opt/cray/modulefiles --------------------------
perftools/5.2.0 perftools/5.2.3 perftools/5.3.0(default)
hpcnicho@eslogin002:~> module avail cce
---------------------------- /opt/modulefiles --------------------------------
cce/7.3.3 cce/7.4.2 cce/8.0.0 cce/8.0.0.137
cce/8.0.2(default) cce/7.3.4 cce/7.4.4 cce/8.0.0.129
cce/8.0.1
71
What Happens When I Load a Module?
hpcnicho@eslogin002:~> module show perftool
-------------------------------------------------------------------
/opt/cray/modulefiles/perftools/5.3.0:
setenv PERFTOOLS_VERSION 5.3.0
conflict x2-craypat
conflict craypat
conflict xt-craypat
conflict apprentice2
module load rca
setenv CHPL_CG_CPP_LINES 1
setenv PDGCS_LLVM_DISABLE_FP_ELIM 1
setenv PAT_REPORT_PRUNE_NAME
_cray$mt_start_,__cray_hwpc_,f_cray_hwpc_,cstart,__pat_,pat_region_,PAT_,OMP.slave_loop,slave_entry,_new_slave
_entry,__libc_start_main,_start,__start,start_thread,__wrap_,UPC_ADIO_,_upc_,upc_,__caf_,__pgas_
module-whatis Perftools - the Performance Tools module sets up environments for CrayPat, Apprentice2 and
PAPI
prepend-path PATH /opt/cray/perftools/5.3.0/bin
prepend-path MANPATH /opt/cray/perftools/5.3.0/man
setenv CRAYPAT_LICENSE_FILE /opt/cray/perftools/craypat.lic
prepend-path CRAYLMD_LICENSE_FILE /opt/cray/perftools/craypat.lic
setenv CRAYPAT_ROOT /opt/cray/perftools/5.3.0
setenv CRAYPAT_INCLUDE_OPTS $($CRAYPAT_ROOT/sbin/pat-opts INCLUDE)
setenv CRAYPAT_PRE_LINK_OPTS $($CRAYPAT_ROOT/sbin/pat-opts PRE_LINK)
setenv CRAYPAT_POST_LINK_OPTS $($CRAYPAT_ROOT/sbin/pat-opts POST_LINK)
setenv CRAYPAT_PRE_COMPILE_OPTS $($CRAYPAT_ROOT/sbin/pat-opts PRE_COMPILE)
setenv CRAYPAT_POST_COMPILE_OPTS $($CRAYPAT_ROOT/sbin/pat-opts POST_COMPILE)
setenv CRAYPAT_ROOT_FOR_EVAL /opt/cray/perftools/$PERFTOOLS_VERSION
module load papi/4.2.0
setenv APP2_STATE 5.3.0
setenv JH_HELPSET /opt/cray/perftools/5.3.0/help/app2help.jar
setenv JH_VIEWER /opt/cray/perftools/5.3.0/help/jh2_0_05/demos/bin/hsviewer.jar
prepend-path CRAY_LD_LIBRARY_PATH /opt/cray/perftools/5.3.0/lib
append-path CLASSPATH /opt/cray/perftools/5.3.0/help/jh2_0_05/javahelp
append-path PE_PRODUCT_LIST PERFTOOLS
append-path PE_PRODUCT_LIST CRAYPAT
------------------------------------------------------------------- 72
Release Notes
hpcnicho@eslogin002:~> module help cce/8.0.2
----------- Module Specific Help for 'cce/8.0.2' ------------------
The modulefile, cce, defines the system paths and environment
variables needed to run the Cray Compile Environment.
Type "module avail cce" to see if other versions of this product
are available on this system. Use "module switch" to change versions.
Cray Compiling Environment 8.0.2 (CCE 8.0.2)
============================================
Purpose:
--------
The CCE 8.0.2 update provides bugfixes to the CCE 8.0.1 release for Cray XE
systems.
Bugs fixed in 8.0.2 are:
779483 Runtime error with Cray Fortran compiler cce/7.4.4
780053 Illegal folding of optional argument test into a merge
780346 Internal compiler error with crayftn when enabling full debugging
779573 Fortran function pointer issue
Note:
-----
Support for CCE on Cray XT systems will continue to be provided with
updates to the CCE 7.4 release. The CCE 8.0 release branch is
supported on the Cray XE and XK systems only.
Dependencies:
-------------
The CCE 8.0.2 release is supported on Cray XE systems that run on the Cray
Linux Environment (CLE) operating system, version 3.1 and later and on the
Cray XK systems that run the Cray Linux Environment 4.0 UP01 and later.
73
Release Notes (cont.)
CCE 8.0.2 requires that gcc/4.4.4 be installed. GCC 4.4.4 does not need to
be a default GNU environment.
Cray Performance Measurement and Analysis Tools dependency:
- cce/8.0.0 or later compiles using -h profile_generate require
perftools/5.3.0 in order to provide loop work estimates.
- perftools/5.3.0 is required to support the PGAS (UPC, CAF) runtime
library changes made in CCE 8.0
The Cray Compiling Environment 8.0.2 update requires the following supporting
asynchronous software products:
Cray Compiler Drivers (xt-asyncpe) 5.04 or later
GNU GCC 4.4.4 must be installed, but is not required to be the default GCC
PMI 2.1.4 or later
Cray Scientific Libraries (LibSci) 11.0.00 or later
The Cray Compiling Environment 8.0.2 update requires the following minimum
version if these products are used:
PETSc 3.1.05 or later
hdf5-netcdf 1.8 (HDF5 1.85 and netcdf 4.1.1)
MPT 5.2.3 or later
acml 4.4.0 or later. To use acml 5.0, gcc 4.6.1 must be installed.
Cray Performance Measurement and Analysis Tools 5.3.0
74
• You use compiler driver commands to launch all Cray XE compilers (ft, cc, and CC)
• The syntax for the compiler driver is:
cc | CC | ftn [Cray_options | PGI_options | GNU_options] files [-lhugetlbfs]
• For example, to use any Fortran compiler (CCE, PGI, GNU) to compile prog1.f90
Use this command: % ftn prog1.f90
• The compiler drivers are checking the the PrgEnv-X Module
Using the Compiler Driver Commands
75
The Cray Compilation Environment
This is the default on hermit
CCE Technology Sources
X86 Code
Generator
Cray XK Code
Generator
Fortran Front End
Interprocedural Analysis
Optimization and
Parallelization
C and C++ Source
Object File
Co
mp
iler
C & C++ Front End
Fortran Source
C and C++ Front End supplied by Edison
Design Group, with Cray-developed code
for extensions and interface support
X86 Code Generation from LLVM, with
additional Cray-developed optimizations
and interface support
Cray Inc. Compiler
Technology
PTX Code Generation derived from the
Cray X2 code generator
Fortran 2003 plus portions of 2008 (CAF),
OpenMP, and Cray-specific programming
support
Aggressive inlining and interprocedural
optimization, including cross-file
Automatic vectorization and SMP;
automatic restructuring for memory
usage; OpenMP, UPC and CAF
expansion and optimization;
heterogeneous target data transfer,
parallelization, and optimization; scalar
and vector optimization
77
• Compliance with ANSI/ISO FORTRAN 2003 Fortran 2008 (full compliance targeted for 2012)
Fortran 2008 coarrays Submodules Block construct Contiguous Attribute ALLOCATE enhancements (MOLD =, shape from SOURCE/MOLD) intrinsic assignment for polymorphic variables Most of the new intrinsic functions ISO_Fortran_Env module enhancements
• Compliance with ANSI/ISO C99 and ANSI/ISO C++ 2003
(except the export keyword for templates) Support for Kernighan & Ritchie C C/C++ enhancements/changes
updated to GCC version 4.4.4 compatibility C++ supports the ISO 1998 Standard Template Library (STL) headers Upgraded the C and C++ front end to EDG Version 4.1
With this update CCE can better handle modern C++ applications Periodic synchronization with the latest sources and bug fixes Better support for non-standard GNU language extensions The new EDG C and C++ front end more strictly enforces the standards
UPC 1.2 support
78
CCE Main Features
• AMD Interlagos support, including AVX, FMA, and XOP instructions
• X86/NVIDIA compiler and library development (ongoing “beta” release)
• Support for MPI 2.2
• Full OpenMP 3.0 support
Automatic multithreading integrated with OpenMP Atomic construct extensions
taskyield construct
firstprivate clause accepts intend(in) and constant objects
• Support for hybrid programming using MPI across node and OpenMP within the node
• Support for IEEE floating-point arithmetic and IEEE file formats
• Cray performance tools and debugger support
• Program Library
• CCE 8.0 was released on December, 2011
The full release overview can be found at: http://docs.cray.com/books/S-5212-74/
79
CCE Main Features (cont.)
• C-based UPC and Fortran Coarray are PGAS language extensions, not stand-alone languages
• A subset of Fortran coarray collectives were added for CCE
Although they are not yet part of the official language – they are too useful to be delayed
• Significant improvements were made to the automatic use of blocked network transfers, including:
Automatic conversion of multiple single-word accesses into blocked accesses
Improved capabilities for pattern matching to hand-optimized library routines, including messages stating what might be inhibiting the conversion
• UPC and Fortran coarrays support up to 2,147,483,647 threads within a single application
We actually did hit the previous limit of 65,535!
UPC and Fortran Coarray Features
80
• The Program Library (PL) feature allows the user to specify a repository of compiler information for an application build
This repository provides the framework for future productivity features such as Whole program static error detection
Incremental recompilation
Provide support for the future Cray interactive whole program performance analysis and tuning assistant Reveal
• Two command line options control the Program Library functionality
-h pl = <PL_path> specifies the repository ftn –hpl=./PL.1 tells the compiler to either update the Program Library “./PL.1”
if it exists, or create it if it does not exist.
<PL_path> should specify a single location to be used for entire application build. If a makefile changes directories during a build, an absolute path might be necessary.
-h wp enables whole-program mode
Whole-Program Compilation
81
• Whole-program mode (-hwp) requires a program library (-hpl =) and both options must be specified on all compilation command lines as well as on the link line. The compiler frontend is invoked for the compilation (-c) command lines The compiler backend (inliner, optimizer, code generator) is invoked for all source
files when the link line is specified. While –hwp might have a negative affect on overall compile time due to increased
inlining, it is most usually a compile time shift, where –c compilations become quite fast and the time spent on the link step increases.
Setting the environment variable “NPROC” to a number greater than 1 instructs the compiler to invoke NPROC backend processes concurrently. The backend invocations are independent of each other and setting NPROC to a level that is appropriate for the host build machine can improve compile time.
• Whole-program mode (-hwp) allows the inliner to see all inline candidates in the
application. This option makes cross file inlining automatic
Removes the need for –h ipafrom = Inlining heuristics are still controlled by –h/-O ipan
Whole-Program Compilation (cont)
82
• Use default optimization levels It’s the equivalent of most other compilers –O3 or –fast It is also our most thoroughly tested configuration
• Use –O3,fp3 (or –O3 –hfp3, or some variation)
-O3 only gives you slightly more than –O2 We also test this thoroughly -hfp3 gives you a lot more floating point optimization, esp. 32-bit
• If an application is intolerant of floating point reassociation, try a lower –hfp
number – try –hfp1 first, only –hfp0 if absolutely necessary Might be needed for tests that require strict IEEE conformance Or applications that have ‘validated’ results from a different compiler Interlagos FMA usage is aggressive at –hfp2 and –hfp3; limited at –hfp1,
and disabled at –hfp0
• Do not use –Oipa5, -Oaggress, and so on – higher numbers are not always correlated with better performance
Recommended CCE Compilation Options
83
• We recommend using –O3 –hfp3 if the application runs cleanly with these options
• -hfp3 primarily improves 32-bit floating point performance on the X86
• A partial list of what happens at –hfp3 is: Use of fast 32-bit inline division, reciprocal, square root, and reciprocal
square root algorithms (with some loss of precision)
Use of a fast 32-bit inline complex absolute value algorithm
Starting with CCE 8.0, more aggressive reassociation (pre-8.0 –hfp2 behavior)
Various assumptions about floating point trap safety
Somewhat more aggressive about NaN assumptions
Assumes standard-compliant Fortran exponentiation (x**y)
What Exactly Does –hfp3 Do?
84
• Overall Options
-ra creates a listing file with optimization info
-rm produces a source listing with loopmark information
• Preprocessor Options
-eZ runs the preprocessor on Fortran files
-F enables macro expansion throughout the source file
• Optimisation Options
-O2 optimal flags [ enabled by default ]
-O3 aggressive optimization
-O ipa<n> inlining, n=0-5
Cray compiler flags
85
• Language Options
-f free process Fortran source using freeform
-s real64 treat REAL variables as 64-bit
-s integer64 treat INTEGER variables as 64-bit
• Parallelization Options
-O omp Recognize OpenMP directives [default ]
-O thread<n> n=0-3, aggressive parallelization, default n=2
Cray compiler flags
=> man crayftn http://docs.cray.com/cgi-bin/craydoc.cgi?mode=View;id=S-3901-71;idx=books_search;this_sort=;q=3901;type=books;title=Cray%20Fortran%20Reference%20Manual
86
• OpenMP is ON by default
Optimizations controlled by –Othread#
To shut off use –Othread0 or –xomp or –hnoomp
• Autothreading is NOT on by default;
-hautothread to turn on
Modernized version of Cray X1 streaming capability
Interacts with OpenMP directives
• If you do not want to use OpenMP and have OMP directives in the code, make sure to shut off OpenMP at compile time
OpenMP
87
• Cray compiler supports a full and growing set of directives and pragmas
!dir$ concurrent
!dir$ ivdep
!dir$ interchange
!dir$ unroll
!dir$ loop_info [max_trips] [cache_na] ... Many more
!dir$ blockable
man directives
man loop_info
CCE Directives
88
Loopmark/Compiler Feedback
• ftn –rm … or cc –hlist=m …
• Compiler will generate an ‘.lst’with annotated listing of your source code with letter indicating important optimizations
89
• Compiler can generate an filename.lst file.
Contains annotated listing of your source code with letter indicating important optimizations
90
Loopmark: Compiler Feedback
%%% L o o p m a r k L e g e n d %%%
Primary Loop Type Modifiers
------- ---- ---- ---------
a - vector atomic memory operation
A - Pattern matched b – blocked
C - Collapsed f – fused
D - Deleted i – interchanged
E - Cloned m - streamed but not partitioned
I - Inlined p - conditional, partial and/or computed
M - Multithreaded r – unrolled
P - Parallel/Tasked s – shortloop
V - Vectorized t - array syntax temp used
W - Unwound w - unwound
• ftn –rm … or cc –hlist=m …
91
Example: Cray loopmark messages for Resid
29. b-------< do i3=2,n3-1
30. b b-----< do i2=2,n2-1
31. b b Vr--< do i1=1,n1
32. b b Vr u1(i1) = u(i1,i2-1,i3) + u(i1,i2+1,i3)
33. b b Vr * + u(i1,i2,i3-1) + u(i1,i2,i3+1)
34. b b Vr u2(i1) = u(i1,i2-1,i3-1) + u(i1,i2+1,i3-1)
35. b b Vr * + u(i1,i2-1,i3+1) + u(i1,i2+1,i3+1)
36. b b Vr--> enddo
37. b b Vr--< do i1=2,n1-1
38. b b Vr r(i1,i2,i3) = v(i1,i2,i3)
39. b b Vr * - a(0) * u(i1,i2,i3)
40. b b Vr * - a(2) * ( u2(i1) + u1(i1-1) + u1(i1+1) )
41. b b Vr * - a(3) * ( u2(i1-1) + u2(i1+1) )
42. b b Vr--> enddo
43. b b-----> enddo
44. b-------> enddo
Example: Cray loopmark messages for Resid (cont) ftn-6289 ftn: VECTOR File = resid.f, Line = 29 A loop starting at line 29 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-6049 ftn: SCALAR File = resid.f, Line = 29 A loop starting at line 29 was blocked with block size 4. ftn-6289 ftn: VECTOR File = resid.f, Line = 30 A loop starting at line 30 was not vectorized because a recurrence was found on "U1" between lines 32 and 38. ftn-6049 ftn: SCALAR File = resid.f, Line = 30 A loop starting at line 30 was blocked with block size 4. ftn-6005 ftn: SCALAR File = resid.f, Line = 31 A loop starting at line 31 was unrolled 4 times. ftn-6204 ftn: VECTOR File = resid.f, Line = 31 A loop starting at line 31 was vectorized. ftn-6005 ftn: SCALAR File = resid.f, Line = 37 A loop starting at line 37 was unrolled 4 times. ftn-6204 ftn: VECTOR File = resid.f, Line = 37 A loop starting at line 37 was vectorized.
92
• The cc(1), CC(1), and ftn(1) man pages contain information about the compiler driver commands
• The pgcc(1), pgCC(1), and pgf95(1) man pages contain descriptions of the PGI compiler command options
• The craycc(1), crayCC(1), and crayftn(1) man pages contain descriptions of the Cray compiler command options
• The gcc(1), g++(1), and gfortran(1) man pages contain descriptions of the GNU compiler command options
• To verify that you are using the correct version of a compiler, use: -V option on a cc, CC, or ftn command with PGI and CCE --version option on a cc, CC, or ftn command with GNU
Compiler man Pages
93
• One rounding for the FMA as a whole, rather than two (one for multiply and one for addition)
• That sounds like a minor difference, but these differences can accumulate
• For our internal testing, most of the differences we manually approved by examining them and deciding the FMA-based results were within an acceptable range
• Actual applications – at least some of them – appear to be less forgiving
• There is no hardware way to obtain the exact same result between FMAs and individual multiplications and additions
… but the performance difference means we really do need to use them
• Some level of FMA control is provided by CCE –hfp options
-hfp0: No FMA generation (but also disables a lot of other stuff)
-hfp1: Generate FMAs, but not across user parenthesis
-hfp2,3: Aggressive FMA generation
94
Impact of Fused Multiply-Add (FMA) on Application Results
Feature PGI Cray
Listing -Mlist -ra
Diagnostic -Minfo -Mneginfo (produced by -ra)
Free format -Mfree -f free
Preprocessing -Mpreprocess -eZ -F
Suggested Optimization -fast (default)
Aggressive Optimization -Mipa=fast,inline -O3, fp3
Variables size -r8 –i8 -s real64 –s integer64
Byte swap -byteswapio -h byteswapio
OpenMP recognition -mp=nonuma (default)
Automatic parallelization -Mconcur -h autothread
Cray and PGI compiler flags
95
• GNU (PrgEnv-gnu) Suggested options: -O3 –ffast-math –funroll-loops
Compiler feedback: -ftree-vectorize -verbose=2
OpenMP: -fopenmp
Man pages: gcc, gfortran, g++
• Intel (PrgEnv-intel) Suggested options: -O3
Aggressive options: -ffast-math -funroll-loops -msse3 -ftree-vectorize
OpenMP: -openmp=on Careful : An extra control thread is spawn: issues when pinning threads to cores. Try aprun –cc [none|numa_node] instead of –cc cpu
Man pages: ifort, icc
Other programming environments
96
• Compiling on a Linux service node
• Generating an executable for a CLE compute node
• Do not use pgf90, pgcc, gcc, g++, ..., unless you want a Linux executable for the service node
Use ftn, cc, or CC instead
Cross Compiling Environment
97
Running an application on the Cray XE6
• ALPS : Application Level Placement Scheduler
• aprun is the ALPS application launcher
It must be used to run application on the XE compute nodes
If aprun is not used, the application is launched on the Mom node (and will most likely fail)
aprun man page contains several useful examples
at least 3 important parameters to control: The total number of PEs : -n
The number of PEs per node: -N
The number of OpenMP threads: -d More precise : The ‘stride’ between 2 PEs in a node
Running an application on the Cray XE ALPS + aprun
99
Some Definitions
• ALPS is always used for scheduling a job on the compute nodes. It does not care about the programming model you used. So we need a few general ‘definitions’ :
PE : Processing Elements Basically an Unix ‘Process’, can be a MPI Task, CAF image, UPC tread, …
Numa_node The cores and memory on a node with ‘flat’ memory access, basically one of the 4 Dies on the Opteron and the direct attach memory.
Thread A thread is contained inside a process. Multiple threads can exist within the same process and share resources such as memory, while different PEs do not share these resources. Most likely you will use OpenMP threads.
100
• Assuming a XE6 IL16 system (32 cores per node)
• Pure MPI application, using all the available cores in a node
$ aprun –n 32 ./a.out
• Pure MPI application, using only 1 core per node
32 MPI tasks, 32 nodes with 32*32 core allocated
Can be done to increase the available memory for the MPI tasks
$ aprun –N 1 –n 32 –d 32./a.out (we’ll talk about the need for the –d32 later)
• Hybrid MPI/OpenMP application, 4 MPI ranks per node
32 MPI tasks, 8 OpenMP threads each
need to set OMP_NUM_THREADS $ export OMP_NUM_THREADS=8
$ aprun –n 32 –N 4 –d $OMP_NUM_THREADS
Running an application on the Cray XE6 some basic examples
101
• CNL can dynamically distribute work by allowing PEs and threads to migrate from one CPU to another within a node
• In some cases, moving PEs or threads from CPU to CPU increases cache and translation lookaside buffer (TLB) misses and therefore reduces performance
• CPU affinity options enable to bind a PE or thread to a particular CPU or a subset of CPUs on a node
• aprun CPU affinity option (see man aprun)
Default settings : -cc cpu PEs are bound a to specific core, depended on the –d setting
Binding PEs to a specific numa node : -cc numa_node PEs are not bound to a specific core but cannot ‘leave’ their numa_node
No binding : -cc none
Own binding : -cc 0,4,3,2,1,16,18,31,9,…
aprun CPU Affinity control
102
• Cray XE6 systems use dual-socket compute nodes with 4 dies
Each die (8 cores) is considered a NUMA-node
• Remote-NUMA-node memory references, can adversely affect performance. Even if you PE and threads are bound to a specific numa_node, the memory used does not have to be ‘local’
• aprun memory affinity options (see man aprun)
Suggested setting is –ss a PE can only allocate the memory local to its assigned NUMA node. If this is not possible, your application will crash.
Memory affinity control
103
Running an application on the Cray XT - MPMD
• aprun supports MPMD – Multiple Program Multiple Data
• Launching several executables on the same MPI_COMM_WORLD $ aprun –n 128 exe1 : -n 64 exe2 : -n 64 exe3
• Notice : Each exacutable needs a dedicated node, exe1 and exe2 cannot share a node. Example : The following commands needs 3 nodes $ aprun –n 1 exe1 : -n 1 exe2 : -n 1 exe3
• Use a script to start several serial jobs on a node : $ aprun –a xt –n 1 –cc none –d32 script.sh >cat script.sh
./exe1&
./exe2&
./exe3&
wait
>
104
● In this mode, an MPI task is pinned to each integer core
● Implications
● Each core has exclusive access to an integer scheduler, integer pipelines and L1 Dcache
● The 256-bit FP unit and the L2 Cache is shared between the two cores
● 256-bit AVX instructions are dynamically executed as two 128-bit instructions if the 2nd FP unit is busy
● When to use
● Code is highly scalable to a large number of MPI ranks
● Code can run with 1 GB per core memory footprint (or 2 GB on 64 GB node)
● Code is not well vectorized
How to use the interlago 1/3 1 MPI Rank on Each Integer Core Mode
Shared L2 Cache
Fetch
Decode
FP
Scheduler
128-b
it F
MA
C
L1 DCache L1 DCache
128-b
it F
MA
C
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Int
Scheduler
Int
Scheduler
Int Core 0 Int Core 1
MPI Task 0 Shared Components
MPI Task 1
May 16, 2012 Slide 105 Cray Proprietary 105
● In this mode, only one integer core is used per core pair
● Implications
● This core has exclusive access to the 256-bit FP unit and is capable of 8 FP results per clock cycle
● The core has twice the memory capacity and memory bandwidth in this mode
● The L2 cache is effectively twice as large
● The peak of the chip is not reduced
● When to use
● Code is highly vectorized and makes use of AVX instructions
● Code needs more memory per MPI rank
How to use the interlago 2/3 Wide AVX mode
Shared L2 Cache
Fetch
Decode
FP
Scheduler
128-b
it F
MA
C
L1 DCache L1 DCache
128-b
it F
MA
C
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Int
Scheduler
Int
Scheduler
Int Core 0 Int Core 1
Idle Components
Active Components
May 16, 2012 Slide 106 Cray Proprietary 106
● In this mode, an MPI task is pinned to a core pair
● OpenMP is used to run a thread on each integer core
● Implications
● Each OpenMP thread has exclusive access to an integer scheduler, integer pipelines and L1 Dcache
● The 256-bit FP unit and the L2 Cache is shared between the two threads
● 256-bit AVX instructions are dynamically executed as two 128-bit instructions if the 2nd FP unit is busy
● When to use
● Code needs a large amount of memory per MPI rank
● Code has OpenMP parallelism exposed in each MPI rank
How to use the interlago 3/3 2-way OpenMP Mode
Shared L2 Cache
Fetch
Decode
FP
Scheduler
128-b
it F
MA
C
L1 DCache L1 DCache
128-b
it F
MA
C
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Pip
elin
e
Int
Scheduler
Int
Scheduler
Int Core 0 Int Core 1
OpenMP Thread 0
Shared Components
OpenMP Thread 1
May 16, 2012 Slide 107 Cray Proprietary 107
Aprun: cpu_lists for each PE
• CLE was updated to allow threads and processing elements to have more flexibility in placement. This is ideal for processor architectures whose cores share resources with which they may have to wait to utilize. Separating cpu_lists by colons (:) allows the user to specify the cores used by processing elements and their child processes or threads. Essentially, this provides the user more granularity to specify cpu_lists for each processing element. Here an example with 3 threads : aprun -n 4 -N 4 -cc 1,3,5:7,9,11:13,15,17:19,21,23
• Note: This feature will be modified in CLE 4.0.UP03, however this option will still be valid.
108
Running a batch application with Torque
• The number of required nodes and cores is determined by the parameters specified in the job header #PBS -l mppwidth=256
#PBS -l mppnppn=4
This example uses 256/4=64 nodes
• The job is submitted by the qsub command
• At the end of the execution output and error files are returned to submission directory
• PBS environment variable: $PBS_O_WORKDIR Set to the directory from which the job has been submitted Default is $HOME
• man qsub for env. variables
109
Other Torque options
• #PBS -N job_name
the job name is used to determine the name of job output and error files
• #PBS -l walltime=hh:mm:ss
Maximum job elapsed time
should be indicated whenever possible: this allows Torque to determine best scheduling startegy
• #PBS -j oe
job error and output files are merged in a single file
• #PBS -q queue
request execution on a specific queue
110
Torque and aprun
Torque aprun
-lmppwidth=$PE -n $PE Number of PE to start
-lmppdepth=$threads -d $threads #threads/PE
-lmppnppn=$N -N $N #(PEs per node)
<none> -S $S #(PEs per numa_node)
-lmem=$size -m $size[h|hs] per-PE required memory
111
• -B will provide aprun with the Torque settings for –n,-N,-d and –m
aprun –B ./a.out
• Using –S can produce problems if you are not asking for a full node.
If possible, ALPS will only give you access to a parts of a node if the Torque
settings allows this. The following will fail :
• PBS -lmppwidth=4 ! Not asking for a full node
• aprun –n4 –S1 … ! Trying to run on every die
• Solution is to ask for a full node, even if aprun doesn‘t use it
Core specialization
• System ‘noise’ on compute nodes may significantly degrade scalability for some applications
• Core Specialization can mitigate this problem
1 core per node will be dedicated for system work (service core)
As many system interrupts as possible will be forced to execute on the service core
The application will not run on the service core
• Use aprun -r to get core specialization
$ aprun –r –n 100 a.out
• apcount provided to compute total number of cores required
$ qsub -l mppwidth=$(apcount -r 1 1024 16)job
aprun -n 1024 -r 1 a.out
112
Core Specialization and MPI progress
Typical HPC application threads tend to run hot, i.e. they don’t typically make calls that result in yielding of the core on which they are scheduled
Because of this, MPI progress threads need to have at least one core of a compute unit available per node for efficient handling of interrupts received from Gemini
Core Specialization provides a convenient way to partition cores on a node between hot application threads, and cool system service daemon threads as well as MPI progress threads.
MPI Asynchronous Progress – enabling
export MPICH_NEMESIS_ASYNC_PROGRESS=1
export MPICH_MAX_THREAD_SAFETY=multiple
aprun –r 1 …
113
Running a batch application with Torque
• The number of required nodes can be specified in the job header
• The job is submitted by the qsub command
• At the end of the exection output and error files are returned to submission directory
• Environment variables are inherited by #PBS -V
• The job starts in the home directory. $PBS_O_WORKDIR contains the directory from which the job has been submitted
Hybrid MPI + OpenMP
#!/bin/bash
#PBS –N hybrid
#PBS –lwalltime=00:10:00
#PBS –lmppwidth=128
#PBS –lmppnppn=8
#PBS –lmppdepth=4
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=4
aprun –n128 –d4 –N8 a.out
114
Starting an interactive session with Torque
• An interactive job can be started by the –I argument
That is <capital-i>
• Example: allocate 64 cores and export the environment variables to the job (-V)
$ qsub –I –V –lmppwith=64 –lmppnppn=32
• This will give you a new prompt in your shell from which you can use aprun directly. Note that you are running on a MOM node (shared resource) if not using aprun
115
Watching a launched job on the Cray XE
• xtnodestat
Shows XE nodes allocation and aprun processes
Both interactive and batch jobs
• apstat
Shows aprun processes status
apstat overview
apstat –a[ apid ] info about all the applications or a specific one
apstat –n info about the status of the nodes
• Batch qstat command
shows batch jobs
116
Accounting at HLRS
• Accounting is done by examining the Torque log files and is based on the unix group id a user belongs to
Normally the user don‘t have to do anything
• If a user is involved in several projects, he has to select the correct one by setting the group id in the batch script :
#PBS -W group_list=<group name>
117
Lustre filesystem at HLRS
• In order to use lustre at HLRS, you have to create a „workspace“
• HLRS provides a tool suite to create and manage the workspace
• To allocate a workspace : ws_allocate <name> <duration>
• To list your workspaces : ws_list
• After <duration>, the workspace is deleted. You can extend the <duration> 3 times.
• https://wickie.hlrs.de/platforms/index.php/Workspace_mechanism
Slide
118
Starting 512 MPI tasks (PEs) #PBS -N MPIjob
#PBS -l mppwidth=512
#PBS -l mppnppn=32
#PBS -l walltime=01:00:00
#PBS -j oe
cd $PBS_O_WORKDIR
export MPICH_ENV_DISPLAY=1
export MALLOC_MMAP_MAX_=0
export MALLOC_TRIM_THRESHOLD_=-1
aprun -n 512 –cc cpu –ss ./a.out
119
Starting a hybrid job single node, 8 MPI tasks, each with 4 threads
#PBS -N hybrid
#PBS -l mppwidth=8
#PBS -l mppnppn=8
#PBS -l mppdepth=4
#PBS -l walltime=01:00:00
#PBS -j oe
cd $PBS_O_WORKDIR
export MPICH_ENV_DISPLAY=1
export MALLOC_MMAP_MAX_=0
export MALLOC_TRIM_THRESHOLD_=-1
export OMP_NUM_THREADS=4
aprun –n8 –N8 –d $OMP_NUM_THREADS –cc cpu –ss ./a.out
120
Starting a MPMD job on a non-default projectid using 1 master, 16 workers, each with 8 threads #PBS -N hybrid
#PBS -l mppwidth=160 ! Note : 5 nodes * 32 cores = 160 cores
#PBS -l mppnppn=32
#PBS -l walltime=01:00:00
#PBS -j oe
#PBS -W group_list=My_Project
cd $PBS_O_WORKDIR
export MPICH_ENV_DISPLAY=1
export MALLOC_MMAP_MAX_=0
export MALLOC_TRIM_THRESHOLD_=-1
export OMP_NUM_THREADS=8
id # Unix command ‚id‘, to check group id
aprun –n1 –d32 –N1 ./master.exe :
-n 16 –N4 –d $OMP_NUM_THREADS –cc cpu –ss ./worker.exe
121
Starting an MPI job on two nodes using only every second integer core
#PBS -N hybrid
#PBS -l mppwidth=32
#PBS -l mppnppn=16
#PBS -l mppdepth=2
#PBS -l walltime=01:00:00
#PBS -j oe
cd $PBS_O_WORKDIR
export MPICH_ENV_DISPLAY=1
aprun –n32 –N16 –d 2 –cc cpu –ss ./a.out
122
Starting a hybrid job on two nodes using only every second integer core
#PBS -N hybrid
#PBS -l mppwidth=32
#PBS -l mppnppn=16
#PBS -l mppdepth=2
#PBS -l walltime=01:00:00
#PBS -j oe
cd $PBS_O_WORKDIR
export MPICH_ENV_DISPLAY=1
export OMP_NUM_THREADS=2
aprun –n32 –N16 –d $OMP_NUM_THREADS
–cc 0,2:4,6:8,10:12,14:16,18:20,22:24,26:28,30 –ss ./a.out
123
• HLRS wiki
https://wickie.hlrs.de/platforms/index.php/Cray_XE6
• Cray docs site
http://docs.cray.com
• Starting point for Cray XE info
http://docs.cray.com/cgi-bin/craydoc.cgi?mode=SiteMap;f=xe_sitemap
• Twitter ?!?
http://twitter.com/craydocs
Documentation
124
End
125
Top Related