A24: DAY 1 PART1: MACHINES - Software Technologyvarbanescu/ASCI_A24.2k14/ASCI_A24_Day1_part1...A24:...
Transcript of A24: DAY 1 PART1: MACHINES - Software Technologyvarbanescu/ASCI_A24.2k14/ASCI_A24_Day1_part1...A24:...
A24: DAY 1 PART1: MACHINES Ana Lucia Varbanescu, UvA [email protected]
Plan (ambitious) • Introduction: About HPC • Part I : Machines & parallelism • Part II : Applications & parallelism • Part III : Performance
Scientific Paradigms • Traditionally:
• Do theory, design, and analysis on paper • Build systems to test hypotheses
• Limitations: • Too difficult, expensive, slow, dangerous
• The Computational Science paradigm: • Use (high performance) computing systems to simulate the
phenomenon • based on known physical laws and numerical methods. • analyze the output data to *understand* the phenomenon
Some challenging computations • Science
• Global climate • Astrophysics • Biology: genome analysis, protein folding • Earthquake and Tsunami prediction
• Engineering • Car-crash simulation • Semiconductor design • Structural analysis • Oil and gas exploration • Aerodynamics • …
Some challenging computations • Business
• Financial and economic modeling • Transaction processing • Web search and indexing engines
• Defense and Security • Nuclear weapons • Cryptography
• Movie animations • Hair* (compare), fur, cloth • Sounds
*Lena Petrovic et. Al, “Volumetric Methods for Simulation and Rendering of Hair”
Global Climate Modeling Problem • Problem: compute
• F (latitude, longitude, elevation, time) = weather = (temperature, pressure, humidity, wind velocity)
• Approach*: • Discretize the domain, e.g., a grid with measurement points every
N km • Devise an model to predict climate at time t+∆t
given climate at time t
Randall, D. A. et. al., Climate modeling with spherical geodesic grids
Global Climate Modeling Problem • Model the fluid flow in the atmosphere
• Solve the resulting Navier-Stokes equations • ~100 Flops per grid point at 1 min. timestep • Earth surface is approximately 5.1x108 km2
• Grid patches =1 km2 => 5.1x1010 per time step for the whole Earth
• Computational requirements: • Real-time: 5x 1010 flops in 60 seconds ≈ 0.8 GFlop/s • Weather prediction (7 days in 24 hours): 5.6 Gflop/s • Climate prediction (100 years in 30 days): 1 Tflop/s • Scenario analysis (100 scenarios): 100 Tflop/s
The Effect of Grid Resolution • Daily precipitation during winter time: models with different
resolutions
75 km resolution
50 km resolution
300 km resolution 100 km resolution
Global Climate Modeling Problem • Double the grid resolution: 4 x computation (2D) • Add B elevation levels: ~ B x (higher for 3D) • Adequate couplings for submodels
• ocean, land, atmosphere, sea-ice
• Requires thousands of TFLOPs • Current (2014) top performance: ~33.000 TFLOPs on LINPACK
• and LINPACK performance is NOT application performance
• What can we do? • We use coarser and less accurate models • We build bigger/better/faster systems • We design new parallel algorithms
Take home message [1] • High Performance Computing is required to make
computationally and/or data intensive problems feasible in time.
• HPC main metric = performance
• Compared “against”: • Real-time apps => worst-case execution guarantees • Embedded/mobile apps => power consumption • Marketing => best-case execution time
Take home message [2] • HPC goal = maximize the performance of a given
application on a given platform
• HPC = f(Machine, Application, Parallelism) • HPC machines
• Super-chips, super-computers, grids, clouds • HPC applications
• Lots of old and new application fields, simulations, predictions, models, ...
• HPC parallelism • Multiple layers
HPC pulse • TOP500 Project*
• The 500 most powerful computers in the world
• Benchmark: Rmax of LINPACK • Solve the Ax=b linear system
• dense problem • matrix A is random
• Dominated by dense matrix-matrix multiply
• Metric: FLOPS/s • Computational throughput: number of floating point operations per
second
• Updated twice a year: latest is Nov 2014
*Read more: www.top500.org
Classify Parallel Architectures • Shared Memory
• Homogeneous compute nodes • Shared address space
• Distributed Memory • Homogeneous compute nodes • Local (disjoint) address spaces
• Hybrids • Heterogeneous compute nodes • Mixed address space(s)
Shared memory machines • All processors have access to all memory locations
• Uniform access time: UMA • Non-uniform access times: NUMA
• Interconnections (networks) • Bus/buses/ring(s) • Meshes (all-to-all) • Cross-bars
Distributed memory machines • Every processor has its own local address space • Non-local data can only be accessed through a request to
the owning processor • Access times (may) depend on distance => non-uniform
Virtual Shared Memory • Virtual global address space
• Memories are physically distributed • Local access remains faster than remote access
• NUMA • All processors can access all memory spaces • Hardware (or software) hide memory distribution
Hybrids • Various combinations between shared and distributed
memory spaces. • More flexible in terms of coherence and consistency • More complex to program
Major issues • Shared Memory model
• Scalability problems (interconnect) • Programming challenge: RD/WR Conflicts
• Distributed Memory model • Data distribution is mandatory • Programming challenge: remote accesses, consistency
• Virtual Shared Memory model • Significant virtualization overhead • Easier programming
• Hybrid models • Local/remote data more difficult to trace
Examples • Multi-core CPUs ?
• Shared memory with respect to system memory • Hybrid when taking caches into account
• Clusters ? • Distributed memory • Could be shared if middleware for virtual shared space is provided
• Supercomputers ? • Usually hybrid
• GPUs ? • Architectures with GPUs?
• Distributed for traditional, off-chip GPUs • Shared for new APUs
Intra-Processor parallelism • Low-level parallelism
• Hidden from programmers • Should be automatically addressed by compilers • Low-level languages do expose it, if needed
• An example: Vector / SIMD operation
• Memory hierarchies • Caches and non-caches alike
=
+
SIMD (vector operations) • Scalar processing
• Traditional • One operation produces one result
• Vector processing • With SSE, SSE2 … , AVX • One operation produces 4
results
A1
B1
C1 =
+
A
B
C
A2
B2
C2
A3
B3
C3
A4
B4
C4
SSE/SSE2/…/SSE4 • Assembly instructions
• 16 (or more) registers • C/C++ intrinsics = “macro’s” to work on variables
float data[N]; for (i=0;i<N;i++) data[i]=(float)i; //vector: load first 4 elements, then next 4 __m128 myVector0=_mm_load_ps(data); __m128 myVector1=_mm_load_ps(data+4); //vector: add => 4 FLOPs __m128 myVector2=_mm_add_ps(myVector0,myVector1); // vector: _MM_SHUFFLE(x,y,z,t)=x,y from 1,z&t from 2 __m128 myVector3=_mm_shuffle_ps(myVector0, myVector1,
_MM_SHUFFLE(2,3,0,1));
float A[N],B[N],C[N]; for (i=0; i<N; i+=4) {
__m128 vecA = _mm_load_ps(A+i); __m128 vecB = _mm_load_ps(B+i); __m128 vecC = _mm_add_ps(vecA,vecB); _mm_store_ps(C+i, vecC); }
float A[N],B[N],C[N]; for (i=0; i<N; i+=4) {
C[i] = A[i]+B[i]; C[i+1] = A[i+1]+B[i+1]; C[i+2] = A[i+2]+B[i+2]; C[i+3] = A[i+3]+B[i+1];}
float A[N],B[N],C[N]; for (i=0; i<N; i++) C[i] = A[i]+B[i]
Vector addition with SSE
• Step 1: loop unrolling
• Step 2: use of intrinsics
SSE vs. AVX • SSE/SSEn is Intel-specific • AVX in an extension of SSE
• To be supported by more vendors • Currently: AMD and Intel
• AVX fact sheet: • Larger registers : 256b instead of 128b • Supports legacy instruction (lower 128b) • Adds support for 3-operand instructions
Memory performance • Flat memory model
• All accesses = same latency • Memory latency slower to improve than processor speed
The gap grows ~50% per year
… which means we wait longer for any access to the (DRAM) memory!
Solutions • Improve latency
• Technology • Memory hierarchies
• Make better use of bandwidth • Bandwidth increases 3x faster than latency! • Collect/gather/coalesce multiple memory accesses
• Overlap computation and communication • Prefetching • Default/automated (low-level) • Software-managed (aka, programmer implemented …)
Memory hierarchies • Hierarchical memory model
• Several memory spaces • Large size, low cost, high latency – main memory • Small size, high cost, low latency – caches / registers
• Main idea: Bring some of the data closer to the processor • Smaller latency => faster access • Smaller capacity => not all data fits!
• Who can benefit? • Applications with locality in their data accesses
• Spatial locality • Temporal locality
Memory hierarchies (cont’d) • Limitations
• Size: no space for every memory address • Organization: what gets loaded & where ? • Policies: who’s in, who’s out, when, why?
• Performance • Hit = access found data in fast memory => low latency • Miss = data not in fast memory => high latency + penalty • Metric: hit ratio (H) = the fraction of accesses that hit • Performance gain: ?
• T_nocache = N * T_main_memory • T_cache = (N-H)*(T_main_meoryoy+penalty) + H*T_cache
Why do memory hierarchies work? • Temporal locality
• RD(x), RD(x), RD(x): • Main_RD, Cache_hit, Cache_hit • Gain: Cache latency is better !
• Spatial locality • RD(x),RD(x+1),RD(x+2):
• Main_RD, Cache_hit, Cache_hit • Gain: Cache latency is better + better bandwidth for coalesced
operations RD(x,x+1,x+2, … )
Memory performance in practice • Theoretical:
• Peak latency & bandwidth are given • … but in theoretical optimal conditions
• Practice: • Microbenchmarking of the memory system
• E.g.: Membench • … or build your own benchmark
Multi-core CPUs
35
• Architecture • Few large cores • (Integrated GPUs) • Vector units
• Streaming SIMD Extensions (SSE) • Advanced Vector Extensions (AVX)
• Stand-alone • Memory
• Shared, multi-layered • Per-core caches + shared caches
• Programming • Multi-threading • OS Scheduler
Parallelism
36
• Core-level parallelism ~ task/data parallelism (coarse) • 4-12 of powerful cores
• Hardware hyperthreading (2x) • Local caches • Symmetrical or asymmetrical threading model • Implemented by programmer
• SIMD parallelism = data parallelism (fine) • 4-SP/2-DP floating point operations per second
• 256-bit vectors • Run same instruction on different data • Sensitive to divergence
• NOT the same instruction => performance loss • Implemented by programmer OR compiler
Programming models
37
• Pthreads + intrinsics • TBB – Thread building blocks
• Threading library
• OpenCL • To be discussed …
• OpenMP • Traditional parallel library • High-level, pragma-based
• Cilk • Simple divide-and-conquer model Le
vel o
f abs
trac
tion
incr
ease
s
GPGPU History
38
• Current generation: NVIDIA Kepler • 7.1B transistors • More cores, more parallelism, more performance
1995 2000 2005 2010
RIVA 128 3M xtors
GeForce® 256 23M xtors
GeForce FX 125M xtors
GeForce 8800 681M xtors
GeForce 3 60M xtors
“Fermi” 3B xtors
GPGPU History
39
• Use Graphics primitives for HPC • Ikonas [England 1978] • Pixel Machine [Potmesil & Hoffert 1989] • Pixel-Planes 5 [Rhoades, et al. 1992]
• Programmable shaders, around 1998 • DirectX / OpenGL • Map application onto graphics domain!
• GPGPU • Brook (2004), Cuda (2007), OpenCL (Dec 2008), ...
(NVIDIA) GPUs
45
• Architecture • Many (100s) slim cores • Sets of (32 or 192) cores grouped into “multiprocessors” with
shared memory • SM(X) = stream multiprocessors
• Work as accelerators • Memory
• Shared L2 cache • Per-core caches + shared caches • Off-chip global memory
• Programming • Symmetric multi-threading • Hardware scheduler
Parallelism
47
• Data parallelism (fine-grain) • Restricted forms of task parallelism possible with newest
generation of NVIDIA GPUs
• SIMT (Single Instruction Multiple Thread) execution • Many threads execute concurrently
• Same instruction • Different data elements • HW automatically handles divergence
• Not same as SIMD because of multiple register sets, addresses, and flow paths*
• Hardware multithreading • HW resource allocation & thread scheduling
• Excess of threads to hide latency • Context switching is (basically) free
*http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
Larrabee
48
• GPU based on x86 architecture • Hardware multithreading • Wide SIMD
• Achieved 1 tflop sustained application performance (SC09)
• Cancelled • Dec’09
Intel Xeon Phi (=MIC) • First product: “Knights corner”
• Accelerator or stand-alone • ±60 Pentium1-like cores • 512-bit SIMD
• Memory • L1-cache per core (32KB I$ + 32KB D$) • Very large unified L2-cache (512KB/core, ~30MB/chip) • At least 8GB of GDDR5
• Programming • Multi-threading
• Traditional models: OpenMP, MPI, Cilk, TBB, parallel libraries + OpenCL • OS thread scheduler
• User-control via affinity
49
Parallelism
51
• 3 operation modes • Stand-alone • Offload • Hybrid
• Core-level parallelism ~ task parallelism (coarse) • ~60 cores, 4x hyperthreading
• SIMD parallelism = data parallelism (fine) • Fine-grained parallelism • 16 SP-Flop, 16 int-ops, 8 DP-Flop / cycle • AVX-512 extensions
• No support for MMX, SSE => not backward compatible
Typical examples • Clusters
• See DAS4
• Super-computers • Half of top500
• Grids & clouds • See Grid5000 • See Amazon EC2