Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and...
-
Upload
emil-andrews -
Category
Documents
-
view
216 -
download
1
Transcript of Exascale: Why It is Different Barbara Chapman University of Houston High Performance Computing and...
Exascale: Why It is Different
Barbara ChapmanUniversity of Houston
High Performance Computing and Tools Grouphttp://www.cs.uh.edu/~hpctools
P2S2 Workshop at ICPPTaipei, TaiwanSeptember 13, 2011
Petascale is a Global Reality K computer
68,544 SPARC64 VIIIfx processors, Tofu interconnect, Linux-based enhanced OS, produced by Fujitsu
Tianhe-1A 7,168 Fermi GPUs and 14,336 CPUs; it would require more than
50,000 CPUs and twice as much floor space to deliver the same performance using CPUs alone.
Jaguar 224,256 x86-based AMD Opteron processor cores, Each
compute node features two Opterons with 12 cores and 16GB of shared memory
Nebulae Nvidia Tesla 4640 GPUs, Intel X5650-based 9280 CPUs
Tsubame 4200 GPUs
Exascale Systems: The Race is On Town Hall Meetings April‐June 2007
Scientific Grand Challenges Workshops November 2008 – October 2009 Climate Science (11/08), High Energy Physics (12/08), Nuclear Physics (1/09), Fusion Energy (3/09), Nuclear Energy (5/09) (with NE) Biology (8/09), Material Science and Chemistry (8/09), National Security (10/09) (with NNSA)
Cross-cutting workshops Architecture and Technology (12/09) Architecture, Applied Mathematics and Computer Science (2/10)
Meetings with industry (8/09, 11/09) External Panels
ASCAC Exascale Charge Trivelpiece Panel
Peak performance is 10**18 floating point operations per second
DOE Exascale Exploration: Findings Exascale is essential in important areas
Climate, combustion, nuclear reactors, fusion, stockpile stewardship, materials, astrophysics,…
Electricity production, alternative fuels, fuel efficiency Systems need to be both usable and affordable
Must deliver exascale level of performance to applications Power budget cannot exceed today’s petascale power
consumption by more than a factor of 3 Applications need to exhibit strong scaling, or weak
scaling that does not increase memory requirements Need to understand R&D implications (HW and SW)
What is the roadmap?
IESP: International Exascale Software International effort to specify research
agenda that will lead to exascale capabilities Academia, labs, agencies, industry
Requires open international collaboration Significant contribution of open-source
software Revolution vs. evolution? Produced a detailed roadmap
Focused meetings to determine R&D needs
IESP: Exascale Systems
Given budget constraints, current predictions focus on two alternative designs:Huge number of lightweight processors, e.g. 1 million chips, 1000 cores/chip = 1 billion threads of executionHybrid processors, e.g. 1.0GHz processor and 10000 FPUs/socket & 100000 sockets/system = 1 billion threads of execution
See http:/www.exascale.org/
Platforms expected around 2018
Exascale: Anticipated Architectural Changes
Massive (ca. 4K) increase in concurrency Mostly within compute node
Balance between compute power and memory changes significantly 500x compute power and 30x memory of 2PF HW Memory access time lags further behind
But Wait, There’s More…
Need breakthrough in power efficiency Impact on all of HW, cooling, system software Power-awareness in applications and runtime
Need new design of I/O systems Reduced I/O bandwidth & file system scaling
Must improve system resilience, manage component failure Checkpoint/restart won’t work
DOE’s Co-design Model
Integrate decisions and explore trade-offs Across hardware and software
Simultaneous redesign of architecture and the application code software
Requires significant interactions, incl. with hardware vendors Including accurate simulators
Programming model. system software, will be key to success
Initial set of projects funded in key areas
Hot Off The PressExaflop supercomputer receives full funding from Senate appropriators September 11, 2011 — 11:45pm ET | By David Perera
An Energy Department effort to create a super computer three orders of magnitude more powerful than today's most powerful computer--an exascale computer--would receive $126 million during the coming federal fiscal year under a Senate Appropriations Committee markup of the DOE spending bill.The Senate Appropriations Committee voted Sept. 7 to approve the amount as part of the fiscal 2012 energy and water appropriations bill; fiscal 2011 ends on Sept. 30. The $126 million is the same as requested earlier this year in the White House budget request.In a report accompanying the bill, Senate appropriators say Energy plans to deploy the first exascale system in 2018, despite challenges.
Heterogeneous High-Performance System
Each node has multiple CPU cores, and some of the nodes are equipped with additional computational accelerators, such as GPUs.
www.olcf.ornl.gov/wp-content/uploads/.../Exascale-ASCR-Analysis.pdf
Many More Cores
Biggest change is within the node Some full-featured cores Many low-power cores
Technology rapidly evolving, will be integrated Easiest way to get power efficiency and high performance Specialized cores Global memory
Low amount of memory per core Coherency domains, networks on chip
AvRM®
Cortex™-A8CPU
L3/L4 Interconnect
C64x+™ DSP and
video accelerators (3525/3530 only)
Peripherals
Program/Data Storage
System
Serial Interfaces
Display
Subsystem
Connectivity
Camera I/F
POWERVR
SGX™ Graph
ics(3515/3530 only)
Example: Nvidia Hardware Echelon – “Radical and rapid evolution of GPUs for
exascale systems” (SC10) – expected in 2013-2014 ~1024 stream cores ~8 latency optimized CPU cores on a single chip ~20TFLOPS in a shared memory system 25x performance over Fermi
Plan is to integrate in exascale chip
Programming Challenges
Scale and structure of parallelism Hierarchical parallelism Many more cores, more nodes in clusters Heterogeneous nodes
New methods and models to match architecture Exploit intranode concurrency, heterogeneity Adapt to reduced memory size; drastically reduce data motion
Resilience in algorithm; fault tolerance in application Need to run multi-executable jobs
Coupled computations; Postprocessing of data while in memory
Uncertainty quantification to clarify applicability of results
Programming Models? Today’s Scenario
// Run one OpenMP thread per device per MPI node #pragma omp parallel num_threads(devCount) if (initDevice()) {
// Block and grid dimensions dim3 dimBlock(12,12);kernel<<<1,dimBlock>>>(); cudaThreadExit();
} else {
printf("Device error on %s\n",processor_name);}
MPI_Finalize(); return 0;
}
www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf
Exascale Programming Models Programming models biggest “worry factor” for application
developers MPI-everywhere model no longer viable Hybrid MPI+OpenMP already in use in HPC
Need to explore new approaches, adapt existing APIs Exascale models and their implementation must take
account of: Scale of parallelism, levels of parallelism Potential coherency domains, heterogeneity Need to reduce power consumption Resource allocation and management Legacy code, libraries; interoperability Resilience in algorithms, alternatives to checkpointing
Work on programming models must begin now!
Programming Model Major Requirements Portable expression of scalable parallelism
Across exascale platforms and intermediate systems
Uniformity One model across variety of resources Across node, across machine?
Locality For performance and power At all levels of hierarchy
Asynchrony Minimize delays But trade-off with locality
genericcore
genericcore
Specialized core
Specialized core
Control and data transfers
Delivering The Programming Model Between nodes, MPI with enhancements might work
Needs more help for fault tolerance and improved scalability
Within nodes, too many models, no complete solution OpenMP, PGAS, CUDA and OpenCL all potential starting point
In layered system, migrate code to most appropriate level, develop at most suitable level Incremental path for existing codes
Timing of programming model delivery is critical Must be in place when machines arrive Needed earlier for development of new codes
2009/05/21
A Layered Programming Approach
Efficient, Deterministic, Declarative, Restrictive Expressiveness based language (DSL maybe?)
Efficient, Deterministic, Declarative, Restrictive Expressiveness based language (DSL maybe?)
Parallel Programming Languages (OpenMP, PGAS, APGAS)Parallel Programming Languages (OpenMP, PGAS, APGAS)
Low-level APIs (MPI, pthreads, OpenCL, Verilog)Low-level APIs (MPI, pthreads, OpenCL, Verilog)
Machine code, AssemblyMachine code, Assembly
Computational Science
Computational Science
Data Informatics
Data Informatics
Information TechnologyInformation Technology
Very-high level
High level
Low-level
Very low-level
Applications
Heterogeneous Hardware
OpenMP Evolution Toward Exascale OpenMP language committee is actively working toward
the expression of locality and heterogeneity And to improve task model to enhance asynchrony
How to identify code that should run on a certain kind of core?
How to share data between host cores and other devices?
How to minimize data motion? How to support diversity of cores?
genericcore
genericcore
Specialized core
Specialized
core
Control and data transfers
OpenMP 4.0 Attempts To Target Range of Acceleration Configurations
Dedicated hardware for specific function(s) Attached to a master processor Multiple types or levels of parallelism
Process level, thread level, ILP/SIMD May not support a full C/C++ or Fortran compiler
May lack stack or interrupts, may limit control flow, types
Master Master
DSPDSPDSPDSP
DSPDSPDSPDSP
ACCACC
Acceleratorw/ nonstd
Programming model
Master
Massively Parallel Accelerator
Master
ACCACC
Increase Locality, Reduce Power
void foo(double A[], double B[], double C[], int nrows, int ncols) {#pragma omp data_region acc_copyout(C), host_shared(A,B) { #pragma omp acc_region for (int i=0; i < nrows; ++i) for (int j=0; j < ncols; j += NLANES) for (int k=0; k < NLANES; ++k) { int index = (i * ncols) + j + k; C[index] = A[index] + B[index]; } // end accelerator region print2d(A,nrows,ncols); print2d(B,nrows,ncols); Transpose(C); // calls function w/another accelerator construct } // end data_region print2d(C, nrows, ncols);}void Transpose(double X[], int nrows, int ncols) { #pragma omp acc_region acc_copy(X), acc_present(X) { … }}
OpenMP Locality Research Locations := Affinity Regions
Coordinate data layout, work Collection of locations represent
execution environment Map data, threads to a location;
distribute data across locations Align computations with data’s
location, or map them Location inherited unless task
explicitly migrated
Research on Locality in OpenMP• Implementation in OpenUH compiler shows good performance
– Eliminates unnecessary thread migrations; – Maximizes locality of data operations;– Affinity support for heterogeneous systems.
int main(int argc, char * argv[]){ double A[N]; …
#pragma omp distribute(BLOCK: A) \ location(0:1) …
#pragma omp parallel for OnLoc(A[i]) for(i=0;i<N;i++){ foo(A[i]) } …}
Enabling Asynchronous Computation Directed acyclic graph (DAG):
where each node represents a task and the edges represents inter-task dependencies
A task can begin execution only if all its predecessors have completed execution
Should user express this directly? Compiler can generate tasks and
graph (at least partially) to enhance performance
What is “right” size of task?
OpenMP Research: Enhancing Tasks for Asynchrony Implicitly create DAG by specifying data input
and output related to tasks A task that produces data may need to wait for
any previous child task that reads or writes the same locations
A task that consumes data may need to wait for any previous child task that writes the same locations
Task weights (priorities) Groups of tasks and synchronization on groups
Asynchronous OpenMP Execution
T.-H. Weng, B. Chapman: Implementing OpenMP Using Dataflow Execution Model for Data Locality and Efficient Parallel Execution. Proc. HIPS-7, 2002
OpenUH “All Task” Execution Model Replace fork-join with data flow execution
model Reduce synchronization, increase concurrency Compiler transforms OpenMP code to a collection of
tasks and a task graph. Task graph represents dependences between tasks Array region analysis; data read/write requests for
the data the task needs.
Map tasks to compute resources through task characterization using cost modeling
Enable locality-aware scheduling and load-balancing
Runtime Is Critical for Performance Runtime support must
Adapt workload and data to environment Respond to changes caused by application
characteristics, power, faults, system noise Dynamically and continuously
Provide feedback on application behavior Uniform support for multiple programming
models on heterogeneous platforms Facilitate interoperability, (dynamic) mapping
Lessons Learned from Prior Work
Only a tight integration of application-provided meta-data and architecture description can let the runtime system take appropriate decisions
Good analysis of thread affinity and data locality
Task reordering H/W aware selective information gathering
OpenMP Runtime Library
OpenMP Runtime Library
Feedback Optimizations
IPA(Inter Procedural Analyzer)
IPA(Inter Procedural Analyzer)
Source code w/ OpenMP directives
Source code w/ OpenMP directives
Source code with runtime library callsSource code with
runtime library calls
Linking
CG(Itanium, Opteron, Pentium)
CG(Itanium, Opteron, Pentium)
WOPT(global scalar optimizer)
WOPT(global scalar optimizer)
Object filesObject files
LOWER_MP(Transformation of OpenMP )
LOWER_MP(Transformation of OpenMP )
A NativeCompilerA NativeCompiler
ExecutablesExecutables
FRONTENDS(C/C++, Fortran 90, OpenMP)
FRONTENDS(C/C++, Fortran 90, OpenMP)
Op
en64
Co
mp
iler
in
fras
tru
ctu
re LNO(Loop Nest Optimizer)
LNO(Loop Nest Optimizer)
OMP_PRELOWER(Preprocess OpenMP )
OMP_PRELOWER(Preprocess OpenMP )
WHIRL2C & WHIRL2F(IR-to-source option )
WHIRL2C & WHIRL2F(IR-to-source option )
DynamicFeedback forCompilerOptimizations
Runtime Feedback Information
Static Feedback Information
Free compute resources
Compiler’s Runtime Must Adapt
Light-weight performance data collection Dynamic optimization Interoperate with external tools and schedulers Need very low-level interfaces to facilitate its implementation
OpenMP Runtime Library
OpenMP Runtime Library
Collector ToolCollector Tool
OpenMP AppOpenMP App
Event callback
Registerevent
Multicore Association is defining interfaces
Interactions Across System Stack
More interactions needed to share information To support application development and tuning Increase execution efficiency Runtime needs architectural information, smart
monitoring Application developer needs feedback So does dynamic optimizer
IPA: Inlining Analysis/ Selective Instrumentation
Instrumentation Phase
Source-to-SourceTransformations
Optimization Logs
Oscar Hernandez, Haoqiang Jin, Barbara Chapman. Compiler Support for Efficient Instrumentation. In Parallel Computing: Architectures, Algorithms and Applications , C. Bischof, M. B¨ucker, P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr, F. Peters (Eds.), NIC Series, Vol. 38, ISBN 978-3-9810843-4-4, pp. 661-668, 2007.
Automating the Tuning Process
K. Huck, O. Hernandez, V. Bui, S. Chandrasekaran, B. Chapman, A. Malony, L. McInnes, B. Norris. Capturing Performance Knowledge for Automated Analysis. Supercomputing 2008
Code Migration Tools From structural information to
sophisticated adaptation support Changes in data structures, large-
grain and fine-grained control flow Variety of transformations for
kernel granularity and memory optimization
Red: High similarityBlue: Low similarity
Summary
Projected exascale hardware requires us to rethink programming model and its execution support Intra-node concurrency is fine-grained, heterogeneous
Memory is scarce and power is expensive Data locality has never been so important
Opportunity for new programming models Runtime continues to grow in importance for
ensuring good performance Need to migrate large apps: Where are the
tools?