LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by...
-
Upload
alexis-fleming -
Category
Documents
-
view
216 -
download
0
Transcript of LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by...
LLNL-PRES-653431This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract
DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Moving Complex Apps To Take Advantage of Complex Hardware
Salishan
Ian Karlin
4/24/2014
Lawrence Livermore National Laboratory LLNL-PRES-6534312
ASC Codes Last Many HW Generations Tuning large complex applications for each
hardware generation is impractical• Performance• Productivity• Code Base Size
Solutions must be general, adaptable to the future and
maintainable
Lawrence Livermore National Laboratory LLNL-PRES-6534313
Power Efficiency is Driving Computer Designs
What drives these designs?
Handheld mobile device weight and battery life
Exascale power goals
Power cost
Lower power reduces performance and reliability
Power vs. frequency for Intel Ivy Bridge
Lawrence Livermore National Laboratory LLNL-PRES-6534314
Chips operating near threshold voltage encounter
More transient errors
More hard errors
Reliability of Systems Will Decrease
Checkpoint restart is our current reliability mechanism
Lawrence Livermore National Laboratory LLNL-PRES-6534315
Advancing Capability at Reduced Power Requires More Complexity
Complex power saving features
SIMD and SIMT
Multi-Level memory systems
Heterogeneous systems
In-Package Memory
Memory
Processing
Multi-Core CPU
In-Package Memory
GPU
Memory
Processing
NVRAM
NVRAM
Exploiting these features is difficult
Lawrence Livermore National Laboratory LLNL-PRES-6534316
No production GPU or Xeon Phi code• GPU and Xeon Phi optimizations are different
No production codes explicitly manage on-node data motion
Less than 10% of our FLOPs use SIMD units, even with the best compilers• Architecture dependent data layouts may hinder the
compiler
Currently, We Do Not Use These Features
Mechanisms are needed to isolate architecture specific code
Lawrence Livermore National Laboratory LLNL-PRES-6534317
We add directives to existing codes where portable
Multi-level memory handled by OS, runtime or used as a cache
We continue to get little SIMD and probably a bit better SIMT parallelism
What Would Continuing on Today’s Path Look Like?
Overall performance improvement is incremental
at best
Lawrence Livermore National Laboratory LLNL-PRES-6534318
Are our algorithms well suited for future machines?
Can we rewrite our data structures to match future machines?
Can We Get Today’s Codes Where We Need To Be Tomorrow?
We will address these questions in the next few slides
Lawrence Livermore National Laboratory LLNL-PRES-6534319
Loop fusion• Make each operator a
single sweep over a mesh
Data structure reorganization
Reduce mallocs or Use better libraries
We Can Manage Locality and Reduce Data Motion
LULESH BG/Q
However, better implementations only get us 2-3x
Lawrence Livermore National Laboratory LLNL-PRES-65343110
Throughput optimized processors execute serial sections slowly
Design codes with limited serial sections
Better runtime support is needed to reduce serial overhead• OpenMP• Malloc Libraries
We Can Reduce Serial Sections
Use latency optimized processor for what remains
Lawrence Livermore National Laboratory LLNL-PRES-65343111
More parallelism exists in current algorithms than we exploit today
Code changes are required to express parallelism more clearly
SIMT or SIMD with HW Gather/Scatter are easier to exploit
We Can Vectorize Better
LULESH Sandy Bridge
Bandwidth constraints will eventually limit us
Lawrence Livermore National Laboratory LLNL-PRES-65343112
However, There Are Fundamental Data Motion Requirements
Many of today’s apps need 0.5-2 bytes for every FLOP performed.
Lawrence Livermore National Laboratory LLNL-PRES-65343113
Future Machines Can Not Move Data Fast Enough For Current Algorithms to Exploit All Resources
Excess FLOPs
Lawrence Livermore National Laboratory LLNL-PRES-65343114
More FLOPs per byte
Small dense operations
More accurate
Potentially more robust and better symmetry preservation
High-order Algorithms Can Bridge the Gap
B to F Requirement vs. Algorithmic Order
Lawrence Livermore National Laboratory LLNL-PRES-65343115
How do you use the FLOPs efficiently?
What does high-order accuracy mean when a there is a shock?
Can you couple all they physics we need at high-order?
They Present New Questions
We are working to answer these, but whether we use new algorithms or our current ones there is a pervasive challenge…
Lawrence Livermore National Laboratory LLNL-PRES-65343116
How to Retarget Large Applications in a Manageable Way?
Lawrence Livermore National Laboratory LLNL-PRES-65343117
Are Optimizations Portable Across Architectures?
Mechanisms are needed to isolate non-portable optimizations
Lawrence Livermore National Laboratory LLNL-PRES-65343118
RAJA, Kokkos and Thrust allow portable abstractions in today’s codes
You Saw One Approach Earlier This Week
Charm++
Liszt
There Are Other Attractive Research Approaches For The Future
Lawrence Livermore National Laboratory LLNL-PRES-65343119
Ultimately We Need to Make Performance a First Class Citizen
Algorithms Programming Models
Architectures
RAJA
Today’s
High Order