LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by...

LLNL-PRES-653431This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract

DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Moving Complex Apps To Take Advantage of Complex Hardware

Salishan

Ian Karlin

4/24/2014

Lawrence Livermore National Laboratory LLNL-PRES-6534312

ASC Codes Last Many HW Generations Tuning large complex applications for each

hardware generation is impractical• Performance• Productivity• Code Base Size

Solutions must be general, adaptable to the future and

maintainable


Power Efficiency is Driving Computer Designs

What drives these designs?

Handheld mobile device weight and battery life

Exascale power goals

Power cost

Lower power reduces performance and reliability

Power vs. frequency for Intel Ivy Bridge


Chips operating near threshold voltage encounter

More transient errors

More hard errors

Reliability of Systems Will Decrease

Checkpoint restart is our current reliability mechanism


Advancing Capability at Reduced Power Requires More Complexity

Complex power saving features

SIMD and SIMT

Multi-Level memory systems

Heterogeneous systems

In-Package Memory

Memory

Processing

Multi-Core CPU

In-Package Memory

GPU

Memory

Processing

NVRAM

NVRAM

Exploiting these features is difficult


No production GPU or Xeon Phi code• GPU and Xeon Phi optimizations are different

No production codes explicitly manage on-node data motion

Less than 10% of our FLOPs use SIMD units, even with the best compilers• Architecture dependent data layouts may hinder the

compiler

Currently, We Do Not Use These Features

Mechanisms are needed to isolate architecture specific code


We add directives to existing codes where portable

Multi-level memory handled by OS, runtime or used as a cache

We continue to get little SIMD and probably a bit better SIMT parallelism

What Would Continuing on Today’s Path Look Like?

Overall performance improvement is incremental

at best


Are our algorithms well suited for future machines?

Can we rewrite our data structures to match future machines?

Can We Get Today’s Codes Where We Need To Be Tomorrow?

We will address these questions in the next few slides


Loop fusion• Make each operator a

single sweep over a mesh

Data structure reorganization

Reduce mallocs or Use better libraries

We Can Manage Locality and Reduce Data Motion

LULESH BG/Q

However, better implementations only get us 2-3x


Throughput optimized processors execute serial sections slowly

Design codes with limited serial sections

Better runtime support is needed to reduce serial overhead• OpenMP• Malloc Libraries

We Can Reduce Serial Sections

Use latency optimized processor for what remains


More parallelism exists in current algorithms than we exploit today

Code changes are required to express parallelism more clearly

SIMT or SIMD with HW Gather/Scatter are easier to exploit

We Can Vectorize Better

LULESH Sandy Bridge

Bandwidth constraints will eventually limit us


However, There Are Fundamental Data Motion Requirements

Many of today’s apps need 0.5-2 bytes for every FLOP performed.


Future Machines Can Not Move Data Fast Enough For Current Algorithms to Exploit All Resources

Excess FLOPs


More FLOPs per byte

Small dense operations

More accurate

Potentially more robust and better symmetry preservation

High-order Algorithms Can Bridge the Gap

B to F Requirement vs. Algorithmic Order


How do you use the FLOPs efficiently?

What does high-order accuracy mean when a there is a shock?

Can you couple all they physics we need at high-order?

They Present New Questions

We are working to answer these, but whether we use new algorithms or our current ones there is a pervasive challenge…


How to Retarget Large Applications in a Manageable Way?


Are Optimizations Portable Across Architectures?

Mechanisms are needed to isolate non-portable optimizations


RAJA, Kokkos and Thrust allow portable abstractions in today’s codes

You Saw One Approach Earlier This Week

Charm++

Liszt

There Are Other Attractive Research Approaches For The Future


Ultimately We Need to Make Performance a First Class Citizen

Algorithms Programming Models

Architectures

RAJA

Today’s

High Order

LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by...

Documents

Transcript of LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by...