LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by...

19
LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC Moving Complex Apps To Take Advantage of Complex Hardware Salishan Ian Karlin 4/24/2014

Transcript of LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by...

Page 1: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

LLNL-PRES-653431This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract

DE-AC52-07NA27344. Lawrence Livermore National Security, LLC

Moving Complex Apps To Take Advantage of Complex Hardware

Salishan

Ian Karlin

4/24/2014

Page 2: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-6534312

ASC Codes Last Many HW Generations Tuning large complex applications for each

hardware generation is impractical• Performance• Productivity• Code Base Size

Solutions must be general, adaptable to the future and

maintainable

Page 3: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-6534313

Power Efficiency is Driving Computer Designs

What drives these designs?

Handheld mobile device weight and battery life

Exascale power goals

Power cost

Lower power reduces performance and reliability

Power vs. frequency for Intel Ivy Bridge

Page 4: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-6534314

Chips operating near threshold voltage encounter

More transient errors

More hard errors

Reliability of Systems Will Decrease

Checkpoint restart is our current reliability mechanism

Page 5: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-6534315

Advancing Capability at Reduced Power Requires More Complexity

Complex power saving features

SIMD and SIMT

Multi-Level memory systems

Heterogeneous systems

In-Package Memory

Memory

Processing

Multi-Core CPU

In-Package Memory

GPU

Memory

Processing

NVRAM

NVRAM

Exploiting these features is difficult

Page 6: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-6534316

No production GPU or Xeon Phi code• GPU and Xeon Phi optimizations are different

No production codes explicitly manage on-node data motion

Less than 10% of our FLOPs use SIMD units, even with the best compilers• Architecture dependent data layouts may hinder the

compiler

Currently, We Do Not Use These Features

Mechanisms are needed to isolate architecture specific code

Page 7: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-6534317

We add directives to existing codes where portable

Multi-level memory handled by OS, runtime or used as a cache

We continue to get little SIMD and probably a bit better SIMT parallelism

What Would Continuing on Today’s Path Look Like?

Overall performance improvement is incremental

at best

Page 8: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-6534318

Are our algorithms well suited for future machines?

Can we rewrite our data structures to match future machines?

Can We Get Today’s Codes Where We Need To Be Tomorrow?

We will address these questions in the next few slides

Page 9: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-6534319

Loop fusion• Make each operator a

single sweep over a mesh

Data structure reorganization

Reduce mallocs or Use better libraries

We Can Manage Locality and Reduce Data Motion

LULESH BG/Q

However, better implementations only get us 2-3x

Page 10: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-65343110

Throughput optimized processors execute serial sections slowly

Design codes with limited serial sections

Better runtime support is needed to reduce serial overhead• OpenMP• Malloc Libraries

We Can Reduce Serial Sections

Use latency optimized processor for what remains

Page 11: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-65343111

More parallelism exists in current algorithms than we exploit today

Code changes are required to express parallelism more clearly

SIMT or SIMD with HW Gather/Scatter are easier to exploit

We Can Vectorize Better

LULESH Sandy Bridge

Bandwidth constraints will eventually limit us

Page 12: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-65343112

However, There Are Fundamental Data Motion Requirements

Many of today’s apps need 0.5-2 bytes for every FLOP performed.

Page 13: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-65343113

Future Machines Can Not Move Data Fast Enough For Current Algorithms to Exploit All Resources

Excess FLOPs

Page 14: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-65343114

More FLOPs per byte

Small dense operations

More accurate

Potentially more robust and better symmetry preservation

High-order Algorithms Can Bridge the Gap

B to F Requirement vs. Algorithmic Order

Page 15: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-65343115

How do you use the FLOPs efficiently?

What does high-order accuracy mean when a there is a shock?

Can you couple all they physics we need at high-order?

They Present New Questions

We are working to answer these, but whether we use new algorithms or our current ones there is a pervasive challenge…

Page 16: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-65343116

How to Retarget Large Applications in a Manageable Way?

Page 17: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-65343117

Are Optimizations Portable Across Architectures?

Mechanisms are needed to isolate non-portable optimizations

Page 18: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-65343118

RAJA, Kokkos and Thrust allow portable abstractions in today’s codes

You Saw One Approach Earlier This Week

Charm++

Liszt

There Are Other Attractive Research Approaches For The Future

Page 19: LLNL-PRES-653431 This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Lawrence Livermore National Laboratory LLNL-PRES-65343119

Ultimately We Need to Make Performance a First Class Citizen

Algorithms Programming Models

Architectures

RAJA

Today’s

High Order