What hardware accelerators are you using/evaluating? Cells in a Roadrunner configuration ◦ 8-way...

What hardware accelerators are you using/evaluating?

Cells in a Roadrunner configuration◦ 8-way SPE threads w/ local memory, DMA &

vector unit programming issues but tremendous flexibility

◦ Fast (25.6 GB/s) & large memory (4GB or larger)◦ Augmented C language; also C++ & now Fortran;

GNU & XL variants; OpenMP is new; OpenCL is being prototyped

◦ Opterons can run bulk of code not needing acceleration; Cell-only clusters possible

What hardware accelerators are you using/evaluating? Several years ago…◦ GPUs (pre CUDA & Tesla)

Brook & Scout (LANL data-parallel language) No 32bit at the time; limited memory; everything is a data-parallel

problem No ECC memory ; insufficient parity/ECC protection of data paths and

logic Others at LANL still working in this area including Tesla & CUDA)

◦ Clearspeed (several years ago) Earliest Clearspeeds before the Advance families Augmented C language; 96 SIMD PEs Everything is done as long SIMD data parallel and in synch Low power

◦ FPGAs (HDL, several years ago) Programming is hard -- very hard Logic space limited the number of 64bit ops Fast SRAM but small; external DRAM modest size but no faster than

CPUs One algorithm at a time, so significant impact to use for multi-physics Low power

Describe the applications that you are porting to accelerators?◦ MD (materials), laser-plasma PIC, IMC X-ray (particle) transport,

GROMACS, n-body universe & galaxies, DNS turbulence & supernovea, HIV genealogy, nanowire long-time-scale MD

◦ Ocean circulation, wildfires, discrete social simulations, clouds & rain, influenza spread, plasma turbulence, plasma sheaths, fluid instabilities

My personal observations:◦ Particle methods are generally easiest◦ Codes with good characteristics:

A few computationally intense “algorithms” pre-existing or obvious “fine-grain” parallel work units C language versus Fortran or highly OO C++

Describe the kinds of speed-ups are you seeing (provide the basis for the comparison)?◦ 5x to 10X over single-Opteron-core for code with high memory BW

intensive and 5%-10% peak◦ 10x to 25x on particle methods, searches, etc.

How does it compare to scaling out (i.e., just using more X86 processors)? What are the bottlenecks to further performance improvements?◦ Scale out via more sockets is better – BUT!

Scaling efficiencies are a problem already for several LANL applications running at 4,000 to 10,000 cores; scale out of LANL-sized machines means $$$ for HW, space, & power

Scaling out by multi-core is not a clear winner◦ Memory BW and cache architectures often limit performance which

Cells mostly get around◦ Memory BW per core is decreasing at “inverse Moore’s law”

Describe the programming effort required to make use of the accelerator.◦ ½ to 1 man-year to “convert” a code, mostly dealing with data

structures and threaded parallelism designs.◦ Lack of debugging & similar tools are like the earliest days of parallel

computing (LANL was leader then as well – remember early PVM Ethernet workstation “carpet” clusters in the mid-80’s before MPPs)

◦ We like to see 1-2 programming experts (PhD-level or equiv) assigned to forefront-science code projects which have 1 to 4+ physics experts (PhD-level)

Amortization◦ Ready for the future – codes and skilled programmers. We expect our

dual-level (MPI+threads) & SIMD-vectorization techniques used for Roadrunner to pay off on future multi-core and many-core chips as well.

◦ It’s not just about running codes this year. Others will have to work through new forms of parallelism soon.

◦ We can do science now that isn’t possible with most other machines

Compare accelerator cost to scaling out cost◦ Commodity-processor-only machines would have cost 2X what

Roadrunner did in 2006-2007 (~$80M more)◦ Used 2X or more power (~$1M per MW)◦ Significantly larger nodes counts cause scaling & reliability issues◦ Accelerators or heterogeneous chips should be Greener

Ease of use issues◦ Newer Cell programming techniques (ALF, OpenMP) could make

this easier.◦ A Cell cluster would be easier, but the PPE is really, really slow for

non- SPU accelerated code segments.◦ Not for the faint of heart, but Top20 machines never are

What is the future direction of hardware based accelerators?◦ Domain specific libraries can make them far more useful in those specific areas◦ Some may appear on Intel QPI or AMD HT.◦ Specialized cores will show up within commodity microprocessors – ignore them or use them◦ GPU-based systems will have to adopt ECC & partity protection◦ Convey appears to have the most viable FPGA approach (FPGA as compiler managed co-

processor)

Software futures?◦ OpenCL looks promising but doesn’t address programming the specialized accelerator

devices themselves◦ The uber-auto-wizard-compiler will never come◦ Heterogeneous compilers may come.◦ Debuggers & tools may come

What are your thoughts on what the vendors need to do to ensure wider acceptance of accelerators?◦ Create next generation versions and sell as mainstream products

Compile & run on PowerPC PPE Identify & isolate algorithm & data to run parallel on 8

“remote” SPEs Compile scalar version of algorithm on SPE

◦ Add SPE thread process control◦ Add DMAs

Use “blocking” DMAs at this stage just for functionality Worry about data alignments

◦ First on a single SPE, then on 8 SPEs Optimize SPE code

◦ SIMD, branchesmerges◦ Add asynch double/triple buffering of DMAs

For Roadrunner, connect to rest of code on Opteron via DaCS and “message relay”

Roadrunner is more than a petascale supercomputer for today’s use◦ provides a balanced platform to explore new algorithm

design, programming models, and to refresh developer skills

LANL has been an early adopter of transformational technology*:◦ 1970s: HPC is scalar

LANL adopts vector (Cray 1 w/ no OS)

◦ 1980s: HPC is vectorLANL adopts data parallel (big CM-2)

◦ 2000s: HPC is multi-core clustersLANL adopts hybrid (Roadrunner)

*Credit to Scott Pakin, CCS-1, for this list idea

Opteron Cell PPC Cell SPE (x8 parallel)

Host data pushed/pulled to Cell

Cell spawns parallelthreads on SPEs

Parallel threads completed

Node may need to push/pull more data to/from Cell& to/from cluster

or could be available forconcurrent workduring this time

Host launches Cell code

Cell code completed

(5b) (5a)

Updated data pushed/pulled to Host

Non-accelerated code

Each SPE computes withinits local memory buffers

Each SPE DMA multi-buffersdata back to Cell memory

Each SPE DMA multi-buffersCell data into local memory

(4)untildone

Node(Opteron)

Serial PPCProcessor

NodeMemory

CellMemory Parallel SPE

ProcessorsLocal Memories

(1)(2)(6)

8-way parallel

MPI(5B)

PCIelink

How much can be automatedHow much can be automatedin compilers or languages?in compilers or languages?

What hardware accelerators are you using/evaluating? Cells in a Roadrunner configuration ◦ 8-way...

Documents

Transcript of What hardware accelerators are you using/evaluating? Cells in a Roadrunner configuration ◦ 8-way...

Roadrunner Nov. 4, 2016 - Constant Contactfiles.constantcontact.com/dbf77c63201/6f8ecaa1-79cf-40c4-b45a-fc… · Roadrunner The Roadrunner is published twice a month by the Rio Verde

3 DISTRICT ROADRUNNER - AZToastmasters.org

Roadrunner Poetry

RoadRunner Alarm Clock

Roadrunner: Hardware and Software Overview · 2009. 1. 14. · Roadrunner hardware overview This chapter describes the hardware components that comprise the Roadrunner system. Specifically,

Challenges using Roadrunner

Jack - Roadrunner

RoadRunner - Problem loading page Images/RoadRunner Data Sheet.pdf · RoadRunner Digital TurnTable Tachometer Measures platter rotational speed to 0.001 RPM resolution Contactless

Roadrunner customer support phone number +1 833-836-0944 Roadrunner support

RoadRunner manual

small school environment Roadrunner

MILWAUKEE DIVISION v. ROADRUNNER TRANSPORTATION … · filings by Roadrunner Transportation Systems, Inc. (“Roadrunner” or the “Company”), as well as ... Plaintiff believes

RoadRunner Timing SystemThe RoadRunner system comes in two general configurations, RoadRunner Cross Country and RoadRunner Track. The RoadRunner track system can support 6, 8 or 10

Roadrunner Challenge

RoadRunner March 2018 - conejousd.org

The White Sands RoadRunner

Roadrunner and hybrid computingcavazos/cisc879/papers/Talks/adwpreviewrr.pdf · Roadrunner and hybrid computing Ken Koch Roadrunner Technical Manager Scientific Advisor, CCS-DO August

Roadrunner technical support number | Roadrunner customer service

April 2009 ROADRUNNER REPORTERroadrunnerdistrict.org/newsletter/2009/04-2009 Newsletter.pdfApril 2009 ROADRUNNER REPORTER ... Page 3 ROADRUNNER REPORTER April 2009 Troop 254 Hello

RoadRUNNER Magazine