SKA Science Data Processor Update · Slide 2 1 I will cover aspects of RCs Jeremy Coles,...

SKA Science Data

Processor Update

John Taylor

High Performance Computing and

Research Computing Service

University of Cambridge

SKA Science Data Processor Consortium

SDP ISC Frankfurt 21st June

Overview

• SDP Context and Scope

• SDP Requirements and Challenges

• Regional Centres

• Overview of Radio Interferometry Processing

• Mapping Current Architecture on to present day

hardware

– Current “Hardware Costed Concept”

• Current prototyping activity and Next Steps

NB Where possible I have indicated which document from the delta-PDR

redacted set I am referring to.

1

1 I will cover aspects of RCsJeremy Coles, 17/06/2016

SCIENCE DATA PROCESSOR

CONTEXT

SKA Context Diagram

These are off-

site! (In Perth &

Cape Town)

SDP Scope SKA Phase 1

Ref. SKA-TEL-SDP-0000001 SDP Preliminary Architecture Design P Alexander et al

SDP Key Performance Requirements -- SKA

Phase 1

SDP Local Monitoring & Control

High Performance

• ~100 PetaFLOPS

Data Intensive

• ~100 PetaBytes/observation (job)

Partially real-time

• ~10s response time

Partially iterative

• ~10 iterations/job (~6hour)

Telescope Manager

C

S

P

Observatory

High Volume & High Growth Rate

• ~100 PetaByte/year

Infrequent Access

• ~few times/year max

Data Processor Data

Preservation

Delivery

System

Data Distribution

•~100 PetaByte/year from Cape Town & Perth to rest of World

Data Discovery

•Visualisation of 100k by 100k by 100k voxel cubes

Science Data Processor

~1 Tbytes-1~10

Gbytes-1~200 (TBC)

Gbits-1

?

SDP Overview

• So the SDP is much more than another HPC system

• It needs to:-

– Achieve high-performance on key scientific algorithms in multi-PFLOPS

regime• HPC technologies are critical

– Collect, manage, store and deliver vast amounts of data into viable

products • Big Data => Variety, velocity, volume, veracity => Value

– Combine real-time and iterative execution environment and provide

feedback at various cadence to other elements of the telescope• High Performance Data Analytics

– Operate 365 days a year • High availability and accommodate failure via software. Modern hyperscale

environments

– Extensible and Scalable• Provide a modern eco-system to accommodate new algorithm development and

upgrades

SDP Challenges

• Power efficiency – Current Exascale roadmap (US) indicates 20-25MW for ExaFlop by 2023. I

recently saw 30MW somewhere too!!!!

– Aurora system 180 (450) Pflops in 13MW.

• Cost Are our assumptions correct? How will growth-rates pan-out (processor,

memory, networking and storage). ?

• Complexity of Hardware and Software

• Scalability and nature of software

– Hardware roadmaps

– Demonstrated software scaling is uncertain

• Extensibility, scalability, maintainability

– SKA1 is the first “milestone” – expecting significant expansion in the

2020s

– 50yr observatory lifetime

KEY CHARACTERISTICS OF

RADIO INTERFEROMETRY IMAGE

PROCESSING

Key Characteristics of SKA Data Processing

Very large data volumes, all data are processed in each observation

Noisy Data

Corrected for by deconvolution using iterative algorithms (~10 iterations)

Sparse and Incomplete Sampling

Corrected by jointly solving for the sky brightness distribution and for the slowly changing corruption effects using iterative algorithms

Corrupted Measurements

Loosely coupled tasks, large degree of parallelism is inherently available

Multiple dimensions of

data parallelism

KEY ARCHITECTURAL

CONSIDERATIONS AND MAPPING

TO CURRENT HARDWARE

SDP Functional Breakdown

Ref. SKA-TEL-SDP-00000013 SDP Preliminary Architecture Design P Alexander et al

Imaging Component

Image Processing Model

Ref. SKA-TEL-SDP-0000018 SDP Data Processor Platform Design C. Broekema

UV data store

Major cycle

Astronomical

quality data

Image gridded data

Deconvolve imaged data Update

(minor cycle) current sky model

Solve for telescope and Update image-plane calibration calibration model model

Imaging processors

RFI excision and

phase rotation

Subtract current sky UV processors

model from visibilities

using current calibration

model

Grid UV data to form e.g.

W-projection

Correlator

Imaging and Fast Imaging in more detail

Drop Islands

Ref. SKA-TEL-SDP-0000015 SDP Execution Framework A. Wicenec et al

Compute Island/Node Concept

Ref. SKA-TEL-SDP-0000018 SDP Data Processor Platform Design C. Broekema

Compute Island

Current Hardware

Costed Concept

SDP Networking

Receive Function

Mid 74 and Low 58 100 GbE Connections (80% Occupancy)

SDP Hardware Concept

SDP PROCESSING

REQUIREMENTS

Key topic on kernels

• Both gridding and FFT currently appear limited by memory bandwidth:– Roofline: 2 bytes memory transfer / DP FLOP

• Can we improve implementation past the 2 bytes / FLOP roofline?

• What is the most energy & cost effective way to buy memory bandwidth?– Can we program such a system?

• Do we need faceting?

Design Equations

Ncu The number of compute units. These are defined as units with very high

bandwidth to working memory and shared-memory parallelism

Cpeak The peak FLOPS capability of the compute unit

Cmax The maximum FLOPS capability that a compute unit delivers in practice

Rbw;max The maximum memory bandwidth of each compute unit to its main, high

throughput, working memory.

Rbw;I=O;max The maximum I/O bandwidth of each compute unit to buffer

Mcu;work Size of working memory of the compute unit. This is the memory with

bandwidth as described by Rbw;max.

Mcu;pool Slower working memory to which working grids etc. are swapped out to

when not being actively worked on. For accelerator-based systems this

could be DRAM on the main board or eventually new high-throughput

NVRAM technology.

Mcu;buf Size of the buffer attached to each compute unit (or its share of data

island local buffer).

Ref. SKA-TEL-SDP-0000040 Parametric models of SDP compute Requirements R. Bolton et al

Where Next

• Prototyping around networking, memory and

processing technologies

•Explore new algorithms which may ameliorate

memory bandwidth

•Understanding of how Execution framework will

work–Critical for Data-Driven Architecture

• Understanding of how Control and Management will

work and appropriate Middleware–Critical for Integration with TM

• Role of Open Architecture Lab

EXPLORING EFFICIENCY

Experimental Evidence

•Using Memory Bandwidth to determine

efficiency

•ξcomp

= ��

/��

•�� = 120 GFLOPs-1 ……..Gridding

•�� = 226 GFLOPs-1 ……..FFT

•�

�= 288 GBytes-1 ……..Nvidia K40

•ρop

� 0.6FLOPS/Byte

Ref. SKA-TEL-SDP-0000086 SDP Memo: Estimating the SDP Computational Efficiency B.

Nikolic

Estimating Future Performance

•Nvidia Pascal

•�� = ~ 5 TFLOPs-1

•�

�= 720 GBytes-1 ( 1 Tbytes-1 predicted)

•Using ρop

� 0.6gives:

⇒ �� ~ 430 GFLOPs-1

=> ξcomp

= ��

/�� ~ 0.09

Ref. SKA-TEL-SDP-0000086 SDP Memo: Estimating the SDP Computational

Efficiency B. Nikolic

Estimating Future Processor Count

•Future HPC Systems using High Bandwidth Memory

(3D Stacking)

•For a given system of 100 PFLOPs-1 implies an

aggregate �

��= 170 Pbytes-1

•For HBM2, each Stack is 256 Gbytes-1 implying 7 x

105 individual stacks or 120 x 103

processor/accelerator packages based on 6

stacks/processor



Estimating Future Power Requirements

•Power Requirements

–Each HBM Stack is estimated at 6 pJ bit-1

–Forasystem�

�= 170 Pbytes-1 => 8.2 MW

–=> ~ 25 MW for system

–Aurora System currently 180 Pflops-1 around 13 MW



Open Architecture Lab


• OAL provides a service function to SDP to support horizontal and vertical

prototyping activities as a means to address risk-reduction w.r.t SDP

Product Tree

• Currently horizontal prototyping is being conducted by a number of separate

prototyping activities, e.g. integration prototype using, where possible, SKA-

SA MeerKat

• OAL is a distributed function across SDP and thus requires effective

communication to inform the consortium on activities and avoid duplication.

• Vertical prototyping focused on specific COTS technologies w.r.t the Product

Tree Analysis. Current activity has been moderated by incipient PT analysis

• Prototyping activities have centred around equipment in HPCS Cambridge

and ICRAR along with specific Industry engagement where appropriate.

Candidate technology elements becoming “clearer” for 2023

SDP Product Tree

Processor Platform

Product Tree Analysis Process Overview

Product tree / design

Select Candidate Solutions

Assess Risk

Select CTEs

Assess TRL

Prototype/test plan

Requirements

1.0 Introduction

1.1 Background and Strategic Fit

1.2 Context

1.3 Behaviour

1.4 Interfaces

2.0 Requirements

2.1 Performance

Requirements

2.2 Functional Requirements

2.3 Cost

2.4 Schedule

3.0 Select Candidate Solutions

3.1 Architectural Drivers

3.2 Candidate Solutions

3.3 Concept Selection Table

3.4 Risk Assessment Table

3.5 Select Preferred Option(s)

4.0 Critical Technology Element

Selection

5.0 Technology Readiness Level

Assessment

6.0 List of TBDs

7.0 Prioritised Prototyping Test Plan

8.0 Not Doing / Not Considered

OAL Vertical Prototyping Activities

Candidate compute architectures to address computational kernels and imaging

pipelines:

• Many-core accelerators (e.g. GPGPU, FPGA and Xeon Phi)

• Arithmetic Processing Units comprising CPU and GPU in one package

• Low-power SoC technologies (e.g. ARM, Atom)

Storage Solutions addressing pseudo real-time buffering of the visibility buffer and the

archive:

• Enterprise-level vs Commodity Disks (e.g. SAS vs. SATA)

• DRAM and Non-volatile (NVRAM) storage

• Parallel file systems (e.g. Lustre)

• Object-based storage (e.g. SWIFT, CEPH)

High performance networks addressing bulk-data transport and potential low-latency

interconnect

• Infiniband and other “proprietary” networking

• High Speed Ethernet

• Software Defined Networks

System Level Software (middleware)

Operations – service levels, system maintenance process, lifecycle management


• Focus on key technology pinch-points – Processor, Storage,

Networking and Data Flow

– Many core/Accelerator model is seen as the most viable route albeit

efficiency may be low (~ 10%) work to follow (mostly x86,Nvidia

Power and ARM)

– NVRAM – Initial work on CASA using SSD (2-3x over local storage)

and should be extended to track NvRAM technology local and over

Fabric

– Networking – Tracking High-Arity Networking silicon and exploring

QoS and SDN for combined networking

– DataFlow – Use of Wilkes in Cambridge

SKA Science Data Processor Update · Slide 2 1 I will cover aspects of RCs Jeremy Coles,...

Documents

Transcript of SKA Science Data Processor Update · Slide 2 1 I will cover aspects of RCs Jeremy Coles,...