SKA Science Data Processor Update · Slide 2 1 I will cover aspects of RCs Jeremy Coles,...
Transcript of SKA Science Data Processor Update · Slide 2 1 I will cover aspects of RCs Jeremy Coles,...
SKA Science Data
Processor Update
John Taylor
High Performance Computing and
Research Computing Service
University of Cambridge
SKA Science Data Processor Consortium
SDP ISC Frankfurt 21st June
Overview
• SDP Context and Scope
• SDP Requirements and Challenges
• Regional Centres
• Overview of Radio Interferometry Processing
• Mapping Current Architecture on to present day
hardware
– Current “Hardware Costed Concept”
• Current prototyping activity and Next Steps
NB Where possible I have indicated which document from the delta-PDR
redacted set I am referring to.
1
Slide 2
1 I will cover aspects of RCsJeremy Coles, 17/06/2016
SCIENCE DATA PROCESSOR
CONTEXT
SKA Context Diagram
These are off-
site! (In Perth &
Cape Town)
SDP Scope SKA Phase 1
Ref. SKA-TEL-SDP-0000001 SDP Preliminary Architecture Design P Alexander et al
SDP Key Performance Requirements -- SKA
Phase 1
SDP Local Monitoring & Control
High Performance
• ~100 PetaFLOPS
Data Intensive
• ~100 PetaBytes/observation (job)
Partially real-time
• ~10s response time
Partially iterative
• ~10 iterations/job (~6hour)
Telescope Manager
C
S
P
Observatory
High Volume & High Growth Rate
• ~100 PetaByte/year
Infrequent Access
• ~few times/year max
Data Processor Data
Preservation
Delivery
System
Data Distribution
•~100 PetaByte/year from Cape Town & Perth to rest of World
Data Discovery
•Visualisation of 100k by 100k by 100k voxel cubes
Science Data Processor
~1 Tbytes-1~10
Gbytes-1~200 (TBC)
Gbits-1
?
SDP Overview
• So the SDP is much more than another HPC system
• It needs to:-
– Achieve high-performance on key scientific algorithms in multi-PFLOPS
regime• HPC technologies are critical
– Collect, manage, store and deliver vast amounts of data into viable
products • Big Data => Variety, velocity, volume, veracity => Value
– Combine real-time and iterative execution environment and provide
feedback at various cadence to other elements of the telescope• High Performance Data Analytics
– Operate 365 days a year • High availability and accommodate failure via software. Modern hyperscale
environments
– Extensible and Scalable• Provide a modern eco-system to accommodate new algorithm development and
upgrades
SDP Challenges
• Power efficiency – Current Exascale roadmap (US) indicates 20-25MW for ExaFlop by 2023. I
recently saw 30MW somewhere too!!!!
– Aurora system 180 (450) Pflops in 13MW.
• Cost Are our assumptions correct? How will growth-rates pan-out (processor,
memory, networking and storage). ?
• Complexity of Hardware and Software
• Scalability and nature of software
– Hardware roadmaps
– Demonstrated software scaling is uncertain
• Extensibility, scalability, maintainability
– SKA1 is the first “milestone” – expecting significant expansion in the
2020s
– 50yr observatory lifetime
KEY CHARACTERISTICS OF
RADIO INTERFEROMETRY IMAGE
PROCESSING
Key Characteristics of SKA Data Processing
Very large data volumes, all data are processed in each observation
Noisy Data
Corrected for by deconvolution using iterative algorithms (~10 iterations)
Sparse and Incomplete Sampling
Corrected by jointly solving for the sky brightness distribution and for the slowly changing corruption effects using iterative algorithms
Corrupted Measurements
Loosely coupled tasks, large degree of parallelism is inherently available
Multiple dimensions of
data parallelism
KEY ARCHITECTURAL
CONSIDERATIONS AND MAPPING
TO CURRENT HARDWARE
SDP Functional Breakdown
Ref. SKA-TEL-SDP-00000013 SDP Preliminary Architecture Design P Alexander et al
Imaging Component
Image Processing Model
Ref. SKA-TEL-SDP-0000018 SDP Data Processor Platform Design C. Broekema
UV data store
Major cycle
Astronomical
quality data
Image gridded data
Deconvolve imaged data Update
(minor cycle) current sky model
Solve for telescope and Update image-plane calibration calibration model model
Imaging processors
RFI excision and
phase rotation
Subtract current sky UV processors
model from visibilities
using current calibration
model
Grid UV data to form e.g.
W-projection
Correlator
Imaging and Fast Imaging in more detail
Drop Islands
Ref. SKA-TEL-SDP-0000015 SDP Execution Framework A. Wicenec et al
Compute Island/Node Concept
Ref. SKA-TEL-SDP-0000018 SDP Data Processor Platform Design C. Broekema
Compute Island
Current Hardware
Costed Concept
SDP Networking
Receive Function
Mid 74 and Low 58 100 GbE Connections (80% Occupancy)
SDP Hardware Concept
SDP PROCESSING
REQUIREMENTS
Key topic on kernels
• Both gridding and FFT currently appear limited by memory bandwidth:– Roofline: 2 bytes memory transfer / DP FLOP
• Can we improve implementation past the 2 bytes / FLOP roofline?
• What is the most energy & cost effective way to buy memory bandwidth?– Can we program such a system?
• Do we need faceting?
Design Equations
Ncu The number of compute units. These are defined as units with very high
bandwidth to working memory and shared-memory parallelism
Cpeak The peak FLOPS capability of the compute unit
Cmax The maximum FLOPS capability that a compute unit delivers in practice
Rbw;max The maximum memory bandwidth of each compute unit to its main, high
throughput, working memory.
Rbw;I=O;max The maximum I/O bandwidth of each compute unit to buffer
Mcu;work Size of working memory of the compute unit. This is the memory with
bandwidth as described by Rbw;max.
Mcu;pool Slower working memory to which working grids etc. are swapped out to
when not being actively worked on. For accelerator-based systems this
could be DRAM on the main board or eventually new high-throughput
NVRAM technology.
Mcu;buf Size of the buffer attached to each compute unit (or its share of data
island local buffer).
Ref. SKA-TEL-SDP-0000040 Parametric models of SDP compute Requirements R. Bolton et al
Where Next
• Prototyping around networking, memory and
processing technologies
•Explore new algorithms which may ameliorate
memory bandwidth
•Understanding of how Execution framework will
work–Critical for Data-Driven Architecture
• Understanding of how Control and Management will
work and appropriate Middleware–Critical for Integration with TM
• Role of Open Architecture Lab
EXPLORING EFFICIENCY
Experimental Evidence
•Using Memory Bandwidth to determine
efficiency
•ξcomp
= �������������
/������
•������������� = 120 GFLOPs-1 ……..Gridding
•������������� = 226 GFLOPs-1 ……..FFT
•�
�= 288 GBytes-1 ……..Nvidia K40
•ρop
� 0.6FLOPS/Byte
Ref. SKA-TEL-SDP-0000086 SDP Memo: Estimating the SDP Computational Efficiency B.
Nikolic
Estimating Future Performance
•Nvidia Pascal
•������ = ~ 5 TFLOPs-1
•�
�= 720 GBytes-1 ( 1 Tbytes-1 predicted)
•Using ρop
� 0.6gives:
⇒ ������������� ~ 430 GFLOPs-1
=> ξcomp
= �������������
/������ ~ 0.09
Ref. SKA-TEL-SDP-0000086 SDP Memo: Estimating the SDP Computational
Efficiency B. Nikolic
Estimating Future Processor Count
•Future HPC Systems using High Bandwidth Memory
(3D Stacking)
•For a given system of 100 PFLOPs-1 implies an
aggregate �
���= 170 Pbytes-1
•For HBM2, each Stack is 256 Gbytes-1 implying 7 x
105 individual stacks or 120 x 103
processor/accelerator packages based on 6
stacks/processor
Ref. SKA-TEL-SDP-0000086 SDP Memo: Estimating the SDP Computational
Efficiency B. Nikolic
Estimating Future Power Requirements
•Power Requirements
–Each HBM Stack is estimated at 6 pJ bit-1
–Forasystem�
�= 170 Pbytes-1 => 8.2 MW
–=> ~ 25 MW for system
–Aurora System currently 180 Pflops-1 around 13 MW
Ref. SKA-TEL-SDP-0000086 SDP Memo: Estimating the SDP Computational
Efficiency B. Nikolic
Open Architecture Lab
Open Architecture Lab
• OAL provides a service function to SDP to support horizontal and vertical
prototyping activities as a means to address risk-reduction w.r.t SDP
Product Tree
• Currently horizontal prototyping is being conducted by a number of separate
prototyping activities, e.g. integration prototype using, where possible, SKA-
SA MeerKat
• OAL is a distributed function across SDP and thus requires effective
communication to inform the consortium on activities and avoid duplication.
• Vertical prototyping focused on specific COTS technologies w.r.t the Product
Tree Analysis. Current activity has been moderated by incipient PT analysis
• Prototyping activities have centred around equipment in HPCS Cambridge
and ICRAR along with specific Industry engagement where appropriate.
Candidate technology elements becoming “clearer” for 2023
SDP Product Tree
Processor Platform
Product Tree Analysis Process Overview
Product tree / design
Select Candidate Solutions
Assess Risk
Select CTEs
Assess TRL
Prototype/test plan
Requirements
1.0 Introduction
1.1 Background and Strategic Fit
1.2 Context
1.3 Behaviour
1.4 Interfaces
2.0 Requirements
2.1 Performance
Requirements
2.2 Functional Requirements
2.3 Cost
2.4 Schedule
3.0 Select Candidate Solutions
3.1 Architectural Drivers
3.2 Candidate Solutions
3.3 Concept Selection Table
3.4 Risk Assessment Table
3.5 Select Preferred Option(s)
4.0 Critical Technology Element
Selection
5.0 Technology Readiness Level
Assessment
6.0 List of TBDs
7.0 Prioritised Prototyping Test Plan
8.0 Not Doing / Not Considered
OAL Vertical Prototyping Activities
Candidate compute architectures to address computational kernels and imaging
pipelines:
• Many-core accelerators (e.g. GPGPU, FPGA and Xeon Phi)
• Arithmetic Processing Units comprising CPU and GPU in one package
• Low-power SoC technologies (e.g. ARM, Atom)
Storage Solutions addressing pseudo real-time buffering of the visibility buffer and the
archive:
• Enterprise-level vs Commodity Disks (e.g. SAS vs. SATA)
• DRAM and Non-volatile (NVRAM) storage
• Parallel file systems (e.g. Lustre)
• Object-based storage (e.g. SWIFT, CEPH)
High performance networks addressing bulk-data transport and potential low-latency
interconnect
• Infiniband and other “proprietary” networking
• High Speed Ethernet
• Software Defined Networks
System Level Software (middleware)
Operations – service levels, system maintenance process, lifecycle management
Open Architecture Lab
• Focus on key technology pinch-points – Processor, Storage,
Networking and Data Flow
– Many core/Accelerator model is seen as the most viable route albeit
efficiency may be low (~ 10%) work to follow (mostly x86,Nvidia
Power and ARM)
– NVRAM – Initial work on CASA using SSD (2-3x over local storage)
and should be extended to track NvRAM technology local and over
Fabric
– Networking – Tracking High-Arity Networking silicon and exploring
QoS and SDN for combined networking
– DataFlow – Use of Wilkes in Cambridge