Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 1 of 51
PDR.02.01 Compute Platform Element Subsystem
Design
Document number….……………………………………………………….SKA-TEL-SDP-0000018
Context…..……………...……………………………………………………………….…….....COMP
Revision……………………………………………………………………………………………….1.0
Author………………………………………………………………………………......P.C. Broekema
Release Date……………………………………………………………………………….2015-02-09
Document Classification………………………………………………………………….Unrestricted
Status………………………………………………………………………………………...……. Final
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 2 of 51
Name Designation Affiliation
Chris Broekema COMP Team Lead ASTRON
Signature & Date:
Name Designation Affiliation
Paul Alexander SDP Project Lead University of Cambridge
Signature & Date:
Version Date of Issue Prepared by Comments
1.0 2015-02-09 P. C. Broekema
ORGANISATION DETAILS
Name Science Data Processor Consortium
Signature:
Email:
Signature:
Email:
P.C. Broekema (Feb 9, 2015)P.C. Broekema
Paul Alexander (Feb 9, 2015)Paul Alexander
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 3 of 51
Table of Contents List of Figures ............................................................................................................................ 5
List of Tables ............................................................................................................................. 5
References ................................................................................................................................ 6
Applicable documents ............................................................................................................ 6
Reference documents ............................................................................................................ 7
Introduction ...............................................................................................................................10
Purpose of this document ......................................................................................................10
Scope of this document .........................................................................................................10
Assumptions made in this document .....................................................................................11
Functional decomposition ......................................................................................................11
SDP requirements and constraints ............................................................................................12
Computational requirements dictated by science objectives ..................................................12
Constraints ............................................................................................................................12
Capital constraints .............................................................................................................12
Power constraints ..............................................................................................................12
L1 and L2 requirements .........................................................................................................13
SDP architecture .......................................................................................................................13
Architectural design principles ...............................................................................................14
Top-level architecture ............................................................................................................15
Data flow model .....................................................................................................................18
Compute Island .....................................................................................................................20
SDP scaling ...........................................................................................................................21
Science Archive .....................................................................................................................23
Computational efficiency ...........................................................................................................23
Roll-out schedule ......................................................................................................................28
Data transport ...........................................................................................................................30
Data transport bandwidth requirements .................................................................................31
Top-level network architecture ...............................................................................................31
Bulk data transport network design ........................................................................................32
Ingest Processing ..................................................................................................................33
High-performance, low-latency interconnect architecture.......................................................33
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 4 of 51
Management, monitoring and control network .......................................................................34
Data reordering .....................................................................................................................34
In-network reordering .........................................................................................................34
Intra-island reordering ........................................................................................................35
Inter-island reordering ........................................................................................................35
Software-defined networking .................................................................................................36
Combining Bulk Data Network with Low-latency Network ......................................................36
Compute node model ................................................................................................................37
From performance model to design characteristics................................................................37
Baseline model - current-day technology ...............................................................................37
Storage model...........................................................................................................................41
Intermediate buffer ................................................................................................................41
Science Archive .....................................................................................................................42
Mirror science archive ...........................................................................................................43
Software stack ..........................................................................................................................43
Operating system ..................................................................................................................43
Middleware ............................................................................................................................43
Messaging layer .................................................................................................................44
Logging system ..................................................................................................................44
Platform management system ............................................................................................45
System optimisation ...........................................................................................................45
Archive HSM Software ..........................................................................................................46
Application development environment and software development kit ....................................46
Scheduler ..............................................................................................................................47
SDP infrastructure .....................................................................................................................47
Data delivery platform hardware ...............................................................................................48
LMC system hardware architecture ...........................................................................................48
Suitability and scalability of the architecture ..............................................................................48
Sub-element risks .....................................................................................................................48
Requirement traceability ...........................................................................................................50
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 5 of 51
List of Figures Figure 1: SDP Hardware compute platform product sub-tree ....................................................10
Figure 2: SDP Software compute platform product sub-tree [AD10] ..........................................10
Figure 3: The Platform Management function is the only function assigned to the compute
platform. ....................................................................................................................................12
Figure 4: SDP compute platform context diagram. ....................................................................14
Figure 5: Top-level overview of the SKA Science Data Processor functions. ............................16
Figure 6: Top level Logical Data Flow Diagram for the SDP pipelines. Shown are dependencies
and interactions between the different pipelines. .......................................................................17
Figure 7: The SDP data flow. ....................................................................................................19
Figure 8: The SDP Compute Island concept, showing the various components in an island. ....21
Figure 9: SDP scaling. Each telescope SDP consists of a number of Compute Islands that are
built up from a number of Compute Nodes. ...............................................................................22
Figure 10: SDP compute distribution for the three SKA telescopes. ..........................................25
Figure 11: cuFFT performance on Nvidia Tesla K40c [RD12]. ..................................................26
Figure 12: Performance analysis of John Romein's gridding algorithm, from [RD10]. ................27
Figure 13: The public Nvidia roadmap up to 2016 [RD13]. ........................................................28
Figure 14: Preliminary timeline for SDP construction for the three telescopes [AD12]. ..............29
Figure 15: Top-level SDP network design. ................................................................................32
Figure 16: A potential SDP compute node model implementation using current-day hardware. 39
Figure 17: Double buffering in an SDP Compute Island. ...........................................................41
Figure 18: The SDP software compute platform middleware product subtree. ..........................44
Figure 19: Overview of the SDP Hierarchical Storage Manager. ...............................................46
List of Tables Table 1: Energy budgets for the three telescope SDPs. ............................................................13
Table 2: Computational FFT efficiency on both CPU and GPU, from [RD09]. ...........................26
Table 3: Hardware specifications of the platforms used in the gridding analysis presented in
[RD10]. .....................................................................................................................................27
Table 4: Input data transport bandwidth requirements for the three telescopes. ........................31
Table 5: Performance requirements for the SKA1 baseline, including baseline-dependent
averaging. .................................................................................................................................37
Table 6: Performance requirements per achieved TFLOPS. .....................................................37
Table 7: SKA1 node characteristics, assuming 700 GFLOPS achieved computational capacity.
.................................................................................................................................................38
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 6 of 51
References
Applicable documents
The following documents are applicable to the extent stated herein. In the event of conflict
between the contents of the applicable documents and this document, the applicable
documents shall take precedence.
Reference Number Reference
PDR.01 / [AD01] SKA.TEL.SDP-0000002 - SKA preliminary SDP architecture and System Description
PDR.05 / [AD02] SKA-TEL-SDP-0000003 – SDP Performance Models – PDR.05
PDR.02 / [AD03] Sub-element design: Data Delivery
PDR.02 / [AD04] Sub-element design document: LMC
PDR.04 / [AD05] Interface Requirements (Ext ICDs) - PDR.04
[AD06] SKA-TEL.SDP.SE-TEL.CSP.SE-ICD-001 SKA1 Interface Control document SDP to CSP
[AD07] SKA-TEL.SADT.SE-TEL.SDP.SE-ICD-001 Interface Control document SADT to SDP
PDR.01.01 / [AD08] SKA-TEL-SDP-0000014 ASSUMPTIONS AND NON-CONFORMANCE
[AD09] SKA-TEL-SKO-0000035 SKA1 POWER BUDGET
PDR.03 / [AD10] Requirements Analysis & Allocations
PDR11 / [AD11] Preliminary Element Integrated Logistics Support Plan
PDR.08 / [AD12] PRELIMINARY PLAN FOR CONSTRUCTION
[AD13] SKA-TEL.SDP.SE-TEL.INFRA.SE-ICD-001 SKA1 Interface Control document SDP to INFRA-AUS and INFRA-SA
[AD14] SKA-TEL-SDP-0000027 Pipelines Element Subsystem Design
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 7 of 51
[AD15] SKA-TEL-SDP-0000054, SKA-TEL-SDP-0000053 Prototyping and Development Plans
[AD16] SKA-TEL-SDP-0000028 Parametric Modelling of the ingest pipeline
Reference documents
The following documents are referenced in this document. In the event of conflict between the
contents of the referenced documents and this document, this document shall take
precedence.
Reference Number Reference
[RD01] SKA-TEL-SDP-0000007-glossary-PDR16
[RD02] SKA-TEL-SDP-COMP-MEMO-010 Networking in LOFAR and how a software-defined network may improve robustness and flexibility
[RD03] SKA-TEL-CSP-0000113 SKA-TEL.CSP.CBF.SUR Sub-element Prototype Test Report (ProtoTestReport-SUR) – Part 1 Data Compression
[RD04] SKA-TEL-SDP-0000019 Compute platform: Hardware alternatives and developments
[RD05] SKA-TEL-SDP-0000020 Compute platform: Software stack developments and considerations
[RD06] SKA-TEL-SDP-0000021 Improving sensor network robustness and flexibility using software-defined networks
[RD07] SKA-TEL-SDP-0000022 Compute platform: Standardisation
[RD08] Measurement and Analysis of TCP Throughput Collapse in cluster-based Storage Systems - Amar Phanishayee et.al. http://www.cs.cmu.edu/~dga/papers/incast-fast2008
[RD09] FFT Analysis - Stefano Salvini
[RD10] An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on GPUs - John W. Romein
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 8 of 51
[RD11] SKA SDP Performance Model - https://github.com/SKA-ScienceDataProcessor/sdp-par-model
[RD12] https://developer.nvidia.com/cuFFT (23-01-2015)
[RD13] http://www.anandtech.com/show/7900/nvidia-updates-gpu-roadmap-unveils-pascal-architecture-for-2016 (23-01-2015)
[RD14] http://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures
[RD15] http://en.wikipedia.org/wiki/Skylake_%28microarchitecture%29
[RD16] http://en.wikipedia.org/wiki/DDR4_SDRAM
[RD17] http://www.extremetech.com/extreme/171678-intel-unveils-72-core-x86-knights-landing-cpu-for-exascale-supercomputing
[RD18] http://goparallel.sourceforge.net/intel-reveals-details-of-next-gen-xeon-phis/
[RD19] http://www.infinibandta.org/content/pages.php?pg=technology_overview
[RD20] http://www.ieee802.org/3/
[RD21] http://en.wikipedia.org/wiki/Phase-change_memory
[RD22] http://en.wikipedia.org/wiki/Memristor
[RD23] http://investors.micron.com/releasedetail.cfm?ReleaseID=692563
[RD24] http://ark.intel.com/products/75272/Intel-Xeon-Processor-E5-2660-v2-25M-Cache-2_20-GHz
[RD25] http://www.nvidia.com/object/tesla-servers.html
[RD26] http://www.hgst.com/solid-state-storage/enterprise-ssd/sas-ssd/ultrastar-ssd1600mr
[RD27] http://www.intel.com/content/www/us/en/network-adapters/converged-network-adapters/ethernet-x520.html
[RD28] http://www.mellanox.com/page/products_dyn?product_family=119&mtag=connectx_3_vpi
[RD29] https://perf.wiki.kernel.org/index.php/Main_Page
[RD30] https://www.docker.com/
[RD31] http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771442.pdf
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 9 of 51
[RD32] SKA-TEL-SDP-0000046 Costs: Basis of Estimate
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 10 of 51
Introduction
Purpose of this document
This document is part of the preliminary design review (PDR) documentation for the Square
Kilometer Array Science Data Processor (SKA SDP) element. It provides the compute platform
sub-system design which includes all hardware and all software required to efficiently use and
develop for that hardware (i.e. operating systems, middleware, deployment software and
development environment).
In terms of the SDP product tree, this document describes the hardware compute platform (C.1)
and the software compute platform (C.2), both of which are shown below.
Figure 1: SDP Hardware compute platform product sub-tree
Figure 2: SDP Software compute platform product sub-tree [AD10]
This document consists of two major components. In the first few chapters we specify the top
level compute platform architecture, in the second half of the document we verify the validity of
this architecture by describing a baseline implementation using current day hardware.
Scope of this document
This document implements the high-level architectural design presented in [AD01]. As a basis
for a detailed design, it uses the performance models and their derived design equations that
are elaborated in [AD02]. Several supporting documents accompany this document, describing
in more detail software-defined networks and the application of these in a radio telescope
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 11 of 51
[RD06], alternative hardware implementations and expected future developments [RD04],
software considerations and developments [RD05] as well as standardisation [RD07].
Assumptions made in this document
This document follows many of the assumptions made in [AD08], the most important of which
are:
● No drastic measures are taken to ensure availability. The assumption is that the
availability budget is unrealistic and will change.
● For SDP sizing we assume 25% efficiency and continued Moore’s law scaling for
compute, storage, memory, and so on (arguably Kryder’s law, Koomey’s law, etc.).
● We assume that the initial reordering of data for ingest/flagging can largely be done in-
network.
● There are no L1 or L2 requirements for SDP to stay within a given energy, capital, or
operational budget. We assume the guidance given by the SKAO is in fact a
requirement. Note that SDP has deliberately been left out of the SKA1 power budget
[AD09].
This document introduces an architecture that is capable of fulfilling the requirements of the
SKA SDP. At the time of writing, the SDP performance models indicate that the current baseline
design exceeds the allocated power and capital budgets. It is very likely that instead the SDP
will have to be built to a capital or energy budget. We therefore introduce a scalable and flexible
architecture that can accommodate changes in budget or capability that may be required to
meet such constraints.
Functional decomposition
The vast majority of the compute platform comprises services provided to major functional
components covered by other parts of the system. The only function that the compute platform
provides is platform management, which includes platform state management and deployment
(see the figure below). Platform management contains functions to deploy the large number of
nodes in the SDP and provides platform state information to be included in the observation meta
data.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 12 of 51
Figure 3: The Platform Management function is the only function assigned to the compute
platform.
SDP requirements and constraints
Computational requirements dictated by science objectives
While total computational load is important, the required capacity for the SDP can be better
expressed as a ratio of various components. For each byte of input data, n double precision
floating point operations are needed, requiring m MB of memory. The scale of the Science Data
Processor is determined by the total computational capacity required, while the ratios mentioned
above impact the design and selection of the components that make up the SDP. The scaling
considerations are discussed in more detail below.
Constraints
Capital constraints
We currently have no assigned capital budget assigned for the three telescope SDPs, but when
this is assigned it is important to note that will be shared between hardware, software (both
domain specific and not), and staff. Although in our current cost model the vast majority of the
SDP budget is spent on hardware, this is not a realistic scenario as explained in [AD08]. Based
on best practices, it is expected that <50% of this budget will be spent on hardware, with most
of the remainder consumed by software development and staff.
Power constraints
The SKAO has proposed an electrical power cap for each of the three telescopes’ SDP (e-mail
communication, 14 Aug 2014), shown in the table below. Two figures are given for each: a likely
power limit and a “best case” power limit. The “likely” power limit is based on an overall power
budget that can be considered realistic given the current state of the design. The “best case”
power limit is the absolute highest power that the SDP will have available if all current unknowns
are replaced by the most favourable assumptions. These power limits are those measured at
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 13 of 51
the building entrance, i.e. including cooling, losses, auxiliaries, etc. It is important to note that
many of the auxiliary components that consume energy, i.e. cooling, are outside the scope of
the Science Data Processor. These are instead the responsibility of the INFRA consortium.
SKA1 Mid SKA1 Low SKA1 Survey
Likely Power Limit (MW) 2.5 0.75 2
Best Case Power Limit (MW) 5 1.5 4
Table 1: Energy budgets for the three telescope SDPs.
L1 and L2 requirements
The L1 and L2 requirements assigned to the compute platform are listed at the end of this
document. Traceability is shown by referencing the relevant sections of this document or where
applicable, other PDR documents.
SDP architecture The figure below shows the compute platform context, including both the hardware and
software.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 14 of 51
Figure 4: SDP compute platform context diagram.
Architectural design principles
To achieve the scalability required in our system, we adopt a highly modular design approach.
The intention is that the Science Data Processor concept should hold for any scale, within
reasonable limits. Scaling is discussed in more detail below, in a dedicated section.
The primary requirement on the SDP is to perform a specific job: turn correlator products into
science-ready data. While a number of different observation modes have been defined, these
follow broadly the same processing model. Contrary to conventional HPC, that needs to support
a wide range of applications with different requirements, we can adopt a highly workload-
optimised system design approach, where tailoring the SDP design for our specific application.
This is expected to enable a more efficient use of the system and reduce both capital and
operational costs. This tailored approach is driving the design decisions made in this document.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 15 of 51
Since the SDP is defined by its data flow, we adopt this data flow, and in particular the efficient
and affordable way to handle that data flow, as our primary design consideration. To ensure
sufficient parallelism we allow significant reordering of data.
Top-level architecture
The SDP is required to perform a number of distinct processing steps in order to produce high
quality science data from raw CSP data. It performs these steps under the auspices of TM
which interacts with the LMC component to steer the computation and data flow. This document
provides a potential realisation of a SDP computational environment based on parametric
modelling which has been used to produce a “Costed Hardware Concept” [RD32]. Where
appropriate we suggest alternatives to the solutions described here which will be analysed
through to CDR.
The main stages of the SDP are illustrated in the figure 5 below. Five distinct processing stages
are identified, namely:
● Ingest
● Buffer
● Pipeline process
● Archive
● Delivery
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 16 of 51
Figure 5: Top-level overview of the SKA Science Data Processor functions.
The processing to be performed by the SDP is defined by a series of pipelines [AD14]. Each Pipeline consists of multiple Components and Components may be part of multiple Pipelines. Potentially the implementation of Components may differ depending on the type of Pipeline. Different Pipelines may run in serial and / or in parallel and the output from one Pipeline may serve as input for another, see Figure 1. This will allow e.g. for Commensal Observations, or External Calibration Observations, where Calibration solutions from one Observation are applied to another one.
The interface between different components of the data processing pipelines will always go through the data layer and can take two physical forms:
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 17 of 51
Non-streaming (using disk buffers) for e.g. the continuum pipeline and the spectral line pipeline.
Streaming for e.g. the ingest pipeline and the fast transient pipeline.
The components will be unaware of this difference, because the DATA layer will abstract it away. During the Data Flow setup stage (using directives from the pipelines), it is determined which communication will be used. Thereafter the processing software is agnostic to it.
The Figure below shows the interaction and dependencies between the various pipelines that are currently foreseen. These pipelines are defined in the sections below. In order to keep the diagrams clear, the interaction with Local Monitoring and Control (LMC) is not shown. In principle, however, each pipeline component will have a bi-directional interaction with LMC for Controlling and Monitoring the component. Also, Data Quality components are not shown. In practice Data Quality will be part of every Pipeline, where the measure of Data Quality as defined by the required metrics is fed back into LMC.
Figure 6: Top level Logical Data Flow Diagram for the SDP pipelines. Shown are dependencies
and interactions between the different pipelines.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 18 of 51
From Figure 6 we can see that:
The Ingest Pipeline (distinct from ingest per se) takes the uv-data from CSP and meta-data from TM and delivers visibility data in various resolutions, depending on the Pipeline that follows up on it. This means that potentially the same data is written multiple times in different resolutions.
The Spectral Line Pipeline will always run AFTER the Continuum Pipeline, because of the dependency on the Calibration solutions.
The Slow Transients Pipeline is defined such that it runs AFTER the Real-time Calibration Pipeline, and again after the Ingest Pipeline.
Science Analysis is not yet further detailed, but consists of Components like Source Finding, RM-Synthesis, Stacking, etc.
The Ingest Pipeline, the Real-time Calibration Pipeline, and the Fast Imaging (for Slow Transients) Pipeline all have to run in real-time.
The properties of the pipelines in terms of their computational needs and data requirements will drive the characteristics of the SDP. In general, we observe that the SDP workload is highly parallel in nature. Experience with precursor and pathfinder instruments has shown that using frequency as the primary parallelisation dimension results in a highly independent, embarrassingly parallel system for the vast majority of applications. This observation forms the basis for our workload-optimised system design, although we do, as mentioned above, indicate where further analysis may lead to alternative solutions or refinements. In particular further analysis of the pipelines will provide information on the trade-offs between memory size and performance, interconnect performance and dimension in terms of ingest-to-buffer vs. compute node-to-compute node together with floating-point performance and power. Such considerations will form part of the prototyping and development plan [AD15] through to CDR.
Data flow model
Moving data costs significant amounts of energy. We therefore design the SDP to minimise the
(inherently large) flow of data. Data flow is directed so that all subsequent processing requires
little or no additional (long-range) communication.
Data from the CSP is handled by the SDP switch stack at ingest. The switch stack will be an
interconnected, but probably not fully non-blocking Ethernet system, and will distribute data to
the SDP Compute Islands (see below). The switch stack increases the SDP’s capital cost but
adds flexibility and resilience, since we can route data around failed nodes, should the need
arise. The switch stack also allows in-network reordering of data, which is an essential
component of our architectural design principles. Finally, the switch infrastructure allows
Software Defined Networking, described in more detail below and in [RD02], to offer
unprecedented flexibility in our data flow.
Each SDP Compute Island is a collection of one or more highly interconnected nodes. An island
is capable of handling front-to-end processing of a chunk of data, without having to
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 19 of 51
communicate with neighbours.
Figure 7: The SDP data flow.
Re-distribution of the data streaming from the CSP is the responsibility of the Data Flow
Manager (described in [AD04]). On a high level, Compute Islands can be seen as subscribing to
data flows from CSP correlator entities. Every “entity” produces a number of data streams, each
representing a fixed chunk of uvw-space. Each Compute Island is responsible for a (potentially
different) subset of uvw-space by subscribing to these CSP streams, as directed by the Data
Flow Manager.
The figure above provides a schematic view of the SDP data flow. While it shows switches at
the ingress and egress points of the Compute Islands, these may not be physically distinct
components. A cost-saving measure may be to have one shared network handling both ingress
and egress. Although the highly unidirectional nature of our data flow does allow this, a shared
infrastructure may cause performance issues. This in turn may cause the data flow from CSP to
drop packets.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 20 of 51
Compute Island
A Compute Island (which is an overloaded term in HPC, and may need renaming) is the basic
replicable unit in the SKA SDP. A Compute Island is a self-contained, independent collection of
compute nodes.
Ideally, a Compute Island only processes data that is contained in the island itself. Some
applications, such as multi-frequency synthesis, require a number of gathers to be performed
before end-products can be combined. The initial analysis in the case of the continuum pipeline
for example [AD14], indicate that inter-compute island traffic will need to be supported. For this
purpose the Compute Islands will be interconnected using a high-bandwidth interconnect,
orthogonal to the ingest-to-buffer network. The tolerance of this network to over-subscription
and hence reductions in cost and complexity are currently in review.
As mentioned before, a Compute Island consists of several interconnected compute nodes.
Each Compute Island has associated infrastructure and facilities such as shared file systems,
management network and master node(s). This makes each Compute Island largely
independent of the rest of the system. The size of the SDP can be expressed by the number of
Compute Islands it contains - a parameter that is freely scalable due to the Compute Islands’
independent nature. Most of the infrastructure will be similar between the three SDPs, but it is
conceivable that the size of an island (e.g. the number of compute nodes within an island) or the
compute node design itself differs between SDPs. This could be the case when the desired
compute to I/O ratios differ between the three telescopes.
Within a Compute Island, a fully non-blocking interconnect, with a per node bandwidth far in
excess of the per-node ingest rate, is provided. This is primarily used for reordering data
between processing steps, ideally within a single island. The same interconnect facilitates
communication between islands for inter-island reordering or global processing, but in this case
bandwidth will be much more limited and end-to-end transfers may require several hops. The
total bisectional bandwidth and over-subscription of the global interconnect network may easily
become a cost-driver. Therefore a careful analysis of the requirements and an effort towards
minimising global data transport will continue to be a design priority.
The file system and/or storage model used by the islands is yet to be determined. A small,
island-wide, parallel file system, or single file system node, are among the options. The buffer
storage spread over the island nodes may also be exposed as a single unified file system in
some way yet to be determined.
The figure below shows an overview of the Compute Island concept. Note that although a
Compute Island is represented by a single rack of hardware in this figure, this is only illustrative.
The actual size of the Compute Island may span multiple racks, or be limited to a fraction of a
rack, depending on various parameters discussed in more detail in the section on scaling.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 21 of 51
Figure 8: The SDP Compute Island concept, showing the various components in an island.
While operational concerns drive a desire for a high degree of standardisation of Compute
Islands and components, the self-contained nature of the Compute Islands allow for partial
upgrades. A heterogeneous SDP, with Compute Islands of various ages and potentially
specialised configurations, is also possible. However, the efficient utilisation of such a system
may require additional effort on the part of LMC and the scheduler. The optimal combination of
flexibility and standardisation will need to be determined on the road to CDR.
SDP scaling
While the total useful capacity of the Science Data Processor depends on many components,
we identify three defining characteristics that we will use to scale the system:
● Total capacity
● Capacity per Compute Island
● Characteristics per node
The total capacity is defined by the number of available Compute Islands. This top-level
number, the aggregate peak performance (Rpeak) expressed in PFLOPS, is defined by the
number of Compute Islands that make up the Science Data Processor and the capacity per
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 22 of 51
Compute Island. While this number is a useful way to express the size of the system, its
usefulness is limited since it does not take computational efficiency into account. Ideally, the
total capacity of the system would be defined by the science or system requirements, but
considering the constraints discussed above, it is more likely that total capacity will be defined
by the available budgets (energy, capital or operational).
Figure 9: SDP scaling. Each telescope SDP consists of a number of Compute Islands that are built up from a number of Compute Nodes.
Capacity per Compute Island is defined by the number of nodes per island and the
characteristics of these nodes. This capacity is expressed in terms of peak computational
capacity, i.e. TFLOPS, but it is likely that computational capacity will not drive the sizing of the
Compute Islands. Island capacity is defined by the most demanding application, in terms of
required memory, network bandwidth, or compute capacity that requires a high-capacity
interconnect. Our current analysis of the ingest pipeline [AD02] shows that a “frequency group”
of between 64 to 256 frequency channels are needed for RFI flagging, giving an upper bound to
the total number of Compute Islands per telescope (1000 to 4000, depending on the eventual
number of frequency channels in a frequency group).
The basic building block of a Compute Island is the compute node. The characteristics of these
nodes are defined by the design equations in [AD02] but within these bounds a vast number of
valid node designs can be identified. Considering the timeframe of the SDP roll-out, which
extends well beyond the available industry roadmaps, the node definition is perhaps the least
well understood component of the SDP design. The SDP parametric model defines a number of
ratio rules that describe suitable node designs. Within the bounds of these rules, cost and
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 23 of 51
energy efficiency and maintainability are considerations that may be used to select optimal node
implementations.
There is one key requirement that a compute node needs to satisfy: if used to ingest data, only
a very small percentage of that data may be lost. In other words, these nodes need to be scaled
such that they comfortably satisfy the ingest real-time requirements, and a sufficient number of
these nodes need to be available to receive all CSP data.
One interesting consideration is whether or not all three SDPs will be standardised on a single
node design. Answering this question requires an interesting trade-off between the
standardisation of components on the one hand, and workload optimisation of those same
components on the other hand. Operational costs, in particular energy versus deployment and
maintenance cost, will also play a key role in this decision. It is clear that this decision cannot be
made until more information is available on the likely technology options available for nodes.
Science Archive
The processed data products are forwarded from the Compute Islands to the local Science
Archive. The Science Archive is part of the SDP and is the end point for SDP data-products.
The primary design goals for the Science Archive are:
● Provide secure storage for data products for the telescope life time
● Facilitate distribution of science data products to Regional Centres
The Science Archive acts as an interface to the wider science community by distributing the
science data to potential Regional Science Centres and by providing access to the data via a
number of interfaces and APIs.
Considering the long designed lifetime of the SKA instrument, careful analysis of total cost-of-
ownership of the various archive technologies is critical. Work on this is ongoing. Possible
architectures span the entire range from a carefully balanced tiered mix of fast to slow storage
media (solid state, spinning disk and tape), to a disk-only solution. It is important to note that
there is no stringent requirement on the security or safety of the data, although the lifetime
requirement on the archive is 50 years [SKA1-SYS_REQ-2363] and the SDP is required to
maintain a mirror [SKA1-SYS_REQ-2350]. For the purposes of this document we assume that
the combination of the Science Archive and Regional Science Centres fulfils the role of the
Science Archive mirror.
At this stage we do not consider the Science Archive a high-risk item. Industry is heavily
focussed on Big Data, and archive sites of the sizes required for SKA1 are already feasible.
Later on in this document we will discuss various implementation options for the Science
Archive in more detail.
Computational efficiency The required aggregate capacity of the Science Data Processor depends on three factors:
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 24 of 51
1. Input bandwidth (bytes/s)
2. Total computational intensity of the pipelines (FLOP/byte)
3. Computational efficiency (% of Rpeak)
The input bandwidth is given in the baseline design. Computational intensity can be estimated
using the performance models in [AD02]. Other considerations, like mixed precision
computations, in which double precision and single precision operations are intermingled, also
have an impact on computational intensity. Of these factors, computational efficiency is
arguably the most difficult to estimate.
Computational efficiency depends on many factors, such as:
● choice of algorithm
● target platform
● implementation
● data access patterns
● programmer talent
Many of these are platform- (i.e. hardware-) dependent, which makes it difficult, if not
impossible, to model or estimate computational efficiency for future hardware generations.
It is, however, possible to estimate the required number of floating point operations for the
various SDP tasks. Several modelling efforts, culminating in [AD02] and [RD11], have resulted
in a good overview of the computational requirements for the SDP, in terms of sustained
performance. In order to translate these into a hardware architecture, we estimate the
computational efficiency of the most intensive components, for the hardware we can expect to
procure based on current day numbers.
It is important to note that the discussion below is highly speculative. The expected procurement
period is well beyond the timeframe of industry roadmaps, and we can only speculate on the
available hardware. Instead we concentrate on currently available hardware and the bottlenecks
we can identify in these. Based on this, and the expected developments in terms of compute
characteristics, we estimate the computational efficiency of current-day hardware and
extrapolate it to the SKA1 roll-out timeframe.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 25 of 51
Figure 10: SDP compute distribution for the three SKA telescopes.
Figure 10, taken from the parametric model as implemented in iPython [RD11], shows that the
vast majority of the SDP compute requirement is taken up by gridding and FFTs. The most
power-efficient way to compute either of these today is on accelerator hardware, so we will
concentrate our analysis on these.
Fast Fourier transforms on Nvidia Tesla GPUs will most likely use cuFFT. Performance
numbers for this library are publicly available, and can be summarised as in Figure 10. The
Tesla K40c shown in this graph has a peak performance of 4290 GFLOPS single precision, and
1430 GFLOPS double precision. This shows that the maximum achieved computational
efficiency, as a percentage of peak performance, is currently 16.3% for single precision floating
point numbers, and a slightly higher 19% for double precision. However, this applies to small
transform sizes. The SDP will most likely use transform sizes of the order of 213to 216, reducing
the efficiency to 9% and 12%, respectively.
The distinct dip in performance for large transforms is due to the data no longer fitting in the fast
local memory. There is a continuing trend in hardware development to increase the amount of
fast local memory (caches) to bridge the widening memory bandwidth gap. This means that
ever larger transforms will fit in fast local memory, widening the distinct peak in efficiency shown
today and potentially improving efficiency for the transform sizes required by the SDP.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 26 of 51
Figure 11: cuFFT performance on Nvidia Tesla K40c [RD12].
Furthermore, it is expected that 3D stacking of memory will dramatically increase memory
bandwidth in the next two generations of hardware, resulting in a corresponding increase in
computational efficiency. This is, of course, speculation and developments need to be closely
tracked to establish the actual computational efficiency.
There are more detailed analysis results available on FFT performance, using both GPU and
CPU hardware [RD09]. The observed computational efficiencies are summarised below. The
results agree with the analysis presented above, although the high end (16-19% of Rpeak) of the
efficiency range for double precision is not achieved in this experimental setup.
Single Precision
CPU Efficiency (multithreaded) 8 – 15 %
GPU efficiency (data on GPU) <10 – 15 %
GPU efficiency (incl. data transfer) ~ 1%
Double Precision
CPU Efficiency (multithreaded) 8 – 15 %
GPU efficiency (data on GPU) 10 – 15 %
GPU efficiency (incl. data transfer) ~ 1%
Table 2: Computational FFT efficiency on both CPU and GPU, from [RD09].
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 27 of 51
The most computationally efficient gridding implementation we know of today was developed by
John Romein [RD10]. [RD10] presents detailed performance measurements, including several
on GPUs. In this paper, achieved performance is measured in Giga grid-point additions per
second, which equals 8 GigaFLOPS, since each complex multiply-add requires four real
multiplications and four real additions.
Figure 12: Performance analysis of John Romein's gridding algorithm, from [RD10].
For relatively small convolution matrix sizes (32x32), the maximum achieved efficiency of the
algorithm is around 23% (Nvidia GTX680). This increases to approximately 25% for larger
convolution matrix sizes on AMD Radeon HD7970. For completeness, the salient hardware
specifications are shown in Table 3.
Rpeak(GFLOPS) Memory bw (GB/s) Max Power (Watt)
Nvidia GTX680 3090 192 195
AMD HD7970 3789 264 230
2x Intel E5-2680 343 102 260
Table 3: Hardware specifications of the platforms used in the gridding analysis presented in
[RD10].
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 28 of 51
Figure 13: The public Nvidia roadmap up to 2016 [RD13].
In March, 2014 NVidia has publicly announced their roadmap leading up to 2016, including the
“Pascal” architecture as shown in Figure 13. The Pascal cards are expected to have 3D-stacked
memory, with 1 terabyte/s bandwidth and with same power consumption per bit transferred as
now. The peak FLOPS capability of these cards has not been explicitly announced but can be
inferred from the graph in Figure 13. This shows that Pascal will achieve about 2.5 times higher
FLOPS/W performance than the Kepler family of GPUs. Since for a similar packaging the total
power envelope must remain constant and since the peak Kepler performance is around 1.3
TFLOPS (e.g. Kepler K20X), this implies that the peak performance of Pascal will be around 3.3
TFLOPS.
This estimate gives a memory bandwidth to peak FLOP ratio for Pascal in 2016 of 0.30 bytes /
FLOP in contrast to about 0.19 bytes / FLOP in the current generation K20x. Our analysis
indicates that both of the algorithms mentioned above (FFT and gridding) are mainly memory-
bandwidth bound. The relative increase in memory bandwidth per peak FLOP indicates a
modest corresponding expected increase in computational efficiency for that generation.
However, this is a single revolutionary step forward in memory bandwidth per FLOP, without a
further improvement in sight. How these developments continue post-Pascal is unclear.
For costing and sizing we currently assume an overall computational efficiency of 25% of peak
performance, which based on the analysis above is optimistic. As mentioned above, this
number is highly speculative and may change based on further prototyping and research.
Roll-out schedule The SDP roll-out schedules are determined by three major considerations:
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 29 of 51
1. Roll-out of compute equipment should be delayed as much as possible, so that we can
take advantage of Moore’s law and to maximise the operational usefulness of the
procured equipment.
2. Roll-out should follow the just-in-time principle, such that integration of equipment
coincides with Array Releases (AR) of the various instruments.
3. Compute requirements increase dramatically with increased baseline length, which
means that the full-scale Science Data Processor will not be required until very late in
the roll-out schedule.
The SDP preliminary plan for construction [AD12] describes these considerations in more detail.
Figure 14: Preliminary timeline for SDP construction for the three telescopes [AD12].
Figure 14 shows the SDP roll-out timeline in the SDP preliminary plan for construction. This
document ignores the milli- and centi-SDP as the scale of these is more or less trivial and are
not required to be maximally efficient in terms of FLOPS or energy. Instead, these precursors
only support commissioning of elements and early science as (cost-) effectively as possible.
The roll-out of the full-scale SDP will most likely happen in the final stages of the SKA system
build-up, during the first half of 2021. These systems must be fully integrated during the second
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 30 of 51
half of 2021, ready for the fifth and sixth array releases (AR5 and AR6 in Figure 13) of SKA1-
MID and SKA1-LOW and the fourth array release of SKA1-SURVEY.
Based on this roll-out schedule, we can conclude that the SDP will use technology that will only
become available after 2020. This timeframe is well beyond established roadmaps, and puts us
into the era where only technology concepts are available.
While this makes our detailed design highly speculative, we feel this is not a risk at this stage of
the project. The high-level concepts do not change and our high-level architecture is capable of
supporting the expected technological changes.
To demonstrate the feasibility of the design, we introduce a baseline implementation below that
is based on current day technology. Evolutionary scaling to SKA1 timeframes shows one
possible implementation option for SKA1. This evolutionary development is unlikely to occur, but
it gives us a solid basis for costing and shows one valid implementation of our design.
The supporting material discusses various hardware developments that are expected to occur
[RD04]. All of these are expected to be capable of fulfilling the SDP requirements, although
some may be more efficient than others. Selection of the optimal node architecture will have to
wait until more information is available on the possible hardware solutions, but also until more
detailed performance models are available. However, it is important to stress that this does not
impact our high-level design.
In particular, the independence of the island implementation is important, since we are trying to
design an architecture with a potential life-span of fifty years. It is impossible to predict what sort
of computational resources will become available during the life-time of the instrument, we
therefore have to provide a high-level architecture that is capable of supporting a wide range of
technology options and is ideally agnostic to the eventual detailed implementation.
Data transport As a data throughput machine, data flow is intended to drive the design of the SDP architecture.
The data transport system is therefore an extremely important part of the system design. We
separate the data transport system into three different, physically separate, networks, each with
their own requirements:
● Bulk data transport
● Island data transport
● Management, monitoring and control
The bulk data transport network may be two physically separated networks, used to receive
data from the CSP and export data to the Regional Science Centres.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 31 of 51
Within each Compute Island, a high-performance, low-latency network is provided to facilitate
the reordering of data. We separate the highly predictable and static bulk data network, and the
more dynamically loaded island interconnect to be able to ensure the real-time performance of
the SDP ingest.
Data transport bandwidth requirements
The table below shows the input bandwidth requirements for the three receiver types, based on
the updated baseline design. The number of required input ports is estimated, using a protocol
overhead of 2% [AD06]. Based on operational experience we limit occupancy per port to 90% to
ensure no packets are dropped and the receiving node can achieve real-time performance. The
total number of top-level network ports is at least double this number.
Instrument Raw data
rate (TB/s)
Estimated
number of ports
(40 GbE)
Estimated
number of ports
(100 GbE)
SKA1-low 9.1 ~2030 ~810
SKA1-mid 4.21 ~940 ~380
SKA1-survey 5.81 ~1300 ~520
Table 4: Input data transport bandwidth requirements for the three telescopes.
Compression of input data is being investigated within the CSP consortium, which may lead to a
reduction of the required number of input ports by as much as 30% [RD03]. This work is still
ongoing, and needs to carefully consider the required compute capacity and development time
against cost savings, both in terms of capital investment and energy consumption in the data
transport.
Top-level network architecture
The top-level SDP network architecture is shown in Figure 15. A three stage oversubscribed
switch stack is shown that receives data from CS, through the Ingest Layer, and delivers it to
the appropriate Compute Island. A second switch stack, which may share hardware with the
input switches, is responsible for connecting the storage components within each compute
island into a single virtual science archive and the distribution of science data to the world
beyond the Science Data Processor.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 32 of 51
Figure 15: Top-level SDP network design.
Bulk data transport network design
The bulk data transport network is responsible for three distinct steps:
● Ingress (i.e. receive data from CSP, as per [AD06] )
● Egress (i.e. move science-ready data to Regional Science Centres)
● Science Archive (i.e. interconnect storage components in the Compute Islands)
All of these streams are Ethernet based, although the egress data stream is not formally part of
the project and has no formal ICD.
The ingress data stream from CSP can be described as:
● A continuous data stream
● UDP/IP on IEEE 802.3 Ethernet frames
● Maximum Transmission Unit (MTU) as large as possible while still maintaining
compatibility with COTS networking equipment (jumbo frames)
The bulk data network must be able to:
● receive a high-bandwidth, but fairly static, uni-directional data flow
● forward data to the Compute Islands
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 33 of 51
● receive this data without losing or dropping packets (latency and out-of-order packets
are acceptable)
In addition, we the system is designed around the concept of a software-defined network
infrastructure, therefore support for this is also essential.
Data from the CSP is expected to arrive via long-haul fibre using Dense Wavelength-division
multiplexing (DWDM), therefore the bulk data transfer network also needs to support large
numbers of optics, conforming to IEEE P802.3bm [AD07].
The high data rates and the continuous nature of the data flow, coupled with the stringent
requirement on data loss, mean that relatively large switch buffers will be required. While this
has been observed in pathfinder instruments, in particular LOFAR, it has not yet been fully
investigated. It is interesting to note that similar effects occur in more conventional applications
as well [RD08].
The egress data stream, which ties the storage components in each Compute Island together
into the virtual Science Archive and transports science-ready data to the Regional Science
Centres, is less well defined. This data stream can be characterised as:
● lower bandwidth
● although fairly static traffic pattern, much less so than ingress
● direct export to the Regional Science Centres
● reliable protocols
● not quite uni-directional, but still highly imbalanced
While in Figure 14 the ingress and egress data networks are drawn as separate entities,
prototyping will have to show if these two highly unidirectional data streams can co-exist in a
single network without loss of performance or data. The unreliable nature of the ingress data
stream makes this not immediately obvious, but significant cost savings may be achieved.
Ingest Processing
The purpose of the ingest pipeline (see Figure 2) is to receive data from the CSP element,
merge it with metadata from TM and to apply conditioning functions prior to integration over time
and frequency before sending it to a number of other pipelines downstream. Currently it is
envisaged that this pipeline forms part of the local SDP function. However, as data movement is
always a high cost, consideration is being given to reducing the overall ingest rate by applying
baseline dependant averaging [AD16]. The details of implementation in the SDP is under
review, but this technique could impact favourably in terms of the size of the bulk data transport
and buffer if co-located with the CSP.
High-performance, low-latency interconnect architecture
Within each Compute Island, a high-performance, low-latency interconnect is available. This
network is used to reorder data between ingest and buffer. This interconnect will be fully non-
blocking, and have a per node bandwidth that is much higher than the per node input
bandwidth. This high degree of over dimensioning is a deliberate design decision to facilitate
extensive reordering.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 34 of 51
The Compute Island networks are themselves interconnected as well, albeit with a certain
degree of over subscription to be determined when the detailed design is considered.
Alternatively, islands may be interconnected using a ring or n-dimensional torus structure with
similar results. This interconnection of islands allows for limited global data reordering,
depending on requirements, although it is the intention that this is avoided as much as possible.
The total inter-island bandwidth depends greatly on the necessity and expected characteristics
of global processing and reordering.
Management, monitoring and control network
A dedicated network will be available for management, monitoring and control. We will not
design this network in detail at this stage, considering the modest requirements in terms of
bandwidth and latency. It is considered likely that basic landed-on-motherboard hardware will be
sufficient for this purpose. Similarly, a simple and cheap switch infrastructure is expected to fulfil
the requirements for this role.
This network will be the interface with Local Monitoring and Control, and through LMC to the
SKA Telescope Manager. The external (SDP-TM) network definitions and requirements are still
TBD, but will be described in [AD08]. The following components will be connected through this
network, most likely sharing physical hardware:
● Island management network
● Island Lights-out-manager network
● Network out-of-band control network.
Data reordering
To maximise available data parallelism, we intend to allow significant data reordering at the
SDP ingest. In this section we analyse the possible implications of reordering. The main goal is
to establish if reordering of data, in any dimension, is feasible in-network. If this is not the case,
a fully non-blocking interconnect, covering the entirety of the Science Data Processor, may be
needed to allow reordering in any dimension.
There are three possible reordering grades. They are discussed in the following sections, from
lowest to highest cost options (in terms of capital investment, required resources and energy
consumption).
In-network reordering
At least three possible hardware configurations may support an in-network data reordering at
SDP ingest:
● A single very large full non-blocking switch per SDP (up to several thousand ports)
◦very expensive
◦will probably have advanced features which we don't require (i.e. this will be a layer 3
router, which is not necessarily what we require)
● Interconnected set of smaller switches, possibly over-subscribed to some degree.
Several topologies are possible:
◦Fat tree structure
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 35 of 51
◦Dragonfly
◦Torus (n-dimensional)
● Reordering in transit (in-network) between CSP and SDP
◦Using a software-defined network, we may be able to dynamically re-configure an
otherwise static Ethernet network, allowing more flexible and extensive reordering
than is possible with the previous two options.
◦Independent of architecture choice, the switch firmware needs to support some form of
software-defined networking protocol (e.g. OpenFlow).
◦This option is somewhat orthogonal to the previous two, since it is not really a hardware
configuration; it still requires either of the two options mentioned above to run on.
The current ICD with CSP [AD06] states that the data shall be packetised into UDP/IP jumbo
frames of 9000 bytes. Each visibility will be two 32 bit single precision floating point numbers.
There will be four cross-polarisations. The full dimensionality of the CSP output is:
𝑁𝑏𝑒𝑎𝑚 × 𝑁𝑐ℎ𝑎𝑛𝑛𝑒𝑙 × 𝑁𝑣𝑖𝑠 × 𝑁𝑝𝑜𝑙
A fully polarised visibility takes up 256 bits. Up to 280 of these visibilities fit into a jumbo frame.
Everything beyond the UV-plane in a packet is routable in the network (see section 2.1.1.6 of
[AD06]), although we are assuming that routing consecutive correlator dumps to different
destinations is more challenging (though not impossible, and possibly useful for round-robin
scheduling).
Intra-island reordering
Not all data that can be reordered in network. For this purpose, a low-latency, high-bandwidth
interconnect is provided in each Compute Island to allow data reordering within the islands. It is
expected that the bisectional bandwidth available in this intra-island network will greatly exceed
the input bandwidth from the correlator per island.
This network is also required for the reordering of data after ingest. Our analysis of the ingest
and subsequent pipelines shows that an intra-island reordering is required between these
components [AD02]. Ingest requires a number of frequency channels (a frequency group) for a
single baseline, while subsequent pipelines require all baselines for a single frequency channel.
While this is a significant task, it can be kept within a Compute Island, provided that:
1. a frequency group is kept within a single island -- satisfied by the Compute Island scaling
2. subsets of the visibility hierarchy can be routed to individual nodes to maintain
parallelism -- satisfied by [AD06].
Inter-island reordering
The low-latency, high-bandwidth intra-island interconnects are themselves interconnected into a
single inter-island interconnect. This network is available for a final stage reordering of data that
cannot be accommodated by the previous two stages. Point-to-point communication may
require several hops (up to nine in a Fat-Tree) to reach the destination.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 36 of 51
At present, we have only costed Fat-Tree topologies with varying degrees of pollarding (over-
subscription). The level of over-subscription is a design decision to be considered later. We
currently assume that most data reordering can be achieved with a combination of in-transit and
intra-island shuffling, and that consequently the over-subscription rate in the current architecture
is very high. However as mentioned above further analysis is required to determine the
appropriate level of over-subscription that can be tolerated.
Software-defined networking
Experience with Ethernet-based precursor instruments, such as LOFAR, has shown that such
infrastructures are static and fairly difficult to maintain. The classic split between network and
compute systems, in design, procurement, and maintenance, does not fit well in our data-flow
driven design philosophy. Since the data flow is the defining characteristic of the SKA Science
Data Processor, network and compute systems must both be considered integral parts of one
and the same system.
In addition to this, a classic Ethernet-based network imposes a very strong coupling between
sending and receiving peers, in this case the CSP-based correlator, and the SDP ingest. Any
change in the data flow needs to be carefully negotiated between sender and receiver, which
may be hundreds of kilometres apart.
We propose to build a software-defined network infrastructure, which will become an integral
part of the SDP workflow, and will fall under the direct control of the Data Flow Manager. This
means that the network is no longer a static piece of infrastructure, but may dynamically change
configurations to suit the work-flow requirements. Such a software-defined network also allows
an effective decoupling of sending and receiving nodes. In this model, the sending peers
effectively send to a virtual receiving node, which may or may not physically exist. Receiving
nodes subscribe to data flows from the CSP, as directed by the data flow manager. A network
controller handles the physical data flow by modifying Ethernet headers in transit to match
receiving peers: a classic publish-subscribe model, implemented in a network.
This is a novel approach to building a sensor network that needs to be prototyped. A more in-
depth discussion on the relative merits is given in [RD02].
Combining Bulk Data Network with Low-latency Network
Currently the SDP relies on two distinct or orthogonal networks supporting on the one had uni-
directional Ethernet from Ingest and a bi-directional low-latency network for extensive re-
ordering or gathering of data within specific pipelines. The combination of these network
functions under a single unified network has not, up to now, been considered given the lack of
QoS capabilities for the bulk data network and its real time requirement. Thus mixing traffic
patterns may well lead to significant unpredictability of the SDP as a whole as well as effecting
overall availability of the system and its resilience. In view of the discussion above, in terms of
ingest processing, in which data rates will be reduced and depending on implementation the
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 37 of 51
protocol form ingest could be changed, an analysis should be performed to understand if a
certain degree of network sharing can tolerated and further reduce cost.
Compute node model To provide a solid basis for costing, we define a detailed potential compute node design, based
on current-day technology extrapolated to the SKA1 timeframe. We emphasise that this does
not describe the final Compute Island implementation. As mentioned before, the SDP
architecture should be mostly implementation agnostic. To validate this claim, we will show a
baseline compute node model in some detail, based on today’s technology extrapolated to 2017
and beyond. This model is used for costing, since it is the only solution we have accurate data
for.
In addition, we show that several other technologies, from low-power alternatives leveraging the
mobile and internet of things revolution, to reconfigurable systems with workload-optimised
accelerators, can also be used to implement Compute Islands. Since many of these
technologies are only available in concept form, these are described in less detail, and no
costing is done.
From performance model to design characteristics
Based on the performance models described in [AD02], we slightly rewrite the design equations
to give a ratio of various key components per achieved unit of double precision compute
capacity.
SKA1_low SKA1_mid SKA1_survey
Compute requirement 25 PFLOPS 52 PFLOPS 72 PFLOPS
Buffer 240 PB 30 PB 90 PB
Input bandwidth 9.1 TB/s 4.21 TB/s 5.81 TB/s
Table 5: Performance requirements for the SKA1 baseline, including baseline-dependent
averaging.
SKA1_low SKA1_mid SKA1_survey
Buffer / TFLOPS 9.6 TB 0.58 TB 1.25 TB
Input bandwidth / TFLOPS 2.91 Gb/s 0.65 Gb/s 0.65 Gb/s
Table 6: Performance requirements per achieved TFLOPS.
Baseline model - current-day technology
The processing model within a Compute Island is similar to the one currently employed in the
new GPU-based LOFAR correlator, with the addition of a large amount of buffer storage to
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 38 of 51
facilitate iterative calibration and imaging. This system design is also used in the Wilkes cluster
at the University of Cambridge.
Taking a modernised and extended LOFAR correlator system with two K40 GPUs as a basis,
and assuming that the Nvidia K40 GPU achieves an Rmax of 350 GFLOPS double precision
(25% of 1.43 TFLOPS Rpeak), we require per node:
SKA1_low SKA1_mid SKA1_survey
DRAM ([AD02]) 370 GB 1 TB 500 GB
Working memory ([AD02]) 8 GB 8 GB 1.2 GB
Buffer 13.7 TB 0.82 TB 1.79 TB
Input bandwidth 4.16 Gb/s 0.93 Gb/s 0.92 Gb/s
Table 7: SKA1 node characteristics, assuming 700 GFLOPS achieved computational capacity.
A possible compute node design, based on current technologies would be (see Figure 15):
● Dual Intel Xeon E5-2660v2 CPU (10 cores @2.2GHz each) [RD24]
● 1024 GB DDR3 main memory @1866MHz
● 2x Nvidia Tesla K40 accelerator [RD25]
○ PCIe v3 x16; 12 GB GDDR5; 4.29 TFLOPS peak SP; 1.43 TFLOPS peak DP
● Intel X520 10 GbE Ethernet NIC (PCIe v2 x8) [RD27]
● Mellanox ConnectX-3 FDR Infiniband HCA (PCIe v3 x8) [RD28]
● HGST Ultrastar SSD1600MR 1.6TB Enterprise MLC SSD [RD26]
● SKA1_low only: 4-6x 3 TB Western Digital RED WD30EFRX [RD31]
Many of the chosen components have many alternative options, and the list above should not
be seen as anything more than an illustration that suitable SKA1 SDP nodes can be built using
components available today.
Note that the SSD chosen is rated for two Disk Writes per Day (DW/P) for five years, which is
just within the expected usage for our buffer (double-buffered, six-hour observations). A more
detailed analysis of both the endurance of SSDs and our expected usage is essential.
The amounts of both main memory and buffer are subject to further analysis, as is required
memory bandwidth (although limited memory bandwidth is implicitly included in the efficiency
percentage), but the system configuration above is readily available. Further storage, solid state
or spinning disk, may be added, since devices with higher capacity are readily available in the
market and bandwidth requirements are relatively modest. Both the Ethernet and Infiniband
networks are heavily over-dimensioned in this design, which is an intentional decision to
facilitate extensive data reordering.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 39 of 51
Figure 16: A potential SDP compute node model implementation using current-day hardware.
Depending on the timescales involved, we can assume that the SKA1 SDP will use technology
available on the market in 2020 or later. When extrapolating to these timeframes, we are of
course limited to information made available by industry. The information mentioned below is all
publicly available, but only extends to about 2016. Likewise, Figure 13 presents schematics of
the public Nvidia roadmap, which shows that they are at least confident of increased
performance per Watt up until 2016, although it is interesting to note that previous versions of
the roadmap showed achieved double precision GFLOPS/Watt, not normalised SGEMM/Watt
(SGEMM is a BLAS-provided matrix multiply-add operation). If we take a look at the expected
future development of the components mentioned above up until ~2016, we find the following:
● Intel Skylake or Cannonlake based CPU [RD14][RD15] (note that these are not formally
announced):
○ Mostly evolutionary development, with at least one, perhaps two changes in
micro-architecture and production process.
● DDR4 main memory [RD16]:
○ Currently available and just entering mass market, expected to increase in clock
rate.
● Nvdia Pascal or Volta based accelerators [RD13]:
○ Although mostly evolutionary developments, we do expect the introduction of
NVlink to give a significant boost to the device-host bandwidth.
○ In addition, the introduction of 3D-stacked memory on these devices will
dramatically increase memory bandwidth per FLOP.
● Next generation Intel Xeon Phi (Knights Landing, Knights Hill) [RD17] [RD18]:
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 40 of 51
○ Based on new micro-architecture, using modified Atom cores, expected in 2015
○ Evolutionary developments after this
● Interconnect based on HDR or NDR Infiniband or similar [RD19]:
○ Evolutionary development of currently available technology
○ Many alternative technologies currently under development
● 100 GbE or 40 GbE NIC, depending on market availability and cost per port [RD20]:
○ Mostly available, expected to significantly reduce in cost per port
○ This much bandwidth may not be necessary, go with 10 GbE or 25 GbE instead
○ Consider using whatever industry lands on motherboard
● Solid state storage based on phase change memory [RD21] or memristor [RD22]
technology or similar connected through PCIe or a specialised memory bus:
○ Both memristor and PCM are in prototype stage (although memristor production
has apparently started [RD23], no products have appeared)
○ Spinning disk may be cheaper per PB, but need to consider operational costs
(energy, replacing broken disks, continuous rebuilding)
○ Flash-based storage is a viable alternative, should new solid state storage
technologies not be available or remain too expensive
○ The endurance of NAND flash is an issue, an analysis of endurance
requirements compared to the expected buffer usage will need to be carried out.
Technology development post-2017 becomes much more uncertain. For the purposes of our
initial costing, we assume that Moore’s law will continue to hold for the foreseeable future.
Indications from industry seem to show that the number of transistors per unit of die area will
indeed continue to rise at least until ~2020. There is a risk that this increase will not translate
into easily achieved additional performance. For this reason, a very high contingency has been
added to the hardware costing model. It is important to note that this risk is well understood by
industry.
This extrapolation does assume that the high-level structure of a node, in particular the device-
host model, does not change. In other words, we still have a host processor, supporting a highly
specialised accelerator. This is by no means certain. Indeed, the recent release of hybrid
CPU/GPU packages with unified memory, such as AMD’s Kaveri based APUs and
Nvidia'sTegra K1, seem to indicate that hybrid Systems-on-Chip (SoC) are a definite possibility.
Likewise, while NVlink offers a PCIe-like programming model, it seems likely that Nvidia is
aiming more for a very high bandwidth mezzanine-like connector, or even a socketable solution.
Whatever the case may be, it seems that the device-host model that we know today will be
replaced by something else before SKA1 becomes operational.
For our system design, this is most likely a positive development. The current device-host model
significantly limits the bandwidth to the accelerator. In addition, the explicit communication of
data to the accelerator is tedious, and the limited space on the add-in boards limits the amount
of memory available to the accelerator.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 41 of 51
Storage model Conceptually there are three storage systems in the Science Data Processor: the high-
performance intermediate buffer, the Science Archive and the mirror archive.
Intermediate buffer
To facilitate iterative calibration and imaging algorithms, SDP requires a buffer to store
observations before they can be processed. This buffer conveniently also marks the boundary
between the near real-time and more conventional batch processing. Since an entire completed
observation is required for calibration and imaging, the buffer will conceptually double-buffer the
data: while one observation is running on (and stored into) one buffer region, the previous
observation is being processed using a second buffer region (see Figure 16). It is not expected
that this buffer will store data for extended periods. The buffer will be configured using Compute
Islands and Data Objects to store data then start the batch processing. The buffer also permits
the straightforward implementation of (re-)processing data from the archive by allowing data to
be moved from the archive to the buffer and the use of the same processing architecture.
Figure 17: Double buffering in an SDP Compute Island.
To facilitate iterative imaging and calibration, each node will require a significant amount of
storage to buffer intermediate data. This buffer is likely to be local to each node, although buffer
capacity may be exposed to other nodes within an island. No technology choice is made, but for
costing we consider three options: spinning disks, solid state (non-volatile) storage, and DRAM.
The high-performance buffer storage may consist of a combination of:
oDRAM
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 42 of 51
oSolid state storage
oSpinning disk storage
Science Archive
The SDP Science Archive can be characterised by the following features:
● receives science-ready products from the Compute Islands
● interfaces with the Regional Science Centres
● provides API-driven access to the users of the data
● needs to provide data security for an archive lifetime of fifty years
Many technologies are available to provide these functions, ranging from a “sea of disks” to
conventional physically distinct and tiered storage solutions. It is important to note that classic
SAN-based storage solutions are designed for applications with much higher data security
requirements, with associated high costs. Our lack of requirements in this area means we have
some flexibility.
The Big Data revolution, and, perhaps more importantly for our application, the advent of what
Jim Gray termed the fourth paradigm: the era of Data-Intensive Scientific Discovery, has also
given rise to a host of technologies that allow massive data stores to be built cost-effectively.
Where traditional storage technologies often require capital investments into raw media (disks),
Big Data or cloud storage technologies are often much cheaper. This class of storage,
characterised by massive quantities of low-cost and (relatively) low-performance hardware,
derives its performance from software, rather than hardware. This is exemplified by the
approach to data security: where traditional storage relies on parity calculations in dedicated
hardware and N+x redundancy of data, Big Data or cloud storage systems simply duplicate
data, with the number of duplicates depending on the requirement on data security. While this
obviously adds additional required storage capacity, the total cost of ownership of such, much
simpler, systems may be lower.
For our application, this simplicity, coupled with the massively parallel nature of such storage
systems, provides additional advantages.
Moving storage system complexity to software potentially allows highly efficient system designs
to be implemented. We could envision the Science Archive storage hardware integrated in the
Compute Islands. Exporting of science-ready data now stays within an island, significantly
reducing the data transport distance. At the egress point of the SDP, the physically separated
storage pools are unified in software. It is currently unclear if current cluster object store
systems, such as Ceph for instance, are capable of providing such functionality. This is currently
being investigated (see also http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern for
CERN’s experience with Ceph).
By integrating Science Archive hardware into the Compute Island, we also align the
replacement cycles of these hardware components, simplifying operations. Note that the
duplication of data in such systems removes the need to migrate data to new archive systems,
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 43 of 51
provided the duplicates are stored on nodes with a different replacement date and data is re-
duplicated to new hardware.
Mirror science archive
For the physical backups of the science data, which is required by the L1 requirements SKA1-
SYS_REQ-2350, we adopt the cloud model. Science data will be duplicated among Science
Centres which are presumed to be in a secure location and offsite of the SDP. Taking this
approach, rather than designing a dedicated mirror facility, means that any mirrored data will
itself be useful, and not just be cold data.
Software stack
Operating system
The operating system forms the basis of the software stack and is the interface with the
hardware compute platform. The operating system needs to support all hardware conceivably
deployed in the Science Data Processor, be extremely scalable and, as experience with
precursor and pathfinder experiments has shown, highly tunable. We also intend to expose
information from hardware performance counters to LMC, so the OS needs to support user
space access for those as well.
Linux is the dominant operating system today, both in high-performance computing and in radio
astronomy, and this matches well with our requirements for the SKA1. Developments in
exascale operating systems will be tracked for suitability, although it should be mentioned that
most of these are based on Linux as well.
Middleware
The SDP middleware contains the software that provides services and APIs to the software
components of the other SDP work packages. In general, this middleware acts as the interface
between the hardware and the rest of the SDP, with the notable exception of LMC that has a
direct link with the Lights-out-manager to allow startup and shutdown from cold or broken state.
It is notable that the middleware layer may end up being extremely thin, if the containerization
concept is taken to the extreme and the data layer interface is a full-fledged operating system
container image running on a bare bones compute platform core OS.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 44 of 51
Figure 18: The SDP software compute platform middleware product subtree.
Messaging layer
The messaging layer provides communication services to the upper software layers. Several
communication protocols and methods are to be supported, having differing characteristics in
terms of (energy) cost, latency, programming model, reliability and throughput:
● Reliable messaging bus to facilitate communication between components, similar to an
Enterprise Service Bus.
● Bulk data transport services within an island
● Bulk data transport services between islands
● Bulk data transport services into SDP from CSP, UDP/IP over Ethernet
● Bulk data transport services from SDP to Regional Science Centres
A lot of experience has been gained in the pathfinder and precursor instruments with a variety of
messaging systems, ranging from raw Ethernet sockets to ZeroMQ, ICE and various flavours of
MPI.
The containerised interface between COMP and DATA allows DATA the option to provide its
own middleware layer per container, using just the kernel of the hosting platform. Alternatively,
the container can be limited to just the application and associated libraries, using the
middleware layer provided by the platform. Either model will work, but a number of middleware
services mentioned above may be integrated into the DATA containerised application.
Logging system
The logging system will collect, aggregate and analyse logs from all SDP components. This is to
be a hierarchical system, where node logs are aggregated on the island level with a subset of
these, for instance only messages from a particular severity upwards, communicated to a
central log store for analysis and dissemination. Machine learning algorithms may be employed
to model the system and predict failure states before they occur, which can be used by the
scheduler component described later on to estimate the availability of unreliable components.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 45 of 51
In addition, the logging system is responsible for collecting and handling any events that occur
in the system.
Platform management system
Although the eventual scale of the SDP is to be determined, it is clear that a highly automated
platform management system is essential for its efficient operation. Although the SDP itself is
rather unique in its requirements, it borrows many of its characteristics from HPC and cloud
systems. Since cloud providers routinely operate data centres of the scale we envision for the
SDP, we intend to heavily rely on existing cloud platform management solutions. The highly
modular nature of the SDP design makes this feasible.
Provisioning and deployment of software will be based on containerised images, using for
instance Docker [RD30], a light-weight and powerful open source container virtualisation
technology, simplifying the efforts required to keep software consistent over a large number of
nodes considerably. The relatively small size of (application) containers would allow the entire
container used for processing to be attached to the Science Archive as a piece of meta data.
Whether this is useful is still under consideration, but the detailed state and versioning of the
software stack needs to be exposed to the application and added to the meta data in some way.
These containers are the primary interface with the data layer.
It is interesting to note that high data rates involved in the SDP may require operating system
level optimisations that fall outside the scope of the Linux containers used to deploy our
applications. While this is a challenging issue, it is expected that the optimisations involved will
be system wide and static over all observation modes.
Apart from the provisioning and deployment of hardware and software, the platform
management system also provides system health monitoring information to the LMC. This
covers the range from processor load and memory capacity used, to temperatures and energy
consumed at component level. We intend to leverage the current trend of heavily instrumenting
processors and exposing these tools to the programmer via the kernel [RD29].
System optimisation
While not strictly part of the software stack, experience with the precursor and pathfinder
instruments has shown that optimisation of various components of the software stack, and of
the system in general, is essential to achieve optimal performance. This requires both highly
specific tooling to monitor system-wide performance and instrumentation of our specific
application and system, and highly skilled and specialised people with extensive knowledge of
the intricacies of the SDP system. This task will initially explore open-source simulation
and behavioural modelling tools and their suitability for the SDP. These environments
will be refined during operation. There will be a close relation between this work and the
modelling work in the logging system component, although with a different specific goal.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 46 of 51
Archive HSM Software
The archive Hierarchical Storage Manager (HSM) software automates the vertical movement of
data across various storage tiers. High-performance storage, with associated high energy
consumption is, in our design, specific to the Compute Island, while the lower performance tiers
are associated with a global namespace, although the hardware may be co-located with the
Compute Islands to minimise data transport distances.
Figure 19: Overview of the SDP Hierarchical Storage Manager.
HSM functionality may either be integrated into the Data layer, or be part of an integrated
storage platform to be selected and evaluated. Whatever the case may be, this part of the
software stack is under very active development in both industry and academia and we expect
to be able to follow rather than set the trend.
We do need to address the non-uniform nature of our storage tiers, where high-performance
storage resources have a distinct locality associated with them in terms of Compute Islands.
While there is no direct requirement to integrate these high-performance devices into the HSM,
this may well be an efficient way to reduce programmer overhead.
Application development environment and software development kit
The highly distributed nature of the project, as well as the complexity of the system we are trying
to build, necessitates a strict and formalised software development methodology. A test-driven
approach, based on the agile and scrum principles, has been successfully used in LOFAR, and
a similar approach will have to be used during the SKA1 development period. To support this,
the development environment must provide the necessary tools and hardware, including:
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 47 of 51
● automatic test and integration toolkits
● issue tracking
● code repository
● representative test and development systems
● support for early roll-out and version tracking
● release, containerisation and packaging of software products.
This also requires a strict software development policy, but this is outside the scope of this
document.
The adoption of containers as the de-facto distribution method should allow for both an easy
and convenient way to distribute a standardised development platform, including base libraries,
and a way to roll out versions of code quickly on a standardised base operating system without
having to worry about library incompatibilities.
Scheduler
The Science Data Processor Scheduler is responsible for the interface between Local
Monitoring and Control and the compute platform. It is responsible for allocating hardware
resources to jobs that need to be carried out on the platform, for working around failed or
unstable hardware and for taking into account external factors to adjust the rate at which the
system can operate, in particular due to thermal and/or energy constraints. In addition, the
scheduler will provide LMC resource requirement estimates upon request, used for coarse-
grained scheduling of observations, based on hardware availability. These estimates may be
based on timed calibration runs of the standard pipelines.
The scheduler design assumes that:
● We can modify an existing open source high-performance computing scheduler,
● The functionality is largely shared between it and the Local Monitoring and Control
component it interfaces with.
While we are confident that existing modular schedulers, like SLURM, will develop sufficiently
for us to be able to modify them successfully, there are some requirements that are unique to
our application, i.e. the ability to estimate required resources and runtime beforehand, based on
a-priori knowledge.
SDP infrastructure The compute platform interfaces with the Local Infrastructure component, which is responsible
for energy provisioning to the hardware components, delivery of a conditioned cooling medium
(either air or fluid), local routing of cabling and rack space. Local infrastructure provides metrics
on consumed power per unit (rack or outlet), temperature and such to LMC.
SDP infrastructure interfaces with the infrastructure consortium, which provides the bulk energy
delivery, the building, and cooling solutions [AD13].
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 48 of 51
Data delivery platform hardware While the data delivery platform hardware is nominally part of the hardware compute platform,
the requirements for this are described in the data delivery sub-element design document
[AD03].
LMC system hardware architecture Like the data delivery, the local monitoring and control hardware is nominally part of the
compute platform. The LMC sub-element design document [AD04] provides some requirements
on hardware reliability and failover, but in general the hardware design is expected to be
straightforward, and we don’t provide any more details here. At CDR this will be added in more
detail.
Suitability and scalability of the architecture
While the concept of the Compute Island is obviously extremely scalable, it has scaling limits
that we have not yet adequately explored. The size of the Compute Island is limited by the
affordability of the fully non-blocking interconnect on one extreme and the storage capacity per
node on the other end of the spectrum. In terms of SDP system scaling, there is an obvious limit
in the bulk data network scaling, since this is a superlinear scaling with the number of Compute
Islands. A more detailed analysis of the scaling limitations of this concept will be carried out on
the road to CDR.
In addition, the switch infrastructure needs to be carefully analysed for fault-tolerance. There is
a significant cost associated with redundancy in the network, but a single switch failure may
cause a sizable chunk of observational data to be lost. Careful analysis and design may allow
the impact of such a loss to be minimised.
While we currently see no observational modes that do not work with the current SDP compute
platform design, we have only carried out data flow analysis of individual components or
pipelines. On the road to CDR we intend to do a more system-wide analysis of the data flow,
which should adequately prove the suitability of the Compute Island concept for the SKA SDP.
Finally, a careful system-level data flow analysis needs to explore the trade-off between
hardware capital investment in data communication for reordering versus science results. It may
be possible to significantly reduce hardware cost with a limited science impact by using less
than optimal data distributions for some of the pipeline components.
Sub-element risks
● The COMP element design is optimised for imaging within Compute Islands. May not be
as suitable for:
a. Real-time calibration
b. Multi-scale, multi-frequency synthesis (may require SDP-wide communication)
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 49 of 51
c. Global solver
d. Any other observations that do not fit within a single island
Mitigation: increase island size or add additional bandwidth between the islands.
● Technology developments difficult to predict.
Mitigation: prototyping of cutting edge hardware with an emphasis on exploring
component characteristics rather than pure performance analysis.
● A late roll-out of the full SDP may impact the software development timeline.
Mitigation: on the one hand the milli- and centi-SDP implementations, but we may also
use the general-purpose HPC facilities for scaling experiments. In general, our
embarrassingly parallel applications should ease scaling issues.
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 50 of 51
Requirement traceability
ID Name Trace (section title)
SDP_REQ-301 SDP ingest data rate SDP scaling Bulk data transport network design
SDP_REQ-372 Early science processing capability
Roll-out schedule
SDP_REQ-375 SDP platform management Platform management system
SDP_REQ-376 Platform management interface to LMC
Scheduler Platform management system
SDP_REQ-377 System health monitoring detailed design @CDR
SDP_REQ-378 Deployment system Platform management system
SDP_REQ-379 Scheduler Scheduler
SDP_REQ-380 Scheduler Interface Scheduler
SDP_REQ-381 Scheduler input Scheduler
SDP_REQ-382 Component system consistency Platform management system
SDP_REQ-597 Component system state information
Platform management system
SKA1-SYS_REQ-2425 SADT to SDP interface. Bulk data transport network design
SKA1-SYS_REQ-2657 Processing capability Roll-out schedule
SKA1-SYS_REQ-2566 Materials list. SDP_REQ-363[AD10]
SKA1-SYS_REQ-2567 Hazardous Materials list. SDP_REQ-363[AD10]
SKA1-SYS_REQ-2568 Parts list. SDP_REQ-363[AD10]
SKA1-SYS_REQ-2569 Process list. SDP_REQ-363[AD10]
SKA1-SYS_REQ-2570 Parts availability. SDP_REQ-363[AD10]
SKA1-SYS_REQ-2571 Long lead time items. SDP_REQ-363[AD10]
SKA1-SYS_REQ-2572 Material environmental rule SDP_REQ-361[AD10]
Document Number: SKA-TEL-SDP-0000018 Unrestricted
Revision: 1.0 Author: C. Broekema
Release Date: 2015-02-09 Page 51 of 51
compliance.
SKA1-SYS_REQ-2573 Serial number. SDP_REQ-361[AD10]
SKA1-SYS_REQ-2574 Drawing numbers. SDP_REQ-361[AD10]
SKA1-SYS_REQ-2575 Marking method. SDP_REQ-361[AD10]
SKA1-SYS_REQ-2576 Electronically readable or scannable ID
SDP_REQ-361[AD10]
SKA1-SYS_REQ-2577 Package part number marking. SDP_REQ-361[AD10]
SKA1-SYS_REQ-2578 Package serial number marking. SDP_REQ-361[AD10]
SKA1-SYS_REQ-2579 Hazard warning marking. SDP_REQ-361[AD10]
SKA1-SYS_REQ-2580 LRU electrostatic warnings SDP_REQ-361[AD10]
SKA1-SYS_REQ-2581 Packaging electrostatic warnings. SDP_REQ-361[AD10]
SKA1-SYS_REQ-2583 Cable identification. SDP_REQ-361[AD10]
SKA1-SYS_REQ-2584 Connector plates. SDP_REQ-361[AD10]
SKA1-SYS_REQ-2711 Component obsolescence plan ILS plan [AD11]
SKA1-SYS_REQ-2716 Telescope availability Non-conformant [AD08] SDP_REQ-195[AD10]
SKA1-SYS_REQ-2718 Availability budgets Non-conformant [AD08] SDP_REQ-195[AD10]
PDR02-01ComputeplatformSubsystemDesign(1) (1)EchoSign Document History February 09, 2015
Created: February 09, 2015
By: Verity Allan ([email protected])
Status: SIGNED
Transaction ID: XJEHDU42H2RX37Y
“PDR02-01ComputeplatformSubsystemDesign (1) (1)” HistoryDocument created by Verity Allan ([email protected])February 09, 2015 - 3:10 PM GMT - IP address: 131.111.185.15
Document emailed to P.C. Broekema ([email protected]) for signatureFebruary 09, 2015 - 3:11 PM GMT
Document viewed by P.C. Broekema ([email protected])February 09, 2015 - 3:13 PM GMT - IP address: 192.87.1.200
P.C. Broekema ([email protected]) verified identity with Google web identity Chris Broekema (https://www.google.com/profiles/118359904325355782244)February 09, 2015 - 3:18 PM GMT
Document e-signed by P.C. Broekema ([email protected])Signature Date: February 09, 2015 - 3:18 PM GMT - Time Source: server - IP address: 192.87.1.200
Document emailed to Paul Alexander ([email protected]) for signatureFebruary 09, 2015 - 3:18 PM GMT
Document viewed by Paul Alexander ([email protected])February 09, 2015 - 6:46 PM GMT - IP address: 131.111.185.15
Document e-signed by Paul Alexander ([email protected])Signature Date: February 09, 2015 - 6:46 PM GMT - Time Source: server - IP address: 131.111.185.15
Signed document emailed to Verity Allan ([email protected]), P.C. Broekema ([email protected]) andPaul Alexander ([email protected])February 09, 2015 - 6:46 PM GMT
Top Related