ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for...

74
Name Designat ion Affilia tion Signature Authored by: A. Magro Subject Matter Expert AADC Date: Owned by: M. Waterson Domain Speciali st SKAO Date: Approved by: P. Gibbs Engineer ing Project SKAO Date: Released by: J. G. Bij de Vaate Consorti um Lead AADC Date: DOCUMENT HISTORY MCCS ARCHITECTURE OVERVIEW Document number.......................SKA-TEL-LFAA-0600050 Context................................................DRE Revision................................................02 Author................................A. Magro, A. DeMarco Date............................................2019-02-12 Document Classification..............FOR PROJECT USE ONLY Status............................................Released

Transcript of ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for...

Page 1: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Name Designation Affiliation Signature

Authored by:

A. MagroSubject Matter Expert

AADCDate:

Owned by:

M. Waterson Domain Specialist SKAO

Date:

Approved by:

P. Gibbs Engineering Project

Manager

SKAODate:

Released by:

J. G. Bij de Vaate Consortium Lead AADC

Date:

DOCUMENT HISTORYRevision Date Of Issue Engineering Change Comments

MCCS ARCHITECTURE OVERVIEWDocument number......................................................................SKA-TEL-LFAA-0600050Context...................................................................................................................... DRERevision........................................................................................................................02Author...........................................................................................A. Magro, A. DeMarcoDate...............................................................................................................2019-02-12Document Classification............................................................FOR PROJECT USE ONLY Status.................................................................................................................Released

Page 2: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Number

A 2018-06-04 - Draft Template version released within consortium

01 2018-10-31 First Release

02 2019-02-12 Implemented CDR panel OARs:

LFAA Element CDR_OAR_MCCS Architecture Overview

OARs: 2-4, 7

DOCUMENT SOFTWAREPackage Version Filename

Wordprocessor MsWord Word 2016 document.docx

Block diagrams

Other

ORGANISATION DETAILSName Aperture Array Design and Construction Consortium

Registered Address ASTRONOude Hoogeveensedijk 47991 PD DwingelooThe Netherlands+31 (0)521 595100

Fax. +31 (0)521 595101Website www.skatelescope.org/lfaa/

CopyrightDocument owner Aperture Array Design and Construction Consortium

This document is written for internal use in the SKA project

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 2 of 52

Page 3: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

TABLE OF CONTENTS1 INTRODUCTION............................................................................................8

1.1 Purpose of the document......................................................................................................81.2 Scope of the document..........................................................................................................81.3 Intended Audience................................................................................................................81.4 Document Overview..............................................................................................................81.5 Document Tree......................................................................................................................9

2 REFERENCES..............................................................................................102.1 Applicable documents.........................................................................................................102.2 Reference documents..........................................................................................................10

3 MCCS ARCHITECTURE OVERVIEW..................................................................113.1 Telescope Overview.............................................................................................................113.2 LFAA Overview.....................................................................................................................123.3 Role of MCCS in LFAA...........................................................................................................143.4 Main MCCS Responsibilities.................................................................................................143.5 MCCS Top-Level Static Decomposition Diagram..................................................................183.6 Interfaces.............................................................................................................................18

3.6.1 External Entities...........................................................................................................183.6.2 Level 4 and Level 5 Components..................................................................................183.6.3 External Interfaces.......................................................................................................193.6.4 Internal Interfaces........................................................................................................20

4 OPERATIONAL CONCEPTS.............................................................................234.1.1 Operational Environment............................................................................................23

4.1.1.1 Operations...............................................................................................................23

4.1.1.2 Maintenance............................................................................................................24

4.1.1.3 Operator Role..........................................................................................................24

4.1.2 Support Environment...................................................................................................244.1.2.1 On-site Maintainer role............................................................................................24

4.1.2.2 Off-site Maintainer role...........................................................................................24

4.1.2.3 Remote support.......................................................................................................24

4.1.3 States and Modes........................................................................................................25

5 MCCS SOFTWARE OVERVIEW.......................................................................285.1 Overview of Software Architecture......................................................................................285.2 Software Component List....................................................................................................335.3 Software-Hardware Mapping..............................................................................................365.4 Software Life Cycle...............................................................................................................38

5.4.1 Agile Release Trains.....................................................................................................385.4.2 SAFe Implementation Overview..................................................................................385.4.3 Essential SAFe..............................................................................................................39

5.4.3.1 Software Development Process During Construction Iterations..............................40

5.4.3.2 The Test-First Approach to Construction.................................................................41

5.5 Commissioning.....................................................................................................................41

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 3 of 52

Page 4: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

6 MCCS PHYSICAL OVERVIEW.........................................................................436.1 Compute Server...................................................................................................................436.2 Network...............................................................................................................................436.3 Rack Assembly.....................................................................................................................45

7 SCENARIOS................................................................................................477.1 Application of power...........................................................................................................477.2 Transition to Low Power Mode............................................................................................487.3 Transition to Off-line............................................................................................................48

7.3.1 Controlled shutdown...................................................................................................487.3.2 Uncontrolled shutdown...............................................................................................49

7.4 Set up and Start Observation...............................................................................................497.5 Calibration...........................................................................................................................497.6 Stop Observing.....................................................................................................................507.7 MCCS Failures......................................................................................................................507.8 Software Upgrades..............................................................................................................51

7.8.1 Software upgrades.......................................................................................................517.8.2 BIOS updates................................................................................................................527.8.3 LRU firmware updates.................................................................................................52

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 4 of 52

Page 5: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

LIST OF FIGURESFigure 1-1 SKA1 LFAA Element Documentation Tree.............................................................................9Figure 3-1 SKA1 Telescope Overview...................................................................................................11Figure 3-2 SKA1_Low Functional Diagram...........................................................................................12Figure 3-3. LFAA overall architecture..................................................................................................13Figure 3-4. LFAA observation organization..........................................................................................15Figure 3-5. MCCS top-level static decomposition................................................................................17Figure 3-6. LFAA L3 context diagram...................................................................................................21Figure 3-7. MCCS - Field interface.......................................................................................................21Figure 3-8. MCCS - SPS interface.........................................................................................................22Figure 4-1 MCCS Sub-Element top-level context diagram showing all external interfaces..................23Figure 4-2. Derived state transition diagram for all TANGO devices in SKA LMC. Not all states are

mandatory for each hardware and software component.........................................................27Figure 5-1. LFAA overall software architecture overview....................................................................28Figure 5-2. LFAA observation management overview.........................................................................30Figure 5-3. LFAA local monitoring and control overview.....................................................................32Figure 5-4. TANGO control structure...................................................................................................33Figure 5-5. Software module decomposition diagram.........................................................................37Figure 5-6. Mapping between array and software components..........................................................38Figure 5-7: Essential SAFe configuration..............................................................................................39Figure 5-8: Software development process during a construction iteration.......................................40Figure 5-9: Test-first development approach......................................................................................41Figure 5-10: Testing during construction iterations.............................................................................42Figure 6-1. Network links between MCCS and external entities..........................................................44Figure 6-2. MCCS network diagram.....................................................................................................45Figure 6-3. MCCS rack assembly..........................................................................................................46

LIST OF TABLESTable 3-1. LFAA numbers.....................................................................................................................14Table 3-2. External interfaces..............................................................................................................19Table 3-3. L2 interfaces to other LFAA sub-elements..........................................................................20Table 4-1. MCCS states and modes......................................................................................................25Table 5-1. Link between hardware components as described in software and the physical

components as defined in the PBS............................................................................................29Table 5-2. List of elements in the Architecture System Overview.......................................................33Table 5-3. Relationships between major elements in the Architecture System Overview..................36Table 6-1. MCCS compute server configuration..................................................................................43

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 5 of 52

Page 6: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

LIST OF ABBREVIATIONS

AADC................................. Aperture Array Design and construction ConsortiumAAVS................................. Aperture Array Verification SystemADC................................... Analog to Digital converterAd-n.................................. nth document in the list of Applicable DocumentsAPIU.................................. Antenna Power Interface UnitAIV.................................... Assembly Integration and VerificationBIOS ................................. Basic Input/Output SystemCDR................................... Critical Design ReviewCI....................................... Configuration ItemCMB.................................. Cabinet Management BoardCOTS................................. Commercial Off The ShelfCPF.................................... Central Processing FacilityCM.................................... Configuration ManagerCPU .................................. Central Processing UnitCSP.................................... Central Signal ProcessingDAQ .................................. Data Acquisition DDD................................... Detailed Design DocumentDMS.................................. Document/Data Management SystemECP.................................... Engineering Change ProposalEMI.................................... Electro Magnetic InterferenceFN ..................................... Field NodeFoV.................................... Field of ViewFPGA................................. Field Programmable Gate ArrayGPU................................... Graphics Processing UnitHW.................................... HardwareICD.................................... Interface Control DocumentINFRAAUS.......................... Infrastructure AustraliaISO.................................... International Organisation for StandardisationLFAA.................................. Low Frequency Aperture ArrayLFAA-DN............................ Low Frequency Aperture Array – Data NetworkLMC................................... Local Monitoring and ControlFQDN................................ Fully Qualified Device NameLNA................................... Low Noise AmplifierLMC................................... Local monitoring and ControlLRU.................................... Line Replaceable UnitMCCS................................. Monitor, Control and Calibration subsystemMRO.................................. Murchison Radio-astronomy ObservatoryMWA................................. Murchison Widefield arrayPBS.................................... Product Breakdown StructurePPS.................................... Pulse Per SecondQA..................................... Quality AssuranceRD-N.................................. nth document in the list of Reference DocumentsRAM ................................. Random Access MemoryRMS .................................. Root Mean SquareRF...................................... Radio FrequencyRFI..................................... Radio Frequency InterferenceRFoF.................................. Radio Frequency signal over FibreRPF.................................... Remote Processing FacilitySAD................................... Software Architecture DocumentSaDT.................................. Signal and Data TransportSDP.................................... Science Data ProcessorSKA.................................... Square Kilometre Array

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 6 of 52

Page 7: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

SKA-LOW........................... SKA low frequency part of the full telescopeSKAO................................. SKA OfficeS/N.................................... Signal to noiseSPS ................................... Signal Processing SubsystemSRMB ................................ Sub-Rack Management BoardSSD ................................... Solid State DriveSW..................................... SoftwareTANGO.............................. TAco Next Generation ObjectsTCP-IP................................ Transmission Control Protocol – Internet ProtocolTBC.................................... To Be ContinuedTBD................................... To Be DoneTM..................................... Telescope ManagementTPM................................... Tile Processor ModuleUPS.................................... Unlimited Power SupplyWBS.................................. Work Breakdown Structure WP.................................... Work Package

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 7 of 52

Page 8: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

1 Introduction

1.1 Purpose of the document

The purpose of this document is to describe the architecture for the Monitoring, Control and Calibration Sub-System for the Low Frequency Aperture Array (LFAA) of the SKA Phase 1, which references detailed design documents for the hardware and network setup, as well as one software architecture document for describing the software system which will run on the MCCS. Combined, these will determine the operational concept, cost, power, equipment space, reliability, availability and maintainability of the MCCS.

This document should be read after the LFAA Architectural Design and Analysis Document [AD3].

1.2 Scope of the document

This document describes how the LFAA MCCS architecture can meet the requirements within the SKA LFAA Signal Processing Requirement Specification. It is meant to provide an overall overview of the MCCS architecture, including and overview of the software system. The Software Architecture Document [RD7] discusses the software architecture in greater details. Where applicable the SAD references this document in order to reduce duplication of content, such that this document acts as an introduction to the SAD.

The level of detail in this document is sufficient to:

1. Define interfaces with other SKA Elements and LFAA Sub-elements.2. Establish a reasonable baseline design at reasonably low perceived risk.3. Estimate time, effort and cost to deliver the functionality specified in the LFAA Signal

Processing Sub-Element Requirements Specification [AD7].

In other words, the LFAA Sub-Element design is defined in enough detail as to reduce risk of effort/time/cost overruns in the Construction Phase.

The current release (100% version) will support the Critical Design Review for the LFAA Element. The level of detail is enough to have high confidence in the referenced design being compliant and able to be constructed with low risk. This Architecture Design Document (ADD), with references to supporting information and data, will provide a design artefact to support the Construction Phase activities.

1.3 Intended Audience

This document is expected to be used by the LFAA Element Consortium Engineering and Management Team and the SKAO System Engineering Team and SKAO LFAA Project Manager. This document is expected to be read by the external CDR review panel

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 8 of 52

Page 9: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

1.4 Document Overview

This document follows a template that was agreed to between the SKAO and the LFAA Consortium. It covers the key contents called out in the LFAA SOW [AD8].

Detailed information is contained in reference documents.

1.5 Document Tree

The overall document tree for the LFAA Element is shown in Figure 1-1. Level 1 (L1) is the SKA System (telescope) level, L2 is the LFAA Element level and L3 is the LFAA sub-element level (where MCCS resides).

L1 Requirements

L2 Requirements

LFAA ADD

LFAA Costing

Planning Verification Specifications Design Costing

Baseline Design/Architecture Data Pack

L3 Requirements

Internal ICDs

L1

L2

L3

Design DocsSub-element

Costings

Sub-element Detailed Design and

Prototyping Docs

Sub-element Test Specs and

Statement of Compliance at

CDR

LFAATest Spec

PMP SEMP

Risk Reg

External ICDs

LFAAAIVP

Sub-element Prototyping

Plans

Sub-element Dev Plans

(SOW,WBS)

External ICDsSE-6*

Construction Plan

Legend

LFAA CIDL Tree – Rev 1.aJune 06, 2018

«(Additional Planning Docs)

Sub-element Signal Models

Con Ops

LFAA RAMS/Logistics/Safety/EMI/EMC

SKAO Doc

LFAA Doc for PDR; SKAO Doc for CDR updates

LFAA Doc at PDR; updates for CDR as requiredLFAA Doc to be delivered for CDR

* L2 docs split between Sub-elements** L3 requirements split per sub-element

Not Delivered

RecoveryPlan

Figure 1-1 SKA1 LFAA Element Documentation Tree

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 9 of 52

Page 10: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

2 References

2.1 Applicable documents

The following documents are applicable to the extent stated herein. In the event of conflict between the contents of the applicable documents and this document, the applicable documents shall take precedence.

[AD1] SKA-1 System Baseline design, SKA-TEL-SKO-0000002 Issue 01[AD2] Roll-out Plan for SKA1 Low, SKA-TEL-AIV-4410001 Issue 05[AD3] LFAA Architectural Design Document, SKA-TEL-LFAA-0200028[AD4] SKA1 TM to LFAA ICD, 100-000000-028, Issue 02[AD5] SKA1 LFAA to INFRA AUS ICD, 100-000000-003, Issue 03[AD6] SKA1 SADT to LFAA ICD, 100-000000-026, Issue 04[AD7] SKA1 LFAA SPS Sub-Element Requirements Specification, SKA-TEL-LFAA-0400014[AD8] SKA1 LFAA Element Statement of Work

2.2 Reference documents

The following documents are referenced in this document. In the event of conflict between the contents of the referenced documents and this document, this document shall take precedence.

[RD1] SKA1 Control System Guidelines, 000-000000-010, Issue 01[RD2] LFAA Internal Interface Control Document SKA-TEL-LFAA-0200030, Issue 01[RD3] CISPR 22 Information technology equipment - Radio disturbance characteristics - Limits

and methods of measurement R2014[RD4] CISPR 24 Information technology equipment - Immunity characteristics - Limits and

methods of measurement 2010[RD5] CISPR 32 Electromagnetic compatibility of multimedia equipment - Emission

requirements 2015[RD6] CISPR 35 Electromagnetic compatibility of multimedia equipment - Immunity

requirements[RD7] MCCS Software Architecture Document, SKA-TEL-LFAA-0600052[RD8] SPS Detailed Design Document, SKA-TEL-LFAA-0500035[RD9] MCCS Detailed Design Document, SKA-TEL-LFAA-0600051[RD10] MCCS Assembly Verification and Test Plan, SKA-TEL-LFAA-0600053[RD11] Safe principles: https://www.scaledagileframework.com/safe-lean-agile-principles/[RD12] Essential Safe: https://www.scaledagileframework.com/essential-safe/

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 10 of 52

Page 11: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

3 MCCS Architecture Overview

3.1 Telescope Overview

Figure 3-2 shows the major SKA1 Observatory entities: SKA1-Low in Australia, SKA1-Mid in South Africa and the SKA Global Headquarters in the UK. The thick flow-lines show the unidirectional transport of large amounts of digitised data from the antennas to the Central Processing Facilities (CPF) on the sites, and from the CPFs to the Science Data Processor (SDP) and Archive facilities. The thin blue dash-dot lines show the bidirectional transport of system monitor and control data.

The SKA1-Low telescope array includes 512 stations, each consisting of 256 dual-polarisation log-periodic antennas. The stations are distributed over a distance of 65 km, with the greatest density of stations in the central core. The Central Processing facility is located on site and the SDP and archive are located in Perth. Additionally, each station can be divided into a number of smaller sub-stations at reduced bandwidth.

A more detailed schematic of the SKA1-Low telescope, extracted from the SKA1 System Baseline V3 Description (in preparation), is shown in Figure 3-3. This figure shows the major SKA1-Low signal flow components, as well as the areas of consortia responsibility (red boxes) and the key technologies needed to implement the components. The green dashed line shows the bi-directional flow of monitor, control and operational data, and the orange dot-dashed line shows the distribution of synchronisation and timing signals.

Figure 3-2 SKA1 Telescope Overview

A schematic of the SKA1_Low Telescope, extracted from the Baseline Design [AD1], is shown below including the LFAA Element, product [101-000000].

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 11 of 52

Page 12: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

SKA1-Low operates concurrently in imaging mode and non-imaging mode with concurrent operation of between 1 and 16+ sub-arrays. Each sub-array is programmable as a separate conceptual telescope in terms of antenna pointing, band selection and the setting of configurable imaging and non-imaging parameters. The only things that are not shared between sub-arrays are observation time, communications links and some processing resources.

Advanced Time Keeping &

DistributionAdvanced Data

Storage

Central Processing Facility

Visibility Data

Can

dida

tes

&

Tim

ing

Dat

a

Syn

chro

nisa

tion

& T

imin

g

Data Transport

LNA & Amplifier RF over Fibre,Opto-

electronics

Filterbanks,Beamformer &Stn Correlator

Antenna Array Design

Outer Antenna

Station Array

RF Electronics

RF Transport

Links

Channelisation Beamforming& Transient

Capture

Low-Frequency Aperture Array Stations

Science Data Processing Facility

Channeliser,Correlator

& Beamformer

Science Data

ProcessingScience

Data Archive &

Distribution

High-speed Digital

Hardware

Fibre OpticDigital Data

Links

Specialised Digital

Hardware

Synchronisation & Timing

DistributionPulsar Search

Pulsar Timing

Observatory Clock System

Telescope Manager

Operations,Control and Monitoring Systems

Core Antenna Station Array

RF Electronics

RF Transport

Links

Channelisation Beamforming& Transient

Capture

Long-haul Links

Telescope Mgt

RF Gain Digitisation

RF Gain Digitisation

Amplification& Filtering

VLBI Data VLBITerminal

Equipment/Interface

Transient Data

SampleClock &

Time StampGeneration

Sample Clock &

Time StampGeneration

Switch

VLBIObserving

Log

VLBI Data

Can

dida

tes

&

Tim

ing

Dat

a

Visibility DataTransient Data

Super-computer Hardware,Software

Science DataProcessingFront-end

Data Routing

Time stamp

Data Transport

Long-haul Links

Fibre OpticDigital Data

Links

Figure 3-3 SKA1_Low Functional Diagram

3.2 LFAA Overview

The LFAA is primarily a hardware-centric element, such that hardware configuration, monitoring and control is a central feature and architectural driver. The physical architecture is defined in Figure 3-4 and the system consists of the following major components:

1. Stations, consisting of Field Nodes, Antenna Power Interface Unit(s) and meshes2. Digital System, consisting of:

a. Signal Processing Subsystem (SPS)b. SPS Network

3. Monitor, Control and Calibration Sub-system (MCCS), including the MCCS network

LFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals transmitted from astronomical objects. The architecture is built around a high-speed switched network which is controlled by MCCS in a centralized and highly configurable system. The SPS provides the infrastructure required to support signal conditioning, digitization and processing functionalities of the TPM. It consists of cabinets with internal cooling, power and clock distribution, each receiving a 10 MHz and 1 PPS from the synchronization and timing (SAT) system which are distributed to each TPM. Each cabinet also includes the first level (i.e. directly connected to the TPMs) data switches which allow the forming of tile beams by summing the signals from sixteen antennas together, followed by the forming of station beams by summing tile beams within a single station together. Beamforming is performed within TPMs. Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 12 of 52

Page 13: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 3-4. LFAA overall architecture

TPMs are the primary component responsible for the processing of signals. They are located within the processing facilities (CPF shown in diagram) and will be housed within Signal Processing Sub-system (SPS) cabinets. TPMs receive the analogue RF over fiber optical signals from 16 antennas (dual-polarization) and convert it back to electrical RF signal. This signal is then filtered to limit the frequency bandwidth, amplified, digitized and channelized into ~1MHz coarse frequency stream (to a total of 512 coarse frequency channels). Calibration coefficients are applied to each frequency channel, whilst beamforming delays are applied to each antenna per beam. The partial beam stream is sent to a digital switch to generate station beams. To generate a station beam, the output of 16 TPMs are combined by making use of one of the data switches.

The data network is a standard high-speed (40Gb or 100Gb) network which will transport the various data streams, i.e. control and monitoring information as well as signal data. This network contributes to the provision of connectivity between the TPMs, the MCCS and the Low correlator beam former (CBF-LOW), involving long haul links from the TPMs that are in the Remote Processing Facilities (RPF).

There will be a total of 256 SPS cabinets, each containing four sub-racks, such that each cabinet is responsible for two stations. Each sub-rack contains a Sub-rack Management board which distributes power, 1Gb network, 10 MHz and PPS signals. A cabinet-wide management unit is responsible for distributing these signals to the sub-racks. A single 100 Gb switch connects the TPMs to the LFAA Network. The Sub-rack Management board also acts as a proxy for monitoring and controlling APIUs. MCCS cabinets host at least 16 high-performance servers (and one or two

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 13 of 52

Page 14: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

additional servers for redundancy), such that each server is responsible for at most eight stations. MCCS cabinets also host a number of 100Gb switches to connect the SPS racks to the MCCS servers. Table 3-1 provides a summary of the number of components in the LFAA and how they are spread across cabinets (refer to [RD8] for a detailed SPS cabinet design).

Table 3-1. LFAA numbers

Total number of antennas 131072 Total number of SPS cabinets 256Total number of stations 512 Sub-racks per cabinet 4Antennas per station 256 TPMs per sub-rack 8Antennas per TPM 16 100Gb switches per MCCS cabinet 2TPMs per station 16 Total number of MCCS cabinets 4Total number of TPMs 8192 Servers per cabinet 17 (+ 0/1)Signals per TPM 32 SDP-LFAA link speed 100Gb/sFrequency channels 512 TM-LFAA link speed 1Gb/sMaximum beams per station 8

3.3 Role of MCCS in LFAA

The MCCS performs the local monitoring, control and calibration functions for the stations and supporting products. It receives commands and reports the LFAA status to TM. It comprises of a compute cluster (hardware resources composed by off-the-shelf high-performance servers), local power and cooling distribution, local network and job management software to support the LFAA monitor and control functions. The MCCS is connected to both the SPS and LFAA Network. It also calculates the beamforming and calibration coefficients. The MCCS controls both TPMs, the M&C and data network, as well as supporting hardware in the cabinets. It is also responsible for implementing the transient buffer and transmitting the buffer, when instructed, to SDP via a dedicated 100Gb link.

3.4 Main MCCS Responsibilities

The two primary responsibilities of the MCCS sub-system is to:1. Create and monitor of observations, including calibration and buffering beamformed data

for transient detection2. Provide monitoring and control capability for all the hardware and software components

The software architecture for the LFAA is primarily driven by these responsibilities, whilst the sizing of the MCCS hardware is defined by the resource requirements for calibration, transient buffers and supporting operations. Observation management is the primary use case for MCCS and defines the primary functional requirements for the software system, whilst most of the remaining requirements can be seen as features and specifications required to make sure that the primary use case remains online, available, working properly and meets the science cases to which the LFAA should cater. The functional requirements which the MCCS should provide can be summarized as follows:

Create and manage observations, where an observation consists of one subarray containing multiple stations, which in turn can be composed of multiple sub-stations

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 14 of 52

Page 15: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Perform calibration, pointing and bandpass flattening coefficient calculation for running observations

Manage TPMs, including downloading firmware, initialising and synchronising the boards and firmware, and updating required coefficients and configurations throughout the lifetime of an observation

Provide a transient buffer such that, when triggered by TM, buffered station beams can be forwarded to SDP

Expose maintenance functionality for fault finding, mitigation and correction Monitor and control TPMs, antennas and other hardware and software components, and

provide a reporting mechanism for generating reports Provide a logging mechanism and store logs for a period of time, where said logs should be

queryable by external parties Raise alarms and events to inform internal and external entities of state and other changes

of LFAA components Routinely perform status and diagnostic checks Provide an inventory database where labelled hardware components and cables are stored,

to be able to easily localise issues within the CPF and RPFs Interact with external entities, including TM, SDP, CSP, operators, engineers and hardware

and software deployers

Observation creation and management, with the associated need to control TPMs, calibrate the arrays and buffer station beams are the main driving factor of the architecture as well as for defining the minimal performance requirements for sizing the MCCS hardware. The need to monitor all hardware and software devices, including the need to have an alarm and notification system, led to the adoption of TANGO by the SKA community as the primary control system for the SKA. Through TANGO, most of the purely LMC-related requirements are met by properly integrating TANGO within the architecture.

Figure 3-5. LFAA observation organization

The primary use case of the LFAA is to generate station beams. Observations organization is shown in Figure 3-5 and described below:

A group of 16 Antennas (connected to a TPM) is called a Tile A Subarray is a set of Stations grouped together for a single observation scheduling block. A

Station is composed of 256 antennas (distributed across 16 Tiles). The LFAA uses the concept of a Sub-array to conform with the SKA control guidelines, for grouping related Tiles and

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 15 of 52

Page 16: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

storing Sub-array related metadata. There is no Sub-array specific operation performed in the signal chain.

The number of Subarray which can be defined is configurable (there is no fixed limit). This document assumes a maximum of 16 Subarrays, however this can be changed

A sub-station is defined as a specific instance of a station beam in which a subset of the antennas does not contribute to the beam (a weight of 0 is applied to these antennas)

Each Station can generate up to 8 Station Beams The Antennas within each Station need to be calibrated (gain and phase calibration). This is

performed on the MCCS servers. The calibration cycle is 10 minutes. During these 10 minutes, coarse frequency channel (from the channels in the Station Beams) are calibrated in a round-robin fashion, such that each channel is calibrated in ~1 second

For each Station Beam, given a pointing polynomial, the delay and delay rate per antenna need to be calculated so that pointing coefficients can be generated. Delays and delay rates per antenna are calculated on the MCCS servers, whilst pointing coefficients per antenna/channel (given the delay and delay rate) are calculated on the TPMs.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 16 of 52

Page 17: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 3-6. MCCS top-level static decomposition

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 17 of 52

Page 18: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

3.5 MCCS Top-Level Static Decomposition Diagram

MCCS Compute Processor and MCCS Software are the main components of the MCCS. A brief architectural description of the software component is presented in Section 5, while an in-depth analysis and description is provided in [RD7]. The MCCS Compute Processor component is composed of four almost-identical MCCS cabinets deployed in the CPF. Each cabinet hosts at least 16 high performance servers which are connected to SPS via a data network and interconnected both though the same data network as well as a dedicated monitoring and control network. In two of the four cabinet an LMC head node (one master, one shadow) hosts the core LMC software and manages both MCCS and SPS. Figure 3-6 show the top-level static decomposition of MCCS.

3.6 Interfaces

This section describes the external entities to MCCS, the level 4 and level 5 components composing MCCS, as well as all internal and external interfaces to MCCS.

3.6.1 External Entities

There are no external entities in the MCCS static decomposition diagram. Verification and maintenance support equipment is not described in detail in this DDD.

3.6.2 Level 4 and Level 5 Components

Level 4 decomposition has only three elements: 4 MCCS Compute Processors which house the High-Performance Computing Units together

with data network LRUs which connect the servers together as well as provide connections to SPS.

MCCS software, which encapsulates all the LMC and supporting software infrastructure for MCCS

LMC infrastructure hardware, comprising of one master node and a shadow master node which is used as a failover in the event where the master node becomes compromised.

An MCCS Compute Processor is composed of: The cabinet chassis, holding all other hardware components 17 high performance computing unit, one of which is spare (kept in lower power mode until

needed) Four 100Gb 32-port Ethernet switches, implementing a single 100Gb Ethernet network for

science and LMC data One 1 Gbps 32-port Ethernet switch for control and management across MCCS The AC distribution system, distributing power to all Level 4 components under CMB control Required cabling Additionally, two of the MCCS Compute Processors contain one LMC Infrastructure node

together with an associated UPS

The MCCS software is logically partitioned into several L5 components: Local Monitor and Control TANGO Framework, which encapsulate most of the LMC

functionality Management Software Module, which manages the hardware and software configuration of

all LFAADocument No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 18 of 52

Page 19: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Graphical User Interface, which provides and engineering user interface for use in commissioning, testing and maintenance

Data Acquisition Software Module, which is responsible for acquiring LMC data transmitted by SPS, used for calibration, transient buffering and diagnostics

Pointing Software, which compute the delay and delay rates per antenna for a given station/sub-station configuration

Calibration Software, which runs the calibration algorithm and generates calibration coefficients that are transmitted to SPS

Diagnostic Software, which monitors the state of the LFAA (both hardware and software), including FieldNode diagnostic, calibration diagnostics and network diagnostics. Some of these diagnostics can be performed within the associated TANGO devices, however others require a higher amount of processing power, in which case they are run as standalone applications.

3.6.3 External Interfaces

The external interfaces between MCCS and other elements are list in Table 3-2 and shown in Figure3-7, whilst the external interfaces between MCCS and other LFAA sub-elements are listed in Table 3-3 and shown in Figure 3-8 and Figure 3-9. The external interfaces are defined in [AD4]/[AD5]/[AD6] whilst the internal interfaces are defined in [RD2] and MCCS intends to be compliant with them.

Table 3-2. External interfaces

External Entity

Interface ID Leading Organization

Key Data or Message flows

TM S1L.TM_LFAA.001 TM Overall LFAA monitoring and control functionality

SDP S1L.SDP_LFAA.002 SDP Transient Buffer

SDP S1L.SPA_LFAA.001 SDP Global sky model updates

SaDT S1L.SADT_LFAA.007 SaDT Monitor and Control and NTP – physical link

SaDT S1L.SADT_LFAA.009 SaDT Transient buffer data – physical link

INAU S1L.LFAA_INAU.005 LFAA Rack power

INAU S1L.LFAA_INAU.008 LFAA Rack cooling

INAU S1L.LFAA_INAU.009 LFAA Floor space

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 19 of 52

Page 20: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Table 3-3. L2 interfaces to other LFAA sub-elements

LFAA Entity

Interface ID Key Data or Message flows

SPS S1L.MCCS_SPS.001 Physical links between CPF and MCCS

SPS S1L.MCCS_SPS.002 Physical links between RPFs and MCCS

SPS S1L.MCCS_SPS.003 Calibration, transient data exchange between SPS and MCCS

SPS S1L.MCCS_SPS.004 LMC data exchange between SPS TPMs and MCCS

SPS S1L.MCCS_SPS.005 LMC data exchange between SPS CMBs and MCCS

SPS S1L.MCCS_SPS.006 LMC data exchange between SPS SRMBs and MCCS

SPS S1L.MCCS_SPS.007 LMC data exchange between SPS Network and MCCS

Field Node S1L.MCCS_FN.001 LMC data exchange between FN and MCCS

3.6.4 Internal Interfaces

The physical interfaces within MCCS are those required for: Distribution of power from the rack power supplies to the PDU and subsequently to the rack

equipment (via the UPS in case of the head and shadow node) 1Gb and 100Gb network connectivity.

Interfaces between software components are described in the MCCS Software Architecture Document [RD7].

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 20 of 52

Page 21: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 3-7. LFAA L3 context diagram

Figure 3-8. MCCS - Field interface

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 21 of 52

Page 22: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 3-9. MCCS - SPS interface

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 22 of 52

Page 23: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

4 Operational Concepts

Figure 4-10 MCCS Sub-Element top-level context diagram showing all external interfaces

4.1.1 Operational Environment

The Central Processing Facility (CPF) screened will house the MCCS equipment as well as other surrounding/support equipment such as that used for SaDT timing and networks, the CSP correlator and LFAA SPS.The CPF is an RFI-shielded facility supporting liquid cooling. This facility has some level of ESD protection and has HVAC filters to prevent dust accumulation on equipment. Notwithstanding the RFI-shielded facility, LFAA LRUs, including those comprising MCCS, are required individually to meet CISPR-22/32 Class A [RD3]/[RD5] radiated and conducted emissions levels. Additionally, MCCS LRUs must meet CISPR 24/35 [RD4]/[RD6] Class A radiated and conducted susceptibility levels or equivalent.

4.1.1.1 Operations

During normal operations MCCS is controlled via the interface with TM [AD4]. MCCS implements a high-level interface which allows TM to control and monitor MCCS as a single instrument. A single point of access is provided for housekeeping commands such as power-up, power-down, and state and mode transitions. Monitoring and error reporting are subscription-based; all parameters that may be of interest to TM and operations in general, including the rolled-up overall operational state and health, are available for subscription. In addition, the MCCS interface provides introspection, i.e. allows an authorized client to ‘discover’ and access parameters and commands implemented by the lower level components when required to support diagnostics and maintenance.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 23 of 52

Page 24: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Signal processing functions are controlled via sub-arrays and scans. MCCS supports the configuration and monitoring of sub-arrays i.e. provides high-level commands [AD4] that TM can use to sub-divide the Low Telescope into up to 16 sub-arrays and operate each sub-array independently. MCCS exposes sub-arrays as top-level entities and makes provision for TM to assign antennae to sub-arrays and select signal processing functions to be performed per sub-array. A scan is defined as a time interval during which a sub-array’s configuration does not change. During normal operations, MCCS accesses a sub-array directly to assign antennae, select signal processing functions, as well as start and stop the scan (i.e. start and stop signal processing).

4.1.1.2 Maintenance

The MCCS Sub-element is an IT system, and therefore requires typical data centre/IT system maintenance. Therefore, maintenance personnel will be typical IT-support personnel; normally only typical IT-support personnel are expected to be required at the site itself.Maintenance is described further in [RD9] Section 14

4.1.1.3 Operator Role

Apart from maintenance activities, LFAA (and thus MCCS) is remotely controlled via TM and ultimately by an operator within the TM environment. The “operator”/maintainer role and how it relates to MCCS is described in more detail in [RD9] Section 14

4.1.2 Support Environment

Support for MCCS will be provided both on-site (i.e. Boolardy) and off-site (i.e. at the SKA1_Low Telescope support facility at Geraldton and/or in/near Perth), as well as remotely. On and off-site support is described in more detail in [RD9] Section 14

4.1.2.1 On-site Maintainer role

The MCCS on-site maintainer needs a technical hardware support background as described in [RD9] Section 14 to execute the required maintenance tasks. This maintainer’s primary objective is to detect and isolate faulty LRUs (corrective maintenance) and to remove and replace these to restore the MCCS functionality. The maintainer’s secondary objective is to determine what maintenance needs to be scheduled (predicted and preventative maintenance) and to coordinate and perform the required tasks when scheduled.

4.1.2.2 Off-site Maintainer role

The off-site maintainer needs software/hardware technical support background to perform second line LRU repairs, configuration, and verification as described in [RD9] Section 14. The off-site maintainer is located at the SKA1_Low Telescope support facility. The off-site maintainer removes and replaces selected SRUs to repair LRUs, configures the repaired COTS LRUs, and tests all repaired equipment in a representative environment to verify that they are fully operational. Once this is confirmed, LRUs are returned to the on-site or close-to-on-site spares store.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 24 of 52

Page 25: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

4.1.2.3 Remote support

MCCS maintenance and support personnel will remotely connect to the CPF over the SKAO communication network to read equipment status, review equipment log files and access MCCS long term monitoring data that is stored in the TM Engineering Data Archive (EDA), to help isolate faults.Off-site support is generally more specialized, and hence is preferred to detect and isolate faults. Off-site support will be provided where possible to assist the Telescope on-site operations personnel to diagnose faulty LRU equipment and problems with telescope functionality (firmware and software). The general rule is, if something can be done remotely, it should be done remotely, but with on-site capability and assistance if such capability is useful (such as GUIs on consoles to track down problems, as described in [RD7] Section 7). See [RD9] Section 14 for more information on remote support.

4.1.3 States and Modes

The MCCS implementation of states and modes is compliant with the SKA Control System Guidelines document [RD1]. Per these guidelines, MCCS implements and reports the standard set of SKA state and mode indicators for SPS, individual sub-arrays and MCCS itself. MCCS monitors state and mode transitions and based on the status reported by LFAA sub-systems derives overall LFAA state and mode indicators. For more detailed information on how states and modes are implemented in the MCCS software architecture refer to [RD7].

Table 4-4 lists the states and modes for a sub-array. The states and modes are applicable to all hardware, software and logical components, although it is not mandatory that all states and modes are applied to each component. Figure 4-11 show the state transition diagram as derived from [RD1].

Table 4-4. MCCS states and modes

Attribute Range Description and comments

adminMode(read-write)

Set by an outside authority (operations via TM and MCCS).

ONLINE The sub-array can be used for scientific observing.MAINTENANCE The sub-array is not to be used for scientific observing but can be used

for testing and commissioning.OFFLINE The sub-array is not to be used at all.NOT_FITTED Set by operations to suppress alarm generation.

opState

(read-only)

MCCS intelligently rolls-up the operational state of all components used by the sub-array and reports the overall operational state for the sub-array.

INIT The sub-array is being initialized.

OFF The sub-array is ‘empty’; no receptors have been assigned to the sub-array.

ON At least one receptor has been allocated to the sub-array; the sub-array is ready to accept a scan configuration.

ALARM The Quality Factor for at least one attribute is outside the pre-defined ALARM limits. Some or all functionality may not be available.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 25 of 52

Page 26: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Attribute Range Description and commentsDISABLE The sub-array is administratively disabled (adminMode=OFFLINE or

NOT_FITTED); basic monitor and control functionality is available, but signal processing functionality is not available.

FAULT An unrecoverable fault has been detected. The sub-array is not available for use; maintainer/operator intervention is required.

UNKNOWN The sub-array is unresponsive, e.g. due to loss of communication.

healthState

(read-only)

OKDEGRADEDFAILED

MCCS intelligently rolls-up attribute quality factors, states, and other indicators for all components and capabilities used by the sub-array and reports the overall sub-array healthState.

obsState

(read-only)

The sub-array Observing State indicates status related to scan configuration and execution.

IDLE The sub-array is not processing input data and is not generating output products. When a sub-array is IDLE, SCAN ID=0.

CONFIGURING Transient state entered when a command to re-configure the sub-array is received. The sub-array leaves this state when re-configuration is completed.

READY The sub-array enters READY when re-configuration has been completed.SCANNING The sub-array is processing input data and generating output products.

ABORTED The sub-array transitions to this state when a command ‘abort scan’ is received. In this state re-configuration, delay tracking, and any other on-going processing functions are stopped.

FAULT An unrecoverable error that requires operator intervention has been detected.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 26 of 52

Page 27: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 4-11. Derived state transition diagram for all TANGO devices in SKA LMC. Not all states are mandatory for each hardware and software component.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 27 of 52

Page 28: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

5 MCCS Software Overview

5.1 Overview of Software Architecture

The software infrastructure of the LFAA must cater for the responsibilities specified above, with a focus on telescope monitoring and control, and observation management. Additionally, the architecture must meet the non-functional requirements listed in [RD9] Section 3.3. A high-level description of the LFAA software architecture is shown in Figure 5-12. This shows diagram separates components which are within the software architecture context from those which are considered external (here the Telescope Manager and hardware devices). Note that not all software components are shown here to avoid clutter. The architecture itself is separated into four sub-systems which communicate with each other over the TANGO bus. This separation is purely logical since almost all software components are implemented as TANGO devices (or have an associated TANGO device). These sub-systems are:

Hardware Devices: Each monitorable and/or controllable hardware device in the LFAA has an associated TANGO device through which all operations are performed. These include: TPMs, antennas, APIU, switches, rack management units and servers. Note that an antenna cannot be monitored and controlled directly, these operations have to go through the APIU and TPM to which the antenna is connected.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 28 of 52

Page 29: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 5-12. LFAA overall software architecture overview

MCCS physical devices are represented by green boxes in Figure 5-12. Table 5-5 shows the relationship between these hardware devices and the physical devices listed in the LFAA PBS. Note that in some instances a hardware device is mapped to multiple physical devices, in which case the hardware device can be interacting with each physical device separately or through a controlling management device. For example, the Sub Rack can control and monitor power and signal distribution, presenting a rolled-up status (although direct device access is still permitted). The detailed design of these devices, in terms of monitoring and control functionality is still not finalised.

Table 5-5. Link between hardware components as described in software and the physical components as defined in the PBS

Component in Figure Physical Component in PBS PBS #CMB SPS Cabinet

- Cabinet Chassis- AC Power distribution- Cooling System- Cabinet Management Board

MCCS Cabinet- Cabinet Chassis- AC Power distribution- Cooling System- UPS

95105101109106120105101109133

SRMB TPM Sub-rack- AC DC Power Supply- Sub-Rack Management Board- TPM Sub-rack

128138158162

APIU Antenna Power Interface Unit (as a single entity) 103Antenna Antenna (through APIU and TPM) 139TPM TPM 161Switch 100G Ethernet Switch (SPS, MCCS)

1 Gb Ethernet Switch (MCCS)98, 99,129

MCCS Server MCCS High Performance Computing UnitsLMC Head Node

121130

Observation Management: Observation creation and management is a complex task which requires the interaction of most of the software components show in Figure 5-12. The observation management sub-system contains the software components which are unique to this functionality, essentially showing the TANGO devices which manage subarrays, stations, station beams and transient buffers. This sub-system includes the calibration, pointing, DAQ and transient buffer processes, and is described in greater detail in Figure 5-13.

Cluster Management: The MCCS will be composed of at least 64 high-performance servers, each housing several GPUs. These numbers are based on the estimated bandwidth, memory and compute power required to calibrate and buffer (transient buffer) all the stations in LFAA. Each server is responsible for at most eight stations, such that each GPU can calibrate two. Cabinet (and hardware within) TANGO devices and observation-related components are partitioned across the cluster and deployed on their associated server. Distributed storage is assumed, such that there is not central point of failure. A cluster manager and a storage manager will be used to administer these

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 29 of 52

Page 30: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

resources, as well as allow the TANGO control systems and observation components to submit jobs on the cluster.

Figure 5-13. LFAA observation management overview

Monitoring and Control: This subsystem contains all the elements defined in the SKA monitor and control guidelines, including the LFAA Master device which is the root of TANGO hierarchy and the main communication point for the LFAA, logging and alarm handling, as well as the TelState device which is managed by TM.

The interaction of observation-related devices is shown in Figure 5-13, while the following provides a high-level step-by-step description of what happens during observation creation and management (certain steps are omitted here, the full sequence is detailed in [RD7] Section 5.2):

1. When the system is started, 16 Subarrays and 512 Stations are created, each unassigned. For each Station, 8 Station Beams and one Transient Buffer devices are instantiated. These remain idle until they are required for an observation.

2. At any point, TM can send an observation configuration command to a Subarray. Assuming all resources are available, Tiles are grouped into Stations, and the Stations are associated with the Subarray. If the stations were already initialised for a prior scan (such that all required SPS and MCCS resources are not in low-power mode and already configured and calibrated), then the process skips directly to step 4. Subarray configuration includes the following operations:

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 30 of 52

Page 31: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

a. MCCS will transition all required Field Nodes, SPS and MCCS resources from low-power mode to the Ready state. The time it takes to do so depends on the time required to stabilise the SPS racks (network switches take some time to switch on, and the cooling system needs to stabilise)

b. When ready, the stations are initialised. TPMs are programmed and initialised (if required) and signal processing starts. The beamforming chain does not need to be initialised at this point.

c. Each Station submits a DAQ, Calibration and Bandpass job to the Cluster Manager, which instantiates them. The jobs are provided with the TANGO FQDN of the creator (the station) such that a proxy can be created. These jobs are initialised and wait for incoming calibration spigot and diagnostic data from the station’s TPMs.

d. The Calibration process loads the previous gain and phase coefficients for this station (if any).

e. The calibration cycle is started by instructing the TPMs to send LMC data to the MCCS server. The DAQ process reads this stream and generates the correlation matrix, which is dumped to disk. The Calibration process reads this and computes the phase and gain coefficients for one frequency channel at a time. These coefficients are written to the Station device, which downloads them on the TPMs. A frequency channel is calibrated every second in this manner.

f. Device-specific checks are performed, and any required alarms are created3. Once the system is fully calibrated (this can take one to two calibration cycles), TM is

notified that configuration is complete4. TM send the full subarray configuration and MCCS perform final configuration (Note that

this step should be compliant with SKA1-LFAA_MCCS_REQ-19):a. The beamforming chain is configured on the TPMs (the station beams are not

transmitted to CSP at this point)b. Each Station Beam and Transient Buffer device creates a Pointing and Transient

Buffer process (respectively).5. TM sends the initial beam pointing polynomials, which are distributed to the respective

Station Beams. The pointing processes calculates the required delay and delay rates per antenna and download them to the TPM. The delay and delay rates are then updated periodically.

6. TM sends the start observation command to the subarray:a. TPMs are instructed to send the generated station beam(s) to CSPb. TPMs start transmitting the quantised station beam which is received by the

Transient Buffer process and stored in the internal buffer. If triggered by TM, the required section of the buffer is transmitted to SDP (see [RD7] Section 5.2.9)

c. Diagnostic operations are performed routinely 7. At any point, TM can update the beam pointing polynomials8. At any point, TM can read attributes from the devices contributing to the observation9. At any point, TM can issue a command on the subarray which changes its state. These

include abort and stop. The stop command stops the transmission of calibrated station beams to CSP. The abort command will in addition, result in the de-configuration of the components. When stopping, the processes are terminated but the station/subarray configuration remains as is.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 31 of 52

Page 32: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 5-14 shows additional elements in the monitoring and control subsystem which are exclusive to LFAA (not covered in the SKA monitoring and control guidelines), which include:

- Beam Model Device, which can provide beam metrics for a particular azimuth and elevation - Inventory Database and associated TANGO device. The LFAA must keep track of all hardware

components and their serial number, which might be request by TM. This database is also used during fault finding

- Command Line Interface and Graphical User Interface, which are clients to the LFAA LMC used by external users and operators

The unified state of the telescope can be collected via the LFAA Master, however the TelState device can also be communicated with directly (as with all other TANGO devices on the TANGO bus) to investigate the state of groups of devices, or individual ones.

Figure 5-14. LFAA local monitoring and control overview

Figure 5-15 shows the TANGO control hierarchy for the LFAA. Four types of TANGO devices are shown: Green representing TANGO devices which are associated with a hardware component, Yellow representing TANGO devices which are observation-related (logical devices representing observational entities), Red representing TANGO devices which interface with third party software, and Blue which represent TANGO devices which support the TANGO infrastructure or are required for the overall monitoring and control of the system, including devices specified in the SKA control guidelines [RD1]. The connections between devices in the diagram show relationships and multiplicities, with LFAA Master being the root of the hierarchy tree. Note that element-level devices (Alarm Handler, TelState, Element Logger) are functionally independent, providing different type of aggregation and functionality to TM or the other Elements.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 32 of 52

Page 33: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 5-15. TANGO control structure

5.2 Software Component List

Table 5-6 provides a short description of each entity in the figures described above, whilst Table 5-7 describes some of the relations between these components.

Table 5-6. List of elements in the Architecture System Overview

# Name Type Multiplicity Description1 Graphical User

InterfaceSW 1 A graphical interface through which users can locally

access parts of the LMC, mainly to support maintenance and debugging

2 Command Line Interface

SW 1 A wrapper around the LFAA Master which allows external libraries and client to perform actions and request information

3 Configuration Database

DB 1 A central store where the required configuration to load and run the LMC is stored

4 Log Storage DB 1 Storage for generated logs5 LFAA Master SW 1 The LFAA Master device, which orchestrates all the

operations of the LMC as well as act as

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 33 of 52

Page 34: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

communication point with external entities, particularly TM

6 Element Logger SW 1 The LMC device which handles element logging functionality

7 Inventory Database DB 1 A database containing the list of hardware devices and cables, including their location with the CPF and how they are interconnected

8 Inventory Device SW 1 A TANGO device which interfaces with the Inventory database

9 Subarray SW 16 Creates, monitors and controls a subarray (a collection of station devices) when and as instructed by TM

10 Station SW 1..512 Creates, monitors and control a logical station11 Station Beam SW 1..8 per

stationControls the pointing functionality for a station beam

12 Beam Model Device

SW 1 TANGO device which contains the beam pointing model for an antenna and station

13 TelState Device SW 1 TANGO device which mirrors the TelState device in Telescope Manager

14 Cluster Manager SW/HW 1 TANGO device which interfaces with the cluster manager for monitoring, control and execution of jobs

15 Transient Buffer SW 1 per station TANGO device which control the transient buffer process and process triggers

16 Transient Buffer Process

SW 1 per station Process which takes care of the transient buffer for a station

17 DAQ Process SW 1 per station Process which enables the reception and storage of data from TPMs

18 Pointing Process SW 1..8 per station

Process which calculates the pointing coefficients for station beams

19 Calibration Process SW 1 per station Process which performs station calibration20 Bandpass Process SW 1 per station Process which calculate the scaling factors for

flattening the bandpass and runs diagnostics based on the antenna bandpass

21 Cabinet Device SW 256 TANGO device which monitors and control the Cabinet Management Boards in SPS cabinets

22 Sub-Rack Device SW 512 TANGO device which monitors and control the Sub Rack Management Boards in SPS cabinets

23 MCCS Server HW 64 Physical high-performance server, making up the MCCS

24 Switch HW 512+20 Physical network switch, composing the LFAA-DN25 Switch Device SW 512+20 TANGO device for monitoring and controlling

switches26 TPM HW 8192 Physical TPM which hosts the digital signal processing

chain27 Tile SW 8192 TANGO devices for monitoring and controlling TPMs28 Antenna HW 131072 Physical antenna29 Antenna Device SE 131072 TANGO devices which monitors antennas30 APIU HW 2048 Physical APIU, which powers and monitors antenna31 APIU Device SW 2048 TANGO device which monitors and controls APIUSs

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 34 of 52

Page 35: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

# Component A Component B Relationship Description1 Graphical User

InterfaceLFAA Master The GUI uses the LFAA Master’s API to provide the local users and

personnel access to the LMC2 Command Line

InterfaceLFAA Master The CLI uses the LFAA Master’s API to allow local and remote

access to the LMC3 LFAA Master Configuration

DatabaseThe LFAA Master uses the Configuration Database for initializing the LMC and keep track of configuration changes

4 LFAA Master Inventory Device

LFAA Master provides high level information and actions from/on the Inventory Database

5 Inventory Device Inventory Database

The Inventory Device manages and updates the Inventory Database

6 Element Logger Device

Log Store Element Logger Device receives logs from TANGO devices and stores them in the Log Store

7 LFAA Master Beam Model Device

LFAA Master provides beam metrics for a given Azimuth and Elevation when requested by TM

8 TANGO Device Alarm Handler Alarms defined on TANGO devices are captured and processed by the Alarm Handler

9 TANGO Device Tel State TANGO devices can read the overall state of the telescope from the TelState devices

10 TANGO Device Element Logger Logs generated by TANGO devices are forwarded to the Element Logger for filtering and storage

11 Station and Station Beam

Cluster Manager

The Station and Station Beam devices submit jobs to the Cluster Manager

12 Subarray Device Station Device Sub Array devices create, monitors and controls a Station Device for each station in the sub array.

13 Station Device Station Beam Device

The Station device create a Sub Station Device for each station beam. Each Station Beam will have an associated pointing process

14 Station Device Cluster Manager Device

Each Station Device submits jobs DAQ, Calibration and Transient Buffer jobs to the Cluster Manager via the Cluster Manager Device

15 Station Device Transient Buffer Device

Each Station Device create a Transient Buffer Device which keeps track of the transient buffer for that station and responds to triggers, sending the buffered data to SDP

16 Transient Buffer Process

Transient Buffer The Transient Buffer Device launches a Transient Buffer Process and processed triggers

17 DAQ Process Station DAQ Process notifies the associated Station when a new file has been written to disk

18 Calibration Process

Station Calibration Process updates the calibration coefficients being used by the associated station

19 Bandpass Process Station Bandpass Process calculates bandpass flattening factors and performs diagnostics on antenna bandpass

20 Pointing Process Station Beam Pointing Process updates the delay and delay rates being used by the associated Station Beam

21 Cluster Manager MCCS Server The Cluster Manager Device communicates with the Cluster Manager, allowing the rest of the LMC to submit jobs and monitor the state of running jobs

22 Storage Manager MCCS Server The Storage manager manages the disk space allocated to the distributed storage on MCCS servers

23 Cabinet Device Rack Management Board

The Cabinet device monitors the cabinet environment (such as temperature) by interfacing with the rack management board

24 Server Device Server The Server Device monitors the state of a server

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 35 of 52

Page 36: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

25 Switch Device Switch The Switch Device monitors the state of a switch, including statistics per port

26 Cabinet Device CMB Monitors and control Cabinet Management Board27 SubRack

DrivDeviceerSRMB Monitors and controls Sub Rack Management Board

28 TPM Device TPM The TPM Device monitors and controls a TPM, including programming and initializing it and allows the LMC to control the running firmware

29 Antenna Device Antenna The Antenna Device monitors the state of an Antenna30 APIU Device APIU The APUI Device monitors and controls an APIU, including the

ability to read out antenna power and shut off the antenna if required.

Table 5-7. Relationships between major elements in the Architecture System Overview

Figure 5-16 shows a high-level module decomposition diagram which groups several of the software components describes in this section into modules. It also shows the system services and software which are required to run the system (the System Services module), which are described in [RD9]. The Hardware TANGO Device and System Service TANGO Devices represent all the TANGO devices which interface with hardware devices (in MCCS and SPS) and software services (including the cluster manager, storage manager, node provisioner, and so on).

5.3 Software-Hardware Mapping

Figure 5-17 shows the mapping between a subset of the array and some of the software components shown in Figure 5-13 for a specific observation setup. In this case a subarray is configured to contain three stations, with one of them containing two sub-stations. For the software architecture, a sub-station is defined as a specific instance of a station beam in which a subset of the antennas does not contribute to the beam (a weight of 0 is applied to these antennas), such that a Sub-Station TANGO device is not required. Sub-stations are therefore defined through appropriate configuration of the station beams. Each station has 256 antennas, which are connected to 16 TPMs. Each TPM has an associated Tile TANGO Device instance through which all interactions with the TPM (and hence control of the antennas and beams) is performed. A Station Device instance is associated with each station and the required number of Station Beam Devices instances are then configured with the station. Since station beams (and sub-stations) are pointed independently, delay calculation (pointing) is performed at the station beam level, whilst calibration is performed at the station level (these relationships are shown in Figure 5-13). The Station instances are then grouped and associated to a single Subarray TANGO Device instance, through which observation control is coordinated by TM.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 36 of 52

Page 37: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 5-16. Software module decomposition diagram

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 37 of 52

Page 38: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 5-17. Mapping between array and software components

5.4 Software Life Cycle

Scaled Agile Framework, also known as SAFe, is an enterprise-scale development methodology developed by Scaled Agile, Inc. SAFe combines Lean and Agile principles within a templated framework. The main principles of SAFe interweaves systems thinking and fast incremental development based on small and regular milestones within those increments. A summary of these principles can be found in [RD11] .

5.4.1 Agile Release Trains

An Agile Release Train, or ART, is a fundamental concept within the scaled agile framework. The ART is the primary value delivery method of SAFe. Agile Teams are a small group of individuals focused on defining, building, and testing solutions within a short time frame. An ART is a self-organizing, long-lived group of Agile Teams, whose purpose is to plan, commit, and execute solutions together. System development will have backlog items assigned in logical groupings, worked around along increments within stipulated periods of time (e.g. a few weeks per increment).

5.4.2 SAFe Implementation Overview

Given the sheer size and scope of SAFe, proper implementation can be rather daunting, especially starting out. Since a full explanation of SAFe implementation would require tens of thousands of words — and because more detailed information is available on the official website — we’ll cover a brief overview of implementation here:

1. Train Implementers: Due to the sheer scope and challenge required in adopting SAFe, most organizations will need a combination of internal and external mentors and coaches. These

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 38 of 52

Page 39: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

people should be capable of easily teaching and delivering SAFe techniques to others throughout the organization.

2. Train Executes, Managers, and Leaders: The initial batch of Implementers should first focus on training all executives, managers, and leaders. Once these fundamental team members understand the Lean-Agile mindset, core SAFe principles, and implementation techniques, the process will become much smoother for the entire organization.

3. Train Teams: Individuals should initially be organized into Agile Teams, who can then all be trained on the various Lean, Agile, and SAFe principles.

4. Launch Agile Release Trains: Finally, once the organization has been properly trained, it’s time to group Agile Teams together into ARTs, and then generate models for objective planning, program execution, program increment planning, and all the other components required for a successful Agile Release Train.

5.4.3 Essential SAFe

The essential basic configuration of the SAFe framework is shown in Figure 5-18 and provides all the elements necessary to have a complete SAFe system. Rather than focus on explaining the SAFe framework, we shall focus on particular elements within this framework, which require some discussion.

Figure 5-18: Essential SAFe configuration.

The software development process will employ the following key principles – adapted in general from the SAFe framework:

1. Collaborating closely with both the stakeholders and with other developers, adding valuable feedback and collaboration.

2. Implementing functionality in priority order – the requirements will be developed based on array assembly prioritisations – and these might change along the way.

3. Analysing and designing - The individual requirements are analysed by model storming on a just-in-time (JIT) basis for a few minutes before spending several hours or days implementing the requirement.

4. Ensuring quality – Use coding conventions, development guidelines and constant refactoring for quality.

5. Regularly delivering working solutions - At the end of each development cycle/iteration there will be a partial, working solution for demonstration/analysis.

6. Testing – Perform a significant amount of testing throughout construction.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 39 of 52

Page 40: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

For more detail on the framework, refer to [RD12].

5.4.3.1 Software Development Process During Construction Iterations

During construction iterations developers will incrementally deliver high-quality working software which meets the changing needs of the system as overviewed in Figure 5-19.

Figure 5-19: Software development process during a construction iteration.

5.4.3.2 The Test-First Approach to Construction

The test-first approach to software development is shown in Figure 5-20. The full testing regime for MCCS is detailed [RD10].

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 40 of 52

Page 41: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 5-20: Test-first development approach.

Within the context of development iterations in an Agile approach, this test-first approach is encompassed within iterations as shown in Figure 5-21.

5.5 Commissioning

Project commissioning is the process of assuring that all systems and components of the project are designed, installed, tested, operated, and maintained according to the operational requirements of the stakeholders. A commissioning process may be applied not only to new projects but also to existing units and systems subject to updates, refactoring, etc.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 41 of 52

Page 42: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 5-21: Testing during construction iterations.

In practice, the commissioning process comprises the integrated application of a set of engineering techniques and procedures to check, inspect and test every operational component of the project, from individual functions, such as instruments and equipment, up to complex amalgamations such as modules, software subsystems and systems.

Commissioning activities, in the broader sense, are applicable to all phases of the project, from the basic and detailed design, procurement, construction and assembly, until the final handover of the unit to the owner, including sometimes an assisted operation phase.

The testing procedures and acceptance process for all sub-units of the system, as well as the integrated system working as a single element is detailed in the MCCS Assembly Verification and Test Plan [RD10]. The commissioning procedure is made up of:

Functional tests Non-functional tests A testing cycle for each test Regression testing A qualification process An acceptance process

It is assumed that the commissioning process for MCCS will form part of a wider commissioning procedure. There are various completion and commissioning tools which can be utilised for this purpose. With regards to MCCS, the commissioning process will support the AIV Element roll-out plan [AD2]. The commissioning process will be split to cater for:

1. Full system commissioning2. Hardware commissioning3. Software/Code commissioning

More details of this split can be seen in the MCCS Detailed Design Document [RD9].

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 42 of 52

Page 43: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

6 MCCS Physical OverviewMCCS is essentially a compute cluster, requiring enough compute processing power, network bandwidth and memory space to run the MCCS software. Compute processing power is dominated by the correlation and calibration processes, network bandwidth is dominated by the transmission of calibration spigots from SPS to MCCS, while memory space is dominated by the fast transient buffer. The compute servers are distributed across 4 MCCS cabinets which, apart from the compute servers themselves, contain the required number of network switches to transport SPS LMC data from SPS to MCCS, interconnect the compute server and transmit the fast transient buffer to SDP. The following section analyse the compute, network and cabinet requirements, as well as provides describes the software necessary for these components to function properly.

6.1 Compute Server

There are a total of 68 compute servers in MCCS distributed across 4 racks (including one spare per rack). Each compute server is responsible for 8 stations. [RD9] Section 4.1 provides an analysis of the compute, network and memory requirements for a single server, summarised below, resulting the compute server configuration listed in Table 6-8:

4 high performance GPUs to run the correlation and calibration related processes One 100Gb interface for receiving data from 8 stations (64 TPMs) At least 1.5 TB RAM, primarily dominated by the space required to store the transient

buffers for eight stations About 80 CPU cores

Table 6-8. MCCS compute server configuration

Item Quantity Minimum SpecificationChassis 1 1U, min 2x SATA, dual 1Gb Ethernet, 2 kW redundant

power supply, NVLink supportCPU 2 20-cores, 2 GHz minimumGPU 4 NVIDIA P100 with NVLink or equivalentRAM 12 128GB 2666MHz DDR41 Gb interfaces 1 On chassis100 Gb interfaces 1 Mellanox 100-Gb ConnectX-5 with 1 QSFP, or equivalentSSDs 2 1 TB 2.5” SATA 6.0 Gb/s

Two additional servers are included which act as the master and shadow master nodes of the MCCS cluster, on which the core LMC functionality, hardware configuration database, maintenance support tools, graphical user interface and other high level software components will operate. These servers will also be responsible for configuring all LFAA and interact with TM. The shadow node takes over when the master node is compromised.

6.2 Network

MCCS is connected to external entities and other LFAA Sub-elements through network link, shown in Figure 6-22. Communication with SPS goes through a single 100Gb link between each SPS cabinet in the RPF and groups of two SPS cabinets in the CPF, totalling to 110 100Gb links. Communication with TM goes through a 1Gb link, of which there are two for redundancy. The transient buffer is transmitted to SDP via a 100 Gb link provided by SaDT.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 43 of 52

Page 44: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 6-22. Network links between MCCS and external entities

These connections need to be distributed across the four racks which host MCCS. Core SPS cabinets have one 100 Gbps per two cabinets to MCCS, RPFs within 25km have one 100 Gbps links to MCCS and RPFs which are farther away than 25km use DWDM through a muxponder, multiplexed to 100 Gbps to MCCS. Apart from the 100G network, there is a separate 1G network local to MCCS used for monitoring and control, and acts as a back-up in case the 100G network goes offline. MCCS is also responsible for the configuration, management and control of all the network and network components within LFAA, including the data network which forms the backbone of SPS, as well as well as all external network links provided by SaDT.

Figure 6-23 shows the network diagram for a single MCCS rack. Compute servers are grouped into two groups, each connected to a separate 32-port 100Gb network switch. Each 100Gb network switch ingests 14 SPS links, except for the bottom switch of the first and last rack which ingests 13 SPS links (totalling 110 links). A single 32-port 1Gb network switch is required to interconnect all hardware devices within an MCCS rack, with enough free ports for creating a full 1G mesh with the rest of the racks. Links to TM and SDP are also shown, however these are not present in all racks. The TM links are connected to the 1G switch in the central two racks, whilst the SDP links can be connected to any of the racks. The head/shadow nodes are also located in the central two racks, each requiring two 1Gb links for redundancy. Note that in the diagram, links without multiplicity mean that there is a single link. The MCCS layout is described in detail in [RD9] Section 4.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 44 of 52

Page 45: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 6-23. MCCS network diagram

6.3 Rack Assembly

The cabinet design is present in Figure 6-24, each containing:• 16 compute servers and one spare compute server• 2 100 Gb switches• 1 1Gb switch• For two of the racks an additional server is required to act as a master/shadow node• For the racks containing the master/shadow node, a UPS is included

The head/shadow node and 1Gb switch connecting the head/shadow node to TM are connected to the UPS, such that if a power failure arises then MCCS can inform TM and perform and emergency shutdown operations, ensuring that the system will be capable of going back online when power is restored. Since The head/master server will be low-power servers (when compared to the compute servers), a standard rack UPS should be able to provide enough up time for the head/shadow node to perform these operations.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 45 of 52

Page 46: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Figure 6-24. MCCS rack assembly

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 46 of 52

Page 47: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

7 Scenarios

In this section some example scenarios have been chosen to be described in detail. The selection is considered to represent many similar operational scenarios. The following scenarios are documented in the following subsections:

1. Application of power to MCCS, including power sequencing2. Transitioning to low power mode3. Power down sequencing, that is transitioning to offline mode4. Observation configuration and start5. Calibration of LFAA and how this can detect failed antennas6. Stopping an observation7. Detection and MCCS failures, including redundancy for continued operations and how

failures are reported, replaced and detected by the software8. Software, BIOS and LRU firmware update

7.1 Application of power

When power is applied to MCCS, a boot-up sequence of minimal hardware and software components occurs. This will transition the operational state of these component from Unknown or Offline to Ready:

1. Power is applied to one of the racks which has a master or shadow node2. The master and shadow nodes are configured to boot up on power, such that they will boot

up and load the operating system. The 1Gb network switch also powers up when power is applied, such that MCCS can then directly access the Rack power supply.

3. An LMC bootstrap mechanism is run automatically at start-up which loads:a. The bare-metal provisioning softwareb. The distributed storage management softwarec. The TANGO databased. The TANGO startere. The LFAA Element Master (root TANGO device)f. The Software Configuration database (in the case that this is an actual database and

needs to be loaded)4. The LFAA Element Master will then start-up the rest the core LMC system by reading the

required configuration, and communication with TM is established5. At this point the MCCS head node is powered on. Action from TM is required to power the

rest of the MCCS as well as SPS. Once TM issues this command the power-on continues6. The master/shadow starts powering on the rest of the MCCS hardware one rack at a time:

a. Rack power is enabled (the ones which do not host the head/shadow node)b. Power to the data switches is enabledc. Power to the compute nodes is enabled (compute nodes are not configured to start

up when power is applied)d. For each compute node, the LMC Element Master instructs the bare metal

provisioning software to power it on. Nodes are powered sequentially, with time between each node TBD. The provisioning software loads an operating system image on the compute node, which in turn go through the boot up process. Nodes are then ready for software provisioning

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 47 of 52

Page 48: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

7. The LFAA Element master provisions the required containers on each compute node to start the TANGO devices for monitoring and controlling the associated SPS hardware components, as well as the distributed storage and other required services

8. MCCS is now ready to accept and start observation configurations

7.2 Transition to Low Power Mode

Compute servers in MCCS cannot be turned off since they host TANGO devices which monitor and control SPS equipment. Equipment in SPS which is in low power mode still need to be monitored (sensor and health status can still be accessed). The master and shadow nodes do not have a low power mode. Low-power mode for MCCS can be described as follows:

A compute server can only be in low-power mode if all associated SPS equipment is in low-power mode (not being used for observations or are part of a maintenance subarray)

Network switches cannot be switched to low-power mode; however, they generally have power saving feature which can be used to reduce their power consumption

A compute server in low power mode translate to the following operations being performed: Switch off GPUs or set their power management configuration to the minimal power

consumption setting (depends on available GPU settings, PCI devices can be disabled with appropriate kernel modules as well)

Set all CPU cores to low power mode. CPU cores can also be disabled through appropriate Linux configurations

The power consumption of network switches depends on the network traffic, so they will automatically consume less power. Additionally, unused ports on the switch are disabled such that they do not consumer any power.

During observation configuration, if a compute server in low power mode is required, the required GPUs are and CPU core re-enabled or switched to normal power configuration.

7.3 Transition to Off-line

MCCS must support the ability to shut down the entire sub-element. This may be related to maintenance, power saving measures, power supply emergency, etc. MCCS will support two types of shutdown:

• Controlled: orderly shutdown of servers and equipment• Uncontrolled: immediate removal of power to running equipment

7.3.1 Controlled shutdown

To transition to off-line (controlled power shutdown) the MCCS head node will: Terminate all running observations (through the appropriate Devices, which will in turn

terminate all running compute processes on the compute nodes), Terminate the LMC control hierarchy for SPS (keeping the LMC core running up to this

point) Instruct the node provisioning system to shut down all compute servers Disable rack power to all racks except for the rack containing the head node Disable power to switches in the racks containing the master or shadow node Shutdown itself

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 48 of 52

Page 49: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Note that power of the main rack must be manually switched off if required (or through the building management system)

7.3.2 Uncontrolled shutdown

In the event of a power emergency, MCCS may be instructed to perform an uncontrolled shutdown, whereby the equipment is turned off as quickly as possible. In this situation the head node will send the shutdown signal to all the compute nodes (through the node provisioning system, regardless of what processing is being performed). When all the compute nodes are powered down (in the order of a few seconds), the head node will disable rack power to all racks and shut itself down.

In the event of power failure, where MCCS power is lost, all MCCS will become offline except for the head and shadow nodes and the 1 Gb switches in the central 2 racks which are connected to a UPS. This perms the head node to perform a proper shutdown and inform TM that MCCS lost power and is going to shut down. The latter assumes that all intermediary switches between MCCS and TM are still powered. If the entire CPF loses power than TM will be unreachable.

7.4 Set up and Start Observation

Observation setup is described in detail in [RD7] Section 5.2.1 and summarised in this document, Section 5.1. When the start observation command is received the following steps are performed:

1. TM sends the start scan command to the Subarray2. The Subarray calls the start command on all associated Stations in parallel3. The Station finalizes configuration on the Tiles. This includes:

a. Setting the CSP ingest node IP, MAC and port as the destination parameters for the final Tile in the chain

b. Instructs the Tiles to start transmission of data4. Once all Tiles are configured, the Station returns a reply to the Subarray5. The Subarray in turn waits for all Stations to finalize their configuration and returns a reply

to TM once configuration is finished

At this point signals are being processed and station beams are being sent to CSP. Throughout the observation calibration and pointing coefficients are being calculated and updated, and control data from the Tiles is being received and processed accordingly.

7.5 Calibration

The calibration process as well as diagnostics which can be performed on the generated calibration solutions is described in [RD7] Sections 5.2.6 and 5.2.7, summarised below:

Raw channel data needs to be transmitted by all the TPMs forming part of a station. This is used for calibration (and diagnostics) and is not transmitted to CSP. This data is directed towards a MCCS compute node, assigned during initialization, on which a DAQ process is running.

The DAQ process reads in this data and buffers it for correlation. This data stream amounts to ~6.4Gbps.

Once all the time samples for a frequency channel are received (that is, the stream switches to a new frequency channel), the buffer is marked as ready and copied to GPU memory.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 49 of 52

Page 50: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

The GPU correlator computes the auto and cross correlation of the data and integrates the entire buffer to a single correlation matrix.

The correlation matrix for the current frequency channel is saved to disk. Once the file is written, the Calibration process is notified Assuming a standard calibration algorithm implementation, the difference between the sky

model and acquired visibilities is minimized, generating a set of coefficients which describe the difference between the two.

The generated coefficients are sent to the Station device. The Station device then distributes the calibration coefficients to its Tiles, which download

them on the TPMs. The Tile devices also distribute the calibration coefficients to the respective Antenna devices

(not shown), where they are archived for diagnostic purposes. These coefficients are kept in the LFAA archive for several days.

Sanity checks and diagnostics on the generated calibration solutions are also performed to ensure that the system is stable and to detect misbehaving devices. These checks include:

Compare calibration solutions and for each antenna against each other to detect outlier antennas (for example by computing the RMS and evaluating antennas against this RMS)

Check how calibration solutions are evolving in time to check system stability (for example by seeing how antennas RMS varies)

Identify noise frequency channels (RFI)

7.6 Stop Observing

At any point TM can issue a stop or abort command on a running subarray: Stop: The current observation is stopped, and the observation is move back to the READY

state. Data output to CSP is stopped. Jobs and Tiles are left configured so that of the next observation required the same parameters the devices do not have to be re-configured

Abort: Abort moves the subarray to the ABORT state. The possible state changes from this are to the CONFIGURING and IDLE state, which means that all resources can be freed up (to be re-used later). Output to CSP is first stopped to avoid invalid data being transmitted while aborting the observation. All running jobs are terminated (through the initiating device via the Cluster Manager device). Tiles are de-configured (but not put in low-power mode). Station, Station Beam and Tile devices are unassigned.

When the subarray receives a reset command while in the READY state, the same operations as abort above are performed. Additionally, the Tiles are de-programmed are placed in low-power mode. This also happens when the command is received whilst in the FAULT state.

7.7 MCCS Failures

MCCS can suffer failures at any point during observation configuration, running or whilst in low power mode. For hardware failures, the hardware is switched off, its status is changed to FAULTY and TM is notified. MCCS software has a hardware configuration database which, apart from storing the configuration of all hardware components, contains their location within the CPF and RPF to aid maintenance personnel to quickly localise the equipment. The following equipment can become faulty (for several reasons):

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 50 of 52

Page 51: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

Compute node, in which case the spare server in the rack takes over the operations of this compute node. A total of four spare compute nodes are always present in MCCS, such that up to four nodes can become faulty. If more than four nodes are faulty or offline then the associated resources (stations) cannot be used until the faulty nodes are replaced.

100G switch, in which case the incoming signal from SPS passing through this switch will be blocked (there is not redundancy for high bandwidth data between SPS and MCCS). LMC communication can be re-routed through other switches. Within MCCS there are several redundant links between switches such that LMC traffic can be re-routed if a switch is faulty of offline

1 G switch, in which case rack devices which are controlled through the 1 Gb network (such as switches and rack power) will become unreachable. LMC communication between compute nodes can still be routed through the 100Gb network.

Master node, in which case the slave node will take over all operations performed by the master node. If the slave node becomes faulty then all MCCS and SPS will be unreachable by TM, and the LMC system will become unavailable.

When faulty LRUs are replaced, maintenance personnel must update the hardware configuration database with the device’s new IP address (through the provided tools). This is not required for compute servers since the provisioning software can automatically detect new node once they are physically powered up.

Software failure are entirely handled by the LMC software system. TANGO devices and system services are automatically restarted after a crash. It is assumed that all software running in MCCS will have been thoroughly tested during the verification stage (goes through the software test cycle).

7.8 Software Upgrades

Software upgrades and updates will happen through the commissioning phases of the SKA, as well as during its long lifetime. Upgrades should happen with minimal disruption of service, that is, with minimal impact on the capabilities of the telescope. These upgrades can be split into three types:

7.8.1 Software upgrades

This refers to all software running on MCCS, including the OS and other system software, management software, third party software and bespoke software developed for MCCS. The way in which these are updates depends on whether they are running on a compute node or a cluster/head node.

Upgrading software on the master nodeWhen a software upgrade is required on the head nodes, the shadow node is updated first (since it only mirrors the functionality of the master node, no disruption is caused). Once the upgrade is complete several tests are performed to verify that the upgrade progress was successful, and all required functionality is still available. Once ready, the shadow node takes over control from the master (becomes the master) which is in turn upgraded in the same manner. This can be used to update the operating system, system libraries and services and the TANGO core system.

Upgrading software on compute nodesThe OS images for compute nodes are stored locally on the master node. These images can be updated and versioned independently of what is running on compute nodes. This scheme is used to

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 51 of 52

Page 52: ska-sdp.orgska-sdp.org/sites/.../ska...architectureoverview.docx · Web viewLFAA is responsible for reception and digitization of low frequency band (50 MHz to at least 350 MHz) signals

update the operating system and system libraries and services. When a compute node is rebooted the new OS image will be used to load the compute node. This can either be performed during a scheduled maintenance time where all the MCCS compute nodes are rebooted, or as a staged system where compute nodes are rebooted when they are not in use (with their running system offloaded to the spare servers, where the spare servers are upgraded first).

Observation-related software (such as the calibration and correlation algorithms) updates result in a new version of the binaries which are stored in the master node. When a new observation is defined then it will simply use the new (or any required) version of the software. These programs are launched in containers on the compute nodes, so no system updates are required.

TANGO devices running on the compute nodes are also launched in containers, such that the same scheme as that for observation-related software can be used. In this case the new version of the TANGO device is first launched, and when it’s running the older version of the device is stopped.

7.8.2 BIOS updates

It is inevitable that BIOS updates will become available during the lifetime of the MCCS servers. They are generally installed through software provided by the manufacturer and will require the node to be rebooted. For the master and shadow nodes the same scheme as above can be used, where operations are taken over by one server whilst the other runs the BIOS update program. For updating the BIOS of a compute node (which must not be performing any observation-related functionality), all TANGO device running on the node are offloaded to a spare server after which the BIOS update program is run. When complete, the TANGO devices are set to run again on the update node.

7.8.3 LRU firmware updates

Additional hardware will need firmware and software updates in MCCS and SPS, including network switches and power supply units. There updates are generally performed by manufacturer-provided software. The device will be offline during this update, which will result in down-time.

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

FOR PROJECT USE ONLYAuthor: A. Magro

Page 52 of 52