ska-sdp.orgska-sdp.org/.../ska-tel-lfaa-0600052-02_softwarearchitecturedocument.d… · Web...

Name Designation Affiliation Signature

Authored by:

A. Magro Subject Matter Expert

AADC

Date:

Owned by:

M. Waterson AA Domain Specialist

SKAO

Date:

Approved by:

P. Gibbs Engineering Project

Manager

SKAO

Date:

Released by:

J. G. Bij de Vaate Consortium Lead

AADC

Date:

DOCUMENT HISTORY

SOFTWARE ARCHITECTURE DOCUMENTDocument number......................................................................SKA-TEL-LFAA-0600052Context...................................................................................................................... DRERevision........................................................................................................................02Author...........................................................................................A. Magro, A. DeMarcoDate...............................................................................................................2019-02-12Document Classification............................................................FOR PROJECT USE ONLYStatus.................................................................................................................Released

Revision Date Of Issue Engineering Change Number

Comments

A 2017-08-23 - Template version released within consortium

B 2018-10-24 First round of revisions

01 2018-10-31 First Release

02 2019-02-12 Implemented CDR panel OARs:

LFAA Element CDR_OAR_MCCS Software Architecture Document

OARs: 10, 11, 12, 14-17, 20-24, 26, 27-39

DOCUMENT SOFTWAREPackage Version Filename

Wordprocessor MsWord Word 2016 document.docx

Block diagrams

Other

ORGANISATION DETAILSName Aperture Array Design and Construction Consortium

Registered Address ASTRONOude Hoogeveensedijk 47991 PD DwingelooThe Netherlands+31 (0)521 595100

Fax. +31 (0)521 595101Website www.skatelescope.org/lfaa/

CopyrightDocument owner Aperture Array Design and Construction Consortium

This document is written for internal use in the SKA project

Document No.:Revision:Date:

Error: Reference source not foundError: Reference source not foundError: Reference source not found

Error: Reference source not foundAuthor: A. Magro, et al.

of 156

TABLE OF CONTENTS1 REFERENCES..............................................................................................10

1.1 Applicable documents........................................................................................................................101.2 Reference documents........................................................................................................................10

2 LFAA SOFTWARE ARCHITECTURE DOCUMENTATION – BEYOND VIEWS...................112.1 Purpose and Scope of the SAD...........................................................................................................112.2 How the SAD is Organized..................................................................................................................112.3 Stakeholder Representation...............................................................................................................112.4 View Definitions.................................................................................................................................132.5 How a View is Documented...............................................................................................................16

3 SYSTEM OVERVIEW.....................................................................................183.1 Requirements and Architecture.........................................................................................................183.2 Quality Attribute Requirements.........................................................................................................213.3 Architectural Approaches...................................................................................................................23

4 LMC INFRASTRUCTURE VIEW........................................................................244.1 Notation.............................................................................................................................................254.2 Context Diagram................................................................................................................................264.3 Primary Presentation.........................................................................................................................26

4.3.1 External Interfaces.........................................................................................................................264.3.2 Internal Interfaces.........................................................................................................................294.3.3 Interface Security...........................................................................................................................30

4.4 Element Catalog.................................................................................................................................304.4.1 Element Attributes.........................................................................................................................33

4.5 Element Behaviour.............................................................................................................................344.5.1 State Transition..............................................................................................................................344.5.2 Reporting Behaviour......................................................................................................................364.5.3 Alarm Behaviour............................................................................................................................37

4.5.3.1 General Alarms for all LFAA Devices......................................................................................40

4.5.4 Event Behaviour.............................................................................................................................414.5.5 Logging Behaviour.........................................................................................................................424.5.6 Archiving Behaviour.......................................................................................................................434.5.7 General Exception Handling Flow..................................................................................................444.5.8 Job Device States and Modes........................................................................................................464.5.9 Device Caching...............................................................................................................................474.5.10 Multiple TANGO Database Servers............................................................................................474.5.11 Multiple TANGO Hosts...............................................................................................................48

4.6 LMC External Interface Element Catalog............................................................................................494.6.1 LFAA Master Device.......................................................................................................................49

4.6.1.1 Class Diagram........................................................................................................................49

4.6.1.2 Element Behaviour................................................................................................................49

4.6.2 Subarray Device.............................................................................................................................524.6.2.1 Class Diagram........................................................................................................................52


4.7 Rationale............................................................................................................................................52

5 OBSERVATION MANAGEMENT VIEW...............................................................53Primary Presentation......................................................................................................................................55

5.1.1 Element Catalogue, Properties and Relationships.........................................................................56




of 156

5.2 Element Behaviour.............................................................................................................................615.2.1 Observation state machine............................................................................................................625.2.2 Observation configuration.............................................................................................................635.2.3 Subarray Control............................................................................................................................695.2.4 Antenna Equalization.....................................................................................................................705.2.5 Pointing..........................................................................................................................................715.2.6 Calibration.....................................................................................................................................725.2.7 Calibration Diagnostics..................................................................................................................765.2.8 Bandpass Flattening and Monitoring.............................................................................................765.2.9 Fast Transient Buffer......................................................................................................................775.2.10 Subarray Monitoring.................................................................................................................795.2.11 Local Sky Model update.............................................................................................................80

5.3 Variability Guide.................................................................................................................................805.3.1 Subarray and Station Configuration...............................................................................................815.3.2 Processing Algorithms....................................................................................................................81

5.4 Rationale............................................................................................................................................815.5 Related Views.....................................................................................................................................82

6 MONITORING AND CONTROL VIEW................................................................836.1 Context Diagram................................................................................................................................836.2 Primary Presentation.........................................................................................................................856.3 Element Catalog.................................................................................................................................85

6.3.1 Antenna Device..............................................................................................................................866.3.1.1 Class Diagram........................................................................................................................86


6.3.2 APIU...............................................................................................................................................916.3.2.1 Class Diagram........................................................................................................................91


6.3.3 Tile.................................................................................................................................................956.3.3.1 Class Diagram........................................................................................................................95


6.3.4 Station...........................................................................................................................................996.3.4.1 Class Diagram........................................................................................................................99


6.3.5 Station Beam...............................................................................................................................1046.3.5.1 Class Diagram......................................................................................................................104

6.3.5.2 Element Behaviour..............................................................................................................104

6.3.6 Transient Buffer...........................................................................................................................1076.3.6.1 Class Diagram......................................................................................................................107


6.3.7 Specific Job Devices.....................................................................................................................1096.3.7.1 Class Diagrams....................................................................................................................109

6.3.8 Server...........................................................................................................................................1106.3.8.1 Class Diagram......................................................................................................................110


6.3.9 Cabinet.........................................................................................................................................1126.3.9.1 Class Diagram......................................................................................................................112


6.3.9.3 SPS Cabinet and MCCS Cabinet Devices..............................................................................114Document No.:Revision:Date:



of 156

6.3.10 Sub Rack Management Board.................................................................................................1156.3.10.1 Class Diagram......................................................................................................................115


6.3.11 Switch......................................................................................................................................1186.3.11.1 Class Diagram......................................................................................................................118


6.3.12 Cluster Manager......................................................................................................................1226.3.12.1 Class Diagram......................................................................................................................122


6.3.13 Storage Manager.....................................................................................................................1266.3.13.1 Class Diagram......................................................................................................................126


6.4 Variability Guide...............................................................................................................................1296.5 Rationale..........................................................................................................................................1296.6 Related Views...................................................................................................................................129

7 HARDWARE CONFIGURATION MANAGEMENT VIEW..........................................1307.1 Primary Presentation.......................................................................................................................131

7.1.1 Element Catalogue, Properties and Relationships.......................................................................1327.2 Element Behaviour...........................................................................................................................134

7.2.1 Adding new hardware devices.....................................................................................................1347.2.2 Replacing hardware devices........................................................................................................1367.2.3 Database querying.......................................................................................................................136

7.3 Related Views...................................................................................................................................137

8 MAINTENANCE SUPPORT VIEW....................................................................1388.1 Context Diagram..............................................................................................................................1388.2 Primary Presentation.......................................................................................................................1398.3 Element Catalog...............................................................................................................................139

8.3.1 Graphical User Interface (GUI).....................................................................................................1398.3.1.1 Element Interface................................................................................................................140


8.3.1.3 Element Properties..............................................................................................................142

8.3.2 Command Line Interface (CLI)......................................................................................................1428.3.2.1 TANGO Framework.............................................................................................................143

8.3.2.2 Engineering Scripts..............................................................................................................143

8.3.3 Integration with TM EDA.............................................................................................................1448.4 Variability Guide...............................................................................................................................144

8.4.1 Graphical User Interface (GUI).....................................................................................................1448.4.2 Command Line Interface (CLI)......................................................................................................144

8.5 Rationale..........................................................................................................................................1448.6 Related Views...................................................................................................................................145

APPENDIX A – SOFTWARE REQUIREMENTS...........................................................146

APPENDIX B – LIST OF STAKEHOLDERS................................................................154




of 156

LIST OF FIGURES

Figure 4-1: Component and Connector, high level context diagram...................................................................24Figure 4-2: Colour-coded notation for component and connector diagrams......................................................25Figure 4-3: Context diagram for the main use cases for LMC infrastructure.......................................................26Figure 4-4: Primary presentation component and connector diagram...............................................................27Figure 4-5: Primary presentation component and connector diagram. This shows the main internal interfaces

involved in LMC Infrastructure.................................................................................................................29Figure 4-6: Base classes class diagram.................................................................................................................31Figure 4-7: Derived state transition diagram for all TANGO devices in SKA LMC. Not all states are mandatory. 35Figure 4-8: Activity diagram for LFAA Master report generation.........................................................................36Figure 4-9: Abstract attribute based alarm quality behaviour for TANGO devices..............................................37Figure 4-10: Element Alarm Handler activity diagram.........................................................................................38Figure 4-11: Alarm notification sequence time constraint..................................................................................39Figure 4-12: Sequence diagram for event information flow across entities........................................................41Figure 4-13: Log message sequence diagram......................................................................................................42Figure 4-14: Attribute archiving sequence diagram.............................................................................................44Figure 4-15: LFAA LMC exception handling flow.................................................................................................45Figure 4-16: Activity to create a construct a DeviceProxy client given multiple TANGO hosts deployed for

control system replication and resilience.................................................................................................48Figure 4-17: Class diagram for the LFAA Master device (inherits from LFAAGroupDevice).................................49Figure 4-18: Activity diagram showing the high-level process for the LFAA Master device to check for compute

resource availability.................................................................................................................................51Figure 4-19: Class diagram for the LFAA Subarray device (inherits from LFAAGroupDevice)..............................52Figure 5-1. Observation management context diagram......................................................................................53Figure 5-2. Observation management use case diagram.....................................................................................54Figure 5-3. Observation management primary presentation..............................................................................55Figure 5-4. Class diagram with properties relevant to observation management...............................................56Figure 5-5. Observation state diagram................................................................................................................62Figure 5-6. Observation creation activity diagram...............................................................................................63Figure 5-7. Resource availability checks activity diagram....................................................................................64Figure 5-8. Station creation sequence diagram...................................................................................................65Figure 5-9. Station formation activity diagram....................................................................................................66Figure 5-10. Station Beam configuration activity diagram...................................................................................67Figure 5-11. Job creation sequence diagram.......................................................................................................68Figure 5-12. Observation start activity diagram..................................................................................................70Figure 5-13. Pointing sequence diagram.............................................................................................................72Figure 5-14. Calibration overview sequence diagram..........................................................................................73Figure 5-15. Calibration procedure timing diagram (UML) for one frequency channel.......................................74Figure 5-16. Calibration timing diagram showing how frequency channels are processed in parallel.................75Figure 5-17. Calibration process information flow diagram................................................................................75Figure 5-18. Bandpass monitoring and flattening activity diagram.....................................................................77Figure 5-19. High level transient buffer sequence diagram.................................................................................78Figure 6-1: Monitoring and control elements have a narrow interface defined by the TANGO framework.

Within this framework, there several primary use-cases required for monitoring and control...............83Figure 6-2: Unified activity for TANGO clients during run-time of the LFAA LMC system....................................84Figure 6-3: Components defined for hardware devices, collectively forming a hierarchy of monitoring and

control functionality all the way up to the LFAA Master device...............................................................85Figure 6-4: Antenna device class diagram (inherits from LFAADevice)................................................................86Figure 6-5: Antenna RMS and bandpass check activity diagram.........................................................................89Figure 6-6: Activity diagram for antenna check for UNKNOWN opState in parent controllers............................90Figure 6-7: APIU device class diagram (inherits from LFAAGroupDevice)............................................................91Figure 6-8: APIU detection of antenna faults......................................................................................................93Figure 6-9: APIU responsiveness check...............................................................................................................94Figure 6-10: Tile class diagram (inherits from LFAAGroupDevice).......................................................................95Document No.:Revision:Date:



of 156

Figure 6-11: Tile detection of faults on connected antennas..............................................................................97Figure 6-12: Tile responsiveness check................................................................................................................98Figure 6-13: Station device class diagram (inherits from LFAAGroupDevice)......................................................99Figure 6-14: Station check for health on associated tiles..................................................................................101Figure 6-15: Basic checks during job submission...............................................................................................102Figure 6-16: Activity to check for state of station beams..................................................................................103Figure 6-17: Class diagram for a station beam device (inherits from LFAAGroupDevice)..................................104Figure 6-18: Activity for station beam device checking for associated tile health.............................................106Figure 6-19: Class diagram for a Transient Buffer device (inherits from LFAADevice).......................................107Figure 6-20: Class diagrams for various job devices (all inherit from JobDevice)..............................................109Figure 6-21: Class diagram for a server device (inherits from LFAADevice).......................................................110Figure 6-22: Class diagram for a Cabinet device (inherits from LFAAGroupDevice)..........................................112Figure 6-23: Sequence diagram for cabinets monitoring reachability to all devices encased in the cabinet.....114Figure 6-24: SPS and MCCS Cabinet device class diagrams (both inherit from Cabinet Device)........................115Figure 6-25: Subrack Management Board device class diagram (inherits from LFAAGroupDevice)..................115Figure 6-26: Switch device class diagram (inherits from LFAADevice)...............................................................118Figure 6-27: Activity diagram for switch checking port health..........................................................................121Figure 6-28: Class diagram for cluster manager device (inherits from LFAAGroupDevice)................................122Figure 6-29: Activity diagram for checking state of all servers which serve as shadow master nodes..............125Figure 6-30: Class diagram for storage manager device (inherits from LFAADevice).........................................126Figure 6-31: Activity diagram for storage manager device to summarize the state of all storage volumes......128Figure 7-1: Hardware Configuration Management context diagram.................................................................130Figure 7-2. Hardware Configuration Management Primary Presentation.......................................................131Figure 7-3. Hardware configuration database entry types................................................................................132Figure 7-4. Adding a new device activity diagram.............................................................................................135Figure 8-1: Primary use-cases for maintenance support arising from the LFAA MCCS software system...........138Figure 8-2: LFAA local monitoring and control overview...................................................................................139Figure 8-3: REST API to TANGO – Request-Reply Flow......................................................................................141Figure 8-4: Publish-subscribe from TANGO to HTTP GUI via Websockets.........................................................142




of 156

LIST OF TABLES

Table 4-1: TM to Element information flow description......................................................................................25Table 4-2: Information flow between all major LMC infrastructure components and TM..................................28Table 4-3: Information flow between all major LMC infrastructure components...............................................29Table 4-4: Class type for different LMC infrastructure components....................................................................32Table 4-5: Base class attribute descriptions.........................................................................................................33Table 4-6: Base class method descriptions..........................................................................................................34Table 4-7: Common alarms across all LFAA devices............................................................................................40Table 4-8: Attribute actions taken for different exception and error type combinations....................................46Table 5-1. Context diagram information flow description...................................................................................53Table 5-2. Element catalog for components in the primary presentation...........................................................57Table 5-3. Element attributes for components in the primary presentation.......................................................58Table 5-4. Element relationships for components in the primary presentation..................................................60Table 5-5. Observation states..............................................................................................................................62Table 5-6. Subarray control commands which can be called on a running Subarray...........................................69Table 5-7. Potential issues which can arise during subarray configuration.........................................................79Table 5-8. Potential issues which can arise during a running observation...........................................................80Table 7-1. Generic Device property list.............................................................................................................133Table 7-2. Monitorable device property list......................................................................................................133Table 7-3. Antenna property list........................................................................................................................134Table 7-4. Station property list..........................................................................................................................134Table 7-5. Generic item property list.................................................................................................................134Table A-0-1. SKA L1 requirements which require support from LFAA Software................................................146




of 156

LIST OF ABBREVIATIONS

AADC................................. Aperture Array Design and construction ConsortiumAAVS................................. Aperture Array Verification SystemADC................................... Analog to Digital converterAd-n.................................. nth document in the list of Applicable DocumentsAIV.................................... Assembly Integration and VerificationCCB.................................... Configuration Control BoardCDR................................... Critical Design ReviewCI....................................... Configuration ItemCOTS................................. Commercial Off The ShelfCPF.................................... Central Processing FacilityCM.................................... Configuration ManagerCW.................................... Continuous WaveEMI.................................... Electro Magnetic InterferenceFoV.................................... Field of ViewFPGA................................. Field Programmable Gate ArrayHW.................................... HardwareICD.................................... Interface Control DocumentISO.................................... International Organisation for StandardisationLFAA.................................. Low Frequency Aperture ArrayLFAA-DN............................ Low Frequency Aperture Array – Data NetworkLNA................................... Low Noise AmplifierLMC................................... Local monitoring and ControlLOFAR............................... Low Frequency Aperture ArrayMCCS................................. Monitor, Control and Calibration serversQA..................................... Quality AssuranceRD-N.................................. nth document in the list of Reference DocumentsRF...................................... Radio FrequencyRFI..................................... Radio Frequency InterferenceRFoF.................................. Radio Frequency signal over FibreRPF.................................... Remote Processing FacilityRMS .................................. Root Mean SquareSaDT.................................. Signal and Data TransportSDP.................................... Signal Data ProcessingSFDR.................................. Spurious Free Dynamic RangeSKA.................................... Square Kilometre ArraySKA-LOW........................... SKA low frequency part of the full telescopeSKAO................................. SKA OfficeS/N.................................... Signal to noiseSW..................................... SoftwareTCP-IP................................ Transmission Control Protocol – Internet ProtocolTBC.................................... to be ContinuedTBD................................... to be DoneTBS.................................... to be SuppliedTM..................................... Telescope ManagementTPM................................... Tile Processor ModuleWBS.................................. Work Breakdown Structure WP.................................... Work Package




of 156

1 References1.1 Applicable documents

The following documents are applicable to the extent stated herein. In the event of conflict between the contents of the applicable documents and this document, the applicable documents shall take precedence.

[AD1] SKA-1 System Baseline Design, SKA-TEL.-SKO-0000002, Issue 01[AD2] SKA TM to LFAA ICD, 100-000000-028, Issue 2[AD3] SKA SDP to LFAA ICD, 00-000000-033, Issue 1[AD4] MCCS Architecture Overview, SKA-TEL-LFAA-0600050

1.2 Reference documents

The following documents are referenced in this document. In the event of conflict between the contents of the referenced documents and this document, this document shall take precedence.

[RD1] SKA Control System Guidelines (CS_Guidelines), 000-000000-010, Issue 01A[RD2] AAVS1 Software Demonstrator Design Report, SKA-TEL-LFAA-0600054[RD3] MCCS Detailed Design Document, SKA-TEL-LFAA-0600051[RD4] LFAA Architectural Design Document, SKA-TEL-LFAA-0200028[RD5] LFAA Internal Interface Control Document, SKA-TEL-LFAA-0200030, Issue G




of 156

2 LFAA Software Architecture Documentation – Beyond Views

This document describes the software architecture for the Low Frequency Aperture Array (LFAA), which will run on the Monitor, Control and Calibration Sub system.

2.1 Purpose and Scope of the SAD

This SAD specifies the software architecture for the LFAA MCCS software package. All information regarding the software architecture may be found in this document, although much information is incorporated by reference to other documents.

What is included in this document?o Software architecture for the LFAA MCCS software packageo Some relevant information to pre-chosen software frameworkso SPEAD data packet format definitionso Software requirementso Firmware-software interaction document

What is not included in this document?o Software implementation specifics

2.2 How the SAD is OrganizedThis SAD is organized as follows:

Section 1: Reference Materials – provides citations and information about reference documents Section 2: Documentation Roadmap – lists and outlines the contents of the overall documentation

package and explains how stakeholder concerns can be addressed by the individual parts. Section 3: Architecture Background – gives a broad system overview of the purposes and functionality

of the MCCS software package, and the significant responsibilities of this system. The architectural requirements, stemming from key quality attributes of the system are described here, in order to provide a wide context for defining the view documents referenced by this document.

Sections 4-8: Provide several Views of the system and describe them in detail Appendix A: Software requirements Appendix B: List of stakeholders

2.3 Stakeholder RepresentationThe purpose of this documentation pack is three-fold. Firstly, to satisfy stakeholders that the MCCS architecture meets functional requirements and will provide a system with the desired quality attributes that will successfully meets business goals. Second, to provide the information required to enable detailed design and shape the supporting systems on which MCCS depends. Finally, to provide reference material on the architecture of the MCCS systems to facilitate modification and upgrades post-construction.

The MCCS software architecture documentation is presented via Component and Connector views (C&C views) and Module Views. C&C views illustrate the runtime connections between an MCCS process and its environment, which consists of other interacting components, data stores, and external components and actors. The C&C views depict the transportation of information between processing units, which transform the data and pass it on to other processing units. Module views depict the dependency relationships between modules, the units of source code from which an MCCS application is built. The module views tell developers what they are allowed and not allowed to use when implementing their part of the system. If a depiction of application structure is required to demonstrate how functional requirements or quality attributes are met, module views are provided to expose the composition of the application. The different view types to be used




of 156

in this SAD, and the information they represent are summarized below. These views will be used consistently to model how the system handles stakeholder concerns and requirements.

View Type

Information Presented in View

Element Types Relation Types Property Types

C&C View

Components: the principal computational elements and data stores that are present at runtime.Ports: the point at which a component may interact with other components and the environment.Connectors: a communication channel / interaction pathway established between two components via compatible ports.

Attachment: component ports are associated with connector roles, such as service provider or a service consumer, to yield a graph of components, connectors, and function that each component plays during interactions via that connector.Delegation: as the runtime composition of applications is exposed, port behaviour can be revealed as delegated to sub-components within the containing structure.

Functionality: the functionality property of each component describes the runtime function of each componentBehaviour: the behaviour of a component and expected sequence of operations during component-to-component interactions.

Module View

Modules: the units of software that implement logic and a set of behaviours in order to fulfil a set of responsibilities.

Depends-on: defines a dependency relationship between two modules. The dependency may specify compile-time usage restrictions and run-time usage limitations.Uses: defines a specific dependency relationship whereby one module depends on another to satisfy its own requirements.

Responsibility: the responsibility property of a module describes what the module does.Visibility: the visibility of a module to other modules can be described where it is important to the architecture.

Data Model View

Data entity: the objects used to hold information and state in the system.

Association: presents the logical associations between data entities.Composition: which data entities are logically contained within another.Generalisation/specialisation: indicates an is-a relationship between data entities, stating that one data entity is an extension or specialisation of a base nature or type.




of 156

Appendix B – List of Stakeholders presents the definition of stakeholders whose needs and views are addressed in this documentation pack. The tables list the roles of each stakeholder, what each stakeholder seeks to obtain from the software system, and relevant techniques to document the architecture relevant to the stakeholder.

2.4 View DefinitionsThe following tables combine the requirements from multiple stakeholders into specific views. These tables show how the stakeholder requirements inform the views that need to be put into an architecture for LFAA MCCS software. The definition of a view is therefore referred to as a combination of appropriately grouped requirements. The sources for the following view are [AD1], [AD2].

LMC InfrastructureAbstract This view covers the essentials of the LMC infrastructure across all

elements of the telescope, as applied to LFAA MCCS.Concerns Concerns to be addressed:

1. States and Modes2. Alarms3. Logging4. Events5. Reporting6. Control and Monitoring Interfaces7. Data Generation

Observation ManagementAbstractConcerns Concerns to be addressed:

1. Calibration2. Pointing3. Creation and Teardown4. Monitoring5. Transient Buffer6. Inputs and Outputs

Hardware Monitoring and ControlAbstractConcerns Concerns to be addressed:

1. Tile Processing Module2. Cabinet Management3. Switches4. Subrack Management Module5. Server Module6. States and Fault Control7. Availability and Scaling

Hardware Configuration ManagementAbstractConcerns Concerns to be addressed:

1. Inventory Management2. Configuration Database

Maintenance SupportDocument No.:Revision:Date:



of 156

AbstractConcerns Concerns to be addressed:

1. Remote Operations2. Diagnosis Metadata3. Maintenance Operations

Software ManagementAbstractConcerns Concerns to be addressed:

1. Software Updates2. Deployment3. Testing and Verification

For every stakeholder, we define the main views of concern. This definition is found in the following table. It does not mean that views not listed for a stakeholder are of no interest to that stakeholder at all. A grading of the information required by each stakeholder (and to what level) is then fully presented in the Stakeholder/View matrix which follows the table.

Stakeholder View(s) that apply to that class of stakeholder’s concernsTelescope Manager User 1. LMC Infrastructure

2. Observation Management3. Hardware Configuration Management4. Hardware Monitoring and Control5. Maintenance Support

Analyst 1. LMC Infrastructure2. Observation Management3. Hardware Monitoring and Control4. Software Management5. Maintenance Support

Architect 1. LMC Infrastructure2. Observation Management3. Hardware Monitoring and Control4. Software Management5. Maintenance Support

Systems Engineer 1. LMC Infrastructure2. Observation Management3. Hardware Configuration Management4. Hardware Monitoring and Control5. Software Management6. Maintenance Support

Integration and Test Engineer 1. LMC Infrastructure2. Observation Management3. Hardware Monitoring and Control4. Software Management5. Maintenance Support

MCCS Software Developer 1. LMC Infrastructure2. Observation Management3. Hardware Monitoring and Control4. Software Management

MCCS Maintainer 1. Software Management2. Maintenance Support

GUI Operator 1. LMC Infrastructure2. Maintenance Support3. Observation Management

Software Deployer 1. LMC InfrastructureDocument No.:Revision:Date:



of 156

2. Software ManagementHardware Deployer 1. Hardware Configuration Management

2. Hardware Monitoring and ControlScience Data Processor User 1. Observation ManagementDesigner 1. LMC Infrastructure

2. Observation Management3. Hardware Configuration Management4. Hardware Monitoring and Control5. Software Management6. Maintenance Support

TPM Software Developer 1. Hardware Monitoring and Control2. Observation Management

Database and Data Store Administrator

1. Hardware Monitoring and Control2. Hardware Configuration Management

Network Administrator 1. Hardware Configuration ManagementUser (CLI/GUI Maintainer) 1. Maintenance Support

2. LMC Infrastructure3. Observation Management4. Hardware Monitoring and Control




of 156

The View/Stakeholder matrix is as follows:

Stakeholder

Tele

scop

e M

anag

er U

ser

Anal

yst

Arch

itect

Syst

ems E

ngin

eer

Inte

grati

on a

nd T

est E

ngin

eer

MCC

S So

ftwar

e De

velo

per

MCC

S M

aint

aine

r

GUI O

pera

tor

Softw

are

Depl

oyer

Hard

war

e De

ploy

er

Scie

nce

Data

Pro

cess

or U

ser

Desi

gner

TPM

Soft

war

e De

velo

per

Data

base

and

Dat

a St

ore

Adm

inist

rato

r

Net

wor

k Ad

min

istr

ator

Use

r (CL

I/GU

I Mai

ntai

ner)

View

poin

t

LMC Infrastructure

Observation Management

Hardware Configuration Management

Hardware Monitoring and Control

Maintenance Support

Software Management

LegendStakeholder requiring detailed information from view

Stakeholder requiring overview information from view

Stakeholder requiring some information only

2.5 How a View is Documented

Sections 4-8 of this SAD describe each view listed in Section 1.5. Each view is documented with several subsections for the view:

1. Name of view

2. Special notation for the view, if applicable

3. Context diagram - This section provides a context diagram showing the context of the part of the system represented by this vie. It also designates the view’s scope with a distinguished symbol and shows interactions with external entities in the vocabulary of the view.

4. Primary presentation - This section presents the elements and the relations among them that populate this view packet, using an appropriate language, languages, notation, or tool-based representation.

5. Element Catalog - Whereas the primary presentation shows the important elements and relations of the view packet, this section provides additional information needed to complete the architectural picture. It consists of the following subsections.




of 156

6. Element Behavior - This section specifies any significant behavior of elements or groups of interacting elements shown in the primary presentation.

7. Rationale - This section provides references for related documentation that determines, in part, the material presented in a view.




of 156

3 System OverviewThis section provides an overview of the significant requirements which drove the architecture for the LFAA LMC and software infrastructure, together with an overview of the architecture and an analysis of how this architecture meets the requirements and the associated quality attributes

The MCCS performs the local monitoring, control and calibration functions for the stations and supporting products. It receives commands and reports the LFAA status to TM. It comprises of a compute cluster (hardware resources composed of off-the-shelf high-performance servers), local power and cooling distribution, local network and job management software to support the LFAA monitor and control functions. The MCCS is connected to both the SPS and LFAA-DN. It also calculates the beamforming and calibration coefficients. The MCCS controls both TPMs, the M&C and data network, as well as supporting hardware in the cabinets. It is also responsible for implementing the transient buffer and transmitting the buffer, when instructed, to SDP via a dedicated 100Gb link.

The two primary responsibilities of the MCCS sub-system is to:1. Creation and monitoring of observations, including calibration and buffering beamformed data for

transient detection2. Provide monitoring and control capability for all the hardware and software components

The software architecture for the LFAA is primarily driven by these responsibilities, whilst the sizing of the MCCS hardware is defined by the resource requirements for calibration, transient buffers and supporting operations. Refer to [AD4] for an overview of the MCCS role in the LFAA, the main MCCS responsibilities, MCCS functional requirements, and overview of the software architecture and how the software maps to the hardware.

3.1 Requirements and Architecture

Appendix X lists down the high-level software requirements for the LFAA software system, whilst the table below highlights requirements which are architecture drivers.

Requirements Summary ASR DescriptionLFAA_MCCS_REQ-19 Maximum time for

mode transitionThe maximum transition time between different observation is defined to be less than 30 seconds. This includes the time required to change between observations, that is, configure the beamforming chain and update the pointing devices. This does not include the time required to transitions all required SPS and MCCS hardware components form low-power mode to online. To meet this requirement, most of the required operations will have to be performed in parallel.

LFAA_MCCS_REQ-33LFAA_MCCS_REQ-179LFAA_MCCS_REQ-220LFAA_MCCS_REQ-142LFAA_MCCS_REQ-283LFAA_MCCS_REQ-160LFAA_MCCS_REQ-160LFAA_MCCS_REQ-277LFAA_MCCS_REQ-248LFAA_MCCS_REQ-280LFAA_MCCS_REQ-36LFAA_MCCS_REQ-37

General monitoring and control functionality

These requirements specify the features required from the control system, including the creation and handling of alarms, logging from all elements, reporting mechanisms organized in a hierarchical manner, and state/mode handling. These requirements have led to the adoption of TANGO by the SKA, which is adopted by the LFAA.




of 156

LFAA_MCCS_REQ-239LFAA_MCCS_REQ-16LFAA_MCCS_REQ-39

Global Sky Model The LFAA must contain a local sky model to be used for calibration. Updates to the sky model are acquired from SDP, after an update notification is received. This introduces an additional interface to the LFAA software system

LFAA_MCCS_REQ-156 Data Acquisition Control data from the TPMs needs to be captured by the MCCS servers to be used for correlation and diagnostics. This data rate can be quite high (about 800MB/s for correlation), such that the DAQ system has the be optimally implemented and deployed appropriately on the MCCS.

LFAA_MCCS_REQ-159 Bandpass Flattening

In order to calculate bandpass flattening coefficients, the bandpass has to be monitored by the MCCS during an observation. This operation is separate from calibration, and requires a different and separate set of coefficients, although a similar setup can be used.


LFAA_MCCS_REQ-157

Correlation and Calibration

These requirements define the sizing of the computational resources in the MCCS since they are by far the most computationally intensive operations which have to be performed. For a given calibration cycle of 10 minutes, and assuming that all usable coarse frequency channels need to be calibrated, a single channel needs to be correlated and calibrated every ~1.5s, thus required accelerators (GPUs). An analysis of the computational and hardware requirements for calibration is provided in the Observation Management view.

LFAA_MCCS_REQ-146LFAA_MCCS_REQ-147

Synchronized coefficient download

Synchronized operations across TPMs within a station is critical for generating properly aligned, calibrated and beamformed data. This requires significant timing checks when communicating with TPMs, as well as a stable system for getting the time



Transient Buffer In order to constantly buffer the stations beams for all station, the last TPM is each station chain has to constantly send the (quantized) station beams to the MCCS, where a ring-buffer of considerable size (~900s) needs to be constantly updated. To keep this in RAM, about 1.5 TB of RAM per server is required. When a transient trigger is received parts of these buffers must be sent to SDP without over-saturating the SDP-LFAA link. After calibration, the transient buffer is the most demanding feature (in networking and storage terms)

LFAA_MCCS_REQ-2LFAA_MCCS_REQ-20LFAA_MCCS_REQ-21LFAA_MCCS_REQ-23LFAA_MCCS_REQ-30

LFAA_MCCS_REQ-291

Station beam configurability

The LFAA should be capable of forming station (256 antennas) as well as sub-stations (up to 2048). For each station, a number of independent station beams need to be configured. This defines to degree to which the architecture must be configurable and flexible, and the number of devices which must be defined and monitored/controlled


Availability In order to achieve an operational availability of 95% (software-wise) a number of fault-mitigation processes have to be in-place to make sure that if a software fault




of 156

occurs the system can recuperate properly and quickly. Redundant hardware resources have to be in-place to switch over to in case of hardware faults.

LFAA_MCCS_REQ-166LFAA_MCCS_REQ-167LFAA_MCCS_REQ-172LFAA_MCCS_REQ-210

Software management

The LFAA must be capable of running specific software/algorithm version for different observations, potentially simultaneously, as well as be capable of rolling out software updates with minimal effect on the running state of the system.


Client interfaces A local GUI as well as an engineering interface needs to be integrated with the software infrastructure to allow local and remote operators access. The entry point for these components will be the same as that for TM (the LFAA end of the TM-LFAA ICD) so as to provide a uniform view to all external entities. Through TANGO, the rest of the system can then be queried easily.

LFAA_MCCS_REQ-162 Network Configuration

The LFAA hardware setup involves a significantly large network which must be setup, controlled and monitored. Stations are formed by appropriately chaining up TPMs via the LFAA-DN. Switches need to be configured properly so that network traffic is appropriately balanced.


LFAA_MCCS_REQ-157

Cluster management

The MCCS is a small high-performance cluster on which the LMC infrastructure will run. Calibration, DAQ, pointing and transient buffer will also reside here. A cluster management system is required to monitor the servers, whilst an accompanying job system is required to easily submit and distribute processing across the nodes. Additionally, the DAQ and other jobs communicate via files, such that a storage management system is required to manage storage (which is distributed across the nodes so as not to have a single point of failure)

LFAA_MCCS_REQ-35 TM-LFAA Interface The LFAA can be thought of as offering a service to TM, such that it’s architecture and functionality is dependent on the operations which TM needs to perform. Several requirements are based on the content of the TM-LFAA ICD, and the software architecture needs to reflect this




of 156

3.2 Quality Attribute Requirements

Based on the LFAA MCCS requirements the following key quality attributes were identified for the system, and how these inform the software architecture described in this document.

Quality Attribute View Significant Requirement

LMC

Infr

astr

uctu

re

Obs

erva

tion

Confi

gura

tion

Hard

war

e Co

nfigu

ratio

n M

anag

emen

t

Mon

itorin

g an

d Co

ntro

l

Faul

t Tol

eran

ce a

nd A

vaila

bilit

y

Mai

nten

ance

Sup

port

Softw

are

Man

agem

ent

Performance

X X

The switching time between telescope observation modes shall take less than 30 seconds. Latency for alarms to be signalled to operator once threshold is crossed shall not be more than 1 second.

XA single coarse frequency channel must be calibrated every ~1.5s, such that the entire bandwidth is calibrated every 10 minutes.

XFor each station a ~900s transient buffer needs to be kept in memory, and when triggered by TM section of these buffers need to be sent to SDP

Availability (Reliability and

Recovery)

XOnce transitioned to safe state, system shall remain in designated safe state until commanded otherwise,

X LFAA shall each have an operational availability of at least 95%.

Modifiability

X

Reconfiguration of the entire array when beamforming parameters are changed. LFAA can be modified to satisfy various configurations of stations and station beams. Allow for TM to configure station beams as required for the observation. Upon reception of new observation, telescope is configured and required workflows are created.

X

Hardware is reconfigured for each new observation when required. Possibility to dynamically add or remove compute nodes as required.

UsabilityX

Tunnelling capability for engineering access. Support remote operations such as powerup, power down, restart, diagnostics.

X Commands are accepted and executed, with data Document No.:Revision:Date:



of 156

processed, for each subarray independently of and concurrently with all others.

XLFAA needs to process up to 8 independent beams from each station within a subarray, each with potentially independent pointings.

X Accessibility on a per user/role basis.X Web-based access for MCCS operations.

Interoperability

X

States and modes reporting structure defined in SKA Control System Guidelines. Logging aggregated to external entities in an agreed schema. Reporting structure across the system is defined and standardized with other systems by ICD. TM-LFAA interface based on TANGO protocol. All modules requiring synchronized telescope network time shall comply with NTP v4 standard.

X

Transmission of information to/from TM e.g. coefficients in an agreed format. Data files generated in observations have a designated format with corresponding libraries to read/write to/from them. Configuration/capability schema agreement. Standardized transient buffer format, event interface. Download of algorithms via interoperable containers or similar.

XStandard inventory part numbers and serials. Standard electronically readable or scannable IDs, cable IDs, connected plate IDs.

X

Interoperability of metrics and attributes provided though TANGO interface. Hierarchical summarization of groups/levels of hardware devices.

Security

XConfiguration possible only within parameters allowable by SKA_LOW. Validation of observation parameters and command sequences.

X Comprehensive set of roll-out rules/alarms for equipment safety and smooth operation.

XTolerance and availability should not take precedence over established alarms and warning procedures.

X User database, role-based functionality, role-based privileges, remote access.




of 156

3.3 Architectural Approaches

This section documents the key architectural decisions that apply to more than one view along with the rationale for the decision. Decision and choices made in specific areas not affecting others are documented in the Views relating to that area. These decisions and their rationale are summarised below.

Adoption of TANGO as the control systemThe TANGO control system was adopted by the SKA as the control system to be used for interaction between TM and all other elements in the SKA, such that all these interfaces must be designed with TANGO. A set of guidelines was produced [AD1] to which all these interfaces, and other parts of the system which use TANGO, should adhere to. In LFAA the decision was taken to use TANGO for monitoring and control of all hardware devices. The LFAA is composed of the order of 150,000 monitorable devices, which requires a mature and stable control system. All the required functionality, including setting up of alarms, generation of events, logging, archiving and others are available in TANGO out of the box.

TANGO device hierarchy through groupingThe TANGO device hierarchy described in this document maps to the physical composition of the LFAA (including the Field Nodes, SPS and MCCS). A well defined hierarchy allows for highly structured monitoring and control, especially when generating status reports and for drill-down purposes. This hierarchy also allows for simplified fault finding and maintenance, since if an error occurs anywhere within the hierarchy tree then all devices beneath the tree can be taken offline whilst the fault is corrected.

TANGO device distribution across MCCS nodesRunning all the TANGO device instances on a small number of centralised servers can result in low availability if one of these servers goes offline due to a fault or for maintenance. As described in[AD4], MCCS will be composed of at least 64 compute nodes, 4 space compute nodes, 1 head node and one shadow head node. The head node will be responsible for running the core TANGO system, as well as hosting all databases and the link to TM. Each compute node is logically associated with several stations. The TANGO devices which monitor and control all devices associated with this station, as well as the processing jobs required to support generation of the station’s beam and transient buffer, will run on the same server. All these software instances will run in containers, such that in the case of a server failing, all containers running on it can be easily relaunched on spare server.

Use of standalone processes for compute-intensive operationsThis architecture includes three types of software modules: TANGO devices interfacing with hardware devices or other entities using TANGO, TANGO devices which interface with third party software such cluster and storage managers, and standalone processes for compute-intensive operations. The latter modules are software components which require significant computational resources, including correlation, calibration and data acquisition. The components are scheduled to run on compute nodes through the job submission system and interact with the control system by creating proxies to their associated TANGO devices. The alternative to this is having these processes be TANGO devices themselves, however this is inefficient, as described in the [RD2]. The main reason for this is that dynamic creation of TANGO devices is not a best practice in TANGO, and having these processes instantiated from the start would required that processing would start based on attribute writes or commands, which has proved to be inefficient in the AAVS1 prototype.The architecture described in this document will be compared to the one prototyped for AAVS1, providing results and issues which have arisen (and led to a change in architectural design)



Error: Reference source not foundAuthor: A. Magro et al

of 156

4 LMC Infrastructure ViewThis view will describe the details of the LMC infrastructure in reference to the standards set in the LMC Guidelines document [RD1]. It is expected that most of the material expressed in this view apply to other LMC units in other elements. An effort has been made to comply as much as possible with the vision set in the LMC Guidelines, with the understanding that some aspects require some specificities in relation to LFAA.This viewpoint will describe several items in relation to the LMC infrastructure of LFAA MCCS, mainly:

States and modes of the LFAA MCCS devices The alarms mechanism and procedure The events mechanism and procedure The logging mechanism and data flow The archiving mechanism and data flow Reporting Control and monitoring interfaces Data generation and data flow

Figure 4-1: Component and Connector, high level context diagram.

The component and connector layout in Figure 4-1, shows the main interface points between the Telescope Manager (TM) and the LFAA LMC system. Communication across these two major elements happens based on:

1. Telescope Model updates2. LFAA master control3. Logging functionality and forwarding4. Alarm handling and forwarding




of 156

The information flow is summarized in Table 4-1.

Table 4-1: TM to Element information flow description# From To Information Flow Description1 TM LFAA Telescope model download The contents of this model are

defined in [AD2]2 LFAA TM Telescope model updates3 TM LFAA Commands and requests The LFAA master is the single point of

control of an element by TM.4 TM LFAA Logs and log events TM can request to setup the logging

redirections required and what type of logging should be done

5 LFAA TM Alarms LFAA generates alarm events based on pre-defined rules and feeds this information to TM when alarms are triggered.

6 LFAA TM Raw log data Based on a log configuration requested, LFAA sends off logs generated within the system to the central logging interface at TM.

4.1 NotationThroughout this document, the UML standard is used for notation. Some modifications have been made to help clarity, in particular to component and connector views. These modifications, related to colour-coding of ports, are explained in Figure 4-2.

Figure 4-2: Colour-coded notation for component and connector diagrams.




of 156

4.2 Context DiagramFigure 4-3 shows the primary use cases of the LMC infrastructure. The uses cases are limited by the LMC framework itself, and as such reflect the basic functionality provided by the Tango framework, on which the LMC system is constructed. Most of the LMC functionality can be considered in reference to a Tango device server providing functionality to different Tango clients.

Figure 4-3: Context diagram for the main use cases for LMC infrastructure.

4.3 Primary PresentationThe primary presentation is split to detail external interfaces from LFAA MCCS to TM, and internal interfaces within LFAA LMC.

4.3.1 External InterfacesThe main interfaces for the LMC Infrastructure view can be opened up into a primary presentation showing the main ports of communication across the interfaces between TM and LFAA. This is shown in Figure 4-4. In particular, this figure shows the main external interfaces involved in LMC Infrastructure. Each facility should have a similar set of elements – an element master, an element alarm handler, an element logger, an element log store, a facility configuration database, a telescope state device if necessary, and the various LMC components required. A breakdown of the information flow for this primary presentation is given in .




of 156

Figure 4-4: Primary presentation component and connector diagram.




of 156

Table 4-2: Information flow between all major LMC infrastructure components and TM.# From To Information Flow Description1 Leaf Node LFAA Master Command and

attribute read/write requests.

A request is made by the TM leaf node device responsible for LFAA, and the response is sent back.

2 Subarray Node

Subarray components

Command and attribute read/write requests.

A request is made by the TM subarray node device responsible for a LFAA subarray, and the response is sent back.

3 Subarray Node

Subarray Events Any events generated by subarrays, and which are subscribed to by the TM subarray nodes are sent over.

4 LFAA Master

Leaf Node Events Any events generated by the LMC, and which are subscribed to by the TM leaf node are sent over.

5 LFAA Master

Central Archiver

Attribute data Any attributes that TM wishes to archive from the LFAA Master are sent over at predefined periods.

6 Central Alarms Handler

Element Alarm Handler

Command and attribute read/write requests

The element alarm handler can run specific alarm-related commands on request by TM.

7 Element Alarm Handler

Central Alarms Handler

Alarm events Alarm events generated by the LFAA LMC system are sent towards the Central Alarms Handler.

8 Central Logger

Element Logger


The element logger can run specific logging-related commands on request by TM.

9 Element Logger

Central Logger

Logging events Logging events generated by the LFAA LMC system are sent towards the Central Logger.

10 Element Log Store

Central Log Store

Raw log data The central log store will keep a record of all logs generated by all elements of the SKA. The local store can maintain a full local copy of these logs, or a limited time window of logs.

11 TM Telescope State (per element)

Element Telescope State

Telescope model data

The element telescope state is subscribed to updates of the global Telescope state.

12 Element Telescope State

TM Telescope State (per element)

Telescope model data

The TM telescope state is subscribed to updates of the local telescope state held by LFAA.




of 156

4.3.2 Internal InterfacesTo satisfy the requests and data flow towards the external interfaces, all elements within the LFAA LMC system will need a unified way of providing the execution and data requests. Although the TANGO-based infrastructure provides a fully peer-to-peer form of communication, the salient connections for internal interfaces within the LFAA LMC element are highlighted in Figure 4-5.

Figure 4-5: Primary presentation component and connector diagram. This shows the main internal interfaces involved in LMC Infrastructure.

Information flow between the various internal LMC infrastructure components is detailed in .

Table 4-3: Information flow between all major LMC infrastructure components.# From To Information Flow Description1 LFAA Master Element

Telescope StateCommand and attribute read/write requests

The LFAA master is the only internal device that should control the element telescope model device.

2 LFAA Master Element Alarm Handler


The LFAA master is the only internal device that should control the element alarm handler.

3 Element Logger

LFAA Master Command and attribute read/write requests

The element logger, receiving instructions from the central logger, can configure log subscriptions on the LFAA master.

4 Element Logger

Component Devices incl. Subarrays


The element logger, receiving instructions from the central logger, can configure log subscriptions on the component devices.

5 LFAA Master Component Devices incl.

Command and attribute

The LFAA master can directly control all other component devices.




of 156

Subarrays read/write requests

6 LFAA Master Element Alarm Handler

Alarms The element alarm handler subscribes to any alarm events coming from the LFAA master.

7 LFAA Master Element Logger Logs The element logger receives any logs coming from the LFAA master.

8 Component Devices incl. Subarrays

LFAA Master Events The LFAA master can subscribe to any events generated by the components of the system.



Alarms The element alarm handler subscribes to any alarm events coming from any component device.


Element Logger Logs The element logger receives any logs coming from any component device.

11 Element Alarm Handler

Element Logger Logs The element logger receives any logs coming from the alarm handler.



Alarms The element alarm handler subscribes to any alarm events coming from the element telescope state.


Element Logger Logs The element logger receives any logs coming from the element telescope model.

14 Element Logger

Log Store Logs The log store receives all raw log data piped by the element logger.


Element Telescope State

Attributes The telescope state device is aware of all telescope changes by subscribing to attribute changes on all element devices.

4.3.3 Interface SecurityExternal access to the LFAA LMC system will be limited to TM-LFAA interfaces defined in the TM-LFAA ICD. These limitations are a responsibility of the external interface and will be expected to revolve around a setup of user databases, user roles, role-based functionality and privileges. Access limitations will be defined by SKA policies and settings, see comments in [RD4], however details are TBD.

4.4 Element CatalogAll elements for LMC infrastructure have a direct connection to the TANGO framework and are all mostly representable as TANGO elements. The elements are based on an inheritance model of base class behaviour. An overview of these base classes is shown in Figure 4-6.




of 156

Figure 4-6: Base classes class diagram.




of 156

A summary of these LMC Infrastructure base classes is as follows:1. TANGODevice: The lower-most class and what the TANGO framework provides as a base for all

devices. The attributes for this device are defined in the TANGO documentation.2. TangoLoggerDevice: a device that implements the TANGO logging interface. The TANGO logging

interface defines a logger device that has a log() method, which is used to define custom logging behaviour. This class inherits from the TangoDevice class.

3. SKADevice: an SKA-wide base class envisaged in [RD1]. The rationale and attributes are defined in detail [RD1]. All devices developed for every SKA element must be built by inheriting from this class.

4. ElementAlarmHandlerDevice: a class which takes off from the Elettra alarm device mechanism. This device must adhere to the alarm standards which SKA will conform to.

5. LFAADevice: a class which serves as the base device for all device classes for LFAA. This device class inherits directly from the SKADevice class. This class is a placeholder for any attributes and commands that can be expected to be available across all LFAA devices for LMC purposes.

6. LFAAGroupDevice: a class which serves as a base in some meaningful group. The TANGO proxy provides ad-hoc creation of Group device proxies. However, this proxy is only available within the scope of its creation and is therefore stateless. The idea of a group device is to have an actual TANGO device representing a group of devices, or possibly an aggregation of devices. This class is expected to harmonize the way groups of devices are interacted with within the LFAA system.

7. JobDevice: a class which serves as a base device for jobs e.g. Calibration job, pointing job, transient buffer job etc. It provides unified job control and monitoring services.

Given these base classes, Table 4-4 defines which LMC infrastructure component maps on to which type of class.

Table 4-4: Class type for different LMC infrastructure components.Component Class TypeLFAA Master LFAAGroupDeviceElement Alarm Handler ElementAlarmHandlerDeviceElement Logger TangoDeviceComponent Devices incl. Subarrays LFAAGroupDevice / LFAADeviceElement Telescope State SKADeviceLog Store n/a




of 156

4.4.1 Element AttributesA full description of attributes for the LFAADevice and LFAAGroupDevice classes is provided in Table 4-5.

Table 4-5: Base class attribute descriptions.Device Attribute DescriptionLFAADevice isHardwareDevice A flag which lets interfacing components know whether

this device wraps around actual hardware, or whether the device is a higher-level software device. This allows for a filter for device with specific care taken in case of hardware devices.

diagMode A flag to signal whether the device is running diagnosis mode or not. This flag can also be used internally for specific diagnostic metric collection to run when required.

calledUndefinedDevice A run-time flag to indicate that the name of the device not defined in the database

calledDeadServer A run-time flag to indicate that a particular device server is hanged/dead.

detectedDeadDatabase A run-time flag to indicate that the TANGO database seems dead/unresponsive/not running.

calledNonrunningDevice A run-time flag to indicate that a device is defined in the database, but is not yet running.

callTimeout A run-time flag to indicate that a run-time flag to indicate that a timeout has occurred.

callCommFailed A run-time flag to indicate that a client/server communication path has failed.

invalidAsynId A run-time flag to indicate a wrong id for asynchronous call.

calledInexistentCallback A run-time flag to indicate a reference to an inexistent callback.

requestIdMismatch A run-time flag to indicate a wrong ID for requesting asynchronous result.

expectedReplyNotReady A run-time flag to indicate that an asynchronous reply was not ready when queried.

experiencedSubscriptionFailure

A run-time flag to indicate that the server was unable to create subscription - could be wrong arguments, wrong parameter type, subscription already existing.

invalidEventId Used an invalid or expired event ID.LFAAGroupDevice memberList A list of TANGO addresses to devices compositing this

groupmemberStates An aggregated list of TANGO states for each member in

the group




of 156

A full description of methods for the LFAADevice and LFAAGroupDevice classes is provided in Table 4-6.

Table 4-6: Base class method descriptions.Device Method DescriptionLFAADevice ExceptionCallback() In case exceptions are not specifically handled, a

default callback can be defined here. Exceptions encountered with no particular handling defined will cause this callback to be executed.

DefaultAlarmOnCallback() In the case an alarm rule has been crossed involving this device, if there is no particular callback configured for the rule, default alarm ON callback behaviour can be defined here.

DefaultAlarmOffCallback() In the case an alarm rule has gone back to normal, involving this device, if there is no particular callback configured for the rule, default alarm OFF callback behaviour can be defined here.

GetFullReport() This method provides a full report of attributes and values, commands available, state information for the device in a predetermined format, to be given to end users as requested.

GetCommandReport() This method provides a report on commands available to clients.

GetAttributeReport() This method provides a report on attributes and values for the device.

ConstructDeviceProxyAddress() A method that facilitates the construction of a TANGO client (DeviceProxy), with multiple TANGO proxies present for control system resilience.

LFAAGroupDevice AddMember() Registers a device as a member of this composite group.

RemoveMember() De-registers a device as a member of this composite group.

RunCommand() A wrapper around running commands on a group proxy for this group of devices.

JobDevice StartJob() Starts the job.StopJob() Terminates the job.

4.5 Element BehaviourThis section details the behavioural aspect of the various elements involved in LMC infrastructure for LFAA.

4.5.1 State TransitionFigure 4-7 describes the state transition system for all software elements derived from the SKADevice class.




of 156

Figure 4-7: Derived state transition diagram for all TANGO devices in SKA LMC. Not all states are mandatory.




of 156

4.5.2 Reporting BehaviourThe activity diagram in Figure 4-8 describes the process that the LMC LFAA master device performs to generate an element-wide report. The generation of a main report involves a recursive call to generate reports from all subcomponents. If a subcomponent is a group device which has other subcomponents, then the call to generate a sub-report is also called recursively. This way, the system does not just generate a linear report of all devices in the system but includes a hierarchy of system devices.

Figure 4-8: Activity diagram for LFAA Master report generation.




of 156

It is not always necessary to build an overall report picture via the LFAA Master Device every time, however. Every device can be queried for an individual report. If the device itself is a group device, and therefore contains sub-members, then these reports will also be called recursively.

4.5.3 Alarm BehaviourFigure 4-9 shows the basic activity TANGO devices have on the quality property of attributes which cross predefined alarm thresholds. All attributes in all devices follow this activity.

Figure 4-9: Abstract attribute based alarm quality behaviour for TANGO devices.

At a higher/element level, alarm conditions can be represented as attributes in alarm handler devices. The evaluation of a complex alarm rule, therefore, is summarized as an attribute, and that same attribute behaves in the same way as shown in Figure 4-9. At a more systematic level, the alarm handler device makes use of this mechanism in a more general way to generate the right event and alarm notifications, as shown in the activity diagram in Figure 4-10. The alarm handler device keeps a constant, parallel check on all the required attributes involved in alarm conditions. If any condition alarm threshold is crossed, the quality is set to ALARM in the standard TANGO way. When this happens, the element alarm handler can send the required alarm notifications to the LFAA master and/or the central alarms handler at TM.

Additionally, when an alarm condition is crossed to an alarm state or a valid state, the element alarms handler can call predetermined callbacks (commands on specific devices).




of 156

Figure 4-10: Element Alarm Handler activity diagram.




of 156

The alarm mechanism infrastructure must adhere to a timing constraint based on LFAA MCCS requirements. This architectural driver is represented in the sequence diagram in Figure 4-11. Notifications to the central alarms handler and the LFAA master device are sent asynchronously of each other. The period of detecting a triggered alarm condition and the sending of a notification is constrained by a one second window. If the central alarms handler is subscribed to a particular alarm handler attribute, then a relay notification is also sent to the central alarms handler. This allows for the central alarms handler to pick and choose which particular element alarms it wants to subscribe to. Furthermore, based on the general TANGO architecture, the central alarms handler could subscribe to event information directly from and TANGO devices.

In addition to the alarm handler devices configured for alarm conditions on specific attributes, each alarm generated is logged to the element logger. Whilst this functionality is not entirely necessary, as it is already possible to retrieve alarm information from the EDA, plus any derived information like a cross-match of other variables, the log of the system can help present a more linear lifetime of the system. If this functionality is eventually not needed, then it can easily be removed from this sequence.

Figure 4-11: Alarm notification sequence time constraint.




of 156

4.5.3.1 General Alarms for all LFAA DevicesA number of alarms are set on all LFAA devices, and these alarms map onto a number of possible exceptions that can be caught in each device. For each type of exception (explained later in 4.5.7), an attribute reflecting the error type has a value changed to TRUE. This will allow for the device to be able to monitor runtime errors which could have the device implementation as a primary cause of error. This setup also allows callbacks for particular alarms to be written for each device, for each of these exception-based alarm conditions. Each device will have a specific alarm callback implementation.Each alarm, irrespective of the source, is logged to the LFAA logging mechanism as described earlier, including all the possible information in the alarm trace. However, these attributes allow for immediate monitoring and automated action to be taken.

Table 4-7: Common alarms across all LFAA devices.# Alarm Condition Description1 calledUndefinedDevice = TRUE The name of the device is not defined in the database2 calledDeadServer = TRUE A particular device server is hanged/dead.3 detectedDeadDatabase = TRUE TANGO database seems dead/unresponsive/not running.4 calledNonrunningDevice = TRUE A device is defined in the database, but is not yet running.5 callTimeout = TRUE A timeout has occurred.6 callCommFailed = TRUE A client/server communication path has failed.7 invalidAsynId = TRUE Wrong id for asynchronous call.8 calledInexistentCallback = TRUE Reference to an inexistent callback.9 requestIdMismatch = TRUE Wrong ID for requesting asynchronous result.10 expectedReplyNotReady = TRUE Asynchronous reply not ready when queried.11 experiencedSubscriptionFailure = TRUE Unable to create subscription - could be wrong

arguments, wrong parameter type, subscription already existing.

12 invalidEventId = TRUE Used an invalid or expired event ID.




of 156

4.5.4 Event Behaviour

Figure 4-12: Sequence diagram for event information flow across entities.

Figure 4-12 shows a sequence diagram for information flow for events across the LMC infrastructure. Whenever a component publishes an event, this event is forwarded to all subscribed devices. For each event received, there is a possibility of a callback mechanism being called. In general, the callback mechanism can make use of any or all options to:

1. Push the event information to the Telescope Manager2. Push the event information to the Central Logger3. Push the event information to the Central Archiver

The receiving devices can also have their respective callbacks executed.




of 156

4.5.5 Logging BehaviourThe flow of information for log messages is demonstrated in Figure 4-13. All TANGO devices can log their behavior at various log levels. The logs are sent to the designated log target within the element. At the very least, each element will have an element log target. This element log target is configured directly by the TM central logger. Instructions on log levels for specific devices are sent from the central logger, to the element logger, and to the respective devices.

The element logger implements the TANGO logging interface, which will, amongst other things, have the log information sent to the element log storage as syslog-compatible logs. Log viewers in general can connect directly to the element logger, which implements a standardized LogConsumer interface. This interface will connect directly to the log storage for consuming/reading logs for log viewing.

Figure 4-13: Log message sequence diagram.




of 156

Optionally, the central logger at TM will be interested in any or all logs generated by the various elements. In this case, log messages are also forwarded to the central logger:

1. either by the element logger itself2. the component device generating the log3. syslog forwarding from the element log store to the central log store.

Generally, it is expected that element log stores do not keep a full history of logs, but only a specified time window. The central log store however, will have much longer term log storage.

4.5.6 Archiving BehaviourFigure 4-14 describes the information flow for attribute archiving for TANGO-based devices. [RD1] uses the term ‘archiving’ in the context of collection of monitoring information (attribute values) with the intention of storing information into an archive. Archived data may be used for diagnostics purposes, trending, maintenance prediction etc. LFAA makes provision for multiple clients to subscribe to any type of update (periodic, on change, and on thresholds); it is up to TM to select information to be stored in the central archive. Whilst LFAA will internally be able to archive via the same archiving behaviour, it makes no provisions for very long-term archiving.

Archiving is implemented using the TANGO design patterns, as follows:

Per-attribute archiving configuration is set by the TANGO Device itself; for each TANGO Device the attributes to be archived are identified during design, construction and commissioning; the archive event thresholds and period are set for each attribute that requires monitoring and archiving.

The Central Archiver, implemented by TM, accesses the LFAA TANGO Facility Database to identify and subscribe for the LFAA attribute archive events on each of the devices in the LFAA Facility Config Database.

LFAA responsibilities related to archiving are:

Expose all LFAA parameters that should be stored in the TM Engineering Database as attributes of the TANGO Devices registered with the LFAA Facility Config Database

Configure attribute properties, such as thresholds for generation of WARNINGs and ALARMs.

Configure attribute properties related to event generation, including: cadence for periodical reporting, report on change, increment/decrement to be reported, etc.

Note: the TANGO Device attribute properties such as absolute or relative change that triggers archiving and the cadence for periodic archiving are part of the TANGO Device configuration. The standard practice is to populate (refresh) archive at a slow rate, and, in addition, to archive the value when a significant change is detected.




of 156

Figure 4-14: Attribute archiving sequence diagram.

Architecturally therefore, archiving can happen either at the element archiver, or independently at the central archiver. Values archived are picked up respectively by the element archive database or the central archive database.

4.5.7 General Exception Handling FlowThe general procedure for handling exceptions in LFAA LMC is shown in Figure 4-15. The list of exception types thrown by the TANGO framework are:

1. DevFailed2. ConnectionFailed3. CommunicationFailed4. WrongNameSyntax5. NonDbDevice6. WrongData7. NonSupportedFeature8. AsynCall9. AsynReplyNotArrived10. EventSystemFailed11. NamedDevFailedList12. DeviceUnlocked




of 156

Figure 4-15: LFAA LMC exception handling flow

When an exception is caught, a tuple of three values gives information about the exception that is currently being handled. The values returned are (type, value, traceback). When one of the Tango exceptions is caught, the type will be class name of the exception (DevFailed, etc.) and the value a tuple of dictionary objects all of which containing the following kind of key-value pairs:

1. reason: a string describing the error type (more readable than the associated error code)2. desc: a string describing in plain text the reason of the error3. origin: a string giving the name of the (C++ API) method which thrown the exception4. severity: one of the strings WARN, ERR, PANIC giving severity level of the error

Depending on the exception type, there are a possible list of combinations and causes for the exception, and this is determined by the TANGO system. This information is used by all devices inheriting from LFAADevice to marshal the appropriate action to take, and provide informative aggregation of the cause of the exception. These actions to be taken for the most important types of exceptions are summarized in Table 4-8.




of 156

Table 4-8: Attribute actions taken for different exception and error type combinations.Exception Type Error Type DevError Info

AvailableAction

ConnectionFailed

DB_DeviceNotDefined The name of the device not defined in the database

called_undefined_device = TRUE

API_CommandFailed The device and command name

if exists(device): called_dead_server = TRUEelse: called_undefined_device = TRUE

API_CantConnectToDevice The device name called_dead_server = TRUEAPI_CorbaException The name of the

CORBA exception, its reason, its locality, its completed flag and its minor code

called_dead_server = TRUE

API_CantConnectToDatabase The database server host and its port number

detected_dead_database = TRUE

API_DeviceNotExported The device name called_nonrunning_device = TRUECommunicationFailed

API_DeviceTimedOut The time-out value and device name

call_timeout = TRUE

API_CommunicationFailed The device and command name

call_comm_failed = TRUE

AsynCall API_BadAsynPollId invalid_asyn_id = TRUEAPI_BadAsyn called_inexistent_callback = TRUEAPI_BadAsynReqType request_id_mismatch = TRUE

AsynReplyNotArrived

expected_reply_not_ready = TRUE

EventSystemFailed

API_NotificationServiceFailed experienced_subscription_failure = TRUE

API_EventNotFound invalid_event_id = TRUEAPI_InvalidArgs experienced_subscription_failure =

TRUEAPI_MethodArgument experienced_subscription_failure =

TRUEAPI_DSFailedRegisteringEvent experienced_subscription_failure =

TRUE

4.5.8 Job Device States and Modes

The following is a description of states and modes that apply to any job device.Attribute Range Description and comments

adminMode(read-write)

Set by an outside authority (the Observatory operations via TM).ONLINE The job can be used for processing during scientific observing.

MAINTENANCE The job is not used for scientific observing but can be used for testing and commissioning. The LFAA is not aware of the higher observation goals and does not enforce this restriction; the LFAA executes commands received from TM. However, some test modes may be available only when the server is set to MAINTENANCE mode.

OFFLINE The job is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.

NOT_FITTED Set by operations to suppress alarm generation.opState LFAA intelligently rolls-up the operational state of all components used by




of 156

(read-only) the server and reports the overall operational state for the server.INIT The job is being initialized. A check for necessary daemons and services is

required to make sure work can be submitted to this server.OFF The job is turned off.ON The job is turned on.

ALARM The Quality Factor for at least one attribute is outside the pre-defined ALARM limits. Some or all functionality may not be available.

DISABLE The job is administratively disabled (adminMode=OFFLINE, NOT_FITTED, or RESERVE); basic monitor and control functionality is available, but no heavy operations operable.

FAULT An unrecoverable fault has been detected. The job is not available for use; maintainer/operator intervention is required.

UNKNOWN The job is unresponsive.healthState(read-only)

OKDEGRADED

FAILED

The LFAA intelligently rolls-up attribute quality factors, states, and other indicators for all components used by the server and reports the overall tile healthState.

obsState(read-only)

The job Observing State indicates status related to scan configuration and execution.

IDLE The job is not processing input data and is not generating output products.CONFIGURING Transient state entered when a command to e.g. restart job is received. The

transient buffer job leaves this state when re-configuration is completed.READY The job enters READY when re-configuration has been completed and the

transient buffer is ready to do data processing.SCANNING The job has running processes doing data processing.ABORTED The job transitions to this state when a command ‘abort scan’ is received. In

this state re-configuration, and any other on-going processing functions are stopped.

FAULT An unrecoverable error that requires operator intervention has been detected.

4.5.9 Device Caching

As per the guidelines in [RD1], devices will support and implement attribute value caching in ring buffers, so that devices that are queried regularly for attribute values, where the query happens at periods faster than the actual polling of actual devices, can return the last known value of the attribute. The attribute history cached in the ring buffer depends on the size of this buffer. This way the attribute polling and attribute value querying are kept separate.

4.5.10 Multiple TANGO Database Servers

For multiple device servers to run database, but only a single database, the environment variable is set up as follows:TANGO_HOST=<host_1>:<port_1>,<host_2>:<port_2>,<host_3>:<port_3>

These TANGO_HOST entries should reflect the hosts in the shadow master pool (shadow_master_pool_node_ids) in ClusterManager device.

Each TANGO database device server listens on a different port and/or host but connects to the same TANGO database. The multiple addresses are then specified in the same TANGO_HOST. A TANGO client tries to connect to each address in sequence and connects to the first one that replies. This connection-making procedure is done by the TANGO framework automatically.




of 156

4.5.11 Multiple TANGO HostsThe TANGO_HOST environment variable refers to the host:port address of the TANGO database where a device is defined. The concept of multiple TANGO_HOST means that you can communicate with devices from multiple TANGO_HOST databases in the same client. You do this by specifying the TANGO_HOST in the name of the device e.g. a client can talk to (i.e. create a Device Proxy) to the following devices:

//host1:port1/my/device/1 //host2:port2/my/device/2

The client is then connected to multiple TANGO_HOSTs. If the TANGO_HOST is not specified, then it the default TANGO_HOST is used. This feature is available in the TANGO framework to improve the reliability of a large TANGO control system. Within the LFAA “facility”, there can be multiple TANGO hosts on different servers for resilience purposes. In order for this procedure to be transparent, client (DeviceProxy) creation should be automated to try and find alternative hosts in case one host is down. The list of hosts should match the entries in the shadow master pool. This process is shown in the activity in Figure 4-16.

Figure 4-16: Activity to create a construct a DeviceProxy client given multiple TANGO hosts deployed for control system replication and resilience.




of 156

4.6 LMC External Interface Element Catalog4.6.1 LFAA Master Device4.6.1.1 Class Diagram

Figure 4-17: Class diagram for the LFAA Master device (inherits from LFAAGroupDevice)

4.6.1.2 Element BehaviourStates and ModesThe states and modes for the LFAA Master device are detailed in [AD2].

Alarms# Alarm Condition Description1 State = Fault2 degraded_percentage > MAX The acceptable degradation level is TBD3 Any(member_states) = FAULT If any of the member states is in FAULT, raise an alarm




of 156

Activity - Checking for Capability and Compute Resource AvailabilityThe observation management view goes into a lot of detail on many aspects of observation requests and processing. One particular process of importance at a hardware/resource level is the checking of both capability and compute resource availability. The execution of this will depend on the compute cluster off-the-shelf software deployed. However, there is a generic flow of how resource availability is ascertained. At a high level, this activity is described in Figure 4-18. Resource allocation on the MCCS cluster can fail in case of failures.

LFAA Capabilities must be allocated to a Sub-Array before they can be configured and used in an observation. The resource allocation process allows LFAA to check the allocation and health status of all requested Capabilities and their supporting components, and to confirm that sufficient compute resources are available to support calibration and observations for the targeted Sub-Array.

LFAA Capabilities are allocated at the level of Sub-Arrays and can be assigned to a maximum of one sub-array at any point in time. Consequently, a Sub-Array must release its allocated Capabilities before they can be reassigned to another Sub-Array. Allocation to a Sub-Array persists beyond scan and scheduling block boundaries; allocated Capabilities stay assigned to a Sub-Array until TM sends a command to release all resources to the host LFAA Sub-Array device, which causes the Sub-Array to release its allocated Capabilities back into the pool of unassigned resources. Resource and Capability allocation is accomplished by TM instructing an allocate command on the LFAA LMC Master TANGO device. This command accepts a JSON document as an argument, the format of which is described in [AD2].




of 156

Figure 4-18: Activity diagram showing the high-level process for the LFAA Master device to check for compute resource availability.




of 156

4.6.2 Subarray Device

4.6.2.1 Class Diagram

Figure 4-19: Class diagram for the LFAA Subarray device (inherits from LFAAGroupDevice)

4.6.2.2 Element Behaviour

States and ModesThe states and modes for the LFAA Subarray device are detailed in [AD2].

Alarms# Alarm Condition Description1 State = Fault2 degraded_percentage > MAX The acceptable degradation level is TBD3 Any(member_states) = FAULT If any of the member states is in FAULT, raise an alarm

4.7 RationaleThe behaviour and architecture described in this view is based on the rationale for harmonized SKA-wide LMC infrastructure as described in [RD1].




of 156

5 Observation Management View

Observation management, that is the ability to create and monitor observation, is the primary requirement of the LFAA software system (as well as LFAA in general), and therefore is the primary driver for the architecture. Figure 5-20 shows the context diagram for observation management, that is the relationship of LFAA with external entities. The information which flows between entities is described in Table 5-9.

Figure 5-20. Observation management context diagram

Table 5-9. Context diagram information flow description# From To Information Flow Description1 TM LFAA TelState download Subscribe to and received updated to the TelState

device2 LFAA TM TelStatel upload Publish any updates to the TelState device3 TM LFAA Transient trigger Trigger which instructs LFAA to transmit the transient

buffer for a particular subarray to SDP4 LFAA TM Subarray status Status update notifications5 TM LFAA GSM updates notification Notification which instructs LFAA to get the latest

GSM from SDP6 TM LFAA Subarray config The subarray configuration for creating a new

observationTM LFAA Resource allocation Request for availability of resources for configuring a




of 156

subarrayTM LFAA Subarray control Commands to control a subarray, including: start,

stop and abort7 LFAA SDP Transient buffer The transient buffer for a particular subarray, sent

when trigger by TM8 SDP LFAA GSM updates Updates to the GSM9 LFAA CSP Station beams The beamformed signals10 TM SDP Subarray config The subarray configuration for creating a new

observation11 SDP TM Transient trigger When SDP detects a pulse is issues a trigger so that

LFAA can be notified12 SDP TM GSM updates notification When SDP makes changes to the GSM it notifies TM

so that LFAA can be notified13 TM CSP Subarray config The subarray configuration for creating a new

observation

Figure 5-21 lists the primary LFAA use cases for observation management involving TM, CSP and SDP.

Figure 5-21. Observation management use case diagram




of 156

Primary Presentation

Figure 5-22. Observation management primary presentation




of 156

Figure 5-22 shows the primary presentation (component-connector diagram) for observation management. This is essentially a high-level description of LFAA component as expanded from Figure 5-20, with information flow redirected to the internal component which processes or generates the information. The interactions between these components for specific action (such as observation creation and calibration) are described in greater detail in the sections below. Figure 5-23 shows a class diagram for the components presented above, listing properties which are of interest to this viewpoint. The connection between classes represent usage (not, for example, inheritance or dependencies).

Figure 5-23. Class diagram with properties relevant to observation management

5.1.1 Element Catalogue, Properties and Relationships

Table 5-10 and Table 5-11 describe the elements in the primary presentation and the associated properties listed in the class diagram. Note that the properties listed here are only ones which are relevant to observation management and are not an exhaustive list of properties for each device. Note that all devices have the standard LMC properties as specified in [RD1], including state, which specified what the current state of the device is Table 5-12 describes the relationships between the components in the primary presentation and describes some of the information flowing between components.




of 156

Table 5-10. Element catalog for components in the primary presentation# Device Multiplicity Description1 Element Master 1 Primary monitoring and control component. Receives initial resource

allocation request to determine whether enough resources are available to configure a subarray, Instructs Local Sky Model to get updates from SDP when a GSM update notification is received. Instructs Transient Buffer Device to send transient buffer to SDP when triggered. Is notified of all updates during creation and runtime of observations and can provide summarised and detailed reports to TM

2 Local Sky Model 1 Contains the current copy of the GSM and is used by the calibration job to generate the model sky

3 TelState 1 Contains the current state and configuration of the telescope, including the observation configuration and state

4 Subarray 1 per obs, 16 total

Represents a grouping of station forming a subarray as configured by TM. Receives and processes subarray configuration as well as subarray control commands. A scheduling block can configure only one subarray at a time. Contains a group proxy of all the station forming the subarray

5 Station 1..512 (per subarray), 512 total

A device representing a group of Tiles which generate a station beam and is part of a subarray. Is responsible for synchronising Tiles and forming the beamforming ring, creating the calibration and DAQ jobs through the Cluster Manager as well as the Transient Buffer and Station Beam Devices. Sends status updates to associated subarray

6 Station Beam 1..8 (per station),

4096 total

Takes care of the pointing for a specific station beam. Creates the pointing job through the Cluster Manager. Sends status update to associated subarray

7 Cluster Manager 1 Responsible for creating and monitoring the calibration, pointing, DAQ and transient buffer jobs.

8 Storage Manager 1 Responsible for providing a directory where the DAQ can store cross correlation files to be used by the Calibration Job, as well as other temporary files for diagnostics.

9 Tiles 1..16 (per station),

8192 total

Part of a station which finally generate the station beams. For each observation it needs to be programmed, initialised and synchronised with all the Tiles forming a station. During an observation, pointing and calibration coefficients are written to the firmware which must be applied in sync across the station. Instructs Tiles to send channelized and other data to the DAQ for use in calibration and diagnosis. Monitors the boards throughout the observation

10 Antenna 16 (per Tile) Monitors the antenna RMS to check stability and acts as a proxy for antenna-specific coefficients (which are forwarded to the Tile for application).

11 Transient Buffer 1 (per station)

A TANO Device which manages the transient buffer for a particular station. Process the trigger sent by TM, through the Element Master, and instructs the Transient Buffer job to send its current buffer to SDP

12 Transient Buffer Job

1 (per station)

Buffer the station data for a particular station and send this buffer to SDP when triggered

13 DAQ Job 1 (per station)

Receives channelized and other data from Tiles forming a station, correlates the data and generates cross-correlation files which are written to disk

14 Calibration Job 1 (per Reads the channelized data files generated by the DAQ job and runs a




of 156

station) calibration algorithm to compute coefficients which are applied by the Tile device. Uses the Local Sky Model to create a model sky to which the computed visibilities are compared

15 Pointing Job 1 (per station)

Computes the pointing coefficient for a station beam to point to the provided azimuth and elevation. Uses antenna and station position as read from the inventory database. Forward calculated coefficients to Antenna device which instructs Tile devices to write the coefficient to the running firmware

16 Bandpass Job 1 (per station)

Reads integrated channel spectra generated by TPMs and acquired by the DAQ process and compute the scaling factors required to flatten all channel in all antennas. Also Performs bandpass-based diagnostics

Table 5-11. Element attributes for components in the primary presentationDevice Attribute DescriptionElement Master subarrays Group proxy with all subarray currently running with the

LFAA LMCSubarray Device subarrayId The subarray identifier

stations A group proxy with all the stations forming the subarrayStation Device stationId The station identifier

tpms Group proxy with all Tiles forming the stationantennas Group proxy with all Antennas forming stationsubarrayId ID of associated subarraycalibrationJobId The job ID for the calibration job submitted by the stationdaqJobId The job ID for the DAQ job submitted by the stationtransientBuffer Proxy to the Transient Buffer device created by the stationdataDirectory Parent directory for all files generated by the station

Station Beam Device

stationId ID of associated stationbeamId The beam identifierstartChannel Start channel of channel group to use for generating the

beambandwidth Bandwidth (number of channels) to use for generating the

beamupdateRate The update rate in Hertz to use when updating pointing

coefficientsisLocked Flag specifying whether beam is locked to targetazimuth Azimuth to point to (published to associated pointing job

when changes)elevation Elevation to point to (published to associated pointing job

when changed)Tile Device tpmId Global Tile identifier

logicalTpmId Logical (within station) Tile identifierstationId ID of associated stationsubarrayId ID of associated subarrayantennas Group proxy to antennas connected to TileipAddress LMC address (and global identifier) of TilelmcIpAddress LMC IP address to (and from) which LMC data will flowlmcPort LMC port to (and from) which LMC data will flowfortyGbDestinationIps 40Gb destination IP for all 40Gb ports on the Tile (source

automatically set during initialization)fortyGbDestinationMacs 40Gb destination MACs for all 40Gb ports on the Tile (source

automatically set during initialization)




of 156

fortyGbDestinationPorts 40Gb destination ports for all 40Gb ports on the Tile (source automatically set during initialization)

cspDestinationIp CSP ingest node IP address for station beam (use if Tile is last one in the beamforming chain)

cspDestinationMac CSP ingest node MAC address for station beam (use if Tile is last one in the beamforming chain)

cspDestinationPort CSP ingest node port address for station beam (use if Tile is last one in the beamforming chain)

firmwareName Name and identifier of currently running firmwarefirmwareVersion Version of currently running firmware

Antenna Device antennaId Global antenna identifierlogicalAntennaId Local (within Tile) antenna identifiertpmId Global Tile ID to which antenna is connectedgain The gain set for the antennarms The measured RMS of the antenna (monitored)delays Delay for each beam to be applied during the next pointing

update cycle (archived)delayRates Delay rate for each beam to be applied during the next

pointing update (archieved)calibrationCoefficient Calibration coefficient to be applied for the next frequency

channel in the calibration cycle (archived)bandpassCoefficient Bandpass coefficient to apply during next calibration cycle to

flatten the antenna’s bandpass (archived)fieldNodeLongitude Longitude of field node (center) to which antenna is

associatedfieldNodeLatitude Latitude of field node (center) to which antenna is associatedxDisplacement Horizontal displacement in meters from field node centeryDisplacement Vertical displacement in meters from field node centeraltitude Antenna altitude in meters

Storage Manager Device

totalDiskSpace Total disk space in clusteravailableDiskSpace Total available disk space in clusterparentDirectory Mounted global storage directory

Cluster Manager Device

servers Group proxy to all compute servers in MCCS

Server Device nodeId Global server identifieroneGbIpAddress Server LMC addressfortyGpIpAddresses 40 Gb IP address for all 40Gb interfaces on serverfortyGbMacAddresses 40 GB MAC address for all 40Gb interfaces on server

Local Sky Model Device

sourceCatalog Astronomical source cataloguelastUpdateTime Timestamp at which catalogue was last updatedcatalogueVersion Current catalogue version

TelState Device elementsState States of all elements in the telescopeobservationStates A list of running subarrays together with their configuration,

and statealgorithms Algorithms to use for pointing and calibrationalgorithmVersion Algorithm versions to use for pointing and calibration

DAQ Job stationId ID of associated stationnodeId ID of server on which job is runninggpuId ID of GPU on which correlator is runningnetworkInterface 40Gb interface to which DAQ is attacheddataDirectory Directory where files are writtendaqMode DAQ mode (see DAQ element behavior)




of 156

integrationTime Integration time for correlatorstartChannel Start channel ID to correlatenofChannels Number of channels, starting from start_channel, to

correlateTransient Buffer Device

stationId ID of associated stationtransientBufferJobId The ID for the Transient Buffer job submitted by the Transient

Buffer DeviceTransient Buffer Job

stationId ID of associated stationnodeId ID of server on which job is runningsdpDestinationIp SDP ingest node IP address for associated stationsdpDestinationPort SDP ingest port for associated station

Calibration Job stationId ID of associated stationantennas Group proxy with all antennas composing associated stationnodeId ID of server on which job is runningdataDirectory Directory from where files are readcalibrationCoefficients Calculated calibration coefficients, published to respective

antenna devicesbandpassCoefficients Calculated bandpass coefficients, published to respective

antenna devicesPointing Job stationId ID of associated station

antennas Group proxy of all antennas composing associated stationnodeId ID of server on which job is runningbeamId ID of associated beamazimuth Beam azimuth to point toelevation Beam elevation to point to

Bandpass Job stationId ID of associated stationnodeId ID of server on which job is runningperiod Cadence of updating scaling factorsIntegrationTime Integration time of integration channel data

Table 5-12. Element relationships for components in the primary presentation# Component A Component B Relationship Description1 Element Master Subarray Once configured and running, the subarray provides the

Element Master with status updates2 Element Master Local Sky Model When the Element Master receives a GSM update notification

it instructs the Local Sky Model to download the changes from SDP

3 Subarray Station The subarray creates the required station devices with the required configuration and initialise them. In turn, the station provides it with status updates. The subarray also forward transient trigger from the Element Master

4 Station Station Beam The station will create the required substation devices as initialise them. In turn, the substation provides the station with status updates

5 Station Tile The station program, initialises and synchronises all the Tiles forming the station, including setting the beamforming configuration. The Tile send status updates to the station throughout an observation

6 Station Transient Buffer The station creates a transient buffer device, which monitors and control the transient buffer for that station. When the station receives a transient trigger, it notifies the transient buffer device




of 156

7 Station Cluster Manager The station submits a DAQ and calibration job to the cluster manager, specifying on which nodes the jobs should run on as well as some additional configuration. The cluster manager schedules and runs the jobs, returning a job ID which can be used to monitor the jobs

8 Calibration/DAQ Job

Station Once the calibration and DAQ jobs are scheduled and running, they provide the station with status updates (through specific command calls)

9 Station Beam Cluster Manager Like #7, but submits a pointing job10 Station Beam Pointing Job Like #811 Cluster Manager All jobs Interfaces with the cluster management software to schedule,

submit and monitor jobs as required by the station, station beam and transient buffer devices

12 Transient Buffer Transient Buffer Job

Like #7, but submits a transient buffer job

13 Transient Buffer Transient Buffer Job

Like #8. Also, when a transient trigger is received, it sends the current buffer to SDP as instructed by the associated station

14 DAQ Job Storage Manager Stores cross-correlation and other files through the storage manager

15 Calibration Job Storage Manager Read cross-correlation and other files generated by the DAQ jobs through the storage manager

16 TelState All jobs The TelState acts as a configuration and state database which all jobs can query to get the required information

17 Antenna Tile The antenna device acts as a placeholder for calibration, bandpass and pointing coefficients. The Tile, when instructed by the station, applies these coefficients in sync to the running firmware. The antenna device also monitors the RMS of the associated antenna

18 Calibration Job Antenna The calibration jobs computes calibration and bandpass coefficients throughout the calibration cycle and writes them to the appropriate attribute in the antenna device.

19 Pointing Job Antenna Like #18, but for pointing coefficients20 Bandpass Job Antenna Like #18, but for bandpass scaling factors20 Calibration Job Local Sky Model The calibration process uses the source catalogue in the Local

Sky Model to generate a model sky

5.2 Element Behaviour

The following sections define element behaviour from different perspectives and use cases throughout the life cycle of an observation, including observation creation, monitoring and tear-down, as well as functionality specific, such as pointing and calibration. They are split into separate sections for readability. Figure 5-25 depicts the main stages which need to be performed for configuring a subarray (here referred to as observation creation), monitoring and stopping the observation. Note that components with an activity sign represent parts of the activity diagram which are described with more detailed diagrams further on in this section.




of 156

5.2.1 Observation state machine

Table 5-13. Observation statesState DescriptionIDLE The subarray is not processing data and is not generating output. It is not associated with

any Stations or Tiles. Tiles are in low power mode.CONFIGURING Transient state entered when a command to configure the Subarray is received. Once

finished there is an automatic transition to the READY stateREADY Subarray configuration is complete. Station, Tiles and Station Beams are assigned, Tiles are

ready to process signals and jobs are ready to start processingOBSERVING Subarray is generating calibrated, pointed station beams. Any parameter that required

updating during the observation are being updatedABORTED The Subarray transitions to this state when an abort command is received. In this state Tiles

are switched to low power mode and jobs are terminated. Subarray allocations cleared and made available for the next configuration

FAULT An unrecoverable error that requires operator intervention has been detected

Figure 5-24 shows the Subarray device state diagram. These states and their transitions are defined in the Table 5-13.

Figure 5-24. Observation state diagram




of 156

5.2.2 Observation configuration

Observation configuration involves almost all the elements shown in the primary presentation, as well as most activities in Figure 5-25. It involves the creation of the Station and Station Beam TANGO devices, as well as submission, instantiation and execution of all jobs. Note that deployment details, such as how jobs and devices server will be distributed over compute servers for load balancing, minimal latency and fault tolerance, can be found in [RD3].

Figure 5-25. Observation creation activity diagram

Figure 5-26 describes all the checks required for making sure that enough resources are available for running the requested observation. In cases where some resources are missing, an “insufficient resources” reply will be generated. The list of resources which need to be available listed below:

An unassigned Subarray Device The required number of unassigned station devices (total number of stations does not exceed 512) Required antennas are not in use, in maintenance mode or being configured for a different subarray

configuration Associated Tile are not in use, in maintenance mode or part of another subarray All links and support hardware from the antennas and Tiles, as well as the cabinets holding the

boards, are online and working. This includes:o The APIUs powering and controlling the antennas (this is a redundant check since the

antennas would already have been performed)o Network switches involved in the beamforming chain for each stationo Network switches connecting the cabinets involved to the MCCS cabinetso PSU and rack management board per cabinet if required (could be required in extreme cases

where power consumption needs to be strictly controlled) Compute resources are available for calibration, DAQ, bandpass, pointing and the transient buffer.

The LFAA Master can estimate the required resources for a given subarray configuration and will interact with the Cluster Manager device to check that these resources are available and can be used. See [AD4] for more details on the mapping between software and hardware in LFAA




of 156

Figure 5-26. Resource availability checks activity diagram

After the checks are performed TM is notified on whether the resources to configure and run the subarray area available. If the resources are available, the subarray ID and FQDN are provided to TM so that it can then issue a configuration request to the subarray directly. Figure 5-27 describes the high-level steps required to create a station, which are described in greater detail in subsequent diagrams. The steps include:

1. The required station devices are assigned to the subarray. Note that operations on different stations are performed concurrently as shows in Figure 5-25

2. The subarray sends the configure command to each station, which starts the initialisation procedure. The station gets the required configuration parameters from the parameters provided in this call, as well as the TelState device

3. The station program, initialises and synchronises all the Tiles allocated to that station in parallel, described in greater detail in Figure 5-28

4. The station creates the DAQ, Calibration and Bandpass jobs by submitting a request to the Cluster Manager device. Figure 5-30 describes general job submission and creation within LFAA. Bandpass job creation is now shown in the figure to reduce clutter; however, it is the same operation as that for the calibration and DAQ jobs

5. The station sends a configure command to the associated Transient Buffer device6. The Transient Buffer device creates the Transient Buffer Job by submitting a request to the Cluster

Manager. (Note that the creation of the Calibration Job, the creation of the DAQ jobs and the configuration of the Transient Buffer device [steps 4 and 5/6] are performed in parallel)

7. Once everything is configured, the calibration cycle starts. Note that at this point signal processing has already started on the Tiles.

8. Once the stations are calibrated, the subarray the waits for the start scan command from TM




of 156

Figure 5-27. Station creation sequence diagram

Figure 5-28 shows how the sequence of actions which need to be performed on the Tiles to program, initialise and synchronise them, as well as to form the beamforming chain to create a station. Note that all Tiles within a station are programmed with the same firmware, the only difference being that the last Tile in the chain is configured to send the generated station beams using the SPEAD protocol defined in the LFAA-CSP ICD. For more details on Tile configuration, see [RD2]. The steps included in this diagram are:

1. The Subarray calls the initialiase_station() command on the Station, providing the required parameters to set up the station.

2. The station devices programs and initialises all the Tiles in parallel. It first gets the currently loaded firmware on the Tile. If there is one and it matches the one needed for the observation, then the Tile is not re-programmed. Otherwise the firmware is loaded into the Tile

3. The station calls the initialise() command on the Tile.4. The Subarray calls the form_station() command on the Station. This instructs the Station to form the

beamforming chain to generate the station beams.5. The Station gets the 40Gb configuration from all the Tiles, which are statically set




of 156

Figure 5-28. Station formation activity diagram




of 156

6. When the Tile is initialised, it sends a reply to the Station, which in turn send a reply to the Subarray notifying it that all Tiles are ready

7. Each Tile’s 40g lane destination IP, MAC and port are then set as the next Tile’s (in the chain) respective lane’s source IP, MAC and port. This is performed for all Tiles except for the last one in the chain, whose parameters will be set to the associated CPS ingest node’s source IP, MAC and port when the command to start the observation is received. Once finished the Station send a reply to the Subarray

8. The Subarray calls the configure_beamformer() to the Station, which in turns applies to configuration on the Tiles. This instructs the firmware on how many station beams will be formed and which coarse frequency channels belong to which beams. Pointing related functionality is performed by the Station Beam devices.

9. The next step is to first synchronise the FPGAs on each Tile, and then synchronise the Tiles across the station. This step is presented in a very high level here, stating that the Station will first wait for PPS cycle and then performed the required action to attempt synchronisation. If this fails, then the steps are repeated until either the FPGAs and Tiles are synchronised, or a pre-set loop limit is reached, in which case synchronisation fails (which can be cause by, for example, malfunctioning PPS, malfunctioning devices on Tiles, high network latencies due to lose cables or malfunctioning switch ports, and so on)

10. A synchronisation report (delays from sync from all Tiles) is composed and returned

Once Tiles are configured, the Calibration and DAQ jobs are created as per Figure 5-27, and then the Station Beam devices are assigned to their respective stations. Each Station Beam device creates a Pointing Job through the cluster manager, as show in Figure 5-29.

Figure 5-29. Station Beam configuration activity diagram




of 156

Figure 5-30. Job creation sequence diagram

The fact that jobs will be created through the cluster has been mentioned in relation to previous diagrams. Figure 5-30 describe the steps required to submit and create a job. In this case the Station Device is shown as the component initiated the job creation, however this could be replaced by any other TANGO Device. Here, Observation Job is a placeholder for any job which needs to be created:

1. The initiating device calls the check_resources() command on the Cluster Manager device to check whether there are enough resources available to launch the job

2. The Cluster Manager replies with a list of nodes on which the job can run3. If resources are available the initiating device calls the submit_job command on the Cluster Manager,

providing the node_id on which to run the job and the station_id so associate the job with the station4. The Cluster Manager submits the job to the requested node, providing the station_id as a parameter5. Once the job is instantiated, it gets the station configuration from the creating device, as well as the

algorithm and version to use6. Once the job is configured, it creates a proxy to the initiating device and sends a job_configured event

When a job is running it can update the initiating device through the created proxy by calling a specific command (there is a command per job type). Some additional notes on specific devices:

The DAQ and Transient Buffer jobs need to be bound to a 100Gb interface to receive data from TPMs. This can be a physical or virtual interface. The interface IP, MAC and port need to communicated back to the station so that these values can be set on the Tiles when the observation is started

The Calibration and DAQ jobs need a data directory from which to read and write data, which is read by the jobs themselves through a station proxy

The Calibration, Transient Buffer and DAQ jobs will start running as soon as they are configured, waiting for incoming data (DAQ) or correlation files (Calibration) to process. This data flow will start when the observation is started

The Pointing Jobs need delay polynomial to calculate the delay and delay rate per antenna for each station beam. The polynomials are supplied by TM periodically.

At this point the subarray is configured, however it must be calibrated first. A full calibration cycle might need to be performed before the stations are ready to generate well calibrated beams. At this point, the array is calibrated, however TPMs are not generating station beams yet (they are transmitting calibration spigots).




of 156

Once all the stations are calibrated TM can call the start command on the subarray, as described in the following section.

5.2.3 Subarray ControlOnce the Subarray is configured it is ready to start the observation when commanded by TM. This is shown in Figure 5-31. It should be noted that TM will periodically update the beam tracking polynomials for each station beam. This must be performed at least once before the observation is started, otherwise the pointing jobs will simply point all the beams to zenith. The following steps are performed when the Subarray receives the start command:

1. TM sends the start command to the Subarray2. The Subarray calls the start command on all associated Stations in parallel3. The Station finalizes configuration on the Tiles. This includes:

a. Setting the CSP ingest node IP, MAC and port as the destination parameters for the final Tile in the chain

b. Configured the beamformer on all Tiles, essentially defining the station beams which need to be produced

c. Instructs the Tiles to start transmission of data4. Once all Tiles are configured, the Station return a reply to the Subarray5. The Subarray in turn waits for all Stations to finalize their configuration and returns a reply to TM

once configuration is finished

At this point signals are being processed and station beams are being sent to CSP. Throughout the observation calibration and pointing coefficients are being calculated and updated, and control data from the Tiles is being received and processed accordingly. At any point TM can issue subarray control command which affect the lifecycle of the observation, as listed below:

Table 5-14. Subarray control commands which can be called on a running SubarrayCommand Effect on running observationstop The current observation is stopped, and the observation is move back to the READY state. Data

output to CSP is stopped. Jobs and Tiles are left configured so that of the next observation required the same parameters the devices do not have to be re-configured

abort Abort moves the subarray to the ABORT state. The possible state changes from this are to the CONFIGURING and IDLE state, which means that all resources can be freed up (to be re-used later). Output to CSP is first stopped to avoid invalid data being transmitted while aborting the observation. All running jobs are terminated (through the initiating device via the Cluster Manager device). Tiles are de-configured (but not put in low-power mode). Station, Station Beam and Tile devices are unassigned.

When the subarray receives a reset command while in the READY state, the same operations as abort above are performed. Additionally, the Tiles are de-programmed are placed in low-power mode. This also happens when the command is received whilst in the FAULT state. At any point in time errors can occur which can degrade the quality of the observation, as described in Section 5.2.10.




of 156

Figure 5-31. Observation start activity diagram

5.2.4 Antenna EqualizationOne of the operations performed by the Tile during initialization is antenna equalization, where a attenuation/gain factor is calculated for each antenna such that the RMS level of all antennas within a station are within an acceptable range. The calculated factors are written to appropriate registers in the running firmware are should remain valid for the duration of an observation. A very simple antenna equalization procedure is described below (performed for each antenna):

1. Get the current antenna RMS value as reach from the ADC2. Calculate difference between read value and ideal value (provided as input, depends on optimal signal

strength required by the firmware)3. Transform this difference to decibels (get the logarithms to base 10 and multiply by a 10). This is

required since the attenuation and gain levels are measured in decibels4. Write compute attenuation/gain to appropriate TPM register for the antenna




of 156

This procedure can detect malfunctioning antenna in cases where the computed attenuation/gain are outside of the value range accepted by the TPM. The average power will be set whenever a station is initialised and check at the configuration of a new observation. If the power is outside a settable bound the gains are re-equalized.

5.2.5 PointingPointing refers to the calculation of beamforming coefficients which, when applied electronically by the TPMs, steer the beam to the desired location. This location is provided by TM as a Sky Coordinate set, which is composed of:

1. Activation time: the time in UTC at which LFAA should start applying the polynomial (milliseconds since UNIX epoch time)

2. Azimuth position: 0th order azimuth coefficient3. Elevation position: 0th order elevation of polynomial4. Tracking speed: 1st order azimuth coefficient (optional)

The Pointing Process is responsible for calculating the delay (based on the required Azimuth and Elevation) and delay rate (based on the Tracking speed) per antenna contributing to a station beam periodically, with the update rate being provided by TM during Subarray configuration. Figure 5-32 shows the steps required to update a station beam’s pointing:

1. TM issues an updated polynomial (not shown) to the Subarray2. The Subarray forwards the polynomial to the Station Beam, which inform the associated Pointing

process of an updated Coordinate set3. The Pointing process reads the required antenna location from the respective Antenna devices (if not

already cached), and the Coordinate set from the Station Beam device4. The Pointing process compute the delay and delay rate per antenna, using the station centre as the

phase centre. The TPM firmware calculates the pointing coefficients for each coarse frequency channel once the delays are downloaded

5. The Pointing publishes the computed delay and delay rates to the respective antennas, and informs the Station beam device that the computation has finished

6. The Station Beam device instructs the Station that a new set of delays is available for download (note that the Station device is responsible for downloading the delays to the associated Tiles in a synchronous manner, so this is why the Station beam device has to go through the Station device)

7. The Station beam downloads the delay to the respective TPMs8. Steps 4-7 are performed periodically, this period being define by TM9. When an updated Coordinate set is received, steps 1-8 are performed

This scheme allows for different observing modes. For example, by omitting the Tracking speed, the station beam remains fixed on a specific location in the sky, as required by ECP-150005. Periodically updating the Sky Coordinate (including a Tracking speed) allows for tracking different celestial (including objects within the Solar System) and man-made objects (satellites).




of 156

Figure 5-32. Pointing sequence diagram

5.2.6 CalibrationInstrumental and environmental induced gain and phase offsets must be corrected for, such that antennas can be beamformed accurately. This is the job of the Calibration process. The array is calibrated one coarse frequency channel at a time (only the usable frequency channels need to be calibrated), and the calibration cycle is 10 minutes, meaning that each frequency channels must be re-calibrated every 600 seconds. Each channel needs to be calibrated in ~1.5s (such that all 384 channels can be calibrated in 10 minutes). The following lists the high-level operations which need to be performed to calibrate a single channel, as also explained in Figure 5-33.




of 156

Figure 5-33. Calibration overview sequence diagram

1. The Local Sky Model is read by the Calibration process and used to generate the model sky for this calibration cycle (not shown in Figure 5-33). This is performed in parallel whilst raw channel data is being received, correlated and calibrated, since the start and end time of the received time samples for every channel can be accurately estimated (assuming an appropriate time window is used).

2. Raw channel data needs to be transmitted by all the TPMs forming part of a station. This is used for calibration (and diagnostics) and is not transmitted to CSP. This data is directed towards a MCCS node, assigned during initialization, on which a DAQ process is running.

3. The DAQ process reads in this data and buffers it for correlation. This data stream amounts to ~6.4Gbps.

4. Once all the time samples for a frequency channel are received (that is, the stream switches to a new frequency channel), the buffer is marked as ready and copied to GPU memory.

5. The GPU correlator computes the auto and cross correlation of the data and integrates the entire buffer to a single correlation matrix.

6. The correlation matrix is saved to disk. A different file is generated for every frequency channel.7. Once the file is written, the Calibration process is notified (it monitors the directory for new files).8. Assuming a standard calibration algorithm implementation, the difference between the sky model

and acquired visibilities is minimized, generating a set of coefficients which describe the difference between the two.

9. The generated coefficients are sent to the Station device.




of 156

10. The Station device then distributes the calibration coefficients to its Tiles, which download them on the TPMs.

11. The Tile devices also distribute the calibration coefficients to the respective Antenna devices (not shown), where they are archived for diagnostic purposes. These coefficients are kept in the LFAA archive for several days.

12. TM also archives the generated calibration coefficients (not shown)

The above description is an overview of the actions performed during calibration. It should be noted that there is no direct communication between the Tile device, DAQ process and Calibration process. When a Tile is instructed to start sending channelized data, it forwards this instruction to its associated TPM, which starts sending out a continuous data stream. In the meantime, the DAQ process is already initialised and waiting for incoming data on a network interface. When packets arrive, they are buffered for processing, and eventually the data is correlated and saved to disk. During this time, the Calibration process is already initialized (and generating the model sky) and is waiting for new correlation matrices to be written to disk.

Figure 5-34. Calibration procedure timing diagram (UML) for one frequency channel

Figure 5-34 shows the interaction between the main components of the calibration procedure as well as timing constraints of the various actions performed. Note that only the operations for a single frequency channel is shown. The only action in this figure which is applicable to all channels is the command from the Station to the Tile to transmit channelized data, since only one command is required to send data from all frequency channels). In relation to this, Figure 5-35 shows how multiple frequency channels can be processed in parallel. Note that in the DAQ, packet reception, buffering and correlation are performed in parallel through different threads.




of 156

Figure 5-35. Calibration timing diagram showing how frequency channels are processed in parallel

The figures above provide timing constraints for the calibration algorithm adopted. Note that the actual calibration algorithm used to calibrate the LFAA out of scope in this document. This sub-section provides the pipeline in which the algorithm should be embedded (the Calibration Process), as well as timing constraints. Essentially, the implementation of the calibration algorithm should be able to calibrate a station within the integration time of the correlation data (~1.5s in Figure 5-34). This includes the time required to generate the model visibilities for each channel every 10 minutes, although this can be performed in a separate processing thread. This does not limit the amount of processing resources which can be consumed by these two processes for generated the calibration coefficients, within the limits imposed by [RD3].

Figure 5-36. Calibration process information flow diagram




of 156

Figure 5-36 shows how several software components provide the information required as inputs to the model sky generator and calibration algorithm, and how the generated output flows back to these components. The model sky generator and calibration algorithm are shown in a bounded, shaded box, which represents an abstracted view of the internal composition of the calibration Process. The Beam Model and Antenna Positions only need to be generated/provided once per observation configuration since they depend on the array shape and antennas used. The Sky Model needs to be updated periodically, determined by pointing scheme (static or tracking), beam width and sky rotation. Measured visibilities and calibration coefficients are generated once every ~1.5 for one channel at a time. The calculated gain and phase solutions might not be of good quality (as defined by the calibration diagnostics below). In this case it might be better to use older calibration solutions.

5.2.7 Calibration Diagnostics

The behaviour of the antenna calibration coefficients is a very efficient diagnostic tool for monitoring the health and performance of the individual signal chains and therefore stations. Simple statistics will be calculated on each set of antenna coefficients during the calibration cycle and compared against expected values taken from the previous cycle and archived data. Values which are beyond limits (based on expected behaviour) can be flagged, and if they continue to behave unexpectedly can be removed (gains set to zero), and TM is informed of the updated availability. Some local environmental conditions, including transient RFI, can result in the calibration algorithm failing to converge, and in this case the prudent action is to extrapolate the gain coefficients from the previous values or a suitable model rather than potentially creating an error in the data sent on, however this should be under control of the observation configuration and reported as part of the data health statistics.

The list below presents a non-exhaustive list of tests which can be performed on the calculated gain coefficients:

- X-and Y phase and gain values for any single antenna should vary slowly and similarly over time, those which do not can be flagged if they differ from the previous value by more than a threshold amount.

- The phase values should follow an increasing slope with frequency and can be compared to a linear model. Those that diverge beyond a limit can be flagged.

- The phase coefficients should be correlated with the calculated delay from the station phase center, and this can be predicted.

- Amplitude coefficients of zero indicate a faulty antenna signal chain; comparing to geographic location and TPM assignment locate groups of faults to LRUs (for example, X and Y both low signal indicate an RFoF fault, a group of 16 would indicate a power supply or TPM fault, etc).

Failing calibrations will be reported upwards through the associated control devices for rolling-up into health reports sent back to TM. The subarray device will have calibration-related attributes defined, such that if the calibration quality degrades TM can be notified. Alarms can also be defined on these attributes.

5.2.8 Bandpass Flattening and Monitoring

The station beams generated by LFAA must be flattened, such that the bandpass is within 1.5dB. To flatten the antenna bandpass a scaling factor per frequency channel and polarization needs to be calculated, which are then applied in firmware, such that the station beam is generated using the flattened bandpasses. The scaling factors depend on the current shape of the bandpass, which can change over time, such that this calculation needs to be performed routinely. Therefore, channelized LMC data is required to determine the bandpass shape. Integrated channelized data (see [RD2]) can be used for this purpose. TPMs can be instructed to send integrated spectra for every antenna and polarization, with a user-defined integration time. These integrated spectra can be used for:

Bandpass monitoring: The shape of the bandpass, and its stability in time, provides a useful diagnostic for antenna behaviour, which can give an indication of whether an antenna is about to reach a Faulty or undefined state

Bandpass flattening: Scaling factors can be computed to flatten to bandpass




of 156

Both bandpass flattening and monitoring require some level of parametrisation of the bandpass shape. This can be achieved in multiple ways, including (in increasing processing requirements):

Averaging over small subsets of frequency channels n-order polynomial fitting (where n depends on the antenna bandpass model) The above two procedure with channel masks (supplied by TM) to remove noisy channels which can

skew the averages and fits. Removed channels can be replaced with the average from adjacent channels or by an interpolation algorithm

Figure 5-37 shows the high-level activity required for bandpass monitoring and flattening, described in the steps below (not that this is performed per station independently):

1. When the station is configured, the TPMs are instructed to send integrated spectra with a user-define integration time and the DAQ is instructed to receive these spectra (not shown)

2. When the TPMs send a new set of integrated spectra they are received by the DAQ process, which informs the Bandpass process

3. The Bandpass process reads the new spectra and parametrises the bandpass of each antenna and polarization. This process can be performed in parallel for each antenna, as shown

4. These parameters can be used to check whether the bandpass is healthy, for example by comparing it with its previous parameters to check for sudden changes and with other antennas to check whether there is an outlier

5. If the bandpass is deemed unhealthy, the antenna status is changed to Faulty or Unknown, such that TM is notified and decide whether to keep the antenna in the station

6. Otherwise, the scaling factor for each frequency channel is calculated 7. Once all antennas are process, the new scaling factor are provided to the Tile device (which

distributes them to the appropriate antenna for archiving, not shown here).8. The Tile device downloads the updated scaling factors to the TPM, where they are applied

Figure 5-37. Bandpass monitoring and flattening activity diagram

5.2.9 Fast Transient BufferThe transient buffer is used to capture transient event. CSP can detect transients, but due to their nature, and to the processing time necessary for the detection, it will trigger a transient capture with a significant delay (several hundreds of seconds) after it has been received by the telescope. A buffer is thus required to keep track of received raw data, and to retrieve it when an interesting event has been detected. As the collected raw data volume is extremely large, data is buffered only for beamformed samples, for a limited bandwidth, and with a limited bit resolution. This buffer is located on the MCCS server, which instantiates the required




of 156

buffer per station during observation configuration. When a trigger is received from TM, a subset of this buffer is sent to SDP. The following lists the steps required for receiving, triggering and transmitting the transient buffer (shown in Figure 5-38):

TM will set the transient buffer parameters during observation configuration. A subset of the stations and coarse channels that are processed for beamforming are selected for transient buffer storage. The destination SDP ingest node address parameters are provided by TM as well

SPS re-quantizes channelized samples being sent transient buffer as 2+2 bits per complex sample, and formats them in independent SPEAD packets. This operation is performed in the last TPM in the beamforming chain. One transient buffer packet is generated for each CSP packet.

SPS sends these packets to a circular buffer on the MCCS servers, with a total buffer space sufficient to hold the whole required time interval

CSP identifies a potential transient event and signals it to TM together with the relevant parameters. Alternatively, TM can accept a transient event trigger form an external source

When a transient is detected, TM triggers MCCS and sends the required parameters (which includes the start time and stop time for each of the frequency channels)

When the stop time is reached, the SPS stops generating transient buffer packets. The stored interval is kept in the buffer (not overwritten).

MCCS retrieves the relevant portion of the buffer and sends it to SDP for archiving When enough buffer space is available for a second transient to be captured, SPS resumes sending

packets to the circular buffer

Figure 5-38. High level transient buffer sequence diagram

Based on the assumptions below, each station generates 1.422 Gbps of data for the transient buffer. The minimum total buffer space required for 900s of data (excluding metadata) is 160 GB per station, for all stations 80 TB. The data to be transmitted for the required 510s is ~45.5 TB, which is transmitted directly to SDP over a dedicated 100 Gbps link. Assumptions:Document No.:Revision:Date:



of 156

At most 150 MHz of data per station and polarization needs to be stored in the buffer Total segment stored in the buffer is 900s long No double buffering is required (after a transient is captured, there will be some time in which the

buffer is downloaded to SDP, and during which events cannot be stored) Data format is at least 2 bits per sample, complex samples, two polarizations. Alternatively,

bandwidth and/or number of stations to buffer can be substituted for higher bit resolution, so long as the total bandwidth into the MCCS server is not exceeded

Samples refer to channelized oversampled data, with oversampling factor 32/27 ≃ 1.185 Samples from all 512 stations are captured

During observation configuration, a transient buffer instance is created for each station. Packet containing transient buffer samples are sent to one of the MCCS servers (on which the processing for the station the packets belong to is performed, see RD-X Section X for server-station mapping). The Transient Buffer Process received these packets and places them in-order in its internal circular buffer. When a transient trigger is received by TM, the ring buffer is frozen, and a pseudo-file is assembled to be transmitted to SDP. Packets are sent in an FTP-like format with metadata provided as a separate file or embedded with the samples. This metadata includes:

Information contained in the SPEAD header, identifying the stations and channels being transmitted, reference UTC time and other metadata common to all samples

For each transmitted channel, the associated beam and physical frequency For each transmitted channel, the initial time and duration (if different among channels) that is being

transmitted For each block of 2048 samples (as defined by SPS), the rescaling value used for 2-bit (or more)

quantization. A rescaling value of zero is used to signal a missing, or corrupted block.

The current SDP-LFAA ICD [AD3] specifies that a separate file is sent for each station beam and channel. Station and channel ID, timestamp and sample count are specified in the file name. Other information (beam ID, physical frequency, physical time) can be retrieved from TM, that initially configured the observation. Rescaling factor could be embedded in the sample stream. The FTP format allows easy handling of transmission speed to exploit the 100 Gb bandwidth of the LFAA to SDP link, which is distributed across the server participating in the transient buffer transmission.

5.2.10 Subarray Monitoring

During subarray configuration and whilst the subarray is running several faults can arise. These faults include hardware components going into a Faulty or Unknown state, software components going into a Faulty or other invalid state (such as process crashes or invalid outputs) and lost off synchronisation and/or timing. The effects of these issues depend on when the issue arises and how much the issue affects the running observation. It is assumed that TM will decide what happens to an observation in this case (aborting, stopping or continuing the observation). Faults which can potentially damage hardware equipment will be mitigated locally, where the hardware can automatically switch itself off (such as the case for the APIU and TPM) or the monitoring software instructs the hardware component to switch off or switch to low power mode.

Table 5-15 and Table 5-16 provide a non-exhaustive list of issue which can arise during subarray configuration and during a running observation. Note that these tables do not go into detail on how these issues may arise (see [RD2] for specific details) and what attempts are performed to correct the issue (such as trying to re-connect to a hardware component to which communication was lost). The effects induced by the listed issues are also discussed, including how this would affect an observation being configured or running.

Table 5-15. Potential issues which can arise during subarray configurationIssue EffectTPM firmware could not be loaded on a specific TPM

The TPM has a specific issue which must be investigated, it cannot form part of a station until issue is addressed

One TPM could not be initialised and The TPM has a specific issue which must be investigated, it




of 156

configured cannot form part of a station until issue is addressedCould not form beamforming chain A networking (interface on TPM or interconnecting switch) or

firmware issue is preventing the chaining of TPMs. Network tests need to be performed, otherwise station cannot be used a whole

Could not schedule job on MCCS cluster MCCS server can malfunction after resource allocation and prior to configuration. If re-scheduling is not possible, then observation cannot be configured.

Table 5-16. Potential issues which can arise during a running observationIssue EffectAntenna status changed to Faulty or Unknown

Antenna output might be invalid

TPM status changed to Faulty or Unknown (lost communication)

TPM might be working fine but intermediary communication has an issue. If the latter, station beam is still valid but calibration and pointing coefficients cannot be updated. If the former, station beam might be invalid

SPS device involved in station goes to Faulty or Unknown

Depending on the SPS device, several TPMs can become unreachable. Station beam is undefined

LMC data stream from a single or multiple TPMs is lost

Can be a network, TPM or server issue. If LMC data cannot be received from a TPM, then it cannot be calibrated. If the TPM is that last one in the beamforming chain, the transient buffer data will not be received either.

Lost communication or LMC data stream from multiple Tiles

Sub-rack, cabinet-level or MCCS-level issue, station cannot be fully calibrated and station beam is undefined

Synchronization error while downloading calibration or pointing coefficients

Results in station beam being undefined for a small period (time taken to apply all calibration or pointing coefficients)

Calibration solutions invalid or incorrect Station beam is undefined for a frequency channels whose calibration solutions are invalid or incorrect not be calibrated for a specific channel

Server hosting station goes to Faulty or Unknown State

Station cannot be calibrated, pointed and transient buffer cannot be populated. Potentially, associated TANGO devices will become go offline until restarted. Station beam will be undefined or invalid

The fault handling philosophy in MCCS is that if any faults are detected they are exposed to higher layers of the TANGO hierarchy (up to LFAA Master and TM). This will distil into a health state which can be explored by drill-down as needed.

5.2.11 Local Sky Model update

The SPD-LFAA ICD [AD3] describes the interface through which LFAA can get a copy of the global sky model from SDP. The data needed from the GSM by LFAA still need to be defined, as well as the requirements (latency, cadence, volume, etc) from LFAA on these items required for station calibration. SDP will provide an exposed API through which LFAA can pull a subset of GSM dataset.

5.3 Variability Guide

There are several points in the Observation Management architecture that are designed to support variability, allowing for future modification or alteration depending upon need. The key to supporting this variability is the logical breakdown of the major components into sub-component and clear interfaces between them. The following variability mechanisms are foreseen:




of 156

5.3.1 Subarray and Station Configuration

The SKA requirements define the composition of station, sub-station and the number of subarrays: A station is composed of exactly 256 antennas A sub-station is composed of a subset of antennas within a station There will be a maximum of 8 station/sub-station beams from each station There will be at least 16 subarrays

These number define several aspects of the physical composition of LFAA and other sub-Element of SKA1-Low. The architecture presents in this section meets these requirements, however does not impose a limit om these numbers. Ignoring hardware limitations, the architecture allows for:

Station composed of any number of antennas (which can or cannot be in groups of 16, since this grouping is imposed by the TPM and not the software)

There can be any reasonable number of station/substation beams Any number of subarrays

However, the above variability does not come for free, and appropriate configuration and hardware resources must be set up.

5.3.2 Processing Algorithms

The observation management architecture makes a distinction between M&C component (TANGO devices) and components which perform some form of processing, such as calibration and pointing (Processes). A process is not a TANGO device, it is simply an implementation of an algorithm (which can be in any language) which is wrapped in Python and instantiates a proxy to the TANGO device which submitted the process to the cluster manager. This allows for a high order of variability when it comes to attaching processing components to an observation:

The algorithmic implementation can change without affecting any part of the architecture, as long as the external interface (through the device proxy) is adhered to

The system can have implementation of different algorithms (or different versions of the same algorithm) and, if appropriate configuration is performed, the Station or Station Beam can select which one to submit for a given observation

Additional Processes can be easily defined and submitted for processing. Minor changes would be required to update the TANGO device associated with the Process, but no architectural change is required

5.4 Rationale

The architectural drivers for the observation management view are to: Form stations and monitor up to 8 independent station beams. Calibrate stations every 10 minutes Capture control data to be used for correlation and diagnostics Calculate bandpass flattening coefficients to be applied by SPS to produce a flattened station beam Constantly buffer the stations beams from all stations and transmit this to SDP when triggered

In terms of SEI software qualities, the architectural drivers for observation management architecture are performance, availability and modifiability, which led to the design of the described architecture. The following list of discussion, posed as questions, describe the rationale for several choices made for this architecture.

Why are there no sub-station devices?A substation is defined a station which some antenna not participating in forming the station beam (have a coefficient of 0, effectively setting their signal to 0). Therefore, a substation can be regarded as a subset of a station. Since multiple substation (and station) beams can be defined on the same station, all using the same set of antennas and processed on the same hardware (since TPMs are partitioned across stations as well), all




of 156

station and substation configuration can be handled by one Station device and instantiating a Station Beam device per station/substation beam.

Why are processes not TANGO devices?In this architecture TANGO devices are associated with hardware or logical components of the architecture whose primary role is to configure, manage and monitor the associated component. A process is inherently different, in that, even though it required some configuration, management and control, its primary role is to perform computation. Computation-heavy processes should not be TANGO device, since the TANGO device model is not optimised for such processes. Therefore, processes are regarded as cluster jobs which are monitored through standard cluster management software. The TANGO device which generates the process is also responsible for managing it.

What is there a Transient Buffer TANGO device and a Transient Buffer process, and not just a single process managed by the Station TANGO device? A Transient Buffer TANGO device was included to offload transient buffer related operations from the Station device, reduce latency from when TM sends the trigger to the start of transient buffer transmission (TM can directly instruct the device to transmit the buffer). The transmission of the transient buffer may take up to one hour, so having a separate device which manages the buffer implementation guarantees that if the Station device malfunctions due to other reasons (since it is a more complex device), buffer transmission can continue.

Why are files used between the DAQ and other processes?The DAQ receives LMC data from TPMs and generates files which are saved in distributed storage. Devices or process which needs to use this data can then request these files from the storage manager. For real-time operations (such as generating the correlation matrix and using it to run the calibration algorithm) this introduces a latency, which however is minimal compared to the timescales over which these operations need to be performed (especially since the size of these files is small). Using the storage manager for storing LMC files is advantageous because:

The TANGO database is not overloaded with data, and only stores data which is eventually archived by TM (or used for short-term debugging in LFAA)

Files can be used offline (can be copied to external storage without having to import data from the TANGO database)

5.5 Related Views

The LMC Infrastructure View provides the notation, state machine, and LFAA TANGO device behaviour, all of which are used and assumed in this view

The Monitoring and Control View provide details on how the hardware components of SPS and MCCS are represented as TANGO devices and details the monitoring points and commands available, some of which are referred to in this view




of 156

6 Monitoring and Control View

6.1 Context DiagramMost of the interaction required for monitoring and controlling any device wrapped in a TANGO framework can be summarized by the context diagram in Figure 6-39, based on the stakeholders of the system during three phases: system design and development, system running, and system maintenance. This context diagram maps primary use cases of monitoring and control that are essential to various stakeholders. Based on these, this view will define several major elements involved in the monitoring and control functionality of the system (aside from observation running, which is defined as a separate view), and detail the behaviour of these elements.

Figure 6-39: Monitoring and control elements have a narrow interface defined by the TANGO framework. Within this framework, there several primary use-cases required for monitoring and control.




of 156

Furthermore, within the context of having most monitoring and control functionality wrapped in a TANGO server, the activity of a client of the monitoring and control system is unified. This unified activity is summarized in Figure 6-40.

Figure 6-40: Unified activity for TANGO clients during run-time of the LFAA LMC system.




of 156

6.2 Primary PresentationTo be able to provide the essential monitoring and control functionality of the hardware devices in the system, which then bubble up to more complex monitoring and control functionality at an observation/instrument level, this view will define the composition of TANGO device elements, which map directly to hardware elements. This is shown in Figure 6-41.

Figure 6-41: Components defined for hardware devices, collectively forming a hierarchy of monitoring and control functionality all the way up to the LFAA Master device.

6.3 Element CatalogThis subsection will present all the elements of the system (mostly as TANGO devices) that provide the basis upon which monitoring, and control behaves for the LFAA LMC system to run according to requirements. The elements presented here will start from TANGO devices that interact with hardware components, up to the devices that describe station and station-beam monitoring and control. In most cases, for every TANGO device, there are defined:

A class diagram with properties and commands A description of element behaviour with one or more of:

o States and modes of the deviceo Alarms pertaining to the deviceo Essential commands requiredo Activity diagrams to define special flows required to be implemented for the device




of 156

6.3.1 Antenna Device6.3.1.1 Class Diagram

Figure 6-42: Antenna device class diagram (inherits from LFAADevice)

The Antenna Device represents the TANGO interface to an Antenna device and inherits all the functionality of the LFAADevice base class. This device mainly provides for antenna metadata and location, monitoring point attributes, calibration and pointing coefficient storage and commands to power on/off the antenna (via communication with the APIU device, which holds power control over antennas). In particular, it also maintains a match between a global antenna ID, the logical antenna ID for the TPM it is connected to, and the logical antenna ID for the APIU it is powered by.

Since the physical antenna is not monitorable, the Antenna TANGO device attributes are set by:- Reading the required information from other TANGO devices (primarily the Tile and APIU devices

associated with a specific antenna)- By running jobs (such as calibration and pointing), which updates the state of the antenna either

during or between observations- The monitoring and control system in general

The Antenna device needs the associated APIU device for reading metric and to act on power commands (antenna power commands defined in the Antenna device act as a proxy to command defined in the APIU device). The APIU device does not need information from the Antenna device. However, there is a cyclic dependency between the Antenna and Tile devices. The Antenna needs to gather metric information from the Tile device, while the tile device might need to read attributes from the Antenna device. Additionally, the Tile device can write attributes in the Antenna device (but the Antenna device does not write attributes in the Tile device). No commands are issued between the two devices. The Antenna state is dependent on both the




of 156

associated APIU and Tile state, if any of them is offline or faulty then the Antenna state will be changed appropriately.

It can be noted that there is a cyclic dependency between the Antenna devices and the Tile and APIU devices. The Antenna device read in metric from both these devices, and also forward command (acts as a proxy to external commands) to the APIU device. The APIU device does read any information from the Antenna device

6.3.1.2 Element BehaviourStates and Modes

Attribute Range Description and commentsadminMode(read-write)

Set by an outside authority (the Observatory operations via TM).ONLINE The Antenna can be used for scientific observing.

MAINTENANCE The Antenna is not used for scientific observing but can be used for testing and commissioning. The LFAA is not aware of the higher observation goals and does not enforce this restriction; the LFAA executes commands received from TM. However, some test modes may be available only when the Antenna is set to MAINTENANCE mode.

OFFLINE The Antenna is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.

NOT_FITTED Set by operations to suppress alarm generation.opState

(read-only)LFAA reports the operational state for the Antenna.

INIT N/AOFF The Antenna is not powered.ON The Antenna is powered and not disabled.


DISABLE The Antenna is administratively disabled (adminMode=OFFLINE, NOT_FITTED, or RESERVE); only basic monitor and control functionality is available.

FAULT An unrecoverable fault has been detected. The Antenna is not available for use; maintainer/operator intervention is required.

UNKNOWN The Antenna is unresponsive, e.g., due to loss of communication.healthState(read-only)

OKDEGRADED

FAILED

The overall Antenna healthState.

obsState(read-only)

The Antenna Observing State indicates status related to scan configuration and execution.

IDLE The Antenna is not generating output products.CONFIGURING N/A

READY The Antenna enters READY when it the antenna signal is ready to start generating a signal. Any calibration parameters that require updates are being updated.

SCANNING The antenna is outputting a signal.PAUSED N/A

ABORTED N/AFAULT An unrecoverable error that requires operator intervention has been

detected.

Alarms

# Alarm Condition Description1 State = FAULT An unrecoverable fault occurred, and operator needs to be notified2 HealthState = FAILED Indicates a failure on both polarizations3 HealthState = DEGRADED Wrong bandpasses or bad RMS on one polarization




of 156

4 obsState = FAULT An unrecoverable fault occurred, and operator needs to be notified

Commands

# Antenna Commands Description1 PowerOn(logical_apiu_antenna_id, apiu_id) A command to power on the Antenna. The

Antenna cannot self-power on, and therefore uses its internal link to the appropriate APIU to instruct it to turn on the antenna.

2 PowerOff(logical_apiu_antenna_id, apiu_id) A command to power off the Antenna. The Antenna cannot self-power off, and therefore uses its internal link to the appropriate APIU to instruct it to turn off the antenna.

Activity - Antenna RMS CheckThe activity in Figure 6-43 describes how the Antenna device checks for issues related to the RMS on the signals received by the antenna. The health state of the antenna depends on whether one or both signal polarizations have bad RMS values.




of 156

Figure 6-43: Antenna RMS and bandpass check activity diagram

Activity - Antenna Check for UNKNOWN opState in Parent ControllersThe activity in Figure 6-44 shows how the antenna device maintains a check on the state changes of its parent Tile and APIU device. When both of these states are detected as UNKNOWN, it essentially means that the LFAA control system has lost direct control over the physical antenna, in which case, the Antenna device should report an UNKNOWN state as well.




of 156

Figure 6-44: Activity diagram for antenna check for UNKNOWN opState in parent controllers.




of 156

6.3.2 APIU6.3.2.1 Class Diagram

Figure 6-45: APIU device class diagram (inherits from LFAAGroupDevice)

The APIU Device represents the TANGO interface to an APIU unit and inherits all the functionality of the LFAAGroupDevice base class. This device is a group class since it operates a set of antennas – the logical IDs of which are stored internally. The APIU unit mainly has commands to power up/down individual antennas, or to power up/down the entire APIU unit.



Set by an outside authority (the Observatory operations via TM).ONLINE The APIU can be used for scientific observing.

MAINTENANCE The APIU is not used for scientific observing but can be used for testing and commissioning. The LFAA is not aware of the higher observation goals and does not enforce this restriction; the LFAA executes commands received from TM. However, some test modes may be available only when the APIU is set to MAINTENANCE mode.

OFFLINE The APIU is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.


(read-only)LFAA reports the operational state for the APIU.

INIT N/AOFF The APIU is not powered.ON The APIU is powered and not disabled.


DISABLE The APIU is administratively disabled (adminMode=OFFLINE, NOT_FITTED, or RESERVE); only basic monitor and control functionality is available.




of 156

FAULT An unrecoverable fault has been detected. The APIU is not available for use; maintainer/operator intervention is required.

UNKNOWN The APIU is unresponsive, e.g., due to loss of communication.healthState(read-only)

OKDEGRADED

FAILED

The overall Antenna healthState.

obsState(read-only)

The APIU Observing State indicates status related to scan configuration and execution.

IDLE N/ACONFIGURING N/A

READY N/ASCANNING N/A

PAUSED N/AABORTED N/A


Commands# APIU Commands Required Description1 PowerUpAntenna(logicalAntennaId)2 PowerDownAntenna(logicalAntennaId)4 PowerUp() Powers up the APIU5 PowerDown() Powers down the APIU

Alarms# Alarm Condition Description1 State = FAULT An unrecoverable error that requires operator intervention

has been detected.2 HealthState = DEGRADED Antenna fault detected3 State = UNKNOWN APIU unresponsive4 degradedPercentage >= MAX If the amount of degradation is no longer acceptable,

trigger an ALARM.5 isAlive = FALSE The device may be in good working condition, but

somehow is currently not “alive” e.g. not pinging back




of 156

Activity - APIU Detection of Antenna FaultsThe activity in Figure 6-46 describes how an APIU checks on the states of all antenna devices connected to it, and based on the state information determines of any antenna is currently in an ALARM state. If so, the APIU is marked as degraded. The APIU also aggregates this information to keep an updated metric of how degraded the APIU operation is.

Figure 6-46: APIU detection of antenna faults




of 156

6.3.3 Tile6.3.3.1 Class Diagram

Figure 6-47: Tile class diagram (inherits from LFAAGroupDevice)

The Tile Device represents the TANGO interface to a Tile (TPM) unit and inherits all the functionality of the LFAAGroupDevice base class. This device is a group class since it operates a set of antennas – the logical IDs of which are stored internally. The Tile commands are not listed here, since they have already been defined in the Tile API, however are present in the device concept. The attributes cover a number configurable aspects of the Tile, mostly related to data transmission and routing, as well as health monitoring points.



Set by an outside authority (the Observatory operations via TM).ONLINE The tile can be used for scientific observing.

MAINTENANCE The tile is not used for scientific observing but can be used for testing and commissioning. The LFAA is not aware of the higher observation goals and




of 156

does not enforce this restriction; the LFAA executes commands received from TM. However, some test modes may be available only when the tile is set to MAINTENANCE mode.

OFFLINE The tile is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.


(read-only)LFAA intelligently rolls-up the operational state of all components used by the tile and reports the overall operational state for the tile.

INIT The tile is being initialized.OFF The tile is ‘empty’; no antennas have been assigned to the tile.ON At least one antenna has been allocated to the tile.


DISABLE The tile is administratively disabled (adminMode=OFFLINE, NOT_FITTED, or RESERVE); basic monitor and control functionality is available, but beam-forming capabilities are not available.

FAULT An unrecoverable fault has been detected. The tile is not available for use; maintainer/operator intervention is required.

UNKNOWN The tile is unresponsive, e.g., due to loss of communication.healthState(read-only)

OKDEGRADED

FAILED

The LFAA intelligently rolls-up attribute quality factors, states, and other indicators for all components used by the tile and reports the overall tile healthState.

obsState(read-only)

The tile Observing State indicates status related to scan configuration and execution.

IDLE The tile is not processing input data and is not generating output products.CONFIGURING Transient state entered when a command to re-configure the tile is received.

The tile leaves this state when re-configuration is completed.READY The tile enters READY when re-configuration has been completed; scan

configuration is complete; the tile is calibrated and ready to generate output data products. The parameters that require updates during the scan are being updated.

SCANNING The tile is generating tile beam output products.ABORTED The tile transitions to this state when a command ‘abort scan’ is received. In

this state re-configuration, and any other on-going processing functions are stopped.


CommandsAll commands for a Tile are defined in the Tile API [RD5].Alarms# Alarm Condition Description1 State = FAULT2 temperatureBoard > MAX Max temperature TBD3 temperatureFpga1 > MAX Max temperature TBD4 temperatureFpga2 > MAX Max temperature TBD5 voltage > MAX Max voltage TBD6 current > MAX Max current TBD7 HealthState = DEGRADED Tile fault detected8 State = UNKNOWN Tile connectivity lost9 flagXXXX = TRUE Any monitoring flags exposed by TPM can be monitored for fault

detection, and more importantly diagnosis10 degradedPercentage >= MAX If the amount of degradation is no longer acceptable, trigger an ALARM.




of 156

11 obsState = FAULT An unrecoverable error that requires operator intervention has been detected.

Activity - Tile Detection of Antenna FaultsThe activity in Figure 6-48 describes how a Tile checks on the states of all antenna devices connected to it, and based on the state information determines of any antenna is currently in an ALARM state. If so, the tile is marked as degraded. The tile also aggregates this information to keep an updated metric of how degraded the tile operation is.

Figure 6-48: Tile detection of faults on connected antennas.




of 156

Activity - Tile Responsiveness CheckThe activity diagram in Figure 6-49 describes the basic process of a response check on the Tile device. The state of the Tile responsiveness needs to be periodically checked since the control of the connected antennas is done directly via the Tile (and APIU) devices associated with particular antennas. Failure to receive a response within a 5 second period will result in a timeout that will change the state of the Tile device to UNKNOWN.

In general, such an explicit check is not entirely required, and is demonstrated here for completeness. Any responsiveness issues will be detected any time any particular read attribute operation on the tile is performed. The device state can easily be updated in case there is no response.

Figure 6-49: Tile responsiveness check




of 156

6.3.4 Station6.3.4.1 Class Diagram

Figure 6-50: Station device class diagram (inherits from LFAAGroupDevice)

The Station Device represents the TANGO interface to the logical construct of a Station unit and inherits all the functionality of the LFAAGroupDevice base class. This device is a group class since it operates a set of Tile devices, which are members of the group device. The main tasks of the Station device are to setup the station, configure jobs that the station will run, apply pointing and calibration operations to the tiles grouped in the station, and maintain overall health aggregate information of the same tiles.



Set by an outside authority (the Observatory operations via TM).ONLINE The station can be used for scientific observing.




of 156

MAINTENANCE The station is not used for scientific observing but can be used for testing and commissioning. The LFAA is not aware of the higher observation goals and does not enforce this restriction; the LFAA executes commands received from TM. However, some test modes may be available only when the station is set to MAINTENANCE mode.

OFFLINE The station is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.


(read-only)LFAA intelligently rolls-up the operational state of all components used by the station and reports the overall operational state for the station.

INIT The station is being initialized.OFF The station is ‘empty’; no Tiles have been assigned to the station.ON At least one Tile has been allocated to the station; the station may (with a

correctly configured Station Beam) be used to generate data products.ALARM The Quality Factor for at least one attribute is outside the pre-defined

ALARM limits. Some or all functionality may not be available.DISABLE The station is administratively disabled (adminMode=OFFLINE, NOT_FITTED,

or RESERVE); basic monitor and control functionality is available, but beam-forming capabilities are not available.

FAULT An unrecoverable fault has been detected. The station is not available for use; maintainer/operator intervention is required.

UNKNOWN The station is unresponsive, e.g., due to loss of communication.healthState(read-only)

OKDEGRADED

FAILED

The LFAA intelligently rolls-up attribute quality factors, states, and other indicators for all components and capabilities used by the station and reports the overall station healthState.

obsState(read-only)

The station Observing State indicates status related to scan configuration and execution.

IDLE The station is not processing input data and is not generating output products.

CONFIGURING Transient state entered when a command to re-configure the station is received. The station leaves this state when re-configuration is completed.

READY The station enters READY when re-configuration has been completed; scan configuration is complete; the station is calibrated and ready to generate output data products. The parameters that require updates during the scan are being updated.

SCANNING The station is generating station beam output products.ABORTED The station transitions to this state when a command ‘abort scan’ is

received. In this state re-configuration, and any other on-going processing functions are stopped.


Commands# Station Commands Required Description1 CheckTileHealth() A polled command (or polled attribute) process to report on

the proper running health of all assigned tiles.2 Configure() Configures the station. This can be used to reconfigure the

current station.3 CreateStation() Sets up the chain of tiles in a station – all tiles will therefore be

programmed, initialized and synced.4 ConfigureCalibrationJob Sets up the calibration job for this station.5 SubmitCalibrationJob() Submit the calibration job for this station.6 ConfigureDaqJob() Sets up the DAQ job for this station.7 SubmitDaqJob() Submit the DAQ job for this station.




of 156

8 ConfigureTransientBufferJob() Sets up the transient buffer job for this station.9 SubmitTransientBufferJob() Submits the transient buffer job for this station.10 CheckAntennaBandpass() Calls on the station to check the bandpass of a particular

antenna in one of the tiles forming the station.

Alarms# Alarm Condition Description1 State = FAULT An unrecoverable error that requires operator intervention has

been detected – see All-Tile Health Check Activity in Figure 6-51.2 HealthState = DEGRADED A number of tiles are not working well.3 State = UNKNOWN Control over station is lost.4 degradedPercentage >= MAX If the amount of degradation is no longer acceptable, trigger an ALARM.4 isCalibrated = FALSE If the calibration cycle failed, raise an alarm.5 obsState = ABORTED Raise an alarm if the observation process of this station is aborted.

Activity - All-Tile Health Check for Good Station OperationThe activity diagram in Figure 6-51 describes how a station maintains monitoring over all the member tiles. This check is part of the logic implemented in the station device STATE attribute. It will combine aspects of opState and healthState. If any of these states, on any member tile are not reporting expected state values of INIT/ON/OK, then the station device is considered to have a degraded state, at which point a metric of how degraded the station is can be calculated.




of 156

Figure 6-51: Station check for health on associated tiles




of 156

Activity - Submission of JobsThe activity in Figure 6-52 describes a basic check (this can be extended to more than one required check) for when a station device receives a job submission signal. The station has to be verified to be configured in the first place, with all tiles reporting good health. If these basic conditions are not met, the job station is taken off the observation, by setting the obsState to ABORTED.

Figure 6-52: Basic checks during job submission




of 156

Activity - All-Station Beams Health Check for Good Station OperationThe activity in Figure 6-53 describes how a station device monitors the respective station beam states associated with the station. Based on the information received the healthState of the station itself is set to degraded if any problems are detected, with a metric for how degraded the station is being kept updated.

Figure 6-53: Activity to check for state of station beams.




of 156

6.3.5 Station Beam6.3.5.1 Class Diagram

Figure 6-54: Class diagram for a station beam device (inherits from LFAAGroupDevice)

The StationBeam Device represents the TANGO interface to the logical construct of a Station Beam and inherits all the functionality of the LFAAGroupDevice base class. This device is a group class since it operates a set of Tile devices, which are members of the group device. The main tasks of the StationBeam device are to configure the beam, and monitor the tile health for the beam configuration.



Set by an outside authority (the Observatory operations via TM).ONLINE The Station Beam can be used for scientific observing.

MAINTENANCE The Station Beam is not used for scientific observing but can be used for testing and commissioning. The LFAA is not aware of the higher observation goals and does not enforce this restriction; the LFAA executes commands received from TM. However, some test modes may be available only when the Station Beam is set to MAINTENANCE mode.

OFFLINE The Station Beam is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.


(read-only)LFAA intelligently rolls-up the operational state of all components used by the Station Beam and reports the overall operational state for the Station Beam.

INIT The Station Beam is being initialized.OFF The Station Beam is inactive.




of 156

ON The Station Beam is active.ALARM The Quality Factor for at least one attribute is outside the pre-defined

ALARM limits. Some or all functionality may not be available.DISABLE The Station Beam is administratively disabled (adminMode=OFFLINE,

NOT_FITTED, or RESERVE); basic monitor and control functionality is available, but beam-forming capabilities are not available.

FAULT An unrecoverable fault has been detected. The Station Beam is not available for use; maintainer/operator intervention is required.

UNKNOWN The Station Beam is unresponsive, e.g., due to loss of communication.healthState(read-only)

OKDEGRADED

FAILED

The LFAA intelligently rolls-up attribute quality factors, states, and other indicators for all components and capabilities used by the Station Beam and reports the overall Station Beam healthState.

obsState(read-only)

The Station Beam Observing State indicates status related to scan configuration and execution.

IDLE The Station Beam is not processing input data and is not generating output products.

CONFIGURING Transient state entered when a command to re-configure the Station Beam is received. The Station Beam leaves this state when re-configuration is complete and the required pointing and calibration parameters are being received.

READY The Station Beam enters READY when re-configuration has been completed; scan configuration is complete; the Station Beam is calibrated, locked on target, and ready to generate output data products. The parameters that require updates during the scan are being updated.

SCANNING The Station Beam is generating output products. The parameters that require updates during the scan are being updated.

PAUSED When a Sub-Array is paused, the Station Beam transitions to obsState=PAUSED and stops generation of output products. The Station Beam configuration remains as-is. Resuming observations causes the Station Beam to transition to SCANNING and start generating output products again.

ABORTED The Station Beam transitions to this state when a command ‘abort scan’ is received by its parent Sub-Array. In this state re-configuration, and any other on-going processing functions are stopped.


Commands# Station Commands Required Description1 CheckTileHealth() A polled command (or polled attribute) process to report on

the proper running health of all assigned tiles.2 Configure() Configures the station beam. This can be used to reconfigure

the current station beam.

Alarms# Alarm Condition Description1 State = FAULT An unrecoverable error that requires operator intervention has been

detected.2 HealthState = DEGRADED A number of tiles are not working well.3 State = UNKNOWN Control over station beam is lost.4 degradedPercentage >= MAX If the amount of degradation is no longer acceptable, trigger an ALARM.4 obsState = ABORTED Raise an alarm if the observation process of this station beam is

aborted.




of 156

Activity - Station Beam Check for All-Tiles HealthThe activity diagram in Figure 6-55 describes how a station beam device monitors the respective tile states associated with the station beam. Based on the information received the healthState of the station beam itself is set to degraded if any problems are detected, with a metric for how degraded the station beam is being kept updated.

Figure 6-55: Activity for station beam device checking for associated tile health




of 156

6.3.6 Transient Buffer


Figure 6-56: Class diagram for a Transient Buffer device (inherits from LFAADevice)

The Transient Buffer Device represents the TANGO interface to the transient buffer that is associated to a transient buffer job and its configuration.


States and ModesAttribute Range Description and comments


Set by an outside authority (the Observatory operations via TM).ONLINE The transient buffer can be used for processing during scientific observing.

MAINTENANCE The transient buffer is not used for scientific observing but can be used for testing and commissioning. The LFAA is not aware of the higher observation goals and does not enforce this restriction; the LFAA executes commands received from TM. However, some test modes may be available only when the server is set to MAINTENANCE mode.

OFFLINE The transient buffer is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.


(read-only)LFAA intelligently rolls-up the operational state of all components used by the server and reports the overall operational state for the server.

INIT The transient buffer is being initialized. A check for necessary daemons and services is required to make sure work can be submitted to this server.

OFF The transient buffer is turned off.ON The transient buffer is turned on.


DISABLE The transient buffer is administratively disabled (adminMode=OFFLINE, NOT_FITTED, or RESERVE); basic monitor and control functionality is available, but no heavy operations operable.

FAULT An unrecoverable fault has been detected. The transient buffer is not available for use; maintainer/operator intervention is required.

UNKNOWN The transient buffer is unresponsive.healthState(read-only)

OKDEGRADED

FAILED


obsState(read-only)

The transient buffer Observing State indicates status related to scan configuration and execution.




of 156

IDLE The transient buffer is not processing input data and is not generating output products.

CONFIGURING Transient state entered when a command to e.g. restart services/jobs is received. The transient buffer leaves this state when re-configuration is completed.

READY The transient buffer enters READY when re-configuration has been completed and the transient buffer is ready to do data processing.

SCANNING The transient buffer has running processes doing data processing.ABORTED The transient buffer transitions to this state when a command ‘abort scan’ is

received. In this state re-configuration, and any other on-going processing functions are stopped.


CommandsAlarms




of 156

6.3.7 Specific Job Devices

6.3.7.1 Class Diagrams

Figure 6-57: Class diagrams for various job devices (all inherit from JobDevice)

All the particular Job Devices are monitoring and control wrappers around the specific job processes running on the MCCS nodes. They all inherit base class functionality from JobDevice – and in general report what is required to be reported for the specific job. Jobs are associated with a particular station and a particular node. General job commands like starting or terminating a job are all contained in the JobDevice class.




of 156

6.3.8 Server6.3.8.1 Class Diagram

Figure 6-58: Class diagram for a server device (inherits from LFAADevice)

The Server device describes the monitoring points and basic commands required for all MCCS compute nodes. In particular, it keeps track of memory, CPU, GPU, storage use. Some of the servers will be designated as master nodes or slave nodes. A command interface to operate MCCS-specific services is also present.



Set by an outside authority (the Observatory operations via TM).ONLINE The server can be used for processing during scientific observing.

MAINTENANCE The server is not used for scientific observing but can be used for testing and commissioning. The LFAA is not aware of the higher observation goals and does not enforce this restriction; the LFAA executes commands received from TM. However, some test modes may be available only when the server is set to MAINTENANCE mode.

OFFLINE The server is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.




of 156



INIT The server is being initialized. A check for necessary daemons and services is required to make sure work can be submitted to this server.

OFF The server is turned off.ON The server is turned on.


DISABLE The server is administratively disabled (adminMode=OFFLINE, NOT_FITTED, or RESERVE); basic monitor and control functionality is available, but no heavy operations operable.

FAULT An unrecoverable fault has been detected. The server is not available for use; maintainer/operator intervention is required.

UNKNOWN The server is unresponsive, e.g., due to loss of communication.healthState(read-only)

OKDEGRADED

FAILED


obsState(read-only)

The server Observing State indicates status related to scan configuration and execution.

IDLE The server is not processing input data and is not generating output products.

CONFIGURING Transient state entered when a command to e.g. restart services/daemons is received. The server leaves this state when re-configuration is completed.

READY The server enters READY when re-configuration has been completed and the server is ready to do data processing.

SCANNING The server has running processes doing data processing.ABORTED The server transitions to this state when a command ‘abort scan’ is received.

In this state re-configuration, and any other on-going processing functions are stopped.


Commands# Server Commands Required Description1 StartService(serviceId) Starts a service on this server – this is possibly just a wrapper to

a cluster manager call to start the service on this server, rather than a direct subsystem call.

2 StopService(serviceId) Stop a service on this server – this is possibly just a wrapper to a cluster manager call to stop a service on this server, rather than a direct subsystem call.

3 RestartService(serviceId) Restart a service on this server – this is possibly just a wrapper to a cluster manager call to stop a service on this server, rather than a direct subsystem call.

4 SwitchToLowPowerMode() Switches server to low power mode.

Alarms# Alarm Condition Description1 State = Fault2 cpuNTemp > MAX Max temperature for a CPU is TBD3 gpuNTemp > MAX Max temperature for a GPU is TBD4 systemMemFreeMb < MIN MIN amount of free memory for a server is TBD5 gpuNMemFreeMb < MIN MIN amount of free memory for a GPU is TBD6 isAlive = FALSE Periodically ask for a reply from the server IP, to test whether the Document No.:Revision:Date:



of 156

server is still alive and discoverable. A timeout here would change the device state to UNKNOWN. (This can possible be done differently, a call to the cluster manager to test for whether the server is alive.)

6.3.9 Cabinet6.3.9.1 Class Diagram

Figure 6-59: Class diagram for a Cabinet device (inherits from LFAAGroupDevice)

A cabinet is a collection of devices stored in the various cabinet units, and this is reflected in the form of a class of type LFAAGroupDevice. Amongst monitoring of various attributes, most notably cabinet temperature and power consumption, there are a number of important commands required for cabinet groups.



Set by an outside authority (the Observatory operations via TM).ONLINE The cabinet can be used control and monitoring of encased devices.

MAINTENANCE The cabinet is not used – probably undergoing hardware maintenance. If a cabinet is in maintenance, then encased devices are unreachable.

OFFLINE The cabinet is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.


(read-only)LFAA intelligently rolls-up the operational state of all components encased by the cabinet and reports the overall operational state for the cabinet.

INIT The cabinet is being powered up, which in turn means encased devices will start to power up.

OFF The cabinet is powered down.ON The cabinet is powered up.


DISABLE The cabinet is administratively disabled (adminMode=OFFLINE, NOT_FITTED, or RESERVE); basic monitor and control functionality is available, but no heavy operations operable.

FAULT An unrecoverable fault has been detected. The cabinet is not available for use; maintainer/operator intervention is required.

UNKNOWN The cabinet is unresponsive, e.g., due to loss of communication.healthState OK The LFAA intelligently rolls-up attribute quality factors, states, and other




of 156

(read-only) DEGRADEDFAILED

indicators for all components used by the cabinet and reports the overall tile healthState.

obsState(read-only)

The cabinet Observing State indicates status related to scan configuration and execution.

IDLE The cabinet is not currently involved in processing input data, and none of the encased components are generating output products.

CONFIGURING Transient state entered when the cabinet is being powered up, if used at all.READY The cabinet enters READY when re-configuration has been completed and

the devices encased by the cabinet are all ready to do data processing.SCANNING The cabinet has some devices which are involved in data processing.

PAUSED n/a

ABORTED n/a.


Commands# Cabinet Commands Required Description1 PowerUp() A power-up involves the powering up of all devices encased

within the cabinet.2 PowerDown() A power-down involves the powering down of all devices

encased within the cabinet.3 SwitchToLowPowerMode() Turns the entire cabinet and its devices to low power mode.4 DeviceConnectivityCheck() A command to communicate with all individual devices

encased within the cabinet. This command can therefore give vital information on which particular devices are unreachable for some reason or other. For example, if all communication attempts return, then all devices within the cabinet are reachable. If only subrack units are unreachable, then this information can be captured.

Alarms# Alarm Condition Description1 State = FAULT2 any(MemberStates) = FAULT, ALARM3 any(MemberStates) = UNKNOWN In case of unreachable components4 rackTemperature > MAX Max temperature TBD5 powerConsumption > MAX Max power consumption TBD




of 156

Sequence - Cabinet Encased Devices Reachability TestThe sequence diagram in Figure 6-60 shows how a cabinet has the responsibility to test reachability of all the devices encased in it. These are devices of different type, and the implementation of the DeviceConnectiityCheck() method is a call which will cater for information from the different encased devices. If a timeout period elapses, for any device, with no reply from the specific device, then the state of the particular member is set to UNKNOWN.

Figure 6-60: Sequence diagram for cabinets monitoring reachability to all devices encased in the cabinet

6.3.9.3 SPS Cabinet and MCCS Cabinet Devices

The SPS Cabinet and MCCS Cabinet devices are both subclasses of the Cabinet device and inherit all the attributes and behaviour. At a system level, the differences will be mostly related to the types of member devices (Cabinet device being an LFAA Group Device) associated with the cabinets.

Conceptually, these cabinet devices will wrap around the various members associated with them, and could be instructed to invoke particular commands on a particular device encased in the cabinet.




of 156

Figure 6-61: SPS and MCCS Cabinet device class diagrams (both inherit from Cabinet Device)

The full API and description of the cabinet commands can be found in the LFAA Internal Interface Control Document. [RD5]




of 156

6.3.10 Sub Rack Management Board


Figure 6-62: Subrack Management Board device class diagram (inherits from LFAAGroupDevice)

The SubrackMgmtBoardDevice device is responsible for monitoring and controlling subrack boards. It is inherently an LFAAGroupDevice, as there are a number of associated member devices forming a subrack (tiles, switch). Main responsibilities are health monitoring of temperatures of the subrack, powering up and down of the unit, and switching to low power mode.


States and ModesAttribute Range Description and comments


Set by an outside authority (the Observatory operations via TM).ONLINE The subrack can be used for scientific observing.

MAINTENANCE The subrack is not used for scientific observing but can be used for testing and commissioning. The LFAA is not aware of the higher observation goals and does not enforce this restriction; the LFAA executes commands received from TM. However, some test modes may be available only when the subrack is set to MAINTENANCE mode.

OFFLINE The subrack is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.

NOT_FITTED Set by operations to suppress alarm generation.




of 156

opState(read-only)

LFAA reports the operational state for the subrack.INIT The subrack is being initialised, and the subrack device is checked for when

process is complete.OFF The subrack is turned off.ON The subrack is turned on an has been initialized.


DISABLE The subrack is administratively disabled (adminMode=OFFLINE, NOT_FITTED, or RESERVE); only basic monitor and control functionality is available.

FAULT An unrecoverable fault has been detected. The subrack is not available for use; maintainer/operator intervention is required.

UNKNOWN The subrack is unresponsive, e.g., due to loss of communication.healthState(read-only)

OKDEGRADED

FAILED

The overall subrack healthState.

obsState(read-only)

The subrack Observing State indicates status related to scan configuration and execution.

IDLE The subrack is not being used for LFAA observations.CONFIGURING N/A

READY The subrack enters READY when all subrack devices are ready to participate in an observation.

SCANNING The subrack is currently in use for an observation.PAUSED N/A


detected.

Commands# Switch Commands Required Description1 PowerOnTpm(tpmId) Powers on a TPM connected to the subrack2 PowerOffTpm(tpmId) Powers off a TPM connected to the subrack3 PowerOn() Powers on the subrack4 PowerOff() Powers off the subrack5 SwitchToLowPowerMode() Switches the entire subrack to low power mode6 SwitchTpmToLowPowerMode() Switches a connected TPM to low power mode7 GetCoolingInformation() Return the input and output cold plate temperatures and

speed, temperature of cabinet air and speed of the air inside the cabinet

8 ConfigureTpms(tpmIds, firmware) Program the specified TPMs with the specified firmware and perform initial configuration (such as configuring and starting the PLL, powering on the ADUs and start signal acquisition)

9 GetSynchronisationInformation() Return the status of the PLL and 10 MHz signals (through lock status of PLL)

10 GetNetworkInformation() Return the status of the network switch on the board, including switch port status and packet counters

Alarms# Alarm Condition Description1 State = FAULT




of 156

6.3.11 Switch6.3.11.1 Class Diagram

Figure 6-63: Switch device class diagram (inherits from LFAADevice)

A set of switches are responsible for routing most of the data transfers occurring in/out of LFAA. To this end, the device representing a switch will have attributes to reflect the various IP/port combinations, in particular the IP/port for LMC use, as well as the IP addresses for ingress/outgress data transfers. For every port housed by the switch, monitoring points will include the number of packets coming into the port (ingress), the number of packets moving out from the port (outgress), and the number of packet errors per port.



Set by an outside authority (the Observatory operations via TM).ONLINE The switch can be used for scientific observing.

MAINTENANCE The switch is not used for scientific observing but can be used for testing and commissioning. The LFAA is not aware of the higher observation goals and does not enforce this restriction; the LFAA executes commands received




of 156

from TM. However, some test modes may be available only when the switch is set to MAINTENANCE mode.

OFFLINE The switch is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.


(read-only)LFAA reports the operational state for the Antenna.

INIT The switch is being initialised, and the switch control API is checked for when process is complete.

OFF The switch is turned off.ON The switch is turned on an has been initialized.


DISABLE The switch is administratively disabled (adminMode=OFFLINE, NOT_FITTED, or RESERVE); only basic monitor and control functionality is available.

FAULT An unrecoverable fault has been detected. The switch is not available for use; maintainer/operator intervention is required.

UNKNOWN The switch is unresponsive, e.g., due to loss of communication.healthState(read-only)

OKDEGRADED

FAILED

The overall switch healthState, based on factors like port packet states, port communication availability etc.

obsState(read-only)

The switch Observing State indicates status related to scan configuration and execution.

IDLE The switch is not being used for LFAA observation data transfer.CONFIGURING Clearing stats before reporting as ready for a new observation session.

READY The switch enters READY when all ports required by the observation are ready to start processing inward/outward data packets.

SCANNING The switch is currently in use for observation data transfer/routing.PAUSED N/A


detected.

Commands# Switch Commands Required Description1 ClearStats() A command to clear packet statistics for this switch.

This will reset all counts for all ports to zero. This method will be particularly useful at the start of an observation.

2 PortCommsCheck() This command will perform a single check for internal switch communication to all ports. It is expected that the switch API provides a relevant call to perform this test. This command can therefore be used for fault finding or diagnosis purposes. Alternatively, it can also be set as a polled command for internal communication checks to be performed periodically. The result of this can be reflected in the port_health attribute, with UNKNOWN states where reachability is a problem.

3 SetVlanId For VLAN configuration, sets the VLAN id for this switch.

4 PowerOff() Power off the switch5 Reboot() Reboot the switch6 GetPortStatistics(port) Get port statistics




of 156

7 ResetPortStatistics(port) Reset port statistics8 SetStaticRoute(macAddress, port) Set a static route to mac_address on port9 ClearStaticRoute(macAddress, port Clear static route to mac_address on port10 PersistSwitchConfiguration() Save the current switch configuration to disk so

that they can be applied on reboot11 GetPowerSupplyInformation() Should return the following information about the

system power supples12 ConfigureSwitchManagement() Configure the management port of the switch13 ConfigureNtp(ntpServerAddress) Configure the NTP service in the switch by

synchronizing it with the provided NTP server14 UpgradeSystemSoftware(softwareLocation) Upgrade the system software with the software

image specified at the provided location. This requires a switch restart

15 UpgradeSystemFirmware(firmwareLocation) Upgrade the system firmware with the firmware image specified at the provided location. This requires a switch restart

Alarms# Alarm Condition Description1 State = FAULT2 switchTemperature > MAX Max temperature TBD3 obsState == SCANNING &&

portXIngress *0.9 < portYOutgress < portXIngress * 1.1

The ingress/outgress rate for paired ports is expected to maintain a steady data rate during an observation. If this rate changes +/- 10% of what an ingress (sender) port sends, then an alarm is fired. This alarm is active only when the switch is in scanning state.

5 obsState == SCANNING &&packetErrors > MAX

Max packet errors per port TBD. This alarm is active only when the switch is in scanning state.

6 obsState == SCANNING && packetErrors relChange > CHANGE_THRESHOLD

An alarm is set on the relative change of packet errors per port. If there is a sudden jump of errors, this may indicate something wrong at the port or switch level. Change levels TBD. This alarm is active only when the switch is in scanning state.

7 any(portHealth) = UNKOWN, FAULT




of 156

Activity - Port Health Activity DiagramThe activity diagram in Figure 6-64 describes the process of a switch checking all ports for their health. This will most probably be done by calling the relevant switch API and parsing the response given. If timeouts occur, when communicating with the switch API, it is assumed that the switch, at least temporarily, is in an UNKNOWN state.

Figure 6-64: Activity diagram for switch checking port health




of 156

6.3.12 Cluster Manager6.3.12.1 Class Diagram

Figure 6-65: Class diagram for cluster manager device (inherits from LFAAGroupDevice)

The ClusterManager device inherits from the LFAAGroupDevice and is a representation of the entire set of nodes forming the compute cluster for MCCS. It is expected that this device will not necessarily communicate directly with the actual nodes, but will obtain the required attribute data from a cluster management system (through an API). However, this device can also be implemented to communicated with the Server devices directly if needed. Besides monitoring points, the responsibility for this device is to communicate with a resource manager for the cluster to start/stop/submit jobs and monitor their progress.




of 156



Set by an outside authority (the Observatory operations via TM).ONLINE The cluster manager can be used during scientific observing.

MAINTENANCE The cluster manager is not used for scientific observing but can probably have limited functionality if the service is turned on and being maintained by an administrator. The LFAA is not aware of the higher observation goals and does not enforce this restriction; the LFAA executes commands received from TM. However, some test modes may be available only when the cluster manager is set to MAINTENANCE mode.

OFFLINE The cluster manager is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.



INIT The cluster manager service is being initialized. A check for necessary daemons and services is required to make sure this manager is able to receive instructions.

OFF The cluster manager service is turned off.ON The cluster manager service is turned on.


DISABLE The cluster manager service is administratively disabled (adminMode=OFFLINE, NOT_FITTED, or RESERVE); basic monitor and control functionality is available, but operations are limited (API dependent)

FAULT An unrecoverable fault has been detected. The cluster manager service is not available for use; maintainer/operator intervention is required.

UNKNOWN The cluster manager service is unresponsive, e.g., due to loss of communication, complete master node failures etc.

healthState(read-only)

OKDEGRADED

FAILED

The LFAA intelligently rolls-up attribute quality factors, states, and other indicators for all components managed by this service, probably by querying the cluster manager API, and an intelligent rollup is maintained in the healthState.

obsState(read-only)

The cluster manager service Observing State indicates status related to participation in scan configuration and execution.

IDLE The cluster itself is not processing input data and is not generating output products.

CONFIGURING Transient state entered when a command to e.g. restart services/daemons is received. The cluster manager service leaves this state when re-configuration is completed.

READY The cluster manager service enters READY when re-configuration has been completed and the cluster is ready to do data processing.

SCANNING The cluster service reports nodes which have running processes doing data processing.

ABORTED The cluster manager service transitions to this state when a command ‘abort scan’ is received. Transitions to IDLE as soon as activity related to the scan is aborted.


Alarms# Alarm Condition Description




of 156

1 State = Fault2 memoryAvail < MIN % Min value TBD, a percentage of memory_used/memory_total3 nodesInUse > MAX Max value TBD4 jobsFailed > 0 We need to know immediately if a submitted job has failed5 jobsUnreachable > 0 We need to know immediately if a job is unreachable6 masterDiskUsed > MAX MAX amount of used space by cluster is TBD7 masterMemUsed > MAX MAX amount of used memory for cluster is TBD8 masterGpusUsed > MAX MAX amount of gpus that can be used by cluster is TBD9 masterCpusUsed > MAX MAX amount of cpus that can be used by cluster is TBD10 nodesInUse > MAX MAX amount of nodes that can be actively working on jobs is TBD11 any(shadowMasterPoolStatus)

= UNKOWN/FAULTWe need to know if a shadow master is unreachable or faulty, in order to not try to bypass the current master to it.

Commands# Cluster Manager Commands Description1 StartJob(jobId) Command to start a particular job2 StopJob(jobId) Command to stop a particular job3 SubmitJob(jobConfig) Command to submit a job to the queue4 GetJobStatus(jobId) Poll the current status for a job5 ClearJobStats() Used to reset all job counters – useful at the start of a new

observation6 PingMasterPool() Pings all nodes in shadow master pool, to maintain status of

each




of 156

Activity - Monitoring Shadow Master Pool StatesOne of the reliability mechanisms for a cluster system is to have a number of shadow masters which can replace the current master should the node be faulty or go offline. For this reason, the cluster manager device aggregates state information for all nodes marked as shadow masters. Knowing this information is required for the election of a new master (whichever the choice of algorithm for election is selected). This also provides a high level view of the health of the shadow masters for LMC purposes. This behaviour is shown in the activity diagram in Figure 6-66.

Figure 6-66: Activity diagram for checking state of all servers which serve as shadow master nodes.

Activity - Master Node ElectionThe cluster will implement one of the many algorithms available for master node election. There are various ways in which this can be done, however the basic principle for many algorithms is the same. Some established algorithms are:

The Bully algorithm The Paxos algorithm




of 156

6.3.13 Storage Manager6.3.13.1 Class Diagram

Figure 6-67: Class diagram for storage manager device (inherits from LFAADevice)

The Storage Manager device will wrap around the distribute storage management system employed for the MCCS cluster. It will mainly aggregate attribute data for memory/storage user, keep track of mounted volumes and their availability, and provide operations to create/destroy/configure volumes and their replication method.



Set by an outside authority (the Observatory operations via TM).ONLINE The storage manager can be used during scientific observing.

MAINTENANCE The storage manager is not used for scientific observing but can probably have limited functionality if the service is turned on and being maintained by an administrator. The LFAA is not aware of the higher observation goals and does not enforce this restriction; the LFAA executes commands received from TM. However, some test modes may be available only when the cluster manager is set to MAINTENANCE mode.

OFFLINE The storage manager is not used at all; when adminMode=OFFLINE, the operational state=DISABLE.



INIT The storage manager service is being initialized. A check for necessary daemons and services is required to make sure this manager is able to receive instructions.




of 156

OFF The storage manager service is turned off.ON The storage manager service is turned on.


DISABLE The storage manager service is administratively disabled (adminMode=OFFLINE, NOT_FITTED, or RESERVE); basic monitor and control functionality is available, but operations are limited (API dependent)

FAULT An unrecoverable fault has been detected. The storage manager service is not available for use; maintainer/operator intervention is required.

UNKNOWN The storage manager service is unresponsive, e.g., due to loss of communication, master node which interacts with storage manager is down etc.

healthState(read-only)

OKDEGRADED

FAILED

The LFAA intelligently rolls-up attribute quality factors, states, and other indicators for all components managed by this service, probably by querying the cluster manager API, and an intelligent rollup is maintained in the healthState.

obsState(read-only)

The storage manager service Observing State indicates status related to participation in scan configuration and execution.

IDLE The storage manager itself is not reporting any reading/writing of observation data.

CONFIGURING Transient state entered when a command to e.g. restart services/daemons is received. The storage manager service leaves this state when re-configuration is completed.

READY The storage manager service enters READY when re-configuration has been completed and the cluster is ready to do data processing.

SCANNING The storage service reports that it has heavy read/write on any or all storage nodes during an observation.

PAUSED n/a

ABORTED n/a


Alarms# Alarm Condition Description1 State = Fault2 storageAvail < MIN Min amount of storage TBD3 memoryAvail < MIN % Min value TBD, a percentage of memory_used/memory_total4 healthState = DEGRADED or

FAILEDOccurs in conditions where volume connectivity fails and there is still enough storage available (DEGRADED), or if there is not enough space for an observation (FAILED)

Commands# Cluster Manager Commands Description1 CreateVolume() Instructs the storage manager to create a new volume2 DestroyVolume(volId) Instructs the storage manager to delete a volume3 FormatVolume(volId) Instructs the storage manager to format an existing volume4 ConfigureReplication(volId, config) Sets up a particular replication configuration for an existing

volume5 GetObsDirectory(obsId) Returns the absolute path to the root directory containing data

for a particular observation6 CheckVolumeConnectivity() A diagnostic procedure to check if all system hosting the

volumes in the system are physically reachable via IPDocument No.:Revision:Date:



of 156

Activity - Volume Connectivity CheckingVolumes in a cluster storage system are usually addressable by the IP address of the host of the volume itself, as part of the fully qualified domain name of the volume. If the host is unreachable, then so is the physical volume itself. By maintaining periodic checks of connectivity to these nodes, the storage manager device summarizes whether the storage system as a whole is degraded or not. Possibly, a threshold of acceptable storage needs can be defined, and based on querying the resource management system of the entire LFAA storage manager, the storage manager is marked as FAILED if the minimum requirements are not met. This behaviour is described in the activity diagram in Figure 6-68.

Figure 6-68: Activity diagram for storage manager device to summarize the state of all storage volumes.




of 156


The variability expected for this view is in the eventual replacement of hardware e.g. Switches to new hardware over time. This should not be problematic as long as the interface i.e. attributes/commands of a particular device do not change. The internal implementation can of course be changed.

6.5 Rationale

This view describes Monitoring and Control drivers for specific hardware/software components making up the MCCS system. The major infrastructural decisions are made in the LMC Infrastructure View, and therefore this section follows directly. The rationale of what commands/attributes are available for particular devices is derived from how the TANGO paradigm is best utilized when applied to the MCCS requirements. Even so, it is expected that changes to these can and will occur over time, and as such, these changes can all be accommodated so long as they comply with the general TANGO paradigm and guidelines, as well as the LMC Infrastructure in place.

6.6 Related Views

The LMC Infrastructure View provides the notation, state machine, and base devices for LFAA TANGO MCCS devices, all of which are used and assumed in this view.




of 156

7 Hardware Configuration Management View

The LFAA LMC system will contain a relational database that records the hardware configuration of the system as deployed. This will enable traceability of all the hardware in the system, as well as the connectivity between different hardware elements. Moreover, the continuous running and maintenance of system hardware means that some components will be replaced by others, and therefore this database will allow for update operations after deployment. Additionally, this database will be used for system diagnostics by the LMC. Figure 7-69 shows the context diagram, including external users and actions can be performed on the hardware configuration database. The top-level information will be included in the database include:

Global identifiers for all MCCS, SPS and Field Node monitorable hardware devices, including sub-components and interconnecting cables

Connections between components, including intermediary cables Component geographic localisation information (such as whether in CPF or an RPF, rack number, and

where within the rack a component is) Maintenance log for each device Network configuration for each device Software and firmware versions

Figure 7-69: Hardware Configuration Management context diagram.




of 156

7.1 Primary Presentation

The high-level device hierarchy should be similar to the PBS for LFAA and reflect the hierarchy of the deployed hardware. Figure 7-70 present the device hierarchy as adapted from the PBS. It shows how the hardware components are associated with each other and the multiplicities of each element. Only monitorable devices are shown in this diagram, and low-level LRUs (below L5, such as GPUs and LNAs) are not shown. The device hierarchy can be extended to include lower-level devices. Element in blue represent device groupings, they are not physical devices but rather high-level grouping of various components.

Figure 7-70. Hardware Configuration Management Primary Presentation




of 156

7.1.1 Element Catalogue, Properties and Relationships

The contents of the hardware configuration database can be logically grouped into four types: Generic Device, which is a high level description of a hardware component Monitorable Device, which is a hardware device which can be access by MCCS, such that it includes

additional properties such and networking information and location within cabinets (the APIU is a special case in that it is a monitorable item but does not reside in a rack)

Antenna, which is not located within the CPF or RPFs and requires additional properties such as its location within a station. This table on its own represents the antenna table which can be used for calibration. An associated station table is also included such that station information does not have to be replicated in all antenna entries

Cables, which interconnect hardware components. There are a large variety of cables in LFAA. For the purpose of the hardware configuration database, the serial number and the list of hardware devices which are interconnected with these cables is sufficient.

Figure 7-71. Hardware configuration database entry types




of 156

Tables Table 7-17, Table 7-18, Table 7-19, Table 7-21 and Table 7-21. Generic item property list described the properties specified in Figure 7-71. Note that Monitorable Device and Antenna types also include the properties in Generic Item (they ‘inherit’ the properties).

Table 7-17. Generic Device property listProperty Descriptionglobal_id A global identifier within the tabledevice_type The type of device. The list of device types is specified as an

enumeration and this property should specify one of thesedevice_id The device-specific identifier within the device type. These two

combined for a unique device identifier. For example, hardware component with device_type TPM and device_id 32 would refer to a specific hardware component

parent_id Points to the global_id of the parent device within the hardware device hierarchy. This can be used to re-create the hierarchy shown in Figure 7-70

serial_number Serial number of device, which should be visible on the device itselfcable_ids Points to a number of ids within the Cable table (can be set to NULL

in the case where no cables are attached). This is a list since there can be multiple cable connected to the device

installation_date The installation datelast_maintenance_date The date when maintenance was last performed on the devicemaintenance_log A text field used as a maintenance log. Every time a maintenance

operation is performed on the device the maintenance log should be updated accordingly

configuration Device-specific configuration which is used by the software infrastructure to configure the device on startup (or when transitioning to READY from low-power mode). This takes the form of a key-value field where the key represents a configurable parameter and the value define the value for the parameter.

This field can also contain additional parameters to extend the functionality of the database. For instance, cabinet entries can also a parameter representing whether the cabinet is in the CPF or one of the RPFs (a processing facility ID) and where within the facility the cabinet is located (or a cabinet number representing their location).

Table 7-18. Monitorable device property listProperty Descriptioncabinet_id The global_id of a cabinet device in which the device is physically

locatedlocation_in_cabinet The physical location within the hosting cabinet, which would include

the rack unit, whether it’s at the back or front and any other additional information

fw_version Device firmware version, if anysw_version Device embedded software version, if anymac_address MAC address of the monitoring interfaceip_address IP address of the monitoring address (this would be assigned during

deployment as IPs are assigned by MCCS)




of 156

Table 7-19. Antenna property listProperty Descriptionstation_id Station identifierstation_latitude Latitude of station to which the antenna belongs tostation_longitude Longitude of station to which the antenna belongs tostation_altitude Altitude of station to which the antenna belongs tox_displacement X displacement in meters from station coordinatesy_displacement Y displacement in meters from station coordinates

Table 7-20. Station property listProperty Descriptionstation_id Station identifierstation_latitude Latitude of station to which the antenna belongs tostation_longitude Longitude of station to which the antenna belongs tostation_altitude Altitude of station to which the antenna belongs tox_displacement X displacement in meters from station coordinatesy_displacement Y displacement in meters from station coordinates

Table 7-21. Generic item property listProperty Descriptionglobal_id A global identifier within the tablecable_type The type of cable. The list of cable types is specified as an

enumeration and this property should specify one of theseserial_number Serial number of cable, which should be visible on the device itselfdevcice_ids Points to a number of ids within the Device tables. This is a list since

there can be multiple device connected to the cable

7.2 Element Behaviour

The following sections define element behaviour from different perspectives and use cases with regards to hardware configuration management, most notably on how it can be populated, how hardware can be located with the processing and facilities and on the field, and how to replace hardware devices.

It is assumed that hardware maintainers and deployers will have access to a tool, both in the processing facilities and on the field), which provides access to the hardware configuration database and facilitates the operations described below. This tool would be part of the GUI (Section 8.4.1) and would also be available through the CLI (Section 8.4.2). Additionally, since all hardware device and cables will have identification labels and most probably a barcode representing the device, a bar-code scanner can be used to insert serial numbers. This would decrease the time required to register new devices and reduces text input error (through human error). Since network access will be available both in the processing facilities and as well as on the field (through the APIUs), a portable device can be used both for data entry and for scanning identification codes. The following element behaviour assumes that such a tool would be available.

7.2.1 Adding new hardware devices

During deployment, each new hardware device which is installed must be added to the hardware configuration database. This process should be as simple as possible. Figure 7-72 shows the activity diagram which describes how a hardware deployer, using a specialised tool, can add new entries to the database. Although ‘scan’ is used to insert the serial number into the database, these values can also be input manually, Document No.:Revision:Date:



of 156

however this would increase the time required to do so. The activity diagram differentiates between four types of devices:

A cable, which just needs a scan of the identification number. Connections are then added by associating the cable with a hardware device

Cabinets, which do not have physical parent devices and requires a cabinet location (or cabinet identifier)

Antenna, which are deployed in the field. Here it is assumed that antenna location will be inserted manually, however it is also possible to integrate the device which calculate the geodetic coordinates of where the antenna is placed so that the location can be automatically inserted

All other hardware devices, which are generally contained within a cabinet

Figure 7-72. Adding a new device activity diagram

The steps required to add a new hardware device to the configuration database are described below:

1. The hardware device is installed (this step can also be performed after inserting the device information in the database, depending on how easily reachable the identification code is one the device is installed

2. The device identification code is scanned




of 156

3. The flow then forks out depending on the type of devicea. For cable, no further action is requiredb. For antenna, it is associated with a station by entering a station id and then antenna the

displacement from the station centre are inputc. For cabinet, the cabinet location is specified (a processing facility identifier and cabinet

identifier within that processing facilityd. For all other hardware devices the parent device identification code is scanned, such that the

device is associated with its parent (as show in Figure 7-70). For this type of device, the parent device will generally be a cabinet. The only device which is monitorable but does note reside in a cabinet is the APIU. In this case, the parent_id field should be skipped and the station_id is added to the configuration field. For a device inside a cabinet, the location within the cabinet is then inserted. If the device is monitorable, then the MAC address should be included as well (not all devices need the MAC address to be input during installation, the address hot-swap devices such as TPMs can be included automatically by the LMC)

4. For each cable to which the device is connected, the cable identifier is scanned. In this step the device_id is also added to the cable’s device_ids field. This step needs a verification stage since many cables may need to be connected to the device (such as the TPMs and APIUs). For each cable scanned, the end-points of the cable, that is the hardware components connected to the other end of the cable, if any, are displayed such that the deployer can confirm that the correct cables are inserted (to the appropriate slot)

These steps should result in all the table fields for the installed device to be populated. Some of these are populated automatically when a scan is performed. For example, scanning the hardware device itself will populate the global_id, device_id, serial_number and installation date. Additional notes can be included in the maintenance log. The configuration field can then be populated at a later staged to include information required by the LMC to operate the device.

7.2.2 Replacing hardware devices

All maintenance operation performed on the device should be included in the maintenance log field of the device, in which case the last_maintenance_date will be automatically updated. When replacing a hardware device, a new record is created with all entries set to the original hardware device entry. The original record is moved to a historical database. The following steps are required:

Select the appropriate function in the GUI (or CLI), which informs the tool Scan the hardware device to be replaced Scan the new hardware device If required, update any of the fields (such as replacing the MAC address)

When scanning a device to be replaced only the identification code is replaced, since it is assumed that all other fields (except for MAC address and potentially some entries in the configuration field) will be applicable to the new device. This also holds for cables and antennas.

7.2.3 Database querying

Even though the schema described above is a relatively simple one, all the following queries can be performed, which should provide maintenance personnel with information for help in fault finding (this is just a sample list of the type of operation which can be performed):

Create an antenna map for a number of stations. This can be combined with the health status of each antenna, from the LMC system, to generate a map showing all the antennas and highlighting ones which are in a faulty or alarm state.

Generate a cabinet device map, similar to the above query, where all the components of a rack are displayed, with health status from the LMC system, highlighting devices which are in a faulty or alarm state




of 156

Get all devices connected to a specific cable Select two devices and get all the interconnecting devices and cables, if any Check maintenance logs, sort by last maintenance date, and other maintenance related queries

7.3 Related Views

The Monitoring and Control View provide details on how the hardware components of SPS, MCCS and Field Node are represented as TANGO devices and details the monitoring points and commands available, some of which are referred to in this view

The Maintenance Support View describes the maintenance related interfaces available, which when combined with the hardware configuration database, provides a full set of utilities for maintenance support




of 156

8 Maintenance Support View8.1 Context DiagramFigure 8-73 shows the primary use cases for requirements for maintenance support functionality in the LFAA MCCS software system. These use-cases can be split into rough categories:

1. Remote operationsa. Remote diagnosticsb. Remote powering up/down and restarting of hardware/software elements

2. Metadata for diagnosisa. A log of all software behaviour in the system

3. Maintenance Operationsa. Fault diagnosisb. Error detectionsc. Remote debuggingd. Default actions on error

Figure 8-73: Primary use-cases for maintenance support arising from the LFAA MCCS software system.




of 156

8.2 Primary PresentationIn the architectural overview of this software architecture document, the high level presentation in Figure 8-74 was given. This primary presentation defines all the logical components which make up the interfaces required for maintenance support.

Figure 8-74: LFAA local monitoring and control overview

At a low level, every component wrapped as a TANGO device will be able to provide detailed logs of all operations as required. These logs can also be split into various log levels, and it is assumed that detailed logs for maintenance purposes will be setup with a “DEBUG” log level. This would mean that errors and faults are also logged, given the logging level hierarchy provided by the TANGO control system.

The metadata required for diagnosis can be collected from all monitoring points/attributes defined in all devices. Moreover, these values can be archived internally in order to provide a trace of all attribute behaviour over a required timescale.

With regards to maintenance operations, all operations required on particular hardware/software devices should be available as attribute/commands on the TANGO device for the element in question. If, for example, a particular device supports a power-restart, then this functionality should be wrapped into a TANGO command for the particular device to perform that power-restart. Once that wrapper is available, remote operations for the command is also available. This is handled by the TANGO framework.

8.3 Element Catalog8.3.1 Graphical User Interface (GUI)Users will interact with a web-based GUI via browser. The front end will be responsible for accepting user requests, interpreting them, and routing them to the LFAA LMC Master devices, via a service layer that translates GUI requests to TANGO requests. Replies from the TANGO susbsystem are then passed back to the user via a service layer that passes on output from the LFAA LMC system to the web interface.

The main idea compared to other more traditional server-side architectures is to build the server as a set of stateless reusable REST services, and from a Model-View-Controller (MVC) perspective to take the controller




of 156

out of the back-end and move it into the browser. The TANGO consortium is currently developing a REST API for development of web applications to control a TANGO subsystem, and it is envisioned that the LFAA LMC GUI system will make use of this framework. Requests and replies can inherently be passed on as messages, for instance in JavaScript Object Notation (JSON), between client and server.

Whilst a formally correct and exhaustive REST API cannot be defined until the architecture of the TANGO REST framework is fully known (see http://tango-rest-api.readthedocs.io/en/latest/), some general concepts of how this GUI framework can be described.

8.3.1.1 Element InterfaceThe dynamicity of the TANGO framework must be reflected in the interfacing layer, which exposes all the available functionality to third party clients, in this case the LFAA LMC GUI. The TANGO REST API will be hosted in a webserver which exposes a number of URLs, each of which result in an action being performed on the TANGO framework. REST over HTTP is used to communicate with the web server. The list of URLs is assumed to be generated dynamically based on the devices/commands/attributes available in the LFAA LMC system. Some early access code from the REST API TANGO work demonstrates that the URLs are designed in such a way as to make it easy to drill down, or filter, components and capabilities by specifying IDs, types other filtering options. This emulates the nature of TANGO as representing a control system by a hierarchy of devices.

REST stands for Representational State Transfer. It relies on a stateless, client-server, cacheable communication protocol (primarily HTTP). It is an architecture style for designing network applications. The primary aim is that instead of using complex mechanisms such as CORBA, RPC or SOAP to connect between machines, simple HTTP is used to make calls instead. RESTful applications use HTTP requests to post data (create and/or update) read data (for example, to make queries) and delete data. Thus, REST uses HTTP for all four CRUD (Create/Read/Update/Write) operations. These operations are performed through the following HTTP requests:

GET – Query an entity for information or data POST – Issue a command which changes the state of an entity (for example, to create an observation

or write an attribute value) PATCH – Update the state of a created entity (for example, to stop an observation) DELETE – Delete an entity (for example, unsubscribe from receiving an event, which will delete the

appropriate entry)

The TANGO Rest API capabilities will be utilized to provide primarily the following functionality around the core control monitoring and control capabilities:

Title DescriptionObservations Used to start and monitor the status of an observation.Components Used to get a list of available components, together with their capabilities

and perform actions on these componentsEvents Used to get an ordered list of eventsAlarms Used to get a list of alarmsEvent Subscriptions

Set up and tear down event subscriptions

8.3.1.2 Element BehaviourIt is important to highlight the two main message exchange patterns from the GUI client to the LFAA LMC control system. There will be two main patterns in a client/server architecture:

1. Request-Response: the most RESTful approach, a CRUD interface to access/create/modify/delete control operations (commands and attributes) in TANGO servers. This process is described in the activity diagram in Figure 8-75.




of 156

2. Publish-Subscribe: functionality offered by the TANGO ecosystem, but not directly by a REST API (unless using locking calls). The control system should be able to send asynchronous notifications to the client as soon as possible. This is done via a Websockets interface over HTTP. This process is described in the sequence diagram in Figure 8-76.

Activity - REST API Request-Reply Activity Diagram

Figure 8-75: REST API to TANGO – Request-Reply Flow




of 156

Activity - Publish-Subscribe Sequence Diagram

Figure 8-76: Publish-subscribe from TANGO to HTTP GUI via Websockets

8.3.1.3 Element PropertiesThe definition of element properties for RESTful access to the monitoring and control subsystem can be split into a series of URLs formed that reflect the hierarchy of the system and the operations that can be performed on TANGO devices in general. The description of this implementation can be found in: http://tango-controls.readthedocs.io/en/latest/development/advanced/rest-api.html

8.3.2 Command Line Interface (CLI)The maintenance support CLI interfaces will run on user computers that have access to the LFAA monitor and control network. Network access, authorization and authentication are provided by the Observatory and are not controlled by this LFAA.It is assumed that the LFAA engineering interfaces will be able to access the LFAA TANGO Facility Database from the LFAA Facility (from the local equipment to be connected with keyboard/mouse and a screen) and remotely (from the SKA control rooms and other authorized facilities and computers).CLI interfaces should in general provide for the following functionality:

1. Basic control & monitoring2. LFAA-specific tools for setup and configuration3. Debugging, testing and diagnostics4. Health monitoring5. Alarm management




of 156

http://tango-controls.readthedocs.io/en/latest/development/advanced/rest-api.html

http://tango-controls.readthedocs.io/en/latest/development/advanced/rest-api.html

6. Direct access to monitoring data by external operators (engineers) in case of TM failure7. Interfaces for non-TANGO components possibly via a tunneling mechanism

8.3.2.1 TANGO FrameworkAs far as the TANGO subsystem is concerned, LFAA LMC will be making use of a ready-developed CLI to all TANGO functionality via the iTANGO tool (see http://pythonhosted.org/itango/). Essentially, iTANGO is a layer on top of the standard iPython environment, providing TANGO-specific functionality. This tool can cater for:

1. Basic monitoring and control2. Health monitoring3. Direct access to monitoring data by external operators (engineers) in case of TM failure4. Partial alarm management

8.3.2.2 Engineering ScriptsA number of scripts, to help automate a number of processes can be devised for:

1. System setup and configuration2. Alarm management3. Debugging and diagnostics4. Interfaces for non-TANGO components possibly via a tunneling mechanism5. CLI-based tests for REST API functionality

System Setup and ConfigurationThe initial setting up of TANGO devices for the system, their configuration in a database, with the appropriate default attribute values etc. can be setup by an automated engineering script which can:

1. Tear down current LMC configuration, clearing TANGO database2. Create a new LMC configuration from scratch3. Update a current LMC configuration

The prototype developed for AAVS does the above by parsing a JSON configuration file, which contains entries for all the device servers that need to be loaded, the devices that need to be created, and bootstraps the execution of the device servers. Devices are populated with predefined property values. This tool allows for very easy modification and automation of the TANGO components within the LMC system.

Alarm ManagementLFAA will make use of the auxiliary functionality currently being added to the Alarm systems proposed as extensions to the Elettra alarm handler. In essence, the system will allow for the definition of complex alarm formulae to be devised in a text/JSON file, which is then parsed by the LMC alarm system. For each alarm condition/rule, attributes are created on the alarm handler device, and the appropriate attribute subscriptions on the respective devices are made.

Debugging and DiagnosticsLFAA will make use of a number of client scripts which have been pre-programmed to poll specific diagnostic attributes from a number of devices, and collecting a report of these diagnostics.

Interfaces for non-TANGO Components Possibly via a Tunnelling MechanismThe LFAA LMC architecture will try to avoid direct tunnelling to any components unless absolutely necessary. Most identified hardware and software components have accessible APIs which can be wrapped in TANGO, in which case, no tunnelling is required, as LMC operations can all be done via the TANGO control system. It remains to be evaluated what other non-TANGO mechanisms are required.

CLI-based Tests for REST API FunctionalityCalls to the REST API can be bypassed and run via scripts e.g. CURL. This will allow for testing and verification of a suite of calls and parameters in an automated fashion, without requiring active users from the GUI.




of 156

http://pythonhosted.org/itango/

8.3.3 Integration with TM EDA

TM has a requirement to archive all the engineering data and control parameters in a central telescope archive, referred as Engineering Data Archive (EDA). Thus, the SKA-LOW telescope facility is in charge of archiving all the attributes provided by the devices in the Element. Elements may deploy element-level archiving for internal use, if required; one use case would be to address element standalone operation.

The Element archives are completely separate from the Telescope archive and can be managed by the Element with no regards for the EDA. SKA archiving will be based on the TANGO HDB++ archiving system. For local element archiving the architecture comprises:

TANGO archiving setup with a MariaDB backend One HDB++ ConfigurationManager device A number of HDB++ EventSubscriber devices

For central Telescope archiving a Cassandra backend is foreseen, with at least: One dedicated HDB++ ConfigurationManager device More than one ConfigurationManager device can be deployed whenever complexity, architectural or

practical reasons require it At least one dedicated EventSubscriber device for each Element; additional EventSubscribers may be

deployed for telescope central archiving of each Element

This is a case where TM accesses, as a subscriber, element TANGO devices at lower levels of the Element hierarchy. For scalability and performance reasons it is generally suggested to avoid the bottleneck of a single point of contact at the LFAA Master device level, and instead exploiting point-to-point TANGO capabilities, in the specific case of publish/subscribe pattern.


8.4.1 Graphical User Interface (GUI)

There is currently no full visibility on what the extent of the GUI is for MCCS. A GUI prototype was developed for the Prototype (AAVS1), and some aspects of this will inform the requirements for the GUI for MCCS. It is expected that the definition, scope, and feature list for the GUI system will keep on expanding during the lifetime of the interface. It is also expected that with newly introduced web-GUI technology stacks, that a total rewrite of this GUI interface will occur at opportune stages of the project. However, the GUI will always be serviced by the underlying control system API – currently in the form of a RESTful API.

8.4.2 Command Line Interface (CLI)

The TANGO framework provides a set of CLI tools. It is expected that these tools are enhanced along the timeline of the TANGO project. These changes are released by the TANGO community, commensurate with the TANGO version in use. With regards to engineering scripts developed for the MCCS system, the variability will depend on the changing requirements and feature enhancements over time. These however, act outside of the MCCS system itself and should not have any impact on the core system.

8.5 Rationale

The rationale behind the Maintenance Support View is to design the principles of a set of tools that allow independent operation of the LFAA system, as well as allowing for a number of demonstrable tests and vertical operational analysis.




of 156

8.6 Related Views

This view presents interfaces to the entire LMC system and is therefore related to all views described in this document.




of 156

Appendix A – Software Requirements

Table A-0-22 lists the MCCS requirements from RD-X which required software for MCCS to be compliant. These requirements drive the architecture presented in this document. The views which caters for each requirement are also listed.

Table A-0-22. SKA L1 requirements which require support from LFAA SoftwareD Requirement Description ViewsSKA1-FLD-4659 Functional Requirements

SKA1-FLD-4633 Signal ProcessingLFAA_MCCS_REQ-126 RFI flagging When commanded by TM, MCCS shall accept a list of frequency channels, amplitude levels and

integration periods for which the packets of the station-beam data "RFI flagged" state shall be set, and transmit this to the SPS (TPMs). Two special cases shall be permitted: - If the level threshold is set to a value of 0 (TBC) the beam data packet shall always be flagged regardless of level. - If the level is set to (maximum word value, TBC), RFI flagging shall be disabled for this channel/beam.


LFAA_MCCS_REQ-118 Spectral channels MCCS shall independently control and calibrate a fixed number of frequency channels conforming to the SPS coarse channelization filter design parameters as described in the LFAA internal ICD.


LFAA_MCCS_REQ-97 Instantaneous bandwidth

The MCCS shall provide resources sufficient to permit control, monitor and calibration processing of no less than 300 MHz of aggregate bandwidth per polarisation. The desired bandwidth, specified by TM as sets of discrete coarse channels individually allocated to beams, shall be processed continually for the duration of the given observation period or until commanded to stop.


LFAA_MCCS_REQ-103 Channelisation frequency channel amplitude response

MCCS shall, in a format conforming to the LFAA-CSP ICD, make available via TM the correction coefficients applied to coarse channel levels to normalize their amplitude response. Observation

Management

SKA1-FLD-4630 Synchronisation and TimingLFAA_MCCS_REQ-12 Network Time

Protocol (NTP)All MCCS client devices and applications that require synchronised telescope network time shall comply with the Network Time Protocol version 4 standard, RFC 5905

LMC Infrastructure

LFAA_MCCS_REQ-171 Provision of NTP services

MCCS shall provide a local NTP service referenced to the SaDT system-level time service for use by all devices physically connected to the LFAA data and management networks.

LMC Infrastructure

LFAA_MCCS_REQ-101 Synchronous Time Stamping

MCCS shall command SPS in each station to synchronize the internal time stamping to a 1PPS signal transition.





of 156

LFAA_MCCS_REQ-147 Pointing synchronization

MCCS shall compute and transmit pointing delay and delay-rate coefficients for all station beams in a sub-array to the TPMs involved in advance of a triggering action which will cause the parameters to be applied simultaneously to all antennas in the station/sub-stations.


LFAA_MCCS_REQ-146 Calibration synchronization

MCCS shall compute and transmit calibration coefficients per frequency channel for all station beams in a sub-array to the TPMs involved in advance of a triggering action which will cause the parameters to be applied simultaneously to all antennas in the station/sub-stations.


SKA1-FLD-4629 CalibrationLFAA_MCCS_REQ-14 SKA1_Low Glass Box

CalibrationMCCS shall store and when commanded, shall provide the necessary information to TM such that TM can reconstruct or restore calibration and pointing coefficients.


LFAA_MCCS_REQ-15 Calibration transfer MCCS shall calculate and forward antenna calibration coefficients for all frequencies and corrected for zenith pointing, independent of observation configuration (channel selection or pointing command) such that changes in frequency band selection or pointing direction commands do not require recalibration.


LFAA_MCCS_REQ-16 Global sky model The MCCS shall use a Local Sky Model in order to generate calibration coefficients. This model shall be updated using a subset of the Global Sky Model pulled from SDP, according to the SDP-LFAA ICD.


LFAA_MCCS_REQ-17 Real time calibration MCCS shall implement on-line station beam calibration such that all active coarse frequency channels are calibrated at least once every 10 minutes.


LFAA_MCCS_REQ-18 Absolute flux density scale

LFAA shall contribute to the calibration of SKA1_Low in order to achieve an absolute flux density scale with an accuracy of better than 5% across the band.


LFAA_MCCS_REQ-89 Aperture Array DDE The LFAA shall have direction dependent models for the station beams for each station with an accuracy of 35 dB at the half-power points to be used in calibration and imaging.


LFAA_MCCS_REQ-104 Normalisation of station gains

MCCS shall match station beam power as a function of frequency across stations within a margin of TBD %.


LFAA_MCCS_REQ-106 Cross polarisation purity

MCCS shall, when commanded, compute and forward polarization compensation coefficients to SPS for each antenna (x-y pair) and frequency channel according to a TBD algorithm. These coefficients shall be archived as part of the system state.


LFAA_MCCS_REQ-157 Correlation MCCS shall correlate channelized data from all antennas in a station/sub-station and generate cross-correlation matrices to be used for calibration.


LFAA_MCCS_REQ-159 Bandpass flattening MCCS shall calculate coefficients to flatten the bandpass to within 1.5dB. Coefficient updates should happen periodically.


LFAA_MCCS_REQ-105 Station beam stability

MCCS shall implement calibration calculations to correct per-antenna amplitude and phase response at each frequency channel with an update rate of less than 600 seconds according to algorithm and accuracy conforming to (TBD) calibration error budget.


LFAA_MCCS_REQ-123 Signal-to-noise ratio LFAA shall have a signal-to-noise ratio of at least 98% (TBC) compared to ideal analogue processing for the same inputs.





of 156

SKA1-FLD-4643 Beam Forming and PointingLFAA_MCCS_REQ-32 Beam pointing

modelMCCS shall implement a model which translates from topocentric Az, El (Sky Coordinate set supplied by TM) to the delay and delay rate per antenna required to steer the beam. This model must include any known imperfections in the geometry of the station (e.g. orientation) and any other effects that can be reproducibly corrected in software. Parameters of the model will be stored by and downloaded from TM, but the calculation happens in MCCS.


LFAA_MCCS_REQ-99 Beam pointing accuracy

MCCS will command delay and delay rates for individual antennas to SPS with an accuracy of better than 1.70 ps and a rate sufficient for a linear approximation to be valid.


LFAA_MCCS_REQ-281 Beam pointing angle range

MCCS shall accept commands to steer and form beams at all possible azimuth and elevation angles.


LFAA_MCCS_REQ-148 Beam pointing control coordinates

MCCS shall receive individual Sky Coordinate Sets from TM, per station beamforming instance, with an update rate of not more than 10 Hz, in accordance with the TM to LFAA ICD.


LFAA_MCCS_REQ-98 Multiple beam capability

MCCS shall control and monitor up to 8 beams (dual polarization) from each station within a sub-array, which can be independently pointed.


LFAA_MCCS_REQ-282 Multiple beam widths

MCCS shall control and monitor beams that have different frequency and channel selections independent of each other (where independence allows identical, overlapping or non-overlapping). The independence allows each of one of the beams to have a non-contiguous bandwidth.


SKA1-FLD-4634 Transient CaptureLFAA_MCCS_REQ-9 Transient buffer size MCCS, when configured, shall store digitized beamformed voltage data, with 2-bit or better

sampling, for at least 150 MHz of continuous or non-continuous frequency range within the observed frequency range, in both polarizations, from a configurable subset up to all of the station/sub-station beams, covering at least 900 seconds.


LFAA_MCCS_REQ-10 LFAA to CSP latency The LFAA shall have latency of at most 1 seconds (TBC) from the time that a signal arrives at the antenna to the time when the beamformed signal is forwarded to CSP for further processing.


LFAA_MCCS_REQ-11 Transient buffer transfer to SDP

When commanded by TM, MCCS shall transfer transient buffer data to SDP via SADT with a data rate of 80Gb/sec (in accordance with the SDP to LFAA ICD), independently for each sub-array, and according to the configuration set by TM (in accordance with the TM to LFAA ICD).


LFAA_MCCS_REQ-120 Transient capture consecutive triggers

MCCS shall restart buffering beam data into the transient buffer at most 45 minutes after the last trigger received from TM.


LFAA_MCCS_REQ-121 Transient capture: TM to LFAA latency

MCCS shall have a latency of no more than 1 second from the time it receives a command to dump the transient buffer from TM until the time it starts transmitting the buffered data to SDP.


LFAA_MCCS_REQ-279 Transient buffer configuration

As configured by TM, the MCCS shall in turn configure the SPS to send station beams for buffering to MCCS by specifying: - frequency window (per TPM) (continuous or not) - 8-bit or re-sampling to 2 or 4-bit





of 156

- number and identification of station/sub-station beamsLFAA_MCCS_REQ-292 Pause sending of

transient dataMCCS shall command SPS to pause sending beam data if MCCS is not able to buffer it (e.g. while the buffer is frozen during read-out), and thereafter command SPS to resume, in accordance with the LFAA Internal ICD.


SKA1-FLD-4644 Observation ConfigurationLFAA_MCCS_REQ-2 Maximum number

of stationsMCCS shall provide configuration, control and monitoring of up to 512 stations (each consisting of 256 dual polarised antenna signal chains).


LFAA_MCCS_REQ-19 Mode transition The MCCS shall complete all internal reconfiguration to support any observing mode changes in less than 30 seconds, assuming that the system is already initialised (a full calibration cycle has been performed) .


LFAA_MCCS_REQ-20 Sub arraying MCCS, when commanded, shall assign station resources as independent groups (sub-arrays) that can be configured and operated independently of each other as described in the TM-LFAA ICD


LFAA_MCCS_REQ-21 Subarray membership

Any LFAA beam shall be assigned independently to one sub-array at a time. Observation Management

LFAA_MCCS_REQ-22 Subarray granularity MCCS shall support sub-arrays containing an integer number of stations between 0 (none) and all (512).


LFAA_MCCS_REQ-23 Subarray independence

MCCS shall configure, monitor and control each sub-array independently of, and concurrently with all others.


LFAA_MCCS_REQ-30 Subarray scheduling block set-up time

On receiving a subarray configuration request from TM, MCCS shall configure LFAA resources to be ready for an observation in less than a TBD subset of 30 seconds.


LFAA_MCCS_REQ-154 Software & Firmware Management

MCCS shall provide the capability to store, manage and transmit software, OS, and firmware images to SPS components as commanded, to support configuration, maintenance and upgrade of these subsystems.


LFAA_MCCS_REQ-142 Configuring reporting interface

When TM requests the MCCS to configure the monitoring of points, alarms, and events, the level of reporting shall comply with the information logs.


LFAA_MCCS_REQ-290 Station membership Each antenna shall be assigned to only one station or sub-station at a time. Observation Management

LFAA_MCCS_REQ-291 Support for sub-stations

When configuring an observation, MCCS shall accept from TM, and transfer to SPS, for each beam, a per-antenna gain map (256x2) which is used to set the weight (complex gain coefficient) of antennas which do not contribute to a sub-station beam to zero.


SKA1-FLD-4631 Monitoring and Control

LFAA_MCCS_REQ-283 Monitoring and control

The control and monitoring structure of MCCS shall be in accordance with the guidelines as defined in the SKA1 Control System Guidelines.

LMC Infrastructure

LFAA_MCCS_REQ-158 Per-antenna average RF power

MCCS shall be capable of reading, from the SPS, the RMS value for each antenna. Monitoring and Control




of 156

LFAA_MCCS_REQ-160 Hardware usage MCCS shall monitor hardware usage and report appropriate statistics of all servers and switches within LFAA.

Monitoring and Control

LFAA_MCCS_REQ-220 Monitor and report operational state

MCCS shall monitor its operational state and make available the information upon request in line with the software standard.


LFAA_MCCS_REQ-162 Network M&C MCCS shall provide monitoring and control functions for LFAA network components, consistent with system-level network policies [TBD]. This includes subnet and IP assignment, traffic balancing, status/health reporting, and beamformed data stream through appropriate packet statistics.


LFAA_MCCS_REQ-156 Data acquisition Control data from TPM shall be retrieved by MCCS software and dumped to disk or processed where required.


LFAA_MCCS_REQ-26 Subarray station failure flagging

When performing observations, MCCS shall detect and report to TM failed stations immediately after detection of the failure.


LFAA_MCCS_REQ-277 Failure and health reporting

The MCCS shall identify all failures and report the health status of the all SPS and MCCS LRUs to the Telescope Manager (TM)

LMC Infrastructure

LFAA_MCCS_REQ-248 Equipment shutdown on failure

MCCS shall autonomously detect and shut-down affected equipment in accordance with the LFAA to INFRA ICD, when any of the following conditions occur:- an over-current condition occurs in a rack - an over-temperature condition occurs in an LRU.

Monitoring and Control, LMC Infrastructure

LFAA_MCCS_REQ-33 Alarm latency Latency from the time that MCCS detects that a measurement has crossed an alarm set-point until it reports this alarm to TM shall be no more than 0.2 seconds.


LFAA_MCCS_REQ-280 Low power mode - MCCS

MCCS hardware components, except for start-up and safety-critical components (i.e. Head/Ghost server and networking components), shall support a Low-Power mode which reduces their power consumption to less than 5% of nominal operating power. MCCS hardware components shall continue to communicate with the start-up and safety-critical components while in Low Power Mode.

Monitoring and Control, LMC Infrastructure

LFAA_MCCS_REQ-36 Low power mode - LFAA components

MCCS shall provide control and monitoring of LFAA components supporting Low-Power Mode and initiate transition to or from this state when commanded by TM.


LFAA_MCCS_REQ-37 Low power mode on power application

On start-up, MCCS shall enter the Low-Power mode mode until commanded by TM to transition to a specific state or mode.


LFAA_MCCS_REQ-170 Remote power-up and power down

MCCS shall implement remote power up and power down functionality. As commanded by TM, power-up should be staged across the CPF to avoid overloading the power generators.


LFAA_MCCS_REQ-180 Fail safe state MCCS equipment that would otherwise present a safety hazard when subjected to an unplanned loss of main electrical power or main control function, shall enter a designated fail-safe state.

LMC Infrastructure

LFAA_MCCS_REQ-181 Fail safe warnings Where transitioning to a designated fail-safe state represents a hazard, components of the MCCS shall issue continued warnings for the duration of the transition.

LMC Infrastructure

LFAA_MCCS_REQ-182 Fail safe recovery Once a transition to a designated fail-safe state is triggered, the MCCS shall complete the LMC Infrastructure




of 156

transition and remain in the designated fail-safe state until commanded otherwise.LFAA_MCCS_REQ-239 User access control MCCS shall provide hardware and communication access to any authenticated user. Note: this

means that user authentication for access to MCCS is provided by MCCS with TM owning the user database.

LMC Infrastructure

SKA1-FLD-4586 Fault testingSKA1-FLD-4587 Off-line Diagnostic

TestsLFAA_MCCS_REQ-211 MCCS off-line built-

in self-test capabilityWhen commanded and with the MCCS Administrative Mode set to OFFLINE, MCCS shall perform diagnostic tests to detect and isolate LRU and sub-element level faults.

LMC Infrastructure, Monitoring and Control

LFAA_MCCS_REQ-212 MCCS off-line fault detection performance

When commanded and the MCCS Administrative Mode set to OFFLINE, MCCS built-in diagnostic self-test capability shall detect and report TBD% of all critical sub-element failures.


LFAA_MCCS_REQ-213 MCCS off-line fault isolation performance

When commanded and with the MCCS Administrative Mode set to OFFLINE, MCCS shall isolate TBD% of all failures down to a LRU level.


LFAA_MCCS_REQ-214 MCCS off-line communications fault detection

When commanded and with the MCCS Administrative Mode set to OFFLINE, MCCS shall detect and report TBD% of sub-element LRU to LRU and LRU to external interface communication path faults.


LFAA_MCCS_REQ-215 MCCS off-line memory and calculation fault detection

When commanded and with the MCCS Administrative Modes set to OFFLINE, MCCS shall detect and report the following faults:- Stuck or incorrect memory cells, whether a direct fault or manifested as such, - Calculation faults that may lead to incorrect data products, whether a direct fault or manifested as such.


SKA1-FLD-4588 On-line Diagnostic Tests

LFAA_MCCS_REQ-216 MCCS on-line built-in self-test capability

While the MCCS Administrative Mode is set to ONLINE or MAINTENANCE, MCCS shall perform diagnostic tests to detect and isolate LRU and sub-element faults.

Maintenance Support

LFAA_MCCS_REQ-217 MCCS on-line fault detection performance

While the MCCS Administrative Mode is set to ONLINE or MAINTENANCE, MCCS shall detect and report TBD% of all Critical Failures.

Maintenance Support

LFAA_MCCS_REQ-218 MCCS on-line fault isolation performance

While the MCCS Administrative Mode is set to ONLINE or MAINTENANCE, the MCCS shall isolate more than TBD% of all failures down to a LRU level.

Maintenance Support

LFAA_MCCS_REQ-219 MCCS on-line communications

While the MCCS Administrative Mode is set to ONLINE or MAINTENANCE, MCCS shall detect all utilized sub-element LRU to LRU and LRU to external LRU communication path faults with

Maintenance Support




of 156

fault detection a detection probability of at least TBD%.SKA1-FLD-4645 Configuration

ManagementLFAA_MCCS_REQ-91 Local database The MCCS shall maintain a database of identification parameters for all hardware entities which is

synchronized with a Global Configuration database. The information shall include IDs and physical location coordinates (to be used for pointing calculations) of Field Nodes and their constituent antennas as well as connection-path mapping and other information needed for maintenance.

Hardware Configuration Management

LFAA_MCCS_REQ-166 Software versioning It shall be possible to instruct LFAA products to use a specific version for a software component and firmware bit file.


LFAA_MCCS_REQ-167 Software updates - disruption to observation

When software component updates are available, MCCS shall deploy these updates to the running system with minimal disruption to running observation (avoid restarts where possible). LMC Infrastructure

SKA1-FLD-4345 Internal interfacesLFAA_MCCS_REQ-129 MCCS to Field Node The interface between MCCS and Field Node shall be compliant with the interface definitions

listed in the LFAA Internal Interface Control Document, document number.Monitoring and Control

LFAA_MCCS_REQ-274 MCCS to SPS The interface between MCCS and SPS shall be compliant with the interface definitions listed in the LFAA Internal Interface Control Document, document number.


SKA1-FLD-4594 External interfacesSKA1-FLD-4648 To externalLFAA_MCCS_REQ-169 Engineering

interfaceA command-line (tunneling) interface to be used locally and remotely shall be provided by MCCS. Maintenance

SupportLFAA_MCCS_REQ-168 Web-based

interfaceA web-based interface to be used locally and remotely shall be provided by MCCS. Maintenance

SupportLFAA_MCCS_REQ-35 TM to LFAA

interfaceMCCS shall be compliant with the interface definitions listed in the TM to LFAA Interface Control Document, document number 100-000000-028.

All

LFAA_MCCS_REQ-39 SDP to LFAA interface

MCCS shall be compliant with the interface definitions listed in the SDP to LFAA Interface Control Document, document number 100-000000-033, for the following sub-interfaces: - I.S1L.SDP_LFAA.001 - I.S1L.SDP_LFAA.002


SKA1-FLD-4345 Software and Firmware StandardsLFAA_MCCS_REQ-172 MCCS software and

firmware qualityMCCS software and hardware description language related deliverables shall comply with the "Fundamental SKA Software and Hardware Description Language Standards".

All

SKA1-FLD-4575 Fail Safe DesignLFAA_MCCS_REQ-179 MCCS non-

propagation of failures

MCCS equipment hardware failures and software errors shall be safe from creating hazardous conditions in interfacing elements and sub-elements.





of 156

SKA1-FLD-4584 AvailabilityLFAA_MCCS_REQ-206 MCCS Availability The MCCS shall have an Inherent Availability of more than 99.99%. AllLFAA_MCCS_REQ-207 MCCS Operationally

CapableMCCS shall remain Operationally Capable when any one processing LRU fails. All

SKA1-FLD-4585 MaintainabilityLFAA_MCCS_REQ-210 MCCS software

update and maintenance down time

The MCCS shall be designed not to require software and hardware maintenance down time, in excess of 1 hour per year (during steady state operations).

Maintenance Support




of 156

Appendix B – List of Stakeholders

Telescope Manager UserRole The Telescope Manager is the overarching user of all the elements of

the telescope, including, but not limited to the LFAA. The Telescope Manager User is therefore one of the ultimate end users of the system, but it not the only user (as there are multiple levels of users across each element).

Concerns to be Addressed 1. ICDs are implemented as expected2. LFAA can deliver on requirements it was designed to perform3. LFAA conforms to specific telescope-wide architectural

decisions

AnalystRole Responsible for analysing the architecture to make sure it meets

certain critical quality attribute requirements. Analysts will be specialized; for instance, performance analysts or security analysts may have well-defined priorities to address.

Concerns to be Addressed 1. Must have an overall understanding of the architecture2. Must be able to assess and test the various aspects of the

architecture against set quality attributes.

Systems EngineerRole Responsible for systems design and development of systems or system

components in which software plays a role. The systems engineer will therefore be involved (to some level) in most of the MCCS and Firmware software engineering process.

Concerns to be Addressed 1. The process of setting and maintaining software standards2. The process of setting and maintaining architecture standards3. The documentation process4. Assuring that the system environment provided for the

software is sufficient.

Science Data Processor (SDP) UserRole The Science Data Processor will be responsible for relaying Global Sky

Model data requests. In this regard, the mechanism designed for LFAA to retrieve sky model data is of interest to SDP.

Concerns to be Addressed 1. Access and volume request are acceptable2. Data integrity of the GSM is maintained3. Regulating permissions of access to the GSM from LFAA

GUI OperatorRole LFAA is deployed with an internal GUI interface to the MCCS system.

The GUI operators are responsible for managing the system from this GUI interface and therefore will have a huge degree of control over how MCCS performs and operates.

Concerns to be Addressed 1. The MCCS architecture has to provide the operator with the facilities required by them.

2. The GUI system must provide the right information in a proper structure, and the architecture should not impede or limit this process.

User and Maintainer of CLI/GUI InterfacesRole Responsible for fixing bugs and providing enhancements to the system

throughout its life (including adaptation of the system for uses not originally envisioned).

Concerns to be Addressed 1. Architecture must allow for some flexibility2. Needs an awareness of possible future adaptations3. Understanding the ramifications of changes to CLI/GUI

software components.4. Understanding the ramifications of changes to the platform

which will have an effect on the CLI/GUI systems.

MCCS Software DeveloperRole Responsible for the development of specific elements according to

designs, requirements and the software architecture.Concerns to be Addressed 1. Understand inviolable constraints and exploitable freedoms

on development activities.

TPM Software DeveloperRole Responsible for the development of TPM-specific elements according

to designs, requirements and the software architecture.Concerns to be Addressed 1. Understand inviolable constraints and exploitable freedoms

on development activities.2. Provide support for higher-level software integration at

control system and access layer levels.3. Reduce tight-coupling where possible.

Integration and Test EngineerRole Responsible for taking individual components and integrating them,

according to the architecture and system designs. Also responsible for the independent testing and verification of the same components against the formal requirements and architecture.

Concerns to be Addressed 1. Has to have a component level understanding of most aspects of the system.

ArchitectRole Responsible for the development of the architecture and its

documentation. Focus and responsibility is on the system.Concerns to be Addressed 1. Negotiating and making trade-offs among competing

requirements and design approaches2. Maintain an architectural view of specific design decisions3. Provide evidence that the architecture satisfies the

requirements

DesignerRole Responsible for systems and/or software design downstream of the

architecture, applying the architecture to meet specific requirements of the parts for which they are responsible

Concerns to be Addressed 1. Resolving resource contention2. Establishing performance and other kinds of runtime resource

consumption budgets3. Understanding how their part will communicate and interact

with other parts of the system




of 156

Hardware DeployerRole Responsible for accepting the hardware from specification and

deploying it, making it operational, and fulfilling its allocated function.Concerns to be Addressed 1. Understands and implementing the hardware layout and

labelling standards set by the architecture references

Software DeployerRole Responsible for accepting the software from the master repository

specification and deploying it, making it operational, and fulfilling its allocated function.

Concerns to be Addressed 1. Understands and implementing the software deployment procedure and system standards set by the architecture references

Database and Data Storage DesignerRole Involved in many aspects of the data stores, including database design,

data analysis, data modelling and optimization to ensure the information handling needs of the system are achieved.

Concerns to be Addressed 1. Determining the storage requirements of the data acquisition system

2. Understanding the database structure of the control system, and understanding the connection to central LMC database system maintained by Telescope Manager

Network AdministratorRole Responsible for the design and development data transportation

networks necessary to fulfil the communication needs of the system.Concerns to be Addressed 1. Determining network loads during various use profiles and

understanding uses of the network.2. Verifying the appropriate network switching infrastructure is

available for the architectural requirements for cases such as data transmitter/receiver addresses.

MCCS MaintainerRole Responsible for fixing bugs and providing enhancements to the system

throughout its life (including adaptation of the system for uses not originally envisioned).

Concerns to be Addressed 1. Architecture must allow for some flexibility2. Needs an awareness of possible future adaptations3. Understanding the ramifications of changes to specific

software components.

Database and Data Store AdministratorRole Involved in many aspects of data stores, networked file systems,

database design, volume management, replication, data modelling and optimization, installation of required software platform, and monitoring and administration of database security.

Concerns to be Addressed 1. Data type and volume needs2. Resilience requirements3. Architecture based hard requirements4. Standards of any interfacing systems




of 156

ska-sdp.orgska-sdp.org/.../ska-tel-lfaa-0600052-02_softwarearchitecturedocument.d… · Web...

Documents

Transcript of ska-sdp.orgska-sdp.org/.../ska-tel-lfaa-0600052-02_softwarearchitecturedocument.d… · Web...