SDP Memo 43: Pulsar Timing Failure...

47
SDP Memo 43: Pulsar Timing Failure Analysis Document Number .......................................................... SDP Memo 43 Document Type ..................................................................... MEMO Revision ................................................................................. C1 Author ................................................. R. J. Lyon, L. Levin, B. W. Stappers Release Date ................................................................... 2018-04-17 Document Classification ........................................................ Unrestricted Status ................................................................................. Draft Lead Author Designation Affiliation R. J. Lyon SDP.PIP.NIPMember University of Manchester Signature & Date: (17/04/2018)

Transcript of SDP Memo 43: Pulsar Timing Failure...

Page 1: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

SDP Memo 43: Pulsar Timing Failure Analysis

Document Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SDP Memo 43Document Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MEMORevision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C1Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. J. Lyon, L. Levin, B. W. StappersRelease Date . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2018-04-17Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .UnrestrictedStatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Draft

Lead Author Designation AffiliationR. J. Lyon SDP.PIP.NIP Member University of ManchesterSignature & Date:

(17/04/2018)

Page 2: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

SDP Memo Disclaimer

The SDP memos are designed to allow the quick recording of investigations and researchdone by members of the SDP. They are also designed to raise questions about parts of theSDP design or SDP process. The contents of a memo may be the opinion of the author, notthe whole of the SDP.

Revisions

Revision Date of issue Prepared by CommentsC February 26th

2018Robert Lyon Initial version of the document.

C1 April 17th2018

Robert Lyon Updates made given feedback from LoritaChristelis.

Updated Tables 5 and 6, replaced sometext that was incorrectly repeated.

Altered Table 4. making it clear thatFM.SDP.PST.103 can also be mitigatedvia rerouteing data to functioning hardware.

Added a new mode, FM.SDP.PST.117, toaccount for a hardware failure in the archivesystem.

Altered Table 5., making a grammaticalchange to FM.SDP.PST.108 in the mitigationcolumn (no change to meaning).

Section 5.2, indicated that a rack controlfailure can occur due to the failure of a top ofrack switch.

Section 5.2, indicated that a loss of con-trol due to failure of the SDP managementsystem, unlikely to be cause by a loss ofconnectivity. There will likely be a networktopology that ensure a connection is alwaysavailable, though perhaps with reducedbandwidth.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 2 of 47

Page 3: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Continued...

Section 5.2, now point to the SDP execu-tion control component [RD9] (see section2.1.1 in that external document) as a possiblesource of failure.

Modified FM.SDP.PST.205 andFM.SDP.PST.206 in Table 7. Added anadditional mitigation strategy, which involvesdecoupling control and monitoring in the SDPExecution Control Component.

Altered Table 11., making it clear thatFM.SDP.PST.224 has the potential to criticallyimpact science outputs, rather than catas-trophically degrade output. Also updated theseverity range and the criticality score. Thisis because the failure mode can be mitigatedso long as science data is retained in a bufferand not discarded until successfully persistedin the archive.

Added new tables to the Appendix thatdescribe FMECA Detection methods.

The following changes have been madeto the requirements in Table 25:

SDP REQ-33 has a new description.SDP REQ-50 has since been deleted.SDP REQ-147 and SDP REQ-148 have sincebeen deleted.SDP REQ-281 has a new description.SDP REQ-546 has a typo correction.SDP REQ-552 has since been deleted.SDP REQ-763 has a new description.SDP REQ-764 has a new description.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 3 of 47

Page 4: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Table of Contents

List of figures 5

List of tables 6

List of abbreviations 7

Summary 8

1 Scope 9

2 Process 10

3 Terms & Definitions 11

4 Assumptions 124.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.4 Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.5 Execution Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.6 Science Software & Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.7 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.8 Pulsar Timing Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.9 Likelihood & Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Failure Modes 185.1 Hardware Induced Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2 Control & Communication Failures . . . . . . . . . . . . . . . . . . . . . . . . . 225.3 Data Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.4 Software/Algorithm Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Summary 37

A FMECA Detection methods 38

B FMECA Results 41

C Applicable Requirements 43

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 4 of 47

Page 5: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

List of Figures

1 Level 2 functional flow diagram for the SDP. . . . . . . . . . . . . . . . . . . . . 92 SDP Hardware Block Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 High level diagram showing the assumed architectural data flow. . . . . . . . . 134 Conceptual data model for timing data. . . . . . . . . . . . . . . . . . . . . . . . 155 Activity diagram for the pulsar timing pipeline. . . . . . . . . . . . . . . . . . . . 17

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 5 of 47

Page 6: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

List of Tables

1 Severity codes applying to failure modes. . . . . . . . . . . . . . . . . . . . . . 112 Likelihood codes applying to failure modes. . . . . . . . . . . . . . . . . . . . . 113 Summary of the main SDP.PST software components. . . . . . . . . . . . . . . 164 Hardware induced failure modes 1-6. . . . . . . . . . . . . . . . . . . . . . . . . 195 Hardware induced failure modes 7-11. . . . . . . . . . . . . . . . . . . . . . . . 206 Hardware induced failure modes 12-16. . . . . . . . . . . . . . . . . . . . . . . 217 Control and Communication failure modes 1-6. . . . . . . . . . . . . . . . . . . 238 Control and Communication failure modes 7-14. . . . . . . . . . . . . . . . . . 249 Control and Communication failure modes 14-19. . . . . . . . . . . . . . . . . . 2510 Control and Communication failure modes 20-23. . . . . . . . . . . . . . . . . . 2611 Control and Communication failure modes 24-28. . . . . . . . . . . . . . . . . . 2712 Control and Communication failure modes 29-33. . . . . . . . . . . . . . . . . . 2813 Control and Communication failure modes 34-36. . . . . . . . . . . . . . . . . . 2914 Data failure modes 1-6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3115 Data failure modes 7-11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3216 Data failure modes 12-17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3317 Software/Algorithm failure modes 1-9. . . . . . . . . . . . . . . . . . . . . . . . 3518 Software/Algorithm failure modes 9-14. . . . . . . . . . . . . . . . . . . . . . . 3619 Summary of the detection methods for each of the failure modes discussed in

this document (Part 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3820 Summary of the detection methods for each of the failure modes discussed in

this document (Part 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3921 Summary of the detection methods for each of the failure modes discussed in

this document (Part 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4022 Summary of the criticality scores for each of the failure modes discussed in this

document (Part 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4123 Summary of the criticality scores for each of the failure modes discussed in this

document (Part 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4224 Summary of the criticality scores for each of the failure modes discussed in this

document (Part 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4325 Level 2 SDP requirements relevant to the failure mode analysis. . . . . . . . . . 43

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 6 of 47

Page 7: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

List of abbreviations

CSP Central Signal ProcessorCOTS Commercial-of-the-ShelfDSD Dynamic Spectra DataEMI Electromagnetic InterferenceFTP File Transfer ProtocolHPC High Performance ComputingICD Interface Control DocumentIM Interstellar MediumLMC Local Monitor and ControlNIC Network Interface CardNIP Non-imaging ProcessingPSRFITS Pulsar Flexible Image Transport SystemPST Pulsar Timing Sub-elementPTD Pulsar Timing DataQA Quality AssuranceSDP Science Data ProcessorSFMECA Software Failure Mode, Effects and Criticality AnalysisTM Telescope ManagerTOA Time-of-ArrivalsTOR Top of Rack

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 7 of 47

Page 8: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Summary

This document describes a Software Failure Mode, Effects and Criticality Analysis (SFMECA)for the pulsar timing pipeline sub-element (PST) of the Science Data Processor (SDP). Theanalysis has been done at the architectural level, and represents an initial attempt to study thefailure modes of the timing pipeline. This work forms the output of sprint task: TSK-2140.

Applicable Documents

The following documents are applicable to the extent stated herein. In the event of conflict be-tween the contents of the applicable documents and this document, the applicable documentsshall take precedence.

Reference Document Number ReferenceNumberAD1 100-000000-002 SKA1 LOW SDP - CSP INTERFACE CONTROL DOCU-

MENTAD2 300-000000-002 SKA1 MID SDP - CSP INTERFACE CONTROL DOCUMENTAD3 100-000000-029 SKA1 INTERFACE CONTROL DOCUMENT SDP TO TM

LOWAD4 300-000000-029 SKA1 INTERFACE CONTROL DOCUMENT SDP TO TM

MID

Reference Documents

The following documents are referenced in this document. In the event of conflict between thecontents of the referenced documents and this document, this document take precedence.

Reference Document Number ReferenceNumberRD1 SKA-TEL-SDP-0000018 PDR.02.01 Compute Platform Element Subsystem DesignRD2 SKA-TEL-SDP-0000027 SDP Pipelines DesignRD3 SKA-TEL-SDP-0000033 SDP L2 requirements specification (L1 Rev 11).RD4 Zhu, Y. M., “Software Failure Mode and Effects Analysis”,

Springer, 2017, doi:10.1007/978-3-319-65103-3 2.RD5 Stadler, J. J. and Seidl, N. J.,“Software failure modes and

effects analysis”, Reliability and Maintainability Symposium(RAMS), 2013, doi:10.1109/RAMS.2013.6517710.

RD6 Stamatis, D. H., “Failure mode and effect analysis : FMEAfrom theory to execution”, Milwaukee, Wisc. : ASQ QualityPress, 2003.

RD7 SDP Memo 40 Lyon, R. J., Levin, L. and Stappers, B. W., “PSRFITSOverview for NIP”.

RD8 Lyon, R. J., “CSP to SDP NIP Data Rates & Data Models(version 1.1)”, doi:10.5281/zenodo.836715.

RD9 SKA-TEL-SDP-0000013 Wortmann, P. et. al., “SDP Operational System Componentand Connector View ”.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 8 of 47

Page 9: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Fast Telescope

State Producer

LSM Management

Receive visibilities

Transient Buffer receive

P & T candidate

receive

Timing receive

Transient Buffer processing

P & T candidate processing

Timing processing

Pre-processing

Buffering Imaging

& Calibration Pipeline

Staging

Detect candidates

Fast imager

Real-time calibration

Master control

QA visualisation

QA metric aggregator

AAAIPersistence

Query, discovery & delivery

Preservation & index science

products

Prepare science products

Data lifecycle management

Switch

1

2

3

1 2 3

Science data Sky model Local telescope model

Transient event Telescope manager Functions producing QA metrics

Functions using Data lifecycle manager

Key

Fast pre-processing

LTM Management

Figure 1: Level 2 functional flow diagram for the SDP. The blue shaded components are thosestudied as part of the failure analysis. The flow diagram is based upon a figure produced by theSDP consortium (author unknown).

1 Scope

The scope of this work is confined to the blue shaded components of the SDP level 2 func-tional flow diagram in Figure 1. This includes the pulsar timing receive and pulsar timingfunctions, from herein collectively referred to as the SDP.PST. The analysis presented here isonly concerned with the identification and analysis of SDP.PST software failure modes at anarchitectural level. The analysis is applicable to both SKA Low and Mid. It includes failuremodes arising from internal and external software (and their interfaces), firmware, interfacesto Commercial-of-the-Shelf (COTS) equipment, and interfaces to free/open source software.Whilst hardware failure modes are not in scope, in some cases hardware will be discussedwhen equipment failures, faults, or defects precipitate software failures/errors. As the SDPdesign is not complete, hardware, software and architectural assumptions are made to bothenable and constrain the analysis. These assumptions are summarised in Section 4, whilst themethodology employed is summarised and justified in Section 2. Finally, note that the pulsarand transient search Non-Imaging Processing (NIP) pipeline failure modes will be consideredelsewhere.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 9 of 47

Page 10: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

2 Process

Whilst software failure mode analyses have been undertaken for some time, there is currentlyno universal SFMECA standard. To proceed it is necessary to tailor approaches borrowedfrom the software engineering literature. Thus we reviewed the literature [RD4,RD5,RD6] forrelevant work. Following this review we designed a process we believe to be conducive toproducing a reproducible, principled, detailed analysis. This is an initial attempt to systema-tise the SFMECA so that it can be reviewed and critiqued, and we hope that our process canbe improved upon via appropriate feedback. Any such feedback will be incorporated into fu-ture SFMECA analyses, e.g. those yet to be done for the pulsar and transient search pipelines.

The following steps form the analysis process employed in this work:

1. Define the scope - This involves determining i) which part of the system is being in-vestigated, ii) which views apply (e.g. functional, interface, algorithmic, maintenance,usability, security), iii) which elements to study (e.g. hardware, software).

2. Information gathering - Gather documents relevant to the analysis, e.g. if taking afunctional view then requirements documents are relevant. This is because failures leadto functional requirements not being met. Interfaces may need to be studied, along withthe system functionality at a higher level. This also involves studying which types ofanalysis can be applied - an SFMECA process designed for medical software, will havedifferent strengths and weaknesses compared to one written for military applications.Thus it’s important to find the right approach.

3. Tailor the analysis - Based on the information gathered, tailor the analysis to the prob-lem at hand. In this case, we need not consider hardware failure modes, thus we canomit these from the analysis.

4. Research failure modes - Enumerate all the possible failure modes and sources oferror. Then begin categorising these according to the chosen view.

5. Analyse - For each mode found determine,

• the root cause of the failure mode.

• the local effect at the software component level (e.g. FFT doesn’t work correctly).

• the sub-system effect. For example the effect on the pulsar timing pipeline sub-system.

• the system effect and how this relates to system requirements (e.g. if pulsar timingfails, what does this mean for SDP, and the wider SKA?).

6. Mitigate - For each failure mode identified, attempt to devise a mitigation strategy whichprevents the failure or mitigates its effects. If no mitigation is possible, then preventativemeasures should be described.

7. Severity & Likelihood - Determine how severe each failure mode is with respect to thesystem requirements, and how likely it is for such a failure mode to occur.

8. Summarise - Produce a critical item list describing all the possible failure modes.

These steps need not be rigidly undertaken. However they are useful for guiding the analysisprocess. Note these steps are described in more detail elsewhere [RD4,RD5,RD6].

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 10 of 47

Page 11: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Table 1: Severity codes applying to failure modes.

Level Code Description1 Minor Normal availability retained by preventative /

mitigation action.2 Marginal Near normal availability retained via preventa-

tive / mitigation action.3 Significant Operating between degraded and normal.4 Critical Operating in degraded mode.5 Catastrophic Functionality unavailable.

Table 2: Likelihood codes applying to failure modes.

Level Code Description1 Extremely unlikely < 0.1%2 Remote 0.1 to 1%3 Occasional 1 to 10%4 Reasonably probable 10 to 20%5 Frequent >20%

3 Terms & Definitions

Before proceeding we define some terms which should make our analysis easier to interpret.Firstly we define the severity codes (Table 1) and probability codes (Table 2) that will be used.These are used to determine a criticality level for each failure mode. The criticality score canbe determined via a simple calculation where the Criticality Score = Severity × Likelihood.

Next we define the key terms as we understand them.

• Failure Mode - Means/process via which software can contribute to a system failure.

• Effect - Behaviour resulting from the failure mode.

• Error - Discrepancy between a computed, observed, or measured value and the true,specified or theoretically correct value or condition.

• Defect - Manifestation of an error arising from the software requirements, design orcode.

• Fault - Defect that has resulted in one or more failures.

• Scan - Basic observational unit.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 11 of 47

Page 12: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

4 Assumptions

4.1 Hardware

The SDP.PST pipeline is assumed execute upon standard COTS equipment that complieswith the SKA’s EMI, power, maintenance and cooling standards. This applies to racks, routers& switches, compute nodes and individual (internal) compute node components (processors,memory, accelerator cards, storage disks, NICs, power supplies, cooling components etc).The hardware is assumed to be housed in a suitable location providing appropriate power,cooling and climate control facilities. Figure 2 depicts our hardware assumptions. An ab-stracted rack configuration presented in a), and an abstracted SDP compute node in b).

Compute Rack n

Compute Rack 1

TM / Control & Monitoring

PreservationSystem

Data IngestedFrom CSP &

Metadata

External PowerSupply

Science data

Power

Management data

Key

a) b)

Memory

Memory

Accelerator n

PU

Accelerator 1

PUPUProcessingUnit (PU)

Host ChannelAdapter (HCA)

NetworkInterface

Card (NIC)

Compute Node

Ethernet Switch

Node 1

Node m

In Rack Power Supply (PDU)

Cooling System

Control & Management Switch

CPU 1

CPU 2

PU

Disk

1

Disk

Disk

m

Disk

NIC NIC HCA

Power / Cooling

Figure 2: Simplified hardware block diagram describing SDP racks (a) and a diagram depictingan abstracted SDP compute node (b). Figure based upon diagrams originally produced byL. Christelis and P. C. Broekema, as part of their SDP work.

4.2 Architecture

The SDP will be an energy efficient yet extremely powerful High Performance Computing(HPC) system. We assume it consists of one or more ‘compute islands’. Each compute islandis an independent scalable compute unit1 [RD1] containing one or more racks as shown inFigure 2 a). Each rack can in turn contain one or more compute/data storage nodes. Wherea compute/data storage node is a typical COTS server as shown in Figure 2 b). In additionto COTS servers, each rack is presumed to contain industry standard networking and storagehardware.

1Compute islands defined in JIRA, see Archive 390: https://jira.ska-sdp.org/browse/ARCHIVE-390.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 12 of 47

Page 13: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Based on these assumptions we describe an abstracted architecture used to guide our anal-ysis. It is summarised in the architectural data flow diagram shown in Figure 3. We assumethat,

• each rack, and each compute node within it, is connected to a control system via a ‘man-agement’ Ethernet switch. The control system is responsible for provisioning resourceswithin the rack, monitoring their use, troubleshooting etc.

• each rack has a separate Ethernet switch dedicated to handling the ingest/transmissionof all other data (e.g. science data, sky models, and metadata). Each compute nodeis connected to this switch, allowing data to be received from the CSP, and sent to thepreservation system as appropriate.

• rack power and cooling is monitored via the management system.

• compute islands, the Telescope Manager (TM), the Central Signal Processor (CSP) andthe preservation system; are connected via suitable network interfaces and equipment.

• there will be redundant compute nodes, data storage nodes, and communication linkswhich will help mitigate the impact of hardware failures.

• for our analysis we can treat the TM, CSP and preservation systems as black boxesinteracting with our pipeline components. Thus any failure modes related to their usecan only occur at any applicable common interfaces.

SDP

Science data

Management data

Key

Regional

Centres

Backup

Centres

SDP data products

Metadata, Sky models etc.

Network (LAN or Wan)

Telescope

Manager

TM Data

Disk

Disk

Data Ingest Island

Rack 1

Ethernet Switch

Node 1

Node m

PDU

Cooling System

Management Switch

Rack 2

Rack n

Compute Island 1

Rack 2

Rack n

Rack 1

Compute Island n

Rack n

Rack 1

Preservation System

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

Disk

CSP

Figure 3: High level diagram showing the assumed architectural data flow. Figure based upondiagram first presented in [RD1].

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 13 of 47

Page 14: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

4.3 Control

We treat the control system as a single abstracted entity interacting with the SDP.PST, andSDP hardware. For this analysis it is irrelevant if control is provided by the TM (e.g. [AD3,AD4]),LMC (Execution Control) or direct human interaction so long as,

• the control system initiates, pauses, and restarts scan processing as appropriate.

• the control system monitors both hardware and software states allowing the efficientmanagement of resources.

• the control system can receive and correctly process information requests from theSDP.PST or the SDP.

• the control system can deliver information to the SDP.PST or the SDP. This includesdetails of the processing to be performed, associated metadata, sky models, pulsarephemerides, standard pulsar profiles, RFI masks, calibration strategies and other rele-vant information.

• the control system can process and correctly act upon error messages/warnings sent bythe SDP.PST or the SDP.

• the control system has some inherent redundancy making failures of the control systemextremely unlikely.

• the control system can operate autonomously during scan processing, and take reme-dial action where/when appropriate according to any error messages received. Thisincludes, for example, automatically compensating for hardware failures at the nodelevel.

4.4 Communications

As per the CSP to SDP Interface Control Documents [AD1,AD2], we assume data is trans-mitted to the SDP via FTP (RFC 959). The communication interface is assumed to be bi-directional, although the data flow is uni-directional in practice (from CSP to SDP). Pulsartiming data transmitted via this protocol is sent one temporal sub-integration at a time2 typ-ically every 10 seconds. Though sub-integration data could be sent by CSP at any intervalbetween 1 to 60 seconds. Finally the sub-integration data is sent in the PSRFITS format[RD7].

4.5 Execution Framework

The execution framework is responsible for executing software components, providing themwith hardware resources (memory, CPU time etc), monitoring their status/resource use, andrestarting them upon failure. The framework treats available hardware resources as a pool,thus processing steps executed one after another need not be situated on the same physi-cal hardware. It is the responsibility of the execution framework to correctly route data fromone software component to another, if executed on different hardware. Finally the executionframework interacts with the control system and is situated on each and every SDP node.

2Defined more clearly in Section 4.7.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 14 of 47

Page 15: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

4.6 Science Software & Processing

Science software is expected to comprise both custom tools developed by the SDP consor-tia, and open source community algorithms. In either case, these will operate within theconstraints of the execution framework, and interface with its error reporting system, so thaterrors can propagate from all software components to the TM.

Pulsar timing processing proceeds in a mostly linear fashion, with some data aggregation/buffer-ing required in places. It is entirely possible for the processing to be done across multiple racksand/or compute islands. However it is better for data from the same beam to be processed onthe same physical compute node.

4.7 Data

The CSP produces ‘detected’ data. This is data that has been i) channelised, ii) fully correctedfor dispersion in the Interstellar Medium (IM), iii) folded at the known pulsar period, and iv)partially calibrated. The resulting time, phase, frequency and polarisation data is sent to theSDP.PST as a matrix (also called a data cube). The matrix dimensions are determined byparameters chosen within CSP. These include the number of frequency channels Nchan, thenumber of phase bins Nbin, the number temporal sub-integrations Nsub, and the number ofpolarisations Npol. The size of the matrix in bits is given by,

Nchan ×Nbin ×Nsub ×Npol ×Nbit, (1)

where Nbit is the number of bits per sample in the matrix. The possible values for theseparameters are constrained elsewhere [AD1,AD2]. The data cube is not sent alone. It is ac-companied by attributes and metadata. We describe the complete data product that containsall this information as Pulsar Timing Data (PTD). This is described at the conceptual level inFigure 4 and summarised elsewhere [RD8].

Timing Data

Metadata

Data Cube

Key

Entity

Weak Entity

Non-identifying relationship

Identifying relationship

Cardinality:

Zero or one

One or more

Exactly one

Zero or more

Many

Logical ModelConceptual Model

PTD

Metadata Data Cube

has

one

has

one

has

one

has

one

Timing Receive

Sub-arrays

Observation

Timing Data

has

many

Relation

Attribute TBD heuristics

TBD metadata

n-D matrix

Data Cube

Pulsar ID

Configures User

Outputs

Data from CSP

has

has

Beam ID

Scheduling Block ID

Observation metadata

Program Block ID

Scan ID

PTD

Figure 4: Conceptual data model for timing data.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 15 of 47

Page 16: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Table 3: Summary of the main SDP.PST software components.

Identifier Name DescriptionSDP.PST.SC001 Command QA Evaluates the quality and correctness of received commands.SDP.PST.SC002 Parameter QA Evaluates the quality and correctness of received parameters.SDP.PST.SC003 Data QA Evaluates the quality and correctness of received data, and data

produced within the pipeline.SDP.PST.SC004 Alert Generates, formats and transmits alert messages. This includes

scientific and hardware/software related alerts/warnings.SDP.PST.SC005 Timing Receive Monitors and controls the ingest of data from the CSP.SDP.PST.SC006 Remove RFI Removes parts of the received data affected by RFI.SDP.PST.SC007 Calibrate Calibrates for flux and polarisation.SDP.PST.SC008 Average Produces partly averaged data cubes for data processing steps

that require higher S/N values rather than high resolution. Sendsaveraged products to the preservation system.

SDP.PST.SC009 TOA Determination Determine pulse TOAs by cross correlating the current observa-tion, with a pulsar-specific standard profile supplied externally.Generates 1 TOA per sub-integration and frequency channel.

SDP.PST.SC010 Compute Residuals Uses a timing model to compute expected pulse TOA. Comparesthe expected & observed TOA, and generates timing residuals asthe difference between them.

SDP.PST.SC011 Update Timing Model Update the timing model for the observed pulsar.

4.8 Pulsar Timing Modes

A maximum of 16 tied-array beams are available for use when in pulsar timing mode. Eachbeam can independently observe a different pulsar, thus 16 pulsars can be studied per scan.It is the responsibility of the CSP to produce data products that can be used by the SDP toperform high precision timing.

The SDP.PST executes multiple processing steps. The first involves RFI mitigation followed bya detailed flux and polarisation calibration. A number of intermediate ‘averaged’ data productsare then generated, that provide different representations of the data. These are sent to thepreservation archive. The pulse Time-of-Arrivals (TOAs) are then determined, and the timingresiduals computed. These are used to update the timing model for the observed pulsar fol-lowing appropriate Quality Assurance (QA) checks. Any significant changes in pulse arrivaltimes should raise an alert, as such a change is of scientific interest. The generalised pipelinesteps are summarised in Figure 5, whilst Table 3 summarises the main SDP.PST components.

Note that all software components must be fault tolerant. To achieve this the timing pipelinemust be capable of operating in two distinct modes:

• Standard mode - here communications are consistent, all data sources are accessi-ble, all data sent and received is correctly formatted and valid, and data is successfullypassed between SDP.PST software components without impediment (e.g. delays).

• Default mode - in the event of any error causing i) a disturbance in communications,ii) command parameters or metadata to become corrupted/invalid, iii) data formattingerrors/corruption, iv) algorithmic/hardware malfunctions, v) a failure in control, or vi) anyother unforeseeable error; the timing pipeline should enter a default mode. This modeprioritises the preservation of valuable science data, and may skip some/all processingsteps as required.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 16 of 47

Page 17: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

QA

Co

mm

and

s /

Para

mete

rs

Ob

tain

Pro

ce

ssin

g

Co

mm

an

ds /

Pa

ram

ete

rs

Valid

?

Ing

est

Data

Get

sky

mo

dels

,

ep

hem

erid

es,

sta

nd

ard

mo

del…

[fo

rk]

[jo

in]

Mo

re

data

?

RF

I M

asks

Calib

ratio

n

so

lutio

ns

[ Data from CSP]

Tim

ing

R

ec

eiv

e

Ing

est

Data

Ing

est

Data

[ G

et

sub

-int

data

fro

m C

SP

]

[tru

e]

Buff

er

Data

[fals

e] [tru

e]

Valid

?R

em

ove

RF

IC

alib

rate

Ave

rag

e

Se

nd

da

ta to

A

rch

ive

[ to

TM

]

[ to

TM

]

De

term

ine

T

OA

sS

en

d T

OA

s to

Q

A S

yste

mK

ey

Co

ntr

ol/D

ata

flo

w

Fo

rk/J

oin

Activity s

tart

Activity e

nd

Pro

cessin

g a

ctivity

Decis

ion n

od

eD

ecis

ion

Activity

Eva

lua

te

Mo

de

l C

ha

ng

es

Da

ta

Aq

uis

itio

n

[fo

rk]

[jo

in]

Up

da

te

Tim

ing

M

od

el?

Se

nd

Mo

de

l to

Arc

hiv

e

Ge

ne

rate

A

lert

Fo

llow

up

?

[tru

e]

[end

puls

ar

tim

ing

scan

pro

cessin

g]

Co

mm

an

d &

Pa

ram

ete

r

Ch

ec

ks

Pu

lsa

r T

imin

g

Pro

ce

ssin

g

[fals

e]

Rep

ort

Pro

ble

m

Rep

ort

Pro

ble

m

[ fr

om

TM

]

Figu

re5:

Act

ivity

diag

ram

show

ing

the

proc

essi

ngst

eps

inth

epu

lsar

timin

gpi

pelin

e.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 17 of 47

Page 18: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

4.9 Likelihood & Probability

The likelihood and probability estimates provided by this analysis represent best guessesbased on empirical experience. Whilst this is not ideal, there is no data available that can beused to facilitate a more rigorous analysis of failure rates and consequences.

5 Failure Modes

We consider three main sources of failure. These are addressed in separate sections forclarity. In each case the priority is to preserve science data whenever possible, even whenextreme errors are encountered. This is because science data, even when damaged or cor-rupted, has utility.

5.1 Hardware Induced Failures

There are many possible causes for a hardware induced failure. These can occur beforeand during timing processing. To keep the analysis at a high-level, we consider the followinghardware failures and treat them as equivalent:

• failures resulting from a mechanical defect (e.g. system fan or hard drive mechanicalfailure).

• power or cooling failures necessitating system shut-down.

• failures caused by incorrect system configuration (e.g. Bios errors).

• failures caused by firmware or operating system errors.

• electronics failures in hardware components (memory, CPU, motherboard etc.).

A number of failure modes related to hardware errors are listed in Tables 4, 5 and 6 below.For simplicity only scenarios where inherent redundancy fails are presented (i.e. a worst casescenario). This is because enumerating all possible failure scenarios and their combinationsis out of scope for our high level analysis.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 18 of 47

Page 19: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e4:

Har

dwar

ein

duce

dfa

ilure

mod

es1-

6.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dTi

min

gR

ecei

ve(F

M.S

DP.

PS

T.10

1)Th

eS

DP

hard

war

ere

-sp

onsi

ble

for

inge

stin

gda

tafro

mth

eC

SP

en-

coun

ters

aha

rdw

are

fail-

ure

atth

eno

dele

vel.

An

inge

stfa

ilure

resu

ltsin

data

loss

atth

esu

b-in

tegr

atio

nda

tale

vel

orfo

ran

indi

vidu

albe

am,

and

dela

ysth

epr

oces

s-in

g.

Pul

sar

timin

gan

alys

isle

ssef

fect

ive,

som

esc

i-en

ceda

talo

st.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

Inte

grity

ofS

cien

ceD

ata

isco

m-

prom

ised

due

toda

talo

ss.

Ens

ure

the

regu

lar

mai

nten

ance

ofin

gest

node

s,an

dpr

even

tth

eir

use

whe

nex

hibi

ting

be-

havi

ours

sym

ptom

atic

ofan

impe

ndin

gha

rdw

are

failu

re.

Whe

repo

ssib

leim

med

iate

lyco

mpe

nsat

efo

rth

eer

ror

byre

peat

ing

the

inge

stw

ithop

era-

tiona

lhar

dwar

e.

Min

orO

ccas

iona

l

Tim

ing

Rec

eive

(FM

.SD

P.P

ST.

102)

The

SD

Pha

rdw

are

re-

spon

sibl

efo

rin

gest

ing

data

from

the

CS

Pen

-co

unte

rsa

hard

war

efa

il-ur

eat

the

rack

leve

l.

An

inge

stfa

ilure

resu

ltsin

sign

ifica

ntda

talo

ssfo

ron

eor

mor

ebe

ams,

and

sign

ifica

ntly

dela

ysth

epr

oces

sing

.

Pul

sar

timin

gan

aly-

sis

sign

ifica

ntly

com

-pr

omis

ed,

mod

erat

esc

ienc

eda

talo

st.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

Inte

grity

ofS

cien

ceD

ata

issi

gnifi

-ca

ntly

com

prom

ised

.

Sam

eas

abov

e.C

ritic

alR

emot

e

Tim

ing

Rec

eive

(FM

.SD

P.P

ST.

103)

The

SD

Pha

rdw

are

re-

spon

sibl

efo

rin

gest

ing

data

from

the

CS

Pen

-co

unte

rsa

hard

war

efa

il-ur

eim

pact

ing

the

data

inge

stis

land

.

With

out

the

capa

city

tobu

ffer

data

sent

byth

eC

SP,

anin

gest

failu

reat

the

data

isla

ndle

vel

re-

sults

inth

elo

ssof

scan

data

fora

llbe

ams.

Pul

sar

timin

gan

alys

isno

tpo

ssib

le,

all

scie

nce

data

lost

.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

No

scie

nce

poss

ible

.

Atte

mpt

tore

rout

eth

eda

tare

-ce

ived

from

CS

Pto

avai

labl

eco

rrec

tlyfu

nctio

ning

hard

war

ere

sour

ces.

The

miti

gatio

nst

rate

-gi

esfro

mab

ove

also

appl

yhe

re.

Cat

astro

phic

Ext

rem

ely

unlik

ely

QA

Com

man

ds/

Par

amet

ers

(FM

.SD

P.P

ST.

104)

The

hard

war

eex

ecut

ing

the

code

that

chec

ksth

eco

rrec

tnes

san

dva

lid-

ityof

com

man

ds/p

aram

-et

ers

fails

.

With

out

valid

com

-m

ands

orpa

ram

eter

sth

epi

pelin

em

ust

ente

rde

faul

tm

ode

whi

chle

ads

tosu

b-op

timal

proc

essi

ng.

Pul

sar

timin

gan

alys

isle

ssef

fect

ive.

Effi

cien

cyde

grad

ed,

mi-

nor

impa

cton

scie

nce

outp

uts.

Ope

rate

inde

faul

tm

ode,

ther

eby

ensu

ring

the

scie

nce

data

isst

illpr

oces

sed

and

pres

erve

din

the

appr

opria

teda

taar

chiv

e.Th

eda

tam

ust

befla

gged

tosh

owit

has

been

subj

ecte

dto

defa

ult

mod

epr

oces

sing

.

Mar

gina

lR

emot

e

Rem

ove

RFI

(FM

.SD

P.P

ST.

105)

The

hard

war

eex

ecut

ing

the

RFI

miti

gatio

nco

defa

ils.

The

sign

al-to

-noi

sera

tioof

the

dete

cted

puls

ew

illbe

low

erw

ithou

tRFI

mit-

igat

ion.

Pul

sar

timin

gan

alys

isle

ssef

fect

ive.

Min

orim

pact

onsc

ienc

eou

tput

s.A

dda

flag

toth

eda

tam

akin

git

clea

rth

atR

FIm

itiga

tion

isye

tto

bepe

rform

ed,

and

proc

eed

toth

ene

xtst

epso

that

pipe

line

proc

essi

ngdo

esno

thal

tand

noda

talo

st.

Mar

gina

lR

emot

e

Cal

ibra

te(F

M.S

DP.

PS

T.10

6)Th

eha

rdw

are

exec

utin

gth

eca

libra

tion

code

fails

.Th

esi

gnal

-to-n

oise

ratio

ofth

ede

tect

edpu

lse

will

belo

wer

with

outc

alib

ra-

tion.

Pul

sar

timin

gan

alys

isle

ssef

fect

ive.

Min

orim

pact

onsc

ienc

eou

tput

s.A

dda

flag

toth

eda

tam

akin

git

clea

rth

atca

libra

tion

isye

tto

bepe

rform

ed,

and

proc

eed

toth

ene

xtst

epso

that

pipe

line

pro-

cess

ing

does

not

halt

and

noda

talo

st.

Mar

gina

lR

emot

e

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 19 of 47

Page 20: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e5:

Har

dwar

ein

duce

dfa

ilure

mod

es7-

11.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dA

vera

ge(F

M.S

DP.

PS

T.10

7)Th

eha

rdw

are

exec

utin

gth

eco

dere

spon

sibl

efo

rpr

oduc

ing

aver

aged

data

prod

ucts

fails

.

Inab

ility

topr

oduc

ein

-te

rmed

iate

outp

utda

tapr

oduc

ts.

Pul

sar

timin

gan

aly-

sis

unaf

fect

ed.

Dat

apr

oduc

tsus

eful

for

post

-pr

oces

sing

anal

ysis

lost

.

No

impa

cton

scie

nce

outp

uts

solo

ngas

prim

ary

data

prod

uct

isst

ored

.In

term

e-di

ate

data

prod

ucts

can

bere

crea

ted

via

post

-pro

cess

ing.

Add

afla

gto

the

data

mak

ing

itcl

ear

that

aver

agin

gis

yet

tobe

perfo

rmed

,an

dpr

ocee

dto

the

next

step

soth

atpi

pelin

epr

o-ce

ssin

gdo

esno

tha

ltan

dno

data

lost

.

Mar

gina

lR

emot

e

Arc

hive

Av-

erag

eP

rodu

cts

(FM

.SD

P.P

ST.

108)

The

hard

war

eex

ecut

ing

the

code

that

arch

ives

mul

tiple

inte

rmed

iate

av-

erag

edda

tapr

oduc

ts,

and

the

data

cube

,fai

ls.

Sto

rage

ofpr

imar

yan

dav

erag

edda

tapr

oduc

tsfa

ils.

Pul

sar

timin

gpi

pelin

efa

ilsto

pers

ist

prim

ary

scie

nce

data

.

Cat

astro

phic

impa

cton

scie

nce

outp

uts.

Itis

impe

rativ

eth

atth

epr

i-m

ary

data

prod

uct

ofth

etim

ing

pipe

line,

the

data

cube

,is

per-

sist

ed.

Thus

this

step

mus

tbe

re-r

unup

onfa

ilure

until

the

pri-

mar

yda

tapr

oduc

tata

min

imum

isst

ored

.Th

ism

ayho

ldup

pro-

cess

ing,

thus

may

requ

ireth

ebu

fferin

gof

data

from

aan

ysu

b-se

quen

tsca

ns.

Crit

ical

Rem

ote

Det

erm

ine

TOA

s(F

M.S

DP.

PS

T.10

9)Th

eha

rdw

are

exec

utin

gth

eco

deth

atde

term

ines

puls

eTO

As

fails

.

Pul

sear

rival

times

can-

notb

eco

mpu

ted.

Pul

sar

timin

gpi

pelin

eca

nnot

mea

sure

puls

ear

rival

times

,co

mpu

tere

sidu

als,

and

upda

tetim

ing

mod

els.

Tim

-in

gpi

pelin

eal

sofa

ilsto

trig

ger

aler

tsfo

rpr

ofile

chan

ges

ofsc

ient

ific

in-

tere

st.

Min

orim

pact

onsc

i-en

ceou

tput

s.TO

As

can

beco

mpu

ted

via

post

-pr

oces

sing

ifne

cess

ary.

Add

afla

gto

the

data

mak

ing

itcl

eart

hatt

heTO

As

coul

dno

tbe

dete

rmin

ed.

Pro

ceed

toar

chiv

eth

eda

taso

that

pipe

line

pro-

cess

ing

does

not

halt

and

noda

talo

st.

Mar

gina

lR

emot

e

Arc

hive

TOA

s(F

M.S

DP.

PS

T.11

0)Th

eha

rdw

are

exec

utin

gth

eco

deth

atse

nds

the

com

pute

dTO

As

toth

ear

chiv

efa

ils.

Failu

reto

stor

eTO

As.

Pul

sar

timin

gpi

pelin

efa

ilsto

arch

ive

usef

ulsc

i-en

ceda

ta.

Min

orim

pact

onsc

i-en

ceou

tput

s.TO

As

can

beco

mpu

ted

via

post

-pr

oces

sing

ifne

cess

ary.

Con

tinue

retr

ying

toar

chiv

eth

eTO

As

until

som

etim

eout

perio

dTB

Dha

sel

apse

d.If

the

TOA

sca

nnot

bear

chiv

ed,a

dda

flag

toth

eda

tain

dica

ting

this

,and

pro-

ceed

toth

ene

xtst

ep.

Mar

gina

lR

emot

e

Gen

erat

eR

esid

uals

(FM

.SD

P.P

ST.

111)

The

hard

war

eex

ecut

ing

the

code

that

gene

rate

stim

ing

resi

dual

sfa

ils.

Failu

reto

gene

rate

tim-

ing

resi

dual

s.P

ulsa

rtim

ing

pipe

line

cann

otde

tect

scie

ntifi

-ca

llyin

tere

stin

gpr

ofile

chan

ges.

This

prev

ents

rapi

dfo

llow

-up.

Min

orim

pact

onsc

ienc

eou

tput

s.R

esid

uals

can

beco

mpu

ted

via

post

-pr

oces

sing

ifne

cess

ary.

Add

afla

gto

the

data

indi

catin

gth

atth

ere

sidu

als

coul

dno

tbe

com

pute

d.P

roce

edto

arch

ive

the

data

soth

atpi

pelin

epr

o-ce

ssin

gdo

esno

tha

ltan

dno

data

lost

.

Mar

gina

lR

emot

e

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 20 of 47

Page 21: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e6:

Har

dwar

ein

duce

dfa

ilure

mod

es12

-16.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dQ

AR

esid

uals

(FM

.SD

P.P

ST.

112)

The

hard

war

eex

ecut

ing

the

code

that

eval

uate

sth

equ

ality

ofth

ere

sidu

-al

sfa

ils.

Poor

qual

ityre

sidu

als

prop

agat

edth

roug

hpi

pelin

e.

Pul

sar

timin

gpi

pelin

eco

ntin

ues

proc

essi

ngw

ithpo

orre

sidu

als.

Min

orim

pact

onsc

ienc

eou

tput

s.R

esid

uals

can

beco

mpu

ted

via

post

-pr

oces

sing

ifne

cess

ary.

Pro

ceed

toth

ene

xtst

epso

that

pipe

line

proc

essi

ngdo

esno

tha

ltan

dno

data

lost

.A

p-pe

nda

flag

toth

eda

tain

dica

ting

that

the

resi

dual

sre

quire

aQ

Aan

alys

is.

Mar

gina

lR

emot

e

Upd

ate

Tim

-in

gM

odel

(FM

.SD

P.P

ST.

113)

The

hard

war

eex

ecut

ing

the

code

that

upda

tes

the

timin

gm

odel

fails

.

Tim

ing

mod

elno

tup

-da

ted.

Pul

sar

timin

gpi

pelin

eca

nnot

proc

eed

with

fur-

ther

proc

essi

ngst

eps.

Min

orim

pact

onsc

ienc

eou

tput

s.Ti

min

gm

odel

can

beup

date

dvi

apo

st-

proc

essi

ngif

nece

ssar

y.

Add

afla

gto

the

data

mak

ing

itcl

ear

that

the

timin

gm

odel

has

not

been

upda

ted,

and

pro-

ceed

toar

chiv

eth

eda

taso

that

pipe

line

proc

essi

ngdo

esno

thal

tand

noda

talo

st.

Mar

gina

lR

emot

e

Arc

hive

Tim

-in

gM

odel

(FM

.SD

P.P

ST.

114)

The

hard

war

eex

ecut

ing

the

code

that

arch

ives

the

timin

gm

odel

fails

.

Tim

ing

mod

elno

tar

chiv

ed.

Pul

sar

timin

gpi

pelin

eca

nnot

carr

you

tits

pri-

mar

ypu

rpos

e,to

auto

-m

atic

ally

upda

tetim

ing

mod

els.

Min

orim

pact

onsc

ienc

eou

tput

s.Ti

min

gm

odel

can

bere

com

pute

dvi

apo

st-p

roce

ssin

gif

nec-

essa

ry.

Con

tinue

retr

ying

toar

chiv

eth

etim

ing

mod

elun

tilso

me

time-

out

perio

dTB

Dha

sel

apse

d.If

mod

elno

tarc

hive

d,ad

da

flag

toth

eda

tain

dica

ting

this

,and

pro-

ceed

toth

eda

taar

chiv

alst

ep.

Mar

gina

lR

emot

e

Eva

luat

eM

odel

Cha

nges

(FM

.SD

P.P

ST.

115)

The

hard

war

eex

ecut

ing

the

code

that

eval

uate

sch

ange

sto

the

timin

gm

odel

fails

.

Inab

ility

tode

tect

sign

ifi-

cant

profi

lech

ange

s.P

ulsa

rtim

ing

pipe

line

cann

otde

tect

sci-

entifi

cally

sign

ifica

ntpu

lse

profi

lech

ange

s(e

.g.

glitc

hes

orm

ode

chan

ges)

.

Min

orto

mar

gina

lim

pact

onsc

ienc

eou

tput

s.Fa

il-ur

eto

eval

uate

prev

ents

rapi

dfo

llow

-up.

Dat

aca

nbe

post

-pro

cess

edal

low

ing

bela

ted

eval

ua-

tion.

Add

afla

gto

the

data

mak

ing

itcl

ear

the

mod

elha

sno

tbe

enev

alua

ted

for

chan

ge,

and

pro-

ceed

toth

eda

taar

chiv

alst

epso

that

pipe

line

proc

essi

ngdo

esno

thal

tand

noda

talo

st.

Mar

gina

lR

emot

e

Gen

erat

eA

lert

(FM

.SD

P.P

ST.

116)

The

hard

war

eex

ecut

ing

the

code

that

gene

rate

sal

erts

fails

.

Ale

rts

notg

ener

ated

.P

ulsa

rtim

ing

pipe

line

cann

otal

ert

TMor

the

com

mun

ityto

scie

ntifi

-ca

llyin

tere

stin

gev

ents

.

Min

orto

mar

gina

lim

pact

onsc

ienc

eou

tput

s.A

dda

flag

toth

eda

tam

akin

git

clea

rth

atth

eda

tare

quire

sfo

llow

-up

anal

ysis

.C

ontin

ueto

atte

mpt

toge

nera

tean

aler

tun-

tilso

me

time-

outp

erio

dTB

Dha

sel

apse

d.

Mar

gina

lR

emot

e

All

arch

iv-

ing

func

tions

(FM

.SD

P.P

ST.

117)

The

hard

war

ear

chiv

ing

puls

artim

ing

data

(dat

acu

bes,

resi

dual

s,TO

As,

met

adat

aor

timin

gm

od-

els)

fails

.

Sci

ence

data

not

per-

sist

ed.

Pul

sar

timin

gpi

pelin

eco

mpl

etes

proc

essi

ngho

wev

ersc

ienc

eda

tais

lost

.

Mar

gina

lto

Crit

ical

im-

pact

onsc

ienc

eou

tput

s.If

arch

ivin

gfa

ilsdu

eto

aha

rdw

are

erro

r,ca

usin

gda

talo

ss,

the

obse

rvat

ion

mus

tbe

resc

hedu

led

and

repe

ated

Mar

gina

lto

Crit

ical

Rem

ote

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 21 of 47

Page 22: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

5.2 Control & Communication Failures

Control failures can occur in a variety of ways. For example,

• control can be lost at the node level, due to the failure of a node management daemonor controlling SDP process.

• control can be lost at the rack or compute island level, similarly to above. This could becaused, for example, via a Top of Rack (TOR) switch failure.

• control can be lost/degraded due to a failure of the SDP management system. Thiscan be caused by either software or hardware failures/errors. Whilst a connection to themanagement system will likely always be available due to the network topology used,bandwidth could be reduced.

• control can be lost due to a problem with the telescope manager, or the LMC. Note theLMC is known as the execution control system [RD9] (see section 2.1.1 in the externaldocument) in SDP.

• control can fail due to communication errors. This could be caused by, for example, thefailure of networking hardware, a network security intrusion, or the corruption of networktraffic due to software problems (e.g. in firmware).

• control can fail due to use of inappropriate commands, and/or human error.

While there are many possible control failure scenarios, we consider only high level failuresfor brevity.

Clearly communication failures can cause many of the control issues outline above. Howevercommunication problems can also affect SDP processing, and these possibilities are consid-ered separately. Communication failures occur due to,

• the corruption of data packets.

• networking hardware failures, or hardware failures at the node level (e.g. at the NICs).

• software errors in processing components which corrupt or invalidate communication.

• incompatible communication protocols or data types.

A number of failure modes related to control and communications are listed in Tables 7,through to Table 13 below. For simplicity only scenarios where inherent redundancy failsare presented (i.e. a worst case scenario).

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 22 of 47

Page 23: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e7:

Con

trola

ndC

omm

unic

atio

nfa

ilure

mod

es1-

6.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dA

LL(F

M.S

DP.

PS

T.20

1)C

ontro

lof

atim

ing

pipe

line

com

pone

ntis

tem

pora

rily

lost

.

Tim

ing

pipe

line

com

po-

nent

cann

otbe

con-

trolle

dor

mon

itore

dex

-te

rnal

ly.

Failu

reto

corr

ectly

con-

trol

and

mon

itor

puls

artim

ing

proc

essi

ng.

Ope

ratio

nal

relia

bil-

ityan

def

ficie

ncy

are

degr

aded

.In

tegr

ityof

scie

nce

data

coul

dbe

com

prom

ised

ifpr

oces

sing

cond

ucte

din

corr

ectly

.

Allo

wal

ltim

ing

pipe

line

com

po-

nent

sto

oper

ate

auto

nom

ousl

yin

eith

era

defa

ult

ora

stan

dard

mod

e.A

ttem

ptto

confi

rm/re

-es

tabl

ish

cont

rol

afte

rth

eco

m-

plet

ion

ofea

chsc

an.

Rai

sean

alar

m.

Min

orR

emot

e

ALL

(FM

.SD

P.P

ST.

202)

Con

trol

ofa

timin

gpi

pelin

eco

mpo

nent

islo

stfo

ra

perio

dof

time

that

exce

eds

asc

anle

ngth

.

Tim

ing

pipe

line

com

po-

nent

cann

otbe

con-

trolle

dor

mon

itore

dex

-te

rnal

ly.

Failu

reto

corr

ectly

con-

trola

ndm

onito

rth

etim

-in

gpr

oces

sing

.

Ope

ratio

nal

relia

bil-

ityan

def

ficie

ncy

are

degr

aded

.In

tegr

ityof

scie

nce

data

coul

dbe

com

prom

ised

ifpr

oces

sing

cond

ucte

din

corr

ectly

.

Com

plet

epr

oces

sing

ofda

taob

-ta

ined

durin

gth

epe

rvio

us/c

ur-

rent

scan

solo

ngas

com

man

dsar

eva

lid,

rais

ean

alar

m,

then

awai

tins

truc

tion

from

TM.

Min

orR

emot

e

ALL

(FM

.SD

P.P

ST.

203)

Con

trol

para

met

ers

give

nto

atim

ing

pipe

line

com

pone

ntar

ein

cor-

rect

lyfo

rmat

ted

orin

valid

.

Tim

ing

pipe

line

com

-po

nent

inco

rrec

tlypr

oces

ses

data

.

Tim

ing

pipe

line

com

po-

nent

cann

otco

rrec

tlypr

oces

sth

eda

tain

-ge

sted

from

CS

Pca

usin

gda

talo

ss/

sub-

optim

alpr

oces

sing

.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

Inte

grity

ofsc

ienc

eda

taco

mpr

o-m

ised

.

Aut

omat

ical

lyde

tect

inco

rrec

tpa

ram

eter

san

dau

tono

mou

sly

ente

rde

faul

tm

ode

topr

even

tth

elo

ssof

scie

nce

data

.R

aise

anal

arm

.

Min

orR

emot

e

ALL

(FM

.SD

P.P

ST.

204)

Con

trolc

omm

ands

give

nto

the

timin

gpi

pelin

eco

mpo

nent

are

inva

lidor

inco

rrec

tlyfo

rmat

ted.

Tim

ing

pipe

line

com

-po

nent

inco

rrec

tlypr

oces

ses

data

.

Tim

ing

pipe

line

com

po-

nent

cann

otco

rrec

tlypr

oces

sth

eda

tain

-ge

sted

from

CS

Pca

usin

gda

talo

ss/

sub-

optim

alpr

oces

sing

.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

Inte

grity

ofsc

ienc

eda

taco

mpr

o-m

ised

.

Aut

omat

ical

lyde

tect

inco

rrec

tco

mm

ands

and

auto

nom

ousl

yen

ter

defa

ult

mod

eto

prev

ent

the

loss

ofsc

ienc

eda

ta.

Rai

sean

alar

m.

Min

orR

emot

e

ALL

(FM

.SD

P.P

ST.

205)

No

mon

itor

orco

ntro

lsi

gnal

stra

nsm

itted

orre

-ce

ived

from

outs

ide

ofth

eS

DP.

Tim

ing

pipe

line

com

po-

nent

cann

otbe

con-

trolle

dor

mon

itore

dex

-te

rnal

ly.

Failu

reto

corr

ectly

con-

trola

ndm

onito

rth

etim

-in

gpr

oces

sing

.

Ope

ratio

nal

relia

bil-

ityan

def

ficie

ncy

are

degr

aded

.In

tegr

ityof

scie

nce

data

coul

dbe

com

prom

ised

ifpr

oces

sing

cond

ucte

din

corr

ectly

.

Red

unda

ntso

ftwar

em

onito

r/

cont

rol

netw

ork.

Allo

wtim

-in

gpi

pelin

eto

oper

ate

au-

tono

mou

sly

inde

faul

tm

ode

inth

eev

ent

ofco

ntro

lfai

lure

.D

e-co

uple

cont

rol

and

mon

itorin

gw

ithin

the

Exe

cutio

nC

ontro

lC

ompo

nent

.

Min

orR

emot

e

ALL

(FM

.SD

P.P

ST.

206)

No

mon

itor

orco

ntro

lsi

gnal

stra

nsm

itted

orre

-ce

ived

tem

pora

rily

insi

deof

the

SD

P.

Tim

ing

pipe

line

com

po-

nent

cann

otbe

con-

trolle

din

tern

ally

.

Failu

reto

corr

ectly

con-

trola

ndm

onito

rth

etim

-in

gpr

oces

sing

.

Ope

ratio

nal

relia

bil-

ityan

def

ficie

ncy

are

degr

aded

.In

tegr

ityof

scie

nce

data

coul

dbe

com

prom

ised

ifpr

oces

sing

cond

ucte

din

corr

ectly

.

Red

unda

ntso

ftwar

em

onito

r/-co

ntro

lne

twor

k.A

llow

tim-

ing

pipe

line

toop

erat

eau

-to

nom

ousl

yin

defa

ult

mod

ein

the

even

tof

cont

rolf

ailu

re.

De-

coup

leco

ntro

lan

dm

onito

ring

with

inth

eE

xecu

tion

Con

trol

Com

pone

nt.

Min

orR

emot

e

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 23 of 47

Page 24: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e8:

Con

trola

ndC

omm

unic

atio

nfa

ilure

mod

es7-

14.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dA

LL(F

M.S

DP.

PS

T.20

7)N

om

onito

ror

cont

rol

sign

als

trans

mitt

edor

re-

ceiv

edte

mpo

raril

yin

side

ofth

eS

DP,

for

ape

riod

oftim

eth

atex

ceed

sa

scan

leng

th.

Tim

ing

pipe

line

com

po-

nent

cann

otbe

con-

trolle

din

tern

ally

.

Failu

reto

corr

ectly

con-

trola

ndm

onito

rth

etim

-in

gpr

oces

sing

.

Ope

ratio

nal

relia

bil-

ityan

def

ficie

ncy

are

degr

aded

.In

tegr

ityof

scie

nce

data

coul

dbe

com

prom

ised

ifpr

oces

sing

cond

ucte

din

corr

ectly

.

Red

unda

ntso

ftwar

em

onito

r/-co

ntro

lnet

wor

k.C

ompl

ete

pro-

cess

ing

ofda

taob

tain

eddu

r-in

gth

epr

evio

us/

curr

ent

scan

solo

ngas

com

man

dsar

eva

lid,

rais

ean

alar

m,

then

awai

tin

-st

ruct

ion

from

TM.

Min

orR

emot

e

ALL

(FM

.SD

P.P

ST.

208)

Mis

sing

orco

rrup

tmon

i-to

rand

cont

rolp

acke

ts.

Una

ble

tore

liabl

ym

on-

itor

orco

ntro

lpi

pelin

eco

mpo

nent

s.

Failu

reto

corr

ectly

con-

trola

ndm

onito

rth

etim

-in

gpr

oces

sing

.

Ope

ratio

nal

relia

bil-

ityan

def

ficie

ncy

are

degr

aded

.In

tegr

ityof

scie

nce

data

coul

dbe

com

prom

ised

ifpr

oces

sing

cond

ucte

din

corr

ectly

.

Allo

wtim

ing

pipe

line

toop

erat

eau

tono

mou

sly

inde

faul

tmod

ein

the

even

tofc

ontro

lfai

lure

.

Sig

nific

ant

Rem

ote

ALL

(FM

.SD

P.P

ST.

209)

Rou

ting

and

trans

mis

-si

onof

data

with

inS

DP

fails

due

tom

issi

ngor

corr

uptd

ata

pack

ets.

Dat

ano

ttra

nsm

itted

.P

ulsa

rtim

ing

anal

ysis

notp

ossi

ble.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

All

scie

nce

data

lost

.

Res

ilien

ceof

rout

ing.

Allo

wtim

ing

pipe

line

toop

erat

ing

au-

tono

mou

sly

inde

faul

tm

ode

inth

eev

ent

com

mun

icat

ions

fail-

ure

that

prio

ritiz

essa

ving

the

sci-

ence

data

.

Sig

nific

ant

Rem

ote

ALL

(FM

.SD

P.P

ST.

210)

Rou

ting

and

trans

mis

-si

onof

data

with

inS

DP

tem

pora

rily

fails

due

tone

twor

ker

rors

orfa

il-ur

es.

Dat

ano

ttra

nsm

itted

.P

ulsa

rtim

ing

anal

ysis

notp

ossi

ble.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

All

scie

nce

data

lost

.

Red

unda

ntda

tane

twor

k.R

e-si

lienc

eof

rout

ing.

Sig

nific

ant

Rem

ote

ALL

(FM

.SD

P.P

ST.

211)

Rou

ting

and

trans

mis

-si

onof

data

with

inS

DP

fails

due

tone

twor

ker

-ro

rsor

failu

res,

for

ape

-rio

dof

time

that

exce

eds

asc

anle

ngth

.

Dat

ano

ttra

nsm

itted

.P

ulsa

rtim

ing

anal

ysis

notp

ossi

ble.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

All

scie

nce

data

lost

.

Red

unda

ntda

tane

twor

k.R

e-si

lienc

eof

rout

ing.

Cat

astro

phic

Ext

rem

ely

unlik

ely

ALL

(FM

.SD

P.P

ST.

212)

Com

poun

dro

utin

g/

com

mun

icat

ion

erro

rsoc

curr

ing

atdi

ffere

ntlo

catio

nsw

ithin

SD

P

Dat

ano

ttra

nsm

itted

.P

ulsa

rtim

ing

anal

ysis

notp

ossi

ble.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

All

scie

nce

data

lost

.

Red

unda

ntda

tane

twor

k.R

e-si

lienc

eof

rout

ing.

Cea

sepr

o-ce

ssin

gan

daw

ait

TMin

stru

c-tio

n.

Cat

astro

phic

Ext

rem

ely

unlik

ely

Tim

ing

Rec

eive

(FM

.SD

P.P

ST.

213)

Con

trol

para

met

ers

sent

toth

etim

ing

re-

ceiv

eco

mpo

nent

are

corr

upte

dvi

apa

cket

loss

orso

me

othe

rco

mm

unic

atio

ner

ror.

Tim

ing

rece

ive

in-

corr

ectly

proc

esse

sre

ceiv

edda

ta.

Tim

ing

rece

ive

cann

otco

rrec

tlyin

gest

the

data

from

CS

Pca

usin

gda

talo

ss.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

Inte

grity

ofsc

ienc

eda

taco

mpr

o-m

ised

.

Aut

omat

ical

lyde

tect

inco

rrec

tpa

ram

eter

san

dau

tono

mou

sly

ente

rde

faul

tm

ode

topr

even

tth

elo

ssof

scie

nce

data

.

Min

orO

ccas

iona

l

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 24 of 47

Page 25: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e9:

Con

trola

ndC

omm

unic

atio

nfa

ilure

mod

es14

-19.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dTi

min

gR

ecei

ve(F

M.S

DP.

PS

T.21

4)R

outin

gan

dtra

nsm

is-

sion

ofda

tafro

mth

eC

SP

fails

due

toto

om

any

mis

sing

orco

rrup

tda

tapa

cket

s.

No

puls

artim

ing

data

re-

ceiv

ed.

Pul

sar

timin

gan

alys

isno

tpos

sibl

e.O

pera

tiona

lre

liabi

lity

and

effic

ienc

yar

ede

-gr

aded

.A

llsc

ienc

eda

talo

st.

Res

ilien

ceof

rout

ing.

The

ca-

paci

tyto

requ

estt

hatd

ata

bere

-se

nt.

Cat

astro

phic

Rem

ote

Tim

ing

Rec

eive

(FM

.SD

P.P

ST.

215)

Rou

ting

and

trans

-m

issi

onof

data

from

the

CS

Pte

mpo

raril

yfa

ilsdu

eto

netw

ork

com

mun

icat

ion

failu

res.

No

puls

artim

ing

data

re-

ceiv

ed.

Pul

sar

timin

gan

alys

isno

tpos

sibl

e.O

pera

tiona

lre

liabi

lity

and

effic

ienc

yar

ede

-gr

aded

.A

llsc

ienc

eda

talo

st.

Re-

esta

blis

hco

nnec

tivity

,an

dif

poss

ible

requ

est

scan

data

bere

sent

from

CS

P.

Cat

astro

phic

Rem

ote

Tim

ing

Rec

eive

(FM

.SD

P.P

ST.

216)

Dat

are

ceiv

edfro

mth

eC

SP

ism

argi

nally

cor-

rupt

edvi

apa

cket

loss

orso

me

othe

rco

mm

unic

a-tio

ner

ror.

Tim

ing

rece

ive

pro-

cess

espa

rtly

corr

upte

dda

ta.

Pul

sar

timin

gan

alys

isle

ssef

fect

ive.

Sci

ence

data

lose

sso

me

ofits

utili

ty.

Mon

itor

prop

ortio

nof

data

sub-

ject

toco

rrup

tion.

Con

tinue

tofu

nctio

nno

rmal

lyso

long

asle

ssth

an20

%TB

Cof

the

data

isco

r-ru

pted

.If

mor

eth

an20

%TB

Cis

corr

upte

dra

ise

anal

arm

,bu

tco

ntin

ueto

func

tion

and

anno

-ta

teth

epr

oces

sed

data

with

afla

gin

dica

ting

that

itsut

ility

issi

gnifi

cant

lyde

grad

ed.

Mar

gina

lO

ccas

iona

l

Tim

ing

Rec

eive

(FM

.SD

P.P

ST.

217)

Tim

ing

rece

ive

tem

-po

raril

ylo

ses

conn

ec-

tivity

with

dow

nstre

amS

DP

com

pone

nts.

Tim

ing

rece

ive

cann

otpa

ssda

tath

roug

hth

etim

ing

pipe

line.

Pul

sar

timin

gan

alys

isno

tpos

sibl

e.S

cien

tific

outp

utno

tpro

-du

ced.

Sen

dth

esc

ienc

eda

tato

the

pres

erva

tion

syst

emw

ithou

tpro

-ce

ssin

gto

prev

ent

data

loss

.Fl

agth

eda

taas

requ

iring

follo

w-

uppo

st-p

roce

ssin

g.G

ener

ate

anal

ert.

Mar

gina

lR

emot

e

Tim

ing

Rec

eive

(FM

.SD

P.P

ST.

218)

Tim

ing

rece

ive

lose

sal

lco

nnec

tivity

with

dow

nstre

amS

DP

com

-po

nent

sfo

ra

perio

dof

time

long

erth

ana

scan

dura

tion.

Tim

ing

rece

ive

cann

otpa

ssda

tath

roug

hth

etim

ing

pipe

line.

Pul

sar

timin

gan

alys

isno

tpos

sibl

e.S

cien

tific

outp

utno

tpro

-du

ced.

Res

ilien

ceof

rout

ing.

Cat

astro

phic

Ext

rem

ely

unlik

ely

Tim

ing

Re-

ceiv

e/

Inge

st(F

M.S

DP.

PS

T.21

9)

Failu

reto

inge

stre

ceiv

edda

tain

atim

ely

fash

ion,

caus

ing

ada

taba

cklo

gw

hich

cann

otbe

cach

ed.

Dat

ado

esno

ten

ter

the

pipe

line

quic

kly

enou

ghto

com

plet

etim

ing

pro-

cess

ing

inth

eal

lotte

dtim

e.

Pul

sart

imin

gan

alys

isin

-co

mpl

ete.

Sci

entifi

cou

tput

sde

-gr

aded

.S

cien

ceda

talo

ses

som

eof

itsut

ility

,so

me

data

loss

.

Res

ilien

ceof

rout

ing,

auto

mat

iclo

adba

lanc

ing

topr

even

tre

-so

urce

cont

entio

nan

dpr

oces

s-in

gde

lays

.

Mar

gina

lR

emot

e

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 25 of 47

Page 26: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e10

:C

ontro

land

Com

mun

icat

ion

failu

rem

odes

20-2

3.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dTi

min

gR

e-ce

ive

/In

gest

(FM

.SD

P.P

ST.

220)

Sub

-inte

grat

ion

data

im-

pact

edby

pack

etlo

ssw

hen

usin

gFT

P(a

sda

tase

nt1

sub-

inta

tatim

e).

Tim

ing

rece

ive

pro-

cess

espa

rtly

corr

upte

dda

ta.

Miti

gatio

nst

rate

gyin

curs

com

puta

tiona

lov

erhe

ad.

Red

uced

effe

ctiv

enes

sof

puls

artim

ing

anal

ysis

.M

inor

degr

adat

ion

tosc

i-en

ceou

tput

s.R

eque

stth

atda

tabe

rese

nt.

Ifre

send

impo

ssib

le,

add

aze

-ro

edsu

b-in

tin

plac

eof

the

cor-

rupt

edsu

b-in

t.U

pdat

ecu

mu-

lativ

etra

ckin

gof

lost

sub-

ints

and

sub-

int

sam

ples

.If

cum

u-la

tive

data

loss

mor

eth

an20

%TB

Cth

ento

om

uch

sign

alha

sbe

enlo

stan

dan

alar

mm

ust

bera

ised

.Ta

gth

eda

taso

the

prop

ortio

nof

lost

sub-

ints

isre

cord

ed.

Sca

nde

-pe

nden

t.Fr

actio

nal

loss

isim

-po

rtan

t.S

ever

ityra

nges

from

min

orto

criti

cal

due

tocu

-m

ulat

ive

effe

cts.

Occ

asio

nal

Rem

ove

RFI

(FM

.SD

P.P

ST.

221)

Rem

ove

RFI

func

-tio

nte

mpo

raril

ylo

ses

conn

ectiv

ityw

ithdo

wn-

stre

amS

DP

com

po-

nent

s.

Rem

ove

RFI

func

-tio

nca

nnot

pass

data

thro

ugh

the

timin

gpi

pelin

e.

Pul

sar

timin

gan

alys

isno

tpos

sibl

e.S

cien

tific

outp

utde

-gr

aded

.R

etry

send

ing

the

data

until

som

etim

e-ou

tpe

riod

TBD

has

elap

sed.

Ifre

try

fails

,se

ndth

esc

ienc

eda

tato

the

pres

erva

-tio

nsy

stem

with

out

proc

essi

ngto

prev

ent

data

loss

.Fl

agth

eda

taas

requ

iring

follo

w-u

ppo

st-

proc

essi

ng.

Gen

erat

ean

aler

t.

Mar

gina

lR

emot

e

Cal

ibra

te(F

M.S

DP.

PS

T.22

2)C

alib

rate

func

tion

tem

-po

raril

ylo

ses

conn

ectiv

-ity

with

dow

nstre

amS

DP

com

pone

nts.

Cal

ibra

tefu

nctio

nca

n-no

tpa

ssda

tath

roug

hth

etim

ing

pipe

line.

Pul

sar

timin

gan

alys

isno

tpos

sibl

e.S

cien

tific

outp

utde

-gr

aded

.R

etry

send

ing

the

data

until

som

etim

e-ou

tpe

riod

TBD

has

elap

sed.

Ifre

try

fails

,se

ndth

esc

ienc

eda

tato

the

pres

erva

-tio

nsy

stem

with

out

proc

essi

ngto

prev

ent

data

loss

.Fl

agth

eda

taas

requ

iring

follo

w-u

ppo

st-

proc

essi

ng.

Gen

erat

ean

aler

t.

Mar

gina

lR

emot

e

Arc

hive

Av-

erag

eP

rodu

cts

(FM

.SD

P.P

ST.

223)

Ave

rage

func

tion

tem

-po

raril

ylo

ses

conn

ectiv

-ity

with

dow

nstre

amS

DP

com

pone

nts.

Ave

rage

func

tion

cann

otpa

ssda

tath

roug

hth

etim

ing

pipe

line.

Pul

sar

timin

gan

alys

isno

tpos

sibl

e.S

cien

tific

outp

utno

tpro

-du

ced.

Ret

ryse

ndin

gth

eda

taun

tilso

me

time-

out

perio

dTB

Dha

sel

apse

d.If

retr

yfa

ils,

send

the

scie

nce

data

toth

epr

eser

va-

tion

syst

emw

ithou

tpr

oces

sing

topr

even

tda

talo

ss.

Flag

the

data

asre

quiri

ngfo

llow

-up

post

-pr

oces

sing

.G

ener

ate

anal

ert.

Mar

gina

lR

emot

e

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 26 of 47

Page 27: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e11

:C

ontro

land

Com

mun

icat

ion

failu

rem

odes

24-2

8.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dA

rchi

veA

v-er

age

Pro

duct

s(F

M.S

DP.

PS

T.22

4)

Con

nect

ivity

with

the

pres

erva

tion

syst

emis

tem

pora

rily

lost

,pre

vent

-in

gst

orag

eof

aver

aged

data

prod

ucts

and

the

prim

ary

data

cube

.

Sto

rage

ofpr

imar

yan

dav

erag

edda

tapr

oduc

tsfa

ils.

Pul

sar

timin

gpi

pelin

efa

ilsto

pers

ist

prim

ary

scie

nce

data

.

Pote

ntia

lfo

rcr

itica

lim

-pa

cton

scie

nce

outp

uts.

Itis

impe

rativ

efo

rth

epr

i-m

ary

data

prod

uct

ofth

etim

-in

gpi

pelin

e,th

eda

tacu

be,

tobe

pers

iste

d.Th

usth

isst

epm

ust

bere

-run

upon

failu

reun

-til

the

prim

ary

data

prod

uct

ata

min

imum

isst

ored

.O

run

tilso

me

time-

out

perio

dTB

Dha

sel

apse

d.G

ener

ate

anal

ert.

Ifth

eda

tais

reta

ined

ina

buffe

ran

dno

tdi

scar

ded

until

scie

nce

outp

uts

are

pers

iste

d,th

enth

ese

verit

yis

redu

ced

tom

argi

nal.

Mar

gina

lto

Crit

ical

Rem

ote

Arc

hive

Av-

erag

eP

rodu

cts

(FM

.SD

P.P

ST.

225)

Arc

hive

Ave

rage

Pro

d-uc

tsfu

nctio

nte

mpo

raril

ylo

ses

conn

ectiv

ityw

ithdo

wns

tream

SD

Pco

m-

pone

nts.

Arc

hive

Ave

rage

Pro

d-uc

tsfu

nctio

nca

nnot

pass

data

thro

ugh

the

timin

gpi

pelin

e.

Pul

sar

timin

gan

alys

isno

tpos

sibl

e.S

ome

scie

ntifi

cou

tput

notp

rodu

ced.

Gen

erat

ean

aler

t,an

dpr

epar

efo

rne

xtsc

an(n

ofu

rthe

rpr

o-ce

ssin

gpo

ssib

le).

Flag

the

data

forf

ollo

w-u

ppo

stpr

oces

sing

.

Min

orR

emot

e

Det

erm

ine

TOA

s(F

M.S

DP.

PS

T.22

6)D

eter

min

eTO

As

func

-tio

nte

mpo

raril

ylo

ses

conn

ectiv

ityw

ithdo

wn-

stre

amS

DP

com

po-

nent

s.

Det

erm

ine

TOA

sfu

nc-

tion

cann

otpa

ssda

tath

roug

hth

etim

ing

pipe

line.

Pul

sar

timin

gan

alys

isno

tpos

sibl

e.S

ome

scie

ntifi

cou

tput

notp

rodu

ced.

Ret

ryse

ndin

gth

eda

taun

tilso

me

time-

out

perio

dTB

Dha

sel

apse

d.G

ener

ate

anal

ert

ifda

tais

nots

ent,

and

prep

are

for

the

next

scan

(no

furt

her

pro-

cess

ing

poss

ible

).Fl

agth

eda

tafo

rfol

low

-up

post

proc

essi

ng.

Min

orR

emot

e

Arc

hive

TOA

s(F

M.S

DP.

PS

T.22

7)C

onne

ctiv

ityw

ithth

epr

eser

vatio

nsy

stem

iste

mpo

raril

ylo

st,

pre-

vent

ing

the

stor

age

ofTO

As.

Failu

reto

stor

eTO

As.

Pul

sar

timin

gpi

pelin

efa

ilsto

arch

ive

usef

ulsc

i-en

ceda

ta.

Min

orim

pact

onsc

i-en

ceou

tput

s.TO

As

can

beco

mpu

ted

via

post

-pr

oces

sing

ifne

cess

ary.

Con

tinue

toat

tem

ptto

arch

ive

the

TOA

sun

tilso

me

time-

out

perio

dTB

Dha

sel

apse

d.If

arch

ivin

gfa

ils,

add

afla

gto

the

data

mak

ing

itcl

ear

that

the

TOA

sha

veno

tbe

enar

chiv

ed.

Pro

ceed

toth

ene

xtst

epso

that

pipe

line

proc

essi

ngdo

esno

thal

tand

noda

talo

st.

Mar

gina

lR

emot

e

Arc

hive

TOA

s(F

M.S

DP.

PS

T.22

8)A

rchi

veTO

As

func

-tio

nte

mpo

raril

ylo

ses

conn

ectiv

ityw

ithdo

wn-

stre

amS

DP

com

po-

nent

s.

Arc

hive

TOA

sfu

nc-

tion

cann

otpa

ssda

tath

roug

hth

etim

ing

pipe

line.

Pul

sar

timin

gan

alys

isno

tpos

sibl

e.S

ome

scie

ntifi

cou

tput

notp

rodu

ced.

Ret

ryse

ndin

gth

eda

taun

tilso

me

time-

out

perio

dTB

Dha

sel

apse

d.G

ener

ate

anal

ert

ifda

tais

not

sent

.Fl

agth

eda

tafo

rfol

low

-up

post

proc

essi

ng.

Min

orR

emot

e

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 27 of 47

Page 28: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e12

:C

ontro

land

Com

mun

icat

ion

failu

rem

odes

29-3

3.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dG

ener

ate

Res

idua

ls(F

M.S

DP.

PS

T.22

9)

Gen

erat

eR

esid

uals

func

tion

tem

pora

rily

lose

sco

nnec

tivity

with

dow

nstre

amS

DP

com

-po

nent

s.

Gen

erat

eR

esid

uals

func

tion

cann

otpa

ssda

tath

roug

hth

etim

ing

pipe

line.

Pul

sar

timin

gan

alys

isno

tpos

sibl

e.S

ome

scie

ntifi

cou

tput

notp

rodu

ced.

Ret

ryse

ndin

gth

eda

taun

tilso

me

time-

out

perio

dTB

Dha

sel

apse

d.If

the

data

isno

tse

nt,

gene

rate

anal

ert.

Then

prep

are

fort

hene

xtsc

an(n

ofu

rthe

rpro

-ce

ssin

gpo

ssib

le).

Flag

the

data

forf

ollo

w-u

ppo

stpr

oces

sing

.

Min

orR

emot

e

Sen

dR

esid

uals

toQ

AS

yste

m(F

M.S

DP.

PS

T.23

0)

Sen

dR

esid

uals

toQ

AS

yste

mfu

nctio

nte

m-

pora

rily

lose

sco

nnec

-tiv

ityw

ithdo

wns

tream

SD

Pco

mpo

nent

s.

Sen

dR

esid

uals

toQ

AS

yste

mfu

nctio

nca

nnot

pass

data

thro

ugh

the

timin

gpi

pelin

e.

Pul

sar

timin

gan

alys

isqu

ality

redu

ced.

Qua

lity

ofsc

ienc

eou

tput

affe

cted

.R

etry

send

ing

the

data

until

som

etim

e-ou

tpe

riod

TBD

has

elap

sed.

Gen

erat

ean

aler

t,an

dfla

gth

eda

tafo

rre

sidu

alQ

A,

and

mov

eto

the

next

proc

essi

ngst

ep.

Min

orR

emot

e

Upd

ate

Tim

ing

Mod

el(F

M.S

DP.

PS

T.23

1)Ti

min

gm

odel

fort

hepu

l-sa

rbe

ing

obse

rved

can-

not

beob

tain

edex

ter-

nally

.

Tim

ing

mod

elno

tup

-da

ted.

Pul

sar

timin

gpi

pelin

eca

nnot

proc

eed

with

fur-

ther

proc

essi

ngst

eps.

Min

orim

pact

onsc

ienc

eou

tput

s.Ti

min

gm

odel

can

beup

date

dvi

apo

st-

proc

essi

ngif

nece

ssar

y.

Con

tinue

toat

tem

ptto

obta

inth

etim

ing

mod

elun

tilso

me

time-

out

perio

dTB

Dha

sel

apse

d.If

un-

avai

labl

eon

retr

y,ad

da

flag

toth

eda

tain

dica

ting

this

.P

roce

edto

the

next

step

soth

atpi

pelin

epr

oces

sing

does

noth

alta

ndno

data

lost

.

Mar

gina

lR

emot

e

Eva

luat

eM

odel

Cha

nges

(FM

.SD

P.P

ST.

232)

Eva

luat

eM

odel

Cha

nges

func

tion

tem

-po

raril

ylo

ses

conn

ec-

tivity

with

dow

nstre

amS

DP

com

pone

nts.

Eva

luat

eM

odel

Cha

nges

func

tion

can-

not

pass

data

thro

ugh

the

timin

gpi

pelin

e.

Can

not

gene

rate

aler

tsba

sed

ofch

ange

sin

atim

ing

profi

le.

Qua

lity

ofsc

ienc

eou

tput

affe

cted

.G

ener

ate

anal

ert,

and

flag

the

data

for

mod

elch

ange

anal

ysis

post

-pro

cess

ing.

Then

proc

eed

toar

chiv

eth

etim

ing

mod

el.

Min

orR

emot

e

Arc

hive

Tim

-in

gM

odel

(FM

.SD

P.P

ST.

233)

Con

nect

ivity

with

the

pres

erva

tion

syst

emis

tem

pora

rily

lost

,pr

e-ve

ntin

gst

orag

eof

the

upda

ted

timin

gm

odel

.

Tim

ing

mod

elno

tsen

tto

the

arch

ive/

pres

erva

tion

syst

em.

Tim

ing

mod

elno

tar

chiv

ed.

Pip

elin

efa

ilsto

auto

mat

ical

lyup

date

timin

gm

odel

s.

Min

orim

pact

onsc

ienc

eou

tput

s.Ti

min

gm

odel

can

bere

com

pute

dvi

apo

st-p

roce

ssin

gif

nec-

essa

ry.

Ret

ryse

ndin

gth

eda

taun

tilso

me

timeo

utpe

riod

TBD

has

elap

sed.

Ifth

eda

tais

not

sent

,ra

ise

anal

arm

.A

dda

flag

toth

eda

tam

akin

git

clea

rth

etim

-in

gm

odel

has

not

been

per-

sist

ed.

Pro

ceed

toth

ene

xtst

epso

that

pipe

line

proc

essi

ngdo

esno

thal

tand

noda

talo

st.

Mar

gina

lR

emot

e

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 28 of 47

Page 29: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e13

:C

ontro

land

Com

mun

icat

ion

failu

rem

odes

34-3

6.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dU

pdat

eTi

m-

ing

Mod

el(F

M.S

DP.

PS

T.23

4)

Upd

ate

Tim

ing

Mod

elfu

nctio

nte

mpo

raril

ylo

ses

conn

ectiv

ityw

ithdo

wns

tream

SD

Pco

m-

pone

nts.

Upd

ate

Tim

ing

Mod

elfu

nctio

nca

nnot

pass

data

thro

ugh

the

timin

gpi

pelin

e.

Tim

ing

mod

elno

tup

-da

ted.

Aut

omat

icup

date

oftim

-in

gm

odel

sfa

ils.

Gen

erat

ean

aler

t,an

dfla

gth

eda

tafo

rfo

llow

-up

post

proc

ess-

ing.

Min

orR

emot

e

Gen

erat

eA

lert

(FM

.SD

P.P

ST.

235)

Con

nect

ivity

with

the

aler

tsy

stem

iste

m-

pora

rily

lost

,pr

even

ting

rapi

dfo

llow

-up.

Ale

rts

notg

ener

ated

.P

ulsa

rtim

ing

pipe

line

cann

otal

ert

TMor

the

rese

arch

com

mun

ityto

scie

ntifi

cally

inte

rest

ing

even

ts.

Min

orto

mar

gina

lim

pact

onsc

ienc

eou

tput

s.A

dda

flag

toth

eda

tam

akin

git

clea

rth

atth

eda

tare

quire

sfo

llow

-up

anal

ysis

.C

ontin

ueto

atte

mpt

toge

nera

tean

aler

tun-

tilso

me

time-

outp

erio

dTB

Dha

sel

apse

d.

Mar

gina

lR

emot

e

ALL

-M

eta-

data

Acq

uisi

tion

(FM

.SD

P.P

ST.

236)

Con

nect

ivity

with

the

syst

em/s

resp

onsi

ble

for

man

agin

gan

dsu

pply

ing

met

adat

ais

tem

pora

r-ily

lost

.Th

isim

pact

sth

eac

quis

ition

ofsk

ym

odel

s,R

FIm

asks

,ca

libra

tion

stra

tegi

es,

puls

arep

hem

erid

es,

Sta

ndar

dP

rofil

esan

dtim

ing

mod

els

Dat

are

quire

dfo

rpr

o-ce

ssin

gno

tav

aila

ble,

caus

ing

proc

essi

ngst

eps

tobe

mis

sed.

Pul

sar

timin

gpi

pelin

eun

able

toru

nco

rrec

tly.

Min

orto

mar

gina

lim

pact

onsc

ienc

eou

tput

s.R

etry

obta

inin

gth

ere

quire

dm

etad

ata

until

som

etim

e-ou

tpe

riod

TBD

has

elap

sed.

Ifm

etad

ata

still

unav

aila

ble,

gen-

erat

ean

aler

t.A

dda

flag

toth

eda

tam

akin

git

clea

rtha

tthe

data

requ

ires

follo

w-u

pan

alys

is.

Pro

-ce

edto

the

next

proc

essi

ngst

epw

here

poss

ible

inde

faul

tmod

e.

Mar

gina

lR

emot

e

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 29 of 47

Page 30: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

5.3 Data Failures

Data failures arise when data is incorrectly formatted, contains invalid values, or is not pro-vided when expected. Formatting and validity issues typically arise through software errorsand incorrectly implemented interfaces. It is also possible for such errors to occur due to com-munication issues (e.g. packet loss), or memory problems (e.g. bit flips) that can cause datacorruption.

Data problems can also arise when using external databases. It is possible for data requestedof an external resource to become corrupted during transfer, or data mismanagement. As thepulsar timing pipeline requires external data to function (e.g. pulsar ephemerides), such errorsare plausible.

A number of failure modes related to data are listed in Tables 14, through to Table 16.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 30 of 47

Page 31: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e14

:D

ata

failu

rem

odes

1-6.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dTi

min

gR

e-ce

ive

/In

gest

(FM

.SD

P.P

ST.

301)

Sub

-inte

grat

ion

data

in-

corr

ectly

form

atte

d/c

on-

tain

sin

valid

valu

es.

Tim

ing

rece

ive

pro-

cess

espa

rtly

corr

upte

dor

inco

rrec

tlyfo

rmat

ted

data

.M

itiga

tion

stra

tegy

incu

rsco

mpu

tatio

nal

over

head

.

Red

uced

effe

ctiv

enes

sof

puls

artim

ing

anal

ysis

.M

inor

degr

adat

ion

tosc

i-en

ceou

tput

s.A

dda

zero

edsu

b-in

tin

plac

eof

inco

rrec

tlyfo

rmat

ted

orin

valid

sub-

inte

grat

ion

(or

sub-

int

data

poin

t).U

pdat

ecu

mul

ativ

etra

ck-

ing

oflo

stsu

b-in

tsan

dsu

b-in

tsa

mpl

es.

Ifcu

mul

ativ

eda

talo

ssm

ore

than

20%

TBC

then

too

muc

hsi

gnal

has

been

lost

and

anal

arm

mus

tbe

rais

ed.

Tag

the

data

soth

epr

opor

tion

oflo

stsu

b-in

tsis

reco

rded

.

Sca

nde

-pe

nden

t.Fr

actio

nal

loss

isim

-po

rtan

t.S

ever

ityra

nges

from

min

orto

criti

cal

due

tocu

-m

ulat

ive

effe

cts.

Rem

ote

Rem

ove

RFI

(FM

.SD

P.P

ST.

302)

No

RFI

mas

kpr

ovid

ed.

Can

not

rem

ove/

miti

gate

RFI

.Th

esi

gnal

-to-n

oise

ratio

ofth

ede

tect

edpu

lse

will

belo

wer

ed.

Pul

sar

timin

gan

alys

isle

ssef

fect

ive.

Min

orto

Mar

gina

lim

pact

onsc

ienc

eou

tput

s.A

dda

flag

toth

eda

tam

ak-

ing

itcl

ear

that

RFI

miti

gatio

nis

yet

tobe

perfo

rmed

,an

dpr

o-ce

edto

the

next

proc

essi

ngst

epso

that

pipe

line

proc

essi

ngdo

esno

thal

tand

noda

talo

st.

Mar

gina

lE

xtre

mel

yU

nlik

ely

Rem

ove

RFI

(FM

.SD

P.P

ST.

303)

Inva

lid/

corr

upt

RFI

mas

kpr

ovid

edto

the

RFI

miti

gatio

nco

mpo

-ne

nt.

Can

not

rem

ove/

miti

gate

RFI

.Th

esi

gnal

-to-n

oise

ratio

ofth

ede

tect

edpu

lse

will

belo

wer

ed.

Pul

sar

timin

gan

alys

isle

ssef

fect

ive.

Min

orto

Mar

gina

lim

pact

onsc

ienc

eou

tput

s.S

ame

asFM

.SD

P.P

ST.

302.

Mar

gina

lR

emot

e

Rem

ove

RFI

(FM

.SD

P.P

ST.

304)

Inap

prop

riate

RFI

mas

kpr

ovid

edto

the

RFI

mit-

igat

ion

com

pone

nt.

The

sign

al-to

-noi

sera

tioof

the

dete

cted

puls

ew

illbe

low

erw

ithou

tRFI

mit-

igat

ion.

Pul

sar

timin

gan

alys

isle

ssef

fect

ive.

Min

orto

Mar

gina

lim

pact

onsc

ienc

eou

tput

s.U

ndo

the

miti

gatio

nst

epan

dA

dda

flag

toth

eda

tam

akin

git

clea

rtha

tRFI

miti

gatio

nis

yett

obe

perfo

rmed

.M

usta

lso

expl

ain

that

the

appl

ied

mas

kfa

iled

toin

-cr

ease

the

sign

al-to

-noi

sera

tio.

Min

orO

ccas

iona

l

Cal

ibra

te(F

M.S

DP.

PS

T.30

5)N

oca

libra

tion

solu

tion

prov

ided

.Th

esi

gnal

-to-n

oise

ratio

ofth

ede

tect

edpu

lse

will

belo

wer

with

outc

alib

ra-

tion.

Pul

sar

timin

gan

alys

isle

ssef

fect

ive.

Min

orim

pact

onsc

ienc

eou

tput

s.A

dda

flag

toth

eda

tam

ak-

ing

itcl

ear

that

calib

ratio

nis

yet

tobe

perfo

rmed

,an

dpr

o-ce

edto

the

next

proc

essi

ngst

epso

that

pipe

line

proc

essi

ngdo

esno

thal

tand

noda

talo

st.

Mar

gina

lR

emot

e

Cal

ibra

te(F

M.S

DP.

PS

T.30

6)In

valid

/co

rrup

tca

libra

-tio

nso

lutio

npr

ovid

ed.

The

sign

al-to

-noi

sera

tioof

the

dete

cted

puls

ew

illbe

low

erw

ithou

tcal

ibra

-tio

n.

Pul

sar

timin

gan

alys

isle

ssef

fect

ive.

Min

orim

pact

onsc

ienc

eou

tput

s.S

ame

asFM

.SD

P.P

ST.

305.

Mar

gina

lR

emot

e

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 31 of 47

Page 32: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e15

:D

ata

failu

rem

odes

7-11

.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dC

alib

rate

(FM

.SD

P.P

ST.

307)

Inap

prop

riate

calib

ratio

nso

lutio

npr

ovid

ed.

The

sign

al-to

-noi

sera

tioof

the

dete

cted

puls

ew

illbe

low

erw

ithou

tcal

ibra

-tio

n.

Pul

sar

timin

gan

alys

isle

ssef

fect

ive.

Min

orim

pact

onsc

ienc

eou

tput

s.U

ndo

the

calib

ratio

nst

epan

dad

da

flag

toth

eda

tam

akin

git

clea

rth

atca

libra

tion

isye

tto

bepe

rform

ed.

Mus

tals

oex

plai

nth

atth

eap

plie

dst

rate

gyfa

iled

toin

crea

seth

esi

gnal

-to-n

oise

ra-

tio.

Min

orO

ccas

iona

l

Ave

rage

(FM

.SD

P.P

ST.

308)

No

spec

ifica

tions

pro-

vide

dfo

rthe

requ

ired

av-

erag

edda

tapr

oduc

ts.

Inab

ility

topr

oduc

ein

-te

rmed

iate

outp

utda

tapr

oduc

ts.

Pul

sar

timin

gan

alys

isun

affe

cted

.O

nly

data

prod

ucts

usef

ulfo

rpo

st-

proc

essi

ngan

alys

isar

elo

st.

No

impa

cton

scie

nce

outp

uts

solo

ngas

the

prim

ary

data

prod

uct

isst

ored

.In

term

e-di

ate

data

prod

ucts

can

bere

crea

ted

via

post

-pro

cess

ing.

Sen

dth

epr

imar

yda

tacu

beto

the

pres

erva

tion

arch

ive,

alon

gw

ithso

me

defa

ulta

vera

ged

data

prod

ucts

.

Min

orO

ccas

iona

l

Arc

hive

Av-

erag

eP

rodu

cts

(FM

.SD

P.P

ST.

309)

Ave

rage

dda

tapr

oduc

tsin

corr

ectly

form

atte

d/

cont

ain

inva

lidva

lues

due

toso

ftwar

eer

ror.

Ave

rage

dda

tapr

od-

ucts

are

not

pers

iste

d.C

anno

tse

ndin

valid

orco

rrup

ted

data

toth

epr

eser

vatio

nar

chiv

e.

Pul

sar

timin

gan

alys

isun

affe

cted

.O

nly

data

prod

ucts

usef

ulfo

rpo

st-

proc

essi

ngan

alys

esar

elo

st.

No

impa

cton

scie

nce

outp

uts

solo

ngas

the

prim

ary

data

prod

uct

isst

ored

.In

term

e-di

ate

data

prod

ucts

can

bere

crea

ted

via

post

-pro

cess

ing.

Sen

dth

epr

imar

yda

tacu

beto

the

pres

erva

tion

arch

ive.

Flag

that

aver

age

data

prod

ucts

wer

ein

valid

and

need

recr

eatin

g.R

aise

anal

arm

.

Min

orR

emot

e

Det

erm

ine

TOA

s(F

M.S

DP.

PS

T.31

0)N

ost

anda

rdpr

ofile

pro-

vide

d.N

oTO

As

dete

rmin

ed.

Pul

sart

imin

gan

alys

isin

-co

mpl

ete.

Min

orto

Mar

gina

lim

pact

onsc

ienc

eou

tput

s.R

etry

obta

inin

gth

est

anda

rdpr

ofile

until

som

etim

e-ou

tper

iod

TBD

has

elap

sed.

Ifno

neav

ail-

able

,en

ter

defa

ult

mod

ean

dse

ndth

eda

tato

the

pres

er-

vatio

nar

chiv

e.A

nnot

ate

the

data

and

flag

for

repr

oces

sing

.P

repa

reto

proc

ess

the

next

scan

(can

not

proc

eed

with

tim-

ing

proc

essi

ngw

ithou

tthe

stan

-da

rdpr

ofile

).R

aise

anal

arm

.

Mar

gina

lR

emot

e

Det

erm

ine

TOA

s(F

M.S

DP.

PS

T.31

1)In

valid

/cor

rupt

stan

dard

profi

lepr

ovid

ed.

No

TOA

sde

term

ined

.P

ulsa

rtim

ing

anal

ysis

in-

com

plet

e.M

inor

toM

argi

nali

mpa

cton

scie

nce

outp

uts.

Ret

ryob

tain

ing

the

stan

dard

profi

leun

tilso

me

time-

outp

erio

dTB

Dha

sel

apse

d.If

none

avai

l-ab

le,

ente

rde

faul

tm

ode

and

send

the

data

toth

epr

eser

-va

tion

arch

ive.

Ann

otat

eth

eda

taan

dfla

gfo

rre

proc

essi

ng.

Pre

pare

topr

oces

sth

ene

xtsc

an(c

anno

tpr

ocee

dw

ithou

tth

est

anda

rdpr

ofile

).R

aise

anal

arm

.

Mar

gina

lR

emot

e

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 32 of 47

Page 33: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e16

:D

ata

failu

rem

odes

12-1

7.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dA

rchi

veTO

As

(FM

.SD

P.P

ST.

312)

Inva

lid/

corr

upt

com

-pu

ted

TOA

sdu

eto

soft-

war

eer

rors

.

Failu

reto

stor

eTO

As.

Can

not

send

inva

lidor

corr

upte

dda

tato

the

pres

erva

tion

arch

ive.

Pul

sar

timin

gpi

pelin

efa

ilsto

arch

ive

usef

ulsc

i-en

ceda

ta.

Min

orim

pact

onsc

i-en

ceou

tput

s.TO

As

can

beco

mpu

ted

via

post

-pr

oces

sing

ifne

cess

ary.

Rai

sean

alar

m,

and

flag

the

data

indi

catin

gth

atTO

As

need

tobe

com

pute

ddu

ring

post

-pr

oces

sing

.

Mar

gina

lR

emot

e

Gen

erat

eR

esid

uals

(FM

.SD

P.P

ST.

313)

Inva

lid/

corr

upt

com

-pu

ted

TOA

sdu

eto

soft-

war

eer

rors

.

Can

not

com

pute

resi

du-

als

from

inva

lid/

corr

upt

TOA

s.

Pul

sar

timin

gpi

pelin

efa

ilsto

com

pute

resi

d-ua

lsfo

rth

eob

serv

edpu

lsar

.C

anno

tde

tect

scie

ntifi

cally

inte

rest

ing

profi

lech

ange

s.

Min

orim

pact

onsc

i-en

ceou

tput

s.TO

As

can

beco

mpu

ted

via

post

-pr

oces

sing

ifne

cess

ary.

Rai

sean

alar

m,

and

flag

the

data

indi

catin

gth

atTO

As

and

resi

dual

sne

edto

beco

mpu

ted

durin

gpo

st-p

roce

ssin

g.

Mar

gina

lR

emot

e

QA

Res

idua

ls(F

M.S

DP.

PS

T.31

4)In

valid

/co

rrup

tre

sidu

-al

spr

ovid

ed,

unab

leto

asse

sth

eirq

ualit

y.

Can

not

cont

inue

pro-

cess

ing.

Pul

sar

timin

gpi

pelin

eha

lts.

Min

orim

pact

onsc

ienc

eou

tput

s.R

esid

uals

can

beco

mpu

ted

via

post

-pr

oces

sing

ifne

cess

ary.

Rai

sean

alar

m,

and

flag

the

data

indi

catin

gth

atre

sidu

als

need

tobe

com

pute

ddu

ring

post

-pro

cess

ing.

Mar

gina

lR

emot

e

Upd

ate

Tim

-in

gM

odel

(FM

.SD

P.P

ST.

315)

Inva

lid/c

orru

ptre

sidu

als

prov

ided

,un

able

toup

-da

teth

etim

ing

mod

el.

Tim

ing

mod

elno

tup

-da

ted.

Pul

sar

timin

gpi

pelin

eca

nnot

proc

eed

with

fur-

ther

proc

essi

ngst

eps.

Min

orim

pact

onsc

ienc

eou

tput

s.Ti

min

gm

odel

can

beup

date

dvi

apo

st-

proc

essi

ngif

nece

ssar

y.

Rai

sean

alar

m,

and

flag

the

data

indi

catin

gth

atre

sidu

als

need

tobe

com

pute

ddu

ring

post

-pro

cess

ing.

Mar

gina

lR

emot

e

Upd

ate

Tim

ing

Mod

el/

Arc

hive

Tim

ing

Mod

el(F

M.S

DP.

PS

T.31

6)

Inva

lid/

corr

upt

timin

gm

odel

prov

ided

,un

able

toup

date

.

Tim

ing

mod

elno

tup

-da

ted.

Pul

sar

timin

gpi

pelin

eca

nnot

proc

eed

with

fur-

ther

proc

essi

ngst

eps.

Min

orim

pact

onsc

ienc

eou

tput

s.Ti

min

gm

odel

can

beup

date

dvi

apo

st-

proc

essi

ngif

nece

ssar

y.

Rai

sean

alar

m,

and

flag

the

data

indi

catin

gth

atth

etim

ing

mod

elne

eds

tobe

upda

ted

dur-

ing

post

-pro

cess

ing.

Mar

gina

lR

emot

e

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 33 of 47

Page 34: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

5.4 Software/Algorithm Failures

Software and algorithms can fail for a variety of reasons. The causes range from bugs inad-vertently introduced at the software design stage, to bugs accidental coded during implemen-tation. Aside from bugs, software can also fail when,

• non-deterministic algorithms do not complete on certain types of input data.

• software/algorithm logic is incorrectly coded preventing loops from terminating.

• numerical precision is incorrectly handled, causing sub-optimal performance or failure.

• incorrect data types are used when handling numerical data causing precision errors.

• errors in parallelism cause data to be incorrectly processed, for example, via memoryaccess errors.

• slow runtime which causes failures at the system level (due to delay).

• similarly sub-optimal implementation, which causes failures at the system level (due toresource contention).

• security vulnerabilities are exploited by attackers.

A number of failure modes related to software/algorithms are listed in Tables 17 and Table 18.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 34 of 47

Page 35: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e17

:S

oftw

are/

Alg

orith

mfa

ilure

mod

es1-

9.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dA

LL(F

M.S

DP.

PS

T.40

1)Fu

nctio

n,pr

oces

sor

ap-

plic

atio

nth

row

san

arith

-m

etic

erro

r(d

ivid

eby

zero

,ar

ithm

etic

over

flow

orun

derfl

ow,l

oss

ofpr

e-ci

sion

).

Func

tion

fails

toco

m-

plet

eex

ecut

ion.

Tim

ing

pipe

line

fails

toco

mpl

ete

anas

sign

edta

sk.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

Inte

grity

ofsc

ienc

eda

taco

uld

beco

mpr

omis

ed.

Exe

cutio

nfra

mew

ork

re-r

uns

func

tion

/al

gorit

hmag

ain

with

inpu

tda

ta.

Err

orre

port

logg

edan

dfe

dto

softw

are

deve

lopm

ent

team

fori

nves

tigat

ion.

Crit

ical

Ext

rem

ely

unlik

ely

ALL

(FM

.SD

P.P

ST.

402)

Func

tion,

proc

ess

orap

-pl

icat

ion

enco

unte

red

alo

gic

erro

r(in

finite

loop

sor

infin

itere

curs

ion,

loop

coun

ter

erro

rs,

arra

yin

-de

xou

tofb

ound

sex

cep-

tion)

.

Sam

eas

for

FM.S

DP.

PS

T.40

1.S

ame

asfo

rFM

.SD

P.P

ST.

401.

Sam

eas

for

FM.S

DP.

PS

T.40

1.S

ame

asfo

rFM

.SD

P.P

ST.

401.

Crit

ical

Ext

rem

ely

unlik

ely

ALL

(FM

.SD

P.P

ST.

403)

Func

tion,

proc

ess

orap

plic

atio

nen

coun

tere

da

reso

urce

erro

r(N

ull

poin

ter,

acce

ssvi

ola-

tions

,re

sour

cele

aks,

buffe

rov

erflo

w-u

se-

afte

r-fre

eer

ror)

.

Sam

eas

for

FM.S

DP.

PS

T.40

1.S

ame

asfo

rFM

.SD

P.P

ST.

401.

Sam

eas

for

FM.S

DP.

PS

T.40

1.S

ame

asfo

rFM

.SD

P.P

ST.

401.

Crit

ical

Ext

rem

ely

unlik

ely

ALL

(FM

.SD

P.P

ST.

404)

Func

tion,

proc

ess

orap

-pl

icat

ion

enco

unte

red

am

ulti-

thre

adin

ger

ror.

Sam

eas

for

FM.S

DP.

PS

T.40

1.S

ame

asfo

rFM

.SD

P.P

ST.

401.

Sam

eas

for

FM.S

DP.

PS

T.40

1.S

ame

asfo

rFM

.SD

P.P

ST.

401.

Crit

ical

Ext

rem

ely

unlik

ely

ALL

(FM

.SD

P.P

ST.

405)

Func

tion,

proc

ess

orap

-pl

icat

ion

enco

unte

red

anin

terfa

ceer

ror.

Sam

eas

for

FM.S

DP.

PS

T.40

1.S

ame

asfo

rFM

.SD

P.P

ST.

401.

Sam

eas

for

FM.S

DP.

PS

T.40

1.S

ame

asfo

rFM

.SD

P.P

ST.

401.

Crit

ical

Ext

rem

ely

unlik

ely

ALL

(FM

.SD

P.P

ST.

406)

Non

-det

erm

inis

ticda

tade

pend

entf

unct

ion

does

not

term

inat

ein

allo

tted

time.

Func

tion

fails

toco

m-

plet

eex

ecut

ion.

Tim

ing

pipe

line

fails

toco

mpl

ete

anas

sign

edta

sk.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

Inte

grity

ofsc

ienc

eda

taco

uld

beco

mpr

omis

ed.

Mon

itor

proc

essi

ngpr

ogre

ss,

and

forc

eea

rlyte

rmin

atio

nif

func

tion

/alg

orith

mno

tcon

verg

-in

g.Ta

gth

epr

oces

sed

data

with

ano

teex

plai

ning

how

the

pro-

cess

ing

was

curt

aile

d.G

ener

ate

alo

gen

try

expl

aini

ngho

wto

re-

prod

uce

the

erro

rmod

e.

Crit

ical

Rem

ote

ALL

(FM

.SD

P.P

ST.

407)

Func

tion,

proc

ess

orap

-pl

icat

ion

does

not

re-

spon

dto

com

man

dsin

atim

ely

man

ner.

Com

pone

nts

can’

tbe

confi

gure

dco

rrec

tly.

Tim

ing

pipe

line

can

com

plet

eex

ecut

ion,

but

poss

ibly

with

sub-

optim

alco

nfigu

ratio

n,e.

g.de

faul

tmod

e.

Inte

grity

ofsc

ienc

eda

taco

uld

beco

mpr

omis

ed.

Re-

star

tth

efu

nctio

n/

proc

ess

prio

rto

the

next

scan

.G

ener

ate

alo

gen

try

desc

ribin

gth

eer

ror

stat

ean

dst

eps

tore

prod

uce.

Crit

ical

Rem

ote

ALL

(FM

.SD

P.P

ST.

408)

Run

time

exce

eds

allo

t-te

dtim

e.P

roce

ssin

gba

cklo

gcr

e-at

ed.

Pla

ces

addi

tiona

llo

adon

proc

essi

ngre

-so

urce

s.

Inte

grity

ofsc

ienc

eda

taco

uld

beco

mpr

omis

edif

som

eda

taca

nnot

bepr

oces

sed.

Ifru

ntim

ebe

gins

toin

crea

seau

-to

mat

ical

lylo

adba

lanc

eto

pro-

vide

addi

tiona

lres

ourc

es.

Min

orO

ccas

iona

l

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 35 of 47

Page 36: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Tabl

e18

:S

oftw

are/

Alg

orith

mfa

ilure

mod

es9-

14.

Func

tion

FMD

escr

iptio

nLo

calE

ffec

tS

ub-s

yste

mE

ffec

tS

yste

mE

ffec

tM

itiga

tion

Sev

erity

Like

lihoo

dA

LL(F

M.S

DP.

PS

T.40

9)C

omm

unic

atio

ntim

e-ou

tca

used

byne

twor

kco

n-ne

ctiv

ityis

sues

.

Inab

ility

topr

oces

sda

ta.

Tim

ing

pipe

line

fails

toco

mpl

ete

anas

sign

edta

sk.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

Inte

grity

ofsc

ienc

eda

taco

uld

beco

mpr

omis

ed.

Ret

ryob

tain

ing

the

data

until

som

etim

e-ou

tpe

riod

TBD

has

elap

sed.

Gen

erat

ean

aler

t.

Min

orO

ccas

iona

l

ALL

(FM

.SD

P.P

ST.

410)

Func

tion,

proc

ess

orap

-pl

icat

ion

beco

mes

unre

-sp

onsi

ve.

Inab

ility

topr

oces

sda

ta.

Tim

ing

pipe

line

fails

toco

mpl

ete

anas

sign

edta

sk.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

Inte

grity

ofsc

ienc

eda

taco

uld

beco

mpr

omis

ed.

Re-

star

tth

efu

nctio

nim

med

i-at

ely.

Gen

erat

ea

log

entr

yde

-sc

ribin

gth

eer

rors

tate

and

step

sto

repr

oduc

e.

Min

orR

emot

e

ALL

(FM

.SD

P.P

ST.

411)

Err

orch

ecki

ngpr

oce-

dure

sfa

ilin

the

exec

ut-

ing

appl

icat

ion

orfu

nc-

tion.

Inab

ility

topr

oces

sda

ta.

Tim

ing

pipe

line

fails

toco

mpl

ete

anas

sign

edta

sk.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

Inte

grity

ofsc

ienc

eda

taco

uld

beco

mpr

omis

ed.

Re-

star

tth

efu

nctio

nim

med

i-at

ely.

Gen

erat

ea

log

entr

yde

-sc

ribin

gth

eer

rors

tate

and

step

sto

repr

oduc

e.

Min

orE

xtre

mel

yun

likel

y

ALL

(FM

.SD

P.P

ST.

412)

Sec

urity

brea

ches

and

intr

usio

nsoc

curr

ing

dur-

ing

norm

alex

ecut

ion.

Inab

ility

topr

oces

sda

ta.

Tim

ing

pipe

line

fails

toco

mpl

ete

anas

sign

edta

sk.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

Inte

grity

ofsc

ienc

eda

taco

uld

beco

mpr

omis

ed.

Term

inat

eal

lfun

ctio

nsan

dpr

o-ce

sses

and

gene

rate

anal

ert.

Min

orE

xtre

mel

yun

likel

y

ALL

(FM

.SD

P.P

ST.

413)

Inm

emor

yer

rors

caus

edby

bit

flips

orpo

wer

surg

esco

rrup

ting

exec

utin

gco

de.

Inab

ility

topr

oces

sda

ta.

Tim

ing

pipe

line

fails

toco

mpl

ete

anas

sign

edta

sk.

Ope

ratio

nal

relia

bilit

yan

def

ficie

ncy

are

de-

grad

ed.

Inte

grity

ofsc

ienc

eda

taco

uld

beco

mpr

omis

ed.

Re-

star

tth

efu

nctio

nim

med

i-at

ely.

Gen

erat

ea

log

entr

yde

-sc

ribin

gth

eer

rors

tate

and

step

sto

repr

oduc

e.G

ener

ate

anal

ert.

Min

orE

xtre

mel

yun

likel

y

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 36 of 47

Page 37: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

6 Summary

In this document we have summarised pulsar timing pipeline failure modes at a high levelof abstraction. Numerous failure types have been identified and contextualised accordingto number of key assumptions. Our next steps will be to improve upon this work followingfeedback from our SDP colleagues, and incorporate those improvements into analyses ofpulsar and transient search pipeline failure modes.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 37 of 47

Page 38: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

A FMECA Detection methods

Table 19 through to Table 21 summarises the detection methods for each failure mode.

Table 19: Summary of the detection methods for each of the failure modes discussed in thisdocument (Part 1).

Failure Mode Code Detection MethodFM.SDP.PST.101 Monitor the health status of the SDP data ingest nodes, and

monitor network connectivity status.FM.SDP.PST.102 Same as for FM.SDP.PST.101.FM.SDP.PST.103 Same as for FM.SDP.PST.101.FM.SDP.PST.104 Monitor the health status of the SDP compute nodes, and

monitor network connectivity status.FM.SDP.PST.105 Same as for FM.SDP.PST.104.FM.SDP.PST.106 Same as for FM.SDP.PST.104.FM.SDP.PST.107 Same as for FM.SDP.PST.104.FM.SDP.PST.108 Same as for FM.SDP.PST.104.FM.SDP.PST.109 Same as for FM.SDP.PST.104.FM.SDP.PST.110 Same as for FM.SDP.PST.104.FM.SDP.PST.111 Same as for FM.SDP.PST.104.FM.SDP.PST.112 Same as for FM.SDP.PST.104.FM.SDP.PST.113 Same as for FM.SDP.PST.104.FM.SDP.PST.114 Same as for FM.SDP.PST.104.FM.SDP.PST.115 Same as for FM.SDP.PST.104.FM.SDP.PST.116 Same as for FM.SDP.PST.104.FM.SDP.PST.117 Same as for FM.SDP.PST.104.FM.SDP.PST.201 Monitor the health status of software modules, and monitor

network connectivity status.FM.SDP.PST.202 Monitor the health status of software modules, and monitor

network connectivity status.FM.SDP.PST.203 QA of control parameters sent between TM/LMC and the tim-

ing pipeline components.FM.SDP.PST.204 QA of control commands sent between TM/LMC and the tim-

ing pipeline components.FM.SDP.PST.205 Active monitoring of software components and the communi-

cation network between them.FM.SDP.PST.206 Active monitoring of software components and the communi-

cation network between them.FM.SDP.PST.207 Active monitoring of software components and the communi-

cation network between them.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 38 of 47

Page 39: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Table 20: Summary of the detection methods for each of the failure modes discussed in thisdocument (Part 2).

Failure Mode Code Detection Method

FM.SDP.PST.208 Active monitoring of software components and the communi-cation network between them.

FM.SDP.PST.209 Active monitoring of data processing hardware.FM.SDP.PST.210 Active monitoring of data processing hardware.FM.SDP.PST.211 Active monitoring of data processing hardware.FM.SDP.PST.212 Active monitoring of data processing hardware.FM.SDP.PST.213 QA of control parameters sent between TM/LMC and the tim-

ing pipeline component.FM.SDP.PST.214 Active monitoring of data processing hardware.FM.SDP.PST.215 Monitor network connectivity status.FM.SDP.PST.216 Monitor network connectivity status and QA of received data.FM.SDP.PST.217 Monitor network connectivity status and QA of received data.FM.SDP.PST.217 Monitor network connectivity status and QA of received data.FM.SDP.PST.218 Monitor network connectivity status and QA of received data.FM.SDP.PST.219 Monitor the processing load placed upon data ingest nodes,

and monitor network connectivity status.FM.SDP.PST.220 Monitor cumulative sub-integration loss for each beam per

scan.FM.SDP.PST.221 Monitor network connectivity status and QA of received data.FM.SDP.PST.222 Monitor network connectivity status and QA of received data.FM.SDP.PST.223 Monitor network connectivity status and QA of received data.FM.SDP.PST.224 Monitor network connectivity status and QA of received data.FM.SDP.PST.225 Monitor network connectivity status and QA of received data.FM.SDP.PST.226 Monitor network connectivity status and QA of received data.FM.SDP.PST.227 Monitor network connectivity status and QA of received data.FM.SDP.PST.228 Monitor network connectivity status and QA of received data.FM.SDP.PST.229 Monitor network connectivity status and QA of received data.FM.SDP.PST.230 Monitor network connectivity status and QA of received data.FM.SDP.PST.231 Monitor network connectivity status and QA of received data.FM.SDP.PST.232 Monitor network connectivity status and QA of received data.FM.SDP.PST.233 Monitor network connectivity status and QA of received data.FM.SDP.PST.234 Monitor network connectivity status and QA of received data.FM.SDP.PST.235 Monitor network connectivity status and QA of received data.FM.SDP.PST.236 Monitor network connectivity status and QA of received data.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 39 of 47

Page 40: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Table 21: Summary of the detection methods for each of the failure modes discussed in thisdocument (Part 3).

Failure Mode Code Detection Method

FM.SDP.PST.301 QA the data received to ensure it is formatted correctly andcontains valid data values.

FM.SDP.PST.302 Check that the RFI mask is valid.FM.SDP.PST.303 Check that the RFI mask is valid.FM.SDP.PST.304 Check the signal-to-noise ratio of the detected pulse increases

post RFI mitigation.FM.SDP.PST.305 Check for valid calibration strategy.FM.SDP.PST.306 Check for valid calibration strategy.FM.SDP.PST.307 Check the signal-to-noise ratio of the detected pulse increases

post calibration.FM.SDP.PST.308 Check for valid configuration.FM.SDP.PST.309 QA the format and values of the averaged data products.FM.SDP.PST.310 QA the standard profile.FM.SDP.PST.311 QA the standard profile.FM.SDP.PST.312 QA the computed TOAs.FM.SDP.PST.313 QA the computed TOAs.FM.SDP.PST.314 QA the residuals.FM.SDP.PST.315 QA the residuals.FM.SDP.PST.316 QA the residuals.FM.SDP.PST.401 Process monitoring at the operating system / execution frame-

work level.FM.SDP.PST.402 Same as for FM.SDP.PST.401.FM.SDP.PST.403 Same as for FM.SDP.PST.401.FM.SDP.PST.404 Same as for FM.SDP.PST.401.FM.SDP.PST.405 Same as for FM.SDP.PST.401.FM.SDP.PST.406 Process monitoring at the operating system / execution frame-

work level.FM.SDP.PST.406 Process monitoring at the operating system / execution frame-

work level.FM.SDP.PST.407 Process monitoring at the operating system / execution frame-

work level.FM.SDP.PST.408 Process monitoring at the operating system / execution frame-

work level.FM.SDP.PST.409 Process monitoring at the operating system / execution frame-

work level.FM.SDP.PST.410 Process monitoring at the operating system / execution frame-

work level.FM.SDP.PST.411 Process monitoring at the operating system / execution frame-

work level.FM.SDP.PST.412 Process monitoring at the operating system / execution frame-

work level.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 40 of 47

Page 41: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

B FMECA Results

Table 22 through to Table 24 summarises the results of our analysis.

Table 22: Summary of the criticality scores for each of the failure modes discussed in thisdocument (Part 1).

Failure Mode Code Severity Probability ScoreFM.SDP.PST.101 Minor Occasional 3FM.SDP.PST.102 Critical Remote 8FM.SDP.PST.103 Catastrophic Extremely unlikely 5FM.SDP.PST.104 Marginal Remote 4FM.SDP.PST.105 Marginal Remote 4FM.SDP.PST.106 Marginal Remote 4FM.SDP.PST.107 Marginal Remote 4FM.SDP.PST.108 Critical Remote 8FM.SDP.PST.109 Marginal Remote 4FM.SDP.PST.110 Marginal Remote 4FM.SDP.PST.111 Marginal Remote 4FM.SDP.PST.112 Marginal Remote 4FM.SDP.PST.113 Marginal Remote 4FM.SDP.PST.114 Marginal Remote 4FM.SDP.PST.115 Marginal Remote 4FM.SDP.PST.116 Marginal Remote 4FM.SDP.PST.117 Marginal to Critical Remote 4 to 8FM.SDP.PST.201 Minor Remote 2FM.SDP.PST.202 Minor Remote 2FM.SDP.PST.203 Minor Remote 2FM.SDP.PST.204 Minor Remote 2FM.SDP.PST.205 Minor Remote 2FM.SDP.PST.206 Minor Remote 2FM.SDP.PST.207 Minor Remote 2FM.SDP.PST.208 Significant Remote 6FM.SDP.PST.209 Significant Remote 6FM.SDP.PST.210 Significant Remote 6FM.SDP.PST.211 Catastrophic Extremely unlikely 5FM.SDP.PST.212 Catastrophic Extremely unlikely 5FM.SDP.PST.213 Minor Occasional 3FM.SDP.PST.214 Catastrophic Remote 6

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 41 of 47

Page 42: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Table 23: Summary of the criticality scores for each of the failure modes discussed in thisdocument (Part 2).

Failure Mode Code Severity Probability ScoreFM.SDP.PST.215 Catastrophic Remote 10FM.SDP.PST.216 Marginal Occasional 6FM.SDP.PST.217 Marginal Remote 4FM.SDP.PST.218 Catastrophic Extremely unlikely 4FM.SDP.PST.219 Marginal Remote 4FM.SDP.PST.220 Scan dependent. Fractional loss

is important. Severity rangesfrom minor to critical due to cu-mulative effects.

Occasional 2 to 8

FM.SDP.PST.221 Marginal Remote 4FM.SDP.PST.222 Marginal Remote 4FM.SDP.PST.223 Marginal Remote 4FM.SDP.PST.224 Marginal to Critical Remote 4 to 8FM.SDP.PST.225 Minor Remote 2FM.SDP.PST.226 Minor Remote 2FM.SDP.PST.227 Marginal Remote 4FM.SDP.PST.228 Minor Remote 2FM.SDP.PST.229 Minor Remote 2FM.SDP.PST.230 Minor Remote 2FM.SDP.PST.231 Marginal Remote 4FM.SDP.PST.232 Minor Remote 2FM.SDP.PST.233 Marginal Remote 4FM.SDP.PST.234 Minor Remote 2FM.SDP.PST.235 Marginal Remote 4FM.SDP.PST.236 Marginal Remote 4FM.SDP.PST.315 Marginal Remote 4FM.SDP.PST.316 Marginal Remote 4FM.SDP.PST.301 Scan dependent. Fractional loss

is important. Severity rangesfrom minor to critical due to cu-mulative effects.

Remote 2 to 8

FM.SDP.PST.302 Marginal Extremely Unlikely 4FM.SDP.PST.303 Marginal Remote 4FM.SDP.PST.304 Minor Occasional 4FM.SDP.PST.305 Marginal Remote 4FM.SDP.PST.306 Marginal Remote 4FM.SDP.PST.307 Minor Occasional 3FM.SDP.PST.308 Minor Occasional 4FM.SDP.PST.309 Minor Remote 4FM.SDP.PST.310 Marginal Remote 4FM.SDP.PST.311 Marginal Remote 4FM.SDP.PST.312 Marginal Remote 4FM.SDP.PST.313 Marginal Remote 4

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 42 of 47

Page 43: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

Table 24: Summary of the criticality scores for each of the failure modes discussed in thisdocument (Part 3).

Failure Mode Code Severity Probability ScoreFM.SDP.PST.314 Marginal Remote 4FM.SDP.PST.315 Marginal Remote 4FM.SDP.PST.316 Marginal Remote 4FM.SDP.PST.401 Critical Extremely unlikely 4FM.SDP.PST.402 Critical Extremely unlikely 4FM.SDP.PST.403 Critical Extremely unlikely 4FM.SDP.PST.404 Critical Extremely unlikely 4FM.SDP.PST.405 Critical Extremely unlikely 4FM.SDP.PST.406 Critical Remote 8FM.SDP.PST.406 Critical Remote 8FM.SDP.PST.407 Minor Occasional 3FM.SDP.PST.408 Minor Occasional 3FM.SDP.PST.409 Minor Remote 2FM.SDP.PST.410 Minor Extremely unlikely 1FM.SDP.PST.411 Minor Extremely unlikely 1FM.SDP.PST.412 Minor Extremely unlikely 1

C Applicable Requirements

We currently do not have access to Innoslate, thus these requirements may not be up-to-date.

Table 25: Level 2 SDP requirements relevant to the failure mode analysis.

Requirement ID Name DescriptionSDP REQ-30 Graceful degradation The failure of a single component should

not cause the SDP to become unavail-able.

SDP REQ-33 Flagging control The SDP shall flag data according to apre-selected RFI Mask.

SDP REQ-52 Failsafe The SDP shall actively ensure that inter-nal failures do not result in a hazardoussituation to the systems and personnelwith which it interfaces.

SDP REQ-133 Pulsar Search Post Process-ing

SDP shall be capable of operating in apulsar search mode, concurrently withcontinuum imaging mode, single pulsetransient search mode and pulsar timingmode, within the same subarray.

SDP REQ-276 Data Product Provenance The SDP shall create and maintain prove-nance links between science data prod-ucts and observing projects and propos-als.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 43 of 47

Page 44: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

SDP REQ-281 Protection against data loss The SDP shall protect the preserved sci-ence data products against data loss andmalicious or accidental modification.

SDP REQ-285 Accessibility The SDP shall enable per user access toSDP resources (hardware and software)using the Authentication and Authorisa-tion facilities provided by the SKA (as perEN 50600-2-5. Data centre facilities andinfrastructures. Part 2-5. Security sys-tems).

SDP REQ-450 SDP standard pipeline prod-ucts

The SDP shall produce processing logsand quality assessment logs for allpipelines. These should be traceable tothe originating Schedule Blocks.

SDP REQ-470 Receive Data The SDP shall receive the observeddata from CSP in compliance with theSDP-CSP ICD 100-000000-002 and 300-000000-002.

SDP REQ-472 Handle Missing Data The SDP shall be capable of handlingmissing data packets coming from CSP insuch a way that it minimises the scientificimpact of the lost data.

SDP REQ-476 Flag RFI The SDP shall be capable of auto-matically flagging known and unknownRFI using algorithms as applied in theAOFlagger.

SDP REQ-477 Excise RFI The SDP shall be capable of automati-cally excising known and unknown RFI.

SDP REQ-478 Detect RFI The SDP shall be capable of detectingdata that is corrupted by RFI.

SDP REQ-479 Remove Sources The SDP shall be capable of removingstrong sources at the highest time andfrequency resolution.

SDP REQ-480 Integrate Data The SDP shall be capable of integratingdata in time and/or frequency.

SDP REQ-524 Pulsar Timing Input SDP shall be capable of receiving pulsartiming data and dynamic spectrum data inaccordance with the SDP-CSP InterfaceControl Document (100-000000-002 and300-000000-002).

SDP REQ-527 Pulsar Search Data Input The SDP shall be capable of receivingpulsar periodicity search data in accor-dance with the SDP-CSP Interface Con-trol Document (100-000000-002 and 300-000000-002).

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 44 of 47

Page 45: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

SDP REQ-529 Pulsar Timing Precision When provided with a suitable template,signal-to-noise and pulsar parameters,SDP shall be able to measure the arrivaltime of a pulse with a precision of 5ns.

SDP REQ-530 Pulsar Timing ToA Determini-nation

SDP shall be capable of determining thetime of arrival of a pulse from pulsar tim-ing data.

SDP REQ-532 Single pulse Transient PostProcessing

SDP shall be capable of operating in asingle pulse transient search mode, con-currently with continuum imaging modeand pulsar search mode and pulsar tim-ing mode, within the same subarray.

SDP REQ-534 Pulsar Timing Data Prepara-tion

SDP shall be capable of performingdata pre-processing (adding the sub-integrations from each pulsar togetherinto one data file) on pulsar timing data.

SDP REQ-539 Non-imaging Transient Input SDP shall be capable of receiving sin-gle pulse transient search data in accor-dance with the SDP-CSP Interface Con-trol Document (100-000000-002 and 300-000000-002).

SDP REQ-542 Pulsar Timing Error Estima-tion

SDP shall be able to estimate the uncer-tainty in the arrival time of a pulse to bet-ter than 5%.

SDP REQ-543 Pulsar Timing Systematic Er-ror

SDP shall not add more than 5ns system-atic error in the time-of-arrival determina-tion.

SDP REQ-544 Single pulse Transient Alerts SDP shall provide preliminary alerts forthe detection of fast (single pulse) tran-sient events within 10s of the data con-taining that event arriving at the SDP.

SDP REQ-546 Single pulse TransientSearch Output

SDP shall output a single ranked list ofsingle pulse transient candidates (withdurations greater 50 µsec) from each ob-servation.

SDP REQ-558 Pulsar Search Output SDP shall output a single ranked list ofpulsar periodicity candidates from eachobservation.

SDP REQ-565 Pulsar Timing Model Fitting SDP shall be capable of fitting a pulsartiming model to pulsar times of arrival.

SDP REQ-640 Single Pulse data preparationperformance

While receiving single pulse transientsearch data the SDP shall prepare thedata for processing within 100 millisec-onds.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 45 of 47

Page 46: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

SDP REQ-641 Transient Buffer Receive Mid The SKA1 MID SDP shall start recordingTransient Buffer data no later than 60 sec-onds from the time that the highest fre-quency component of a transient signalarrives at the telescope.

SDP REQ-642 Transient Buffer Receive Low The SKA1 LOW SDP shall start record-ing Transient Buffer data no later than 900seconds from the time that the highestfrequency component of a transient sig-nal arrives at the telescope.

SDP REQ-643 Transient Buffer Receive The SDP shall receive Transient Bufferdata from the CSP for the purpose ofarchiving the transient buffer data.

SDP REQ-644 Pulsar timing compute perfor-mance

When performing pulsar timing the SDPshall have at least sufficient performanceto execute an algorithm of comparablecomplexity to using PSRCHIVE (for pro-cessing PSRFITS fits files and produc-ing pulsar arrival times) and TEMPO2 (forcomputing time residuals and updatingtiming models).

SDP REQ-645 Pulsar timing quantity When performing pulsar timing process-ing the SDP shall be able to processdata from 16 pulsars concurrently withSKA1 MID constrained to a net, on sky,bandwidth of 20GHz per polarisation.

SDP REQ-646 Single Pulse search computeperformance

When performing single pulse transientsearch the SDP shall have at least suf-ficient performance to execute an algo-rithm of comparable complexity to usingPulsar Feature Lab (for heuristics), Gaus-sian Hellinger Very Fast Decision Tree(classification) and Sigproc Gtools (TBC-043) (for coincidence tests).

SDP REQ-647 Single pulse reception rate While performing single pulse transientsearch the SDP shall be able to receiveone candidate per beam every 1 second(TBC-044).

SDP REQ-648 Pulsar search compute per-formance

When performing pulsar search the SDPshall have at least sufficient performanceto execute an algorithm of compara-ble complexity to using Pulsar FeatureLab (for heuristics), Gaussian HellingerVery Fast Decision Tree (classification)and Sigproc Gtools (TBC-045) (for coin-cidence tests).

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 46 of 47

Page 47: SDP Memo 43: Pulsar Timing Failure Analysisska-sdp.org/sites/default/files/attachments/nipsoftwarefailureanalysi… · FM.SDP.PST.103 can also be mitigated via rerouteing data to

SDP REQ-649 Pulsar search performance While performing pulsar search the SDPshall be able to process a maximum of1000 candidates per beam.

SDP REQ-653 Flag invalid data The SDP shall flag invalid data (NaN orInf) and data invalid according to meta-data.

SDP REQ-706 Delivery latency The SDP shall start delivering any sci-ence data product, regardless of physi-cal location, within 10 minutes (for a 1TBscience product) (TBC-077) of receiving aretrieval request for a science data prod-uct.

SDP REQ-722 TM command acknowledge-ment latency

The SDP shall acknowledge receipt ofcommands from TM within 1s.

SDP REQ-731 Science events The SDP shall send events to the TM forthe following activities: -detection of animaging transient -detection of a singlepulse transient.

SDP REQ-763 SDP Critical failure identifica-tion

The SDP shall identify more than 99% ofall critical failures and report them to TM.

SDP REQ-764 SDP Isolation of critical fail-ures

The SDP, shall isolate 95% of all criticalfailures and report it to TM.

SDP REQ-786 Dynamic Spectrum dataproduct

The SDP when commanded shall receiveand store a high time resolution dynamicspectrum data product (time-frequency-polarisation).

SDP REQ-787 Dynamic spectrum sub-arraysupport

The SDP, when configured in dynamicspectrum mode, shall receive and storedynamic spectrum mode data for a to-tal of up to 16 dual polarisation beams(with SKA1 Mid constrained to a net, onsky, bandwidth of 20 GHz per polarisa-tion) from one to sixteen subarrays, inde-pendently and concurrently.

SDP REQ-807 Dynamic Spectrum ModeData Preparation

SDP shall perform data pre-processing(aggregating sub-integrations from ascan into a single file) for dynamicspectrum mode data for SKA1 Low andSKA1 Mid.

Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17

UnrestrictedAuthor: R. J. Lyon et. al.

Page 47 of 47