SDP Memo 43: Pulsar Timing Failure...
Transcript of SDP Memo 43: Pulsar Timing Failure...
SDP Memo 43: Pulsar Timing Failure Analysis
Document Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SDP Memo 43Document Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MEMORevision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C1Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. J. Lyon, L. Levin, B. W. StappersRelease Date . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2018-04-17Document Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .UnrestrictedStatus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Draft
Lead Author Designation AffiliationR. J. Lyon SDP.PIP.NIP Member University of ManchesterSignature & Date:
(17/04/2018)
SDP Memo Disclaimer
The SDP memos are designed to allow the quick recording of investigations and researchdone by members of the SDP. They are also designed to raise questions about parts of theSDP design or SDP process. The contents of a memo may be the opinion of the author, notthe whole of the SDP.
Revisions
Revision Date of issue Prepared by CommentsC February 26th
2018Robert Lyon Initial version of the document.
C1 April 17th2018
Robert Lyon Updates made given feedback from LoritaChristelis.
Updated Tables 5 and 6, replaced sometext that was incorrectly repeated.
Altered Table 4. making it clear thatFM.SDP.PST.103 can also be mitigatedvia rerouteing data to functioning hardware.
Added a new mode, FM.SDP.PST.117, toaccount for a hardware failure in the archivesystem.
Altered Table 5., making a grammaticalchange to FM.SDP.PST.108 in the mitigationcolumn (no change to meaning).
Section 5.2, indicated that a rack controlfailure can occur due to the failure of a top ofrack switch.
Section 5.2, indicated that a loss of con-trol due to failure of the SDP managementsystem, unlikely to be cause by a loss ofconnectivity. There will likely be a networktopology that ensure a connection is alwaysavailable, though perhaps with reducedbandwidth.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 2 of 47
Continued...
Section 5.2, now point to the SDP execu-tion control component [RD9] (see section2.1.1 in that external document) as a possiblesource of failure.
Modified FM.SDP.PST.205 andFM.SDP.PST.206 in Table 7. Added anadditional mitigation strategy, which involvesdecoupling control and monitoring in the SDPExecution Control Component.
Altered Table 11., making it clear thatFM.SDP.PST.224 has the potential to criticallyimpact science outputs, rather than catas-trophically degrade output. Also updated theseverity range and the criticality score. Thisis because the failure mode can be mitigatedso long as science data is retained in a bufferand not discarded until successfully persistedin the archive.
Added new tables to the Appendix thatdescribe FMECA Detection methods.
The following changes have been madeto the requirements in Table 25:
SDP REQ-33 has a new description.SDP REQ-50 has since been deleted.SDP REQ-147 and SDP REQ-148 have sincebeen deleted.SDP REQ-281 has a new description.SDP REQ-546 has a typo correction.SDP REQ-552 has since been deleted.SDP REQ-763 has a new description.SDP REQ-764 has a new description.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 3 of 47
Table of Contents
List of figures 5
List of tables 6
List of abbreviations 7
Summary 8
1 Scope 9
2 Process 10
3 Terms & Definitions 11
4 Assumptions 124.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3 Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.4 Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.5 Execution Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.6 Science Software & Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.7 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.8 Pulsar Timing Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.9 Likelihood & Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Failure Modes 185.1 Hardware Induced Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2 Control & Communication Failures . . . . . . . . . . . . . . . . . . . . . . . . . 225.3 Data Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.4 Software/Algorithm Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6 Summary 37
A FMECA Detection methods 38
B FMECA Results 41
C Applicable Requirements 43
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 4 of 47
List of Figures
1 Level 2 functional flow diagram for the SDP. . . . . . . . . . . . . . . . . . . . . 92 SDP Hardware Block Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 High level diagram showing the assumed architectural data flow. . . . . . . . . 134 Conceptual data model for timing data. . . . . . . . . . . . . . . . . . . . . . . . 155 Activity diagram for the pulsar timing pipeline. . . . . . . . . . . . . . . . . . . . 17
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 5 of 47
List of Tables
1 Severity codes applying to failure modes. . . . . . . . . . . . . . . . . . . . . . 112 Likelihood codes applying to failure modes. . . . . . . . . . . . . . . . . . . . . 113 Summary of the main SDP.PST software components. . . . . . . . . . . . . . . 164 Hardware induced failure modes 1-6. . . . . . . . . . . . . . . . . . . . . . . . . 195 Hardware induced failure modes 7-11. . . . . . . . . . . . . . . . . . . . . . . . 206 Hardware induced failure modes 12-16. . . . . . . . . . . . . . . . . . . . . . . 217 Control and Communication failure modes 1-6. . . . . . . . . . . . . . . . . . . 238 Control and Communication failure modes 7-14. . . . . . . . . . . . . . . . . . 249 Control and Communication failure modes 14-19. . . . . . . . . . . . . . . . . . 2510 Control and Communication failure modes 20-23. . . . . . . . . . . . . . . . . . 2611 Control and Communication failure modes 24-28. . . . . . . . . . . . . . . . . . 2712 Control and Communication failure modes 29-33. . . . . . . . . . . . . . . . . . 2813 Control and Communication failure modes 34-36. . . . . . . . . . . . . . . . . . 2914 Data failure modes 1-6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3115 Data failure modes 7-11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3216 Data failure modes 12-17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3317 Software/Algorithm failure modes 1-9. . . . . . . . . . . . . . . . . . . . . . . . 3518 Software/Algorithm failure modes 9-14. . . . . . . . . . . . . . . . . . . . . . . 3619 Summary of the detection methods for each of the failure modes discussed in
this document (Part 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3820 Summary of the detection methods for each of the failure modes discussed in
this document (Part 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3921 Summary of the detection methods for each of the failure modes discussed in
this document (Part 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4022 Summary of the criticality scores for each of the failure modes discussed in this
document (Part 1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4123 Summary of the criticality scores for each of the failure modes discussed in this
document (Part 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4224 Summary of the criticality scores for each of the failure modes discussed in this
document (Part 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4325 Level 2 SDP requirements relevant to the failure mode analysis. . . . . . . . . . 43
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 6 of 47
List of abbreviations
CSP Central Signal ProcessorCOTS Commercial-of-the-ShelfDSD Dynamic Spectra DataEMI Electromagnetic InterferenceFTP File Transfer ProtocolHPC High Performance ComputingICD Interface Control DocumentIM Interstellar MediumLMC Local Monitor and ControlNIC Network Interface CardNIP Non-imaging ProcessingPSRFITS Pulsar Flexible Image Transport SystemPST Pulsar Timing Sub-elementPTD Pulsar Timing DataQA Quality AssuranceSDP Science Data ProcessorSFMECA Software Failure Mode, Effects and Criticality AnalysisTM Telescope ManagerTOA Time-of-ArrivalsTOR Top of Rack
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 7 of 47
Summary
This document describes a Software Failure Mode, Effects and Criticality Analysis (SFMECA)for the pulsar timing pipeline sub-element (PST) of the Science Data Processor (SDP). Theanalysis has been done at the architectural level, and represents an initial attempt to study thefailure modes of the timing pipeline. This work forms the output of sprint task: TSK-2140.
Applicable Documents
The following documents are applicable to the extent stated herein. In the event of conflict be-tween the contents of the applicable documents and this document, the applicable documentsshall take precedence.
Reference Document Number ReferenceNumberAD1 100-000000-002 SKA1 LOW SDP - CSP INTERFACE CONTROL DOCU-
MENTAD2 300-000000-002 SKA1 MID SDP - CSP INTERFACE CONTROL DOCUMENTAD3 100-000000-029 SKA1 INTERFACE CONTROL DOCUMENT SDP TO TM
LOWAD4 300-000000-029 SKA1 INTERFACE CONTROL DOCUMENT SDP TO TM
MID
Reference Documents
The following documents are referenced in this document. In the event of conflict between thecontents of the referenced documents and this document, this document take precedence.
Reference Document Number ReferenceNumberRD1 SKA-TEL-SDP-0000018 PDR.02.01 Compute Platform Element Subsystem DesignRD2 SKA-TEL-SDP-0000027 SDP Pipelines DesignRD3 SKA-TEL-SDP-0000033 SDP L2 requirements specification (L1 Rev 11).RD4 Zhu, Y. M., “Software Failure Mode and Effects Analysis”,
Springer, 2017, doi:10.1007/978-3-319-65103-3 2.RD5 Stadler, J. J. and Seidl, N. J.,“Software failure modes and
effects analysis”, Reliability and Maintainability Symposium(RAMS), 2013, doi:10.1109/RAMS.2013.6517710.
RD6 Stamatis, D. H., “Failure mode and effect analysis : FMEAfrom theory to execution”, Milwaukee, Wisc. : ASQ QualityPress, 2003.
RD7 SDP Memo 40 Lyon, R. J., Levin, L. and Stappers, B. W., “PSRFITSOverview for NIP”.
RD8 Lyon, R. J., “CSP to SDP NIP Data Rates & Data Models(version 1.1)”, doi:10.5281/zenodo.836715.
RD9 SKA-TEL-SDP-0000013 Wortmann, P. et. al., “SDP Operational System Componentand Connector View ”.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 8 of 47
Fast Telescope
State Producer
LSM Management
Receive visibilities
Transient Buffer receive
P & T candidate
receive
Timing receive
Transient Buffer processing
P & T candidate processing
Timing processing
Pre-processing
Buffering Imaging
& Calibration Pipeline
Staging
Detect candidates
Fast imager
Real-time calibration
Master control
QA visualisation
QA metric aggregator
AAAIPersistence
Query, discovery & delivery
Preservation & index science
products
Prepare science products
Data lifecycle management
Switch
1
2
3
1 2 3
Science data Sky model Local telescope model
Transient event Telescope manager Functions producing QA metrics
Functions using Data lifecycle manager
Key
Fast pre-processing
LTM Management
Figure 1: Level 2 functional flow diagram for the SDP. The blue shaded components are thosestudied as part of the failure analysis. The flow diagram is based upon a figure produced by theSDP consortium (author unknown).
1 Scope
The scope of this work is confined to the blue shaded components of the SDP level 2 func-tional flow diagram in Figure 1. This includes the pulsar timing receive and pulsar timingfunctions, from herein collectively referred to as the SDP.PST. The analysis presented here isonly concerned with the identification and analysis of SDP.PST software failure modes at anarchitectural level. The analysis is applicable to both SKA Low and Mid. It includes failuremodes arising from internal and external software (and their interfaces), firmware, interfacesto Commercial-of-the-Shelf (COTS) equipment, and interfaces to free/open source software.Whilst hardware failure modes are not in scope, in some cases hardware will be discussedwhen equipment failures, faults, or defects precipitate software failures/errors. As the SDPdesign is not complete, hardware, software and architectural assumptions are made to bothenable and constrain the analysis. These assumptions are summarised in Section 4, whilst themethodology employed is summarised and justified in Section 2. Finally, note that the pulsarand transient search Non-Imaging Processing (NIP) pipeline failure modes will be consideredelsewhere.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 9 of 47
2 Process
Whilst software failure mode analyses have been undertaken for some time, there is currentlyno universal SFMECA standard. To proceed it is necessary to tailor approaches borrowedfrom the software engineering literature. Thus we reviewed the literature [RD4,RD5,RD6] forrelevant work. Following this review we designed a process we believe to be conducive toproducing a reproducible, principled, detailed analysis. This is an initial attempt to systema-tise the SFMECA so that it can be reviewed and critiqued, and we hope that our process canbe improved upon via appropriate feedback. Any such feedback will be incorporated into fu-ture SFMECA analyses, e.g. those yet to be done for the pulsar and transient search pipelines.
The following steps form the analysis process employed in this work:
1. Define the scope - This involves determining i) which part of the system is being in-vestigated, ii) which views apply (e.g. functional, interface, algorithmic, maintenance,usability, security), iii) which elements to study (e.g. hardware, software).
2. Information gathering - Gather documents relevant to the analysis, e.g. if taking afunctional view then requirements documents are relevant. This is because failures leadto functional requirements not being met. Interfaces may need to be studied, along withthe system functionality at a higher level. This also involves studying which types ofanalysis can be applied - an SFMECA process designed for medical software, will havedifferent strengths and weaknesses compared to one written for military applications.Thus it’s important to find the right approach.
3. Tailor the analysis - Based on the information gathered, tailor the analysis to the prob-lem at hand. In this case, we need not consider hardware failure modes, thus we canomit these from the analysis.
4. Research failure modes - Enumerate all the possible failure modes and sources oferror. Then begin categorising these according to the chosen view.
5. Analyse - For each mode found determine,
• the root cause of the failure mode.
• the local effect at the software component level (e.g. FFT doesn’t work correctly).
• the sub-system effect. For example the effect on the pulsar timing pipeline sub-system.
• the system effect and how this relates to system requirements (e.g. if pulsar timingfails, what does this mean for SDP, and the wider SKA?).
6. Mitigate - For each failure mode identified, attempt to devise a mitigation strategy whichprevents the failure or mitigates its effects. If no mitigation is possible, then preventativemeasures should be described.
7. Severity & Likelihood - Determine how severe each failure mode is with respect to thesystem requirements, and how likely it is for such a failure mode to occur.
8. Summarise - Produce a critical item list describing all the possible failure modes.
These steps need not be rigidly undertaken. However they are useful for guiding the analysisprocess. Note these steps are described in more detail elsewhere [RD4,RD5,RD6].
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 10 of 47
Table 1: Severity codes applying to failure modes.
Level Code Description1 Minor Normal availability retained by preventative /
mitigation action.2 Marginal Near normal availability retained via preventa-
tive / mitigation action.3 Significant Operating between degraded and normal.4 Critical Operating in degraded mode.5 Catastrophic Functionality unavailable.
Table 2: Likelihood codes applying to failure modes.
Level Code Description1 Extremely unlikely < 0.1%2 Remote 0.1 to 1%3 Occasional 1 to 10%4 Reasonably probable 10 to 20%5 Frequent >20%
3 Terms & Definitions
Before proceeding we define some terms which should make our analysis easier to interpret.Firstly we define the severity codes (Table 1) and probability codes (Table 2) that will be used.These are used to determine a criticality level for each failure mode. The criticality score canbe determined via a simple calculation where the Criticality Score = Severity × Likelihood.
Next we define the key terms as we understand them.
• Failure Mode - Means/process via which software can contribute to a system failure.
• Effect - Behaviour resulting from the failure mode.
• Error - Discrepancy between a computed, observed, or measured value and the true,specified or theoretically correct value or condition.
• Defect - Manifestation of an error arising from the software requirements, design orcode.
• Fault - Defect that has resulted in one or more failures.
• Scan - Basic observational unit.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 11 of 47
4 Assumptions
4.1 Hardware
The SDP.PST pipeline is assumed execute upon standard COTS equipment that complieswith the SKA’s EMI, power, maintenance and cooling standards. This applies to racks, routers& switches, compute nodes and individual (internal) compute node components (processors,memory, accelerator cards, storage disks, NICs, power supplies, cooling components etc).The hardware is assumed to be housed in a suitable location providing appropriate power,cooling and climate control facilities. Figure 2 depicts our hardware assumptions. An ab-stracted rack configuration presented in a), and an abstracted SDP compute node in b).
Compute Rack n
Compute Rack 1
TM / Control & Monitoring
PreservationSystem
Data IngestedFrom CSP &
Metadata
External PowerSupply
Science data
Power
Management data
Key
a) b)
Memory
Memory
Accelerator n
PU
Accelerator 1
PUPUProcessingUnit (PU)
Host ChannelAdapter (HCA)
NetworkInterface
Card (NIC)
Compute Node
Ethernet Switch
Node 1
Node m
In Rack Power Supply (PDU)
Cooling System
Control & Management Switch
CPU 1
CPU 2
PU
Disk
1
Disk
…
Disk
m
Disk
…
NIC NIC HCA
Power / Cooling
Figure 2: Simplified hardware block diagram describing SDP racks (a) and a diagram depictingan abstracted SDP compute node (b). Figure based upon diagrams originally produced byL. Christelis and P. C. Broekema, as part of their SDP work.
4.2 Architecture
The SDP will be an energy efficient yet extremely powerful High Performance Computing(HPC) system. We assume it consists of one or more ‘compute islands’. Each compute islandis an independent scalable compute unit1 [RD1] containing one or more racks as shown inFigure 2 a). Each rack can in turn contain one or more compute/data storage nodes. Wherea compute/data storage node is a typical COTS server as shown in Figure 2 b). In additionto COTS servers, each rack is presumed to contain industry standard networking and storagehardware.
1Compute islands defined in JIRA, see Archive 390: https://jira.ska-sdp.org/browse/ARCHIVE-390.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 12 of 47
Based on these assumptions we describe an abstracted architecture used to guide our anal-ysis. It is summarised in the architectural data flow diagram shown in Figure 3. We assumethat,
• each rack, and each compute node within it, is connected to a control system via a ‘man-agement’ Ethernet switch. The control system is responsible for provisioning resourceswithin the rack, monitoring their use, troubleshooting etc.
• each rack has a separate Ethernet switch dedicated to handling the ingest/transmissionof all other data (e.g. science data, sky models, and metadata). Each compute nodeis connected to this switch, allowing data to be received from the CSP, and sent to thepreservation system as appropriate.
• rack power and cooling is monitored via the management system.
• compute islands, the Telescope Manager (TM), the Central Signal Processor (CSP) andthe preservation system; are connected via suitable network interfaces and equipment.
• there will be redundant compute nodes, data storage nodes, and communication linkswhich will help mitigate the impact of hardware failures.
• for our analysis we can treat the TM, CSP and preservation systems as black boxesinteracting with our pipeline components. Thus any failure modes related to their usecan only occur at any applicable common interfaces.
SDP
Science data
Management data
Key
Regional
Centres
Backup
Centres
SDP data products
Metadata, Sky models etc.
Network (LAN or Wan)
Telescope
Manager
TM Data
Disk
…
Disk
…
Data Ingest Island
Rack 1
Ethernet Switch
Node 1
Node m
PDU
Cooling System
Management Switch
Rack 2
Rack n
Compute Island 1
Rack 2
Rack n
Rack 1
Compute Island n
Rack n
Rack 1
Preservation System
Disk
…
Disk
…
Disk
…
Disk
…
Disk
…
Disk
…
Disk
…
Disk
…
Disk
…
CSP
Figure 3: High level diagram showing the assumed architectural data flow. Figure based upondiagram first presented in [RD1].
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 13 of 47
4.3 Control
We treat the control system as a single abstracted entity interacting with the SDP.PST, andSDP hardware. For this analysis it is irrelevant if control is provided by the TM (e.g. [AD3,AD4]),LMC (Execution Control) or direct human interaction so long as,
• the control system initiates, pauses, and restarts scan processing as appropriate.
• the control system monitors both hardware and software states allowing the efficientmanagement of resources.
• the control system can receive and correctly process information requests from theSDP.PST or the SDP.
• the control system can deliver information to the SDP.PST or the SDP. This includesdetails of the processing to be performed, associated metadata, sky models, pulsarephemerides, standard pulsar profiles, RFI masks, calibration strategies and other rele-vant information.
• the control system can process and correctly act upon error messages/warnings sent bythe SDP.PST or the SDP.
• the control system has some inherent redundancy making failures of the control systemextremely unlikely.
• the control system can operate autonomously during scan processing, and take reme-dial action where/when appropriate according to any error messages received. Thisincludes, for example, automatically compensating for hardware failures at the nodelevel.
4.4 Communications
As per the CSP to SDP Interface Control Documents [AD1,AD2], we assume data is trans-mitted to the SDP via FTP (RFC 959). The communication interface is assumed to be bi-directional, although the data flow is uni-directional in practice (from CSP to SDP). Pulsartiming data transmitted via this protocol is sent one temporal sub-integration at a time2 typ-ically every 10 seconds. Though sub-integration data could be sent by CSP at any intervalbetween 1 to 60 seconds. Finally the sub-integration data is sent in the PSRFITS format[RD7].
4.5 Execution Framework
The execution framework is responsible for executing software components, providing themwith hardware resources (memory, CPU time etc), monitoring their status/resource use, andrestarting them upon failure. The framework treats available hardware resources as a pool,thus processing steps executed one after another need not be situated on the same physi-cal hardware. It is the responsibility of the execution framework to correctly route data fromone software component to another, if executed on different hardware. Finally the executionframework interacts with the control system and is situated on each and every SDP node.
2Defined more clearly in Section 4.7.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 14 of 47
4.6 Science Software & Processing
Science software is expected to comprise both custom tools developed by the SDP consor-tia, and open source community algorithms. In either case, these will operate within theconstraints of the execution framework, and interface with its error reporting system, so thaterrors can propagate from all software components to the TM.
Pulsar timing processing proceeds in a mostly linear fashion, with some data aggregation/buffer-ing required in places. It is entirely possible for the processing to be done across multiple racksand/or compute islands. However it is better for data from the same beam to be processed onthe same physical compute node.
4.7 Data
The CSP produces ‘detected’ data. This is data that has been i) channelised, ii) fully correctedfor dispersion in the Interstellar Medium (IM), iii) folded at the known pulsar period, and iv)partially calibrated. The resulting time, phase, frequency and polarisation data is sent to theSDP.PST as a matrix (also called a data cube). The matrix dimensions are determined byparameters chosen within CSP. These include the number of frequency channels Nchan, thenumber of phase bins Nbin, the number temporal sub-integrations Nsub, and the number ofpolarisations Npol. The size of the matrix in bits is given by,
Nchan ×Nbin ×Nsub ×Npol ×Nbit, (1)
where Nbit is the number of bits per sample in the matrix. The possible values for theseparameters are constrained elsewhere [AD1,AD2]. The data cube is not sent alone. It is ac-companied by attributes and metadata. We describe the complete data product that containsall this information as Pulsar Timing Data (PTD). This is described at the conceptual level inFigure 4 and summarised elsewhere [RD8].
Timing Data
Metadata
Data Cube
Key
Entity
Weak Entity
Non-identifying relationship
Identifying relationship
Cardinality:
Zero or one
One or more
Exactly one
Zero or more
Many
Logical ModelConceptual Model
PTD
Metadata Data Cube
has
one
has
one
has
one
has
one
Timing Receive
Sub-arrays
Observation
Timing Data
has
many
Relation
Attribute TBD heuristics
TBD metadata
n-D matrix
Data Cube
Pulsar ID
Configures User
Outputs
Data from CSP
has
has
Beam ID
Scheduling Block ID
Observation metadata
Program Block ID
Scan ID
PTD
Figure 4: Conceptual data model for timing data.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 15 of 47
Table 3: Summary of the main SDP.PST software components.
Identifier Name DescriptionSDP.PST.SC001 Command QA Evaluates the quality and correctness of received commands.SDP.PST.SC002 Parameter QA Evaluates the quality and correctness of received parameters.SDP.PST.SC003 Data QA Evaluates the quality and correctness of received data, and data
produced within the pipeline.SDP.PST.SC004 Alert Generates, formats and transmits alert messages. This includes
scientific and hardware/software related alerts/warnings.SDP.PST.SC005 Timing Receive Monitors and controls the ingest of data from the CSP.SDP.PST.SC006 Remove RFI Removes parts of the received data affected by RFI.SDP.PST.SC007 Calibrate Calibrates for flux and polarisation.SDP.PST.SC008 Average Produces partly averaged data cubes for data processing steps
that require higher S/N values rather than high resolution. Sendsaveraged products to the preservation system.
SDP.PST.SC009 TOA Determination Determine pulse TOAs by cross correlating the current observa-tion, with a pulsar-specific standard profile supplied externally.Generates 1 TOA per sub-integration and frequency channel.
SDP.PST.SC010 Compute Residuals Uses a timing model to compute expected pulse TOA. Comparesthe expected & observed TOA, and generates timing residuals asthe difference between them.
SDP.PST.SC011 Update Timing Model Update the timing model for the observed pulsar.
4.8 Pulsar Timing Modes
A maximum of 16 tied-array beams are available for use when in pulsar timing mode. Eachbeam can independently observe a different pulsar, thus 16 pulsars can be studied per scan.It is the responsibility of the CSP to produce data products that can be used by the SDP toperform high precision timing.
The SDP.PST executes multiple processing steps. The first involves RFI mitigation followed bya detailed flux and polarisation calibration. A number of intermediate ‘averaged’ data productsare then generated, that provide different representations of the data. These are sent to thepreservation archive. The pulse Time-of-Arrivals (TOAs) are then determined, and the timingresiduals computed. These are used to update the timing model for the observed pulsar fol-lowing appropriate Quality Assurance (QA) checks. Any significant changes in pulse arrivaltimes should raise an alert, as such a change is of scientific interest. The generalised pipelinesteps are summarised in Figure 5, whilst Table 3 summarises the main SDP.PST components.
Note that all software components must be fault tolerant. To achieve this the timing pipelinemust be capable of operating in two distinct modes:
• Standard mode - here communications are consistent, all data sources are accessi-ble, all data sent and received is correctly formatted and valid, and data is successfullypassed between SDP.PST software components without impediment (e.g. delays).
• Default mode - in the event of any error causing i) a disturbance in communications,ii) command parameters or metadata to become corrupted/invalid, iii) data formattingerrors/corruption, iv) algorithmic/hardware malfunctions, v) a failure in control, or vi) anyother unforeseeable error; the timing pipeline should enter a default mode. This modeprioritises the preservation of valuable science data, and may skip some/all processingsteps as required.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 16 of 47
QA
Co
mm
and
s /
Para
mete
rs
Ob
tain
Pro
ce
ssin
g
Co
mm
an
ds /
Pa
ram
ete
rs
Valid
?
Ing
est
Data
Get
sky
mo
dels
,
ep
hem
erid
es,
sta
nd
ard
mo
del…
[fo
rk]
[jo
in]
Mo
re
data
?
RF
I M
asks
Calib
ratio
n
so
lutio
ns
[ Data from CSP]
Tim
ing
R
ec
eiv
e
Ing
est
Data
Ing
est
Data
[ G
et
sub
-int
data
fro
m C
SP
]
[tru
e]
Buff
er
Data
[fals
e] [tru
e]
Valid
?R
em
ove
RF
IC
alib
rate
Ave
rag
e
Se
nd
da
ta to
A
rch
ive
[ to
TM
]
[ to
TM
]
De
term
ine
T
OA
sS
en
d T
OA
s to
Q
A S
yste
mK
ey
Co
ntr
ol/D
ata
flo
w
Fo
rk/J
oin
Activity s
tart
Activity e
nd
Pro
cessin
g a
ctivity
Decis
ion n
od
eD
ecis
ion
Activity
Eva
lua
te
Mo
de
l C
ha
ng
es
Da
ta
Aq
uis
itio
n
[fo
rk]
[jo
in]
Up
da
te
Tim
ing
M
od
el?
Se
nd
Mo
de
l to
Arc
hiv
e
Ge
ne
rate
A
lert
Fo
llow
up
?
[tru
e]
[end
puls
ar
tim
ing
scan
pro
cessin
g]
Co
mm
an
d &
Pa
ram
ete
r
Ch
ec
ks
Pu
lsa
r T
imin
g
Pro
ce
ssin
g
[fals
e]
Rep
ort
Pro
ble
m
Rep
ort
Pro
ble
m
[ fr
om
TM
]
Figu
re5:
Act
ivity
diag
ram
show
ing
the
proc
essi
ngst
eps
inth
epu
lsar
timin
gpi
pelin
e.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 17 of 47
4.9 Likelihood & Probability
The likelihood and probability estimates provided by this analysis represent best guessesbased on empirical experience. Whilst this is not ideal, there is no data available that can beused to facilitate a more rigorous analysis of failure rates and consequences.
5 Failure Modes
We consider three main sources of failure. These are addressed in separate sections forclarity. In each case the priority is to preserve science data whenever possible, even whenextreme errors are encountered. This is because science data, even when damaged or cor-rupted, has utility.
5.1 Hardware Induced Failures
There are many possible causes for a hardware induced failure. These can occur beforeand during timing processing. To keep the analysis at a high-level, we consider the followinghardware failures and treat them as equivalent:
• failures resulting from a mechanical defect (e.g. system fan or hard drive mechanicalfailure).
• power or cooling failures necessitating system shut-down.
• failures caused by incorrect system configuration (e.g. Bios errors).
• failures caused by firmware or operating system errors.
• electronics failures in hardware components (memory, CPU, motherboard etc.).
A number of failure modes related to hardware errors are listed in Tables 4, 5 and 6 below.For simplicity only scenarios where inherent redundancy fails are presented (i.e. a worst casescenario). This is because enumerating all possible failure scenarios and their combinationsis out of scope for our high level analysis.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 18 of 47
Tabl
e4:
Har
dwar
ein
duce
dfa
ilure
mod
es1-
6.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dTi
min
gR
ecei
ve(F
M.S
DP.
PS
T.10
1)Th
eS
DP
hard
war
ere
-sp
onsi
ble
for
inge
stin
gda
tafro
mth
eC
SP
en-
coun
ters
aha
rdw
are
fail-
ure
atth
eno
dele
vel.
An
inge
stfa
ilure
resu
ltsin
data
loss
atth
esu
b-in
tegr
atio
nda
tale
vel
orfo
ran
indi
vidu
albe
am,
and
dela
ysth
epr
oces
s-in
g.
Pul
sar
timin
gan
alys
isle
ssef
fect
ive,
som
esc
i-en
ceda
talo
st.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
Inte
grity
ofS
cien
ceD
ata
isco
m-
prom
ised
due
toda
talo
ss.
Ens
ure
the
regu
lar
mai
nten
ance
ofin
gest
node
s,an
dpr
even
tth
eir
use
whe
nex
hibi
ting
be-
havi
ours
sym
ptom
atic
ofan
impe
ndin
gha
rdw
are
failu
re.
Whe
repo
ssib
leim
med
iate
lyco
mpe
nsat
efo
rth
eer
ror
byre
peat
ing
the
inge
stw
ithop
era-
tiona
lhar
dwar
e.
Min
orO
ccas
iona
l
Tim
ing
Rec
eive
(FM
.SD
P.P
ST.
102)
The
SD
Pha
rdw
are
re-
spon
sibl
efo
rin
gest
ing
data
from
the
CS
Pen
-co
unte
rsa
hard
war
efa
il-ur
eat
the
rack
leve
l.
An
inge
stfa
ilure
resu
ltsin
sign
ifica
ntda
talo
ssfo
ron
eor
mor
ebe
ams,
and
sign
ifica
ntly
dela
ysth
epr
oces
sing
.
Pul
sar
timin
gan
aly-
sis
sign
ifica
ntly
com
-pr
omis
ed,
mod
erat
esc
ienc
eda
talo
st.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
Inte
grity
ofS
cien
ceD
ata
issi
gnifi
-ca
ntly
com
prom
ised
.
Sam
eas
abov
e.C
ritic
alR
emot
e
Tim
ing
Rec
eive
(FM
.SD
P.P
ST.
103)
The
SD
Pha
rdw
are
re-
spon
sibl
efo
rin
gest
ing
data
from
the
CS
Pen
-co
unte
rsa
hard
war
efa
il-ur
eim
pact
ing
the
data
inge
stis
land
.
With
out
the
capa
city
tobu
ffer
data
sent
byth
eC
SP,
anin
gest
failu
reat
the
data
isla
ndle
vel
re-
sults
inth
elo
ssof
scan
data
fora
llbe
ams.
Pul
sar
timin
gan
alys
isno
tpo
ssib
le,
all
scie
nce
data
lost
.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
No
scie
nce
poss
ible
.
Atte
mpt
tore
rout
eth
eda
tare
-ce
ived
from
CS
Pto
avai
labl
eco
rrec
tlyfu
nctio
ning
hard
war
ere
sour
ces.
The
miti
gatio
nst
rate
-gi
esfro
mab
ove
also
appl
yhe
re.
Cat
astro
phic
Ext
rem
ely
unlik
ely
QA
Com
man
ds/
Par
amet
ers
(FM
.SD
P.P
ST.
104)
The
hard
war
eex
ecut
ing
the
code
that
chec
ksth
eco
rrec
tnes
san
dva
lid-
ityof
com
man
ds/p
aram
-et
ers
fails
.
With
out
valid
com
-m
ands
orpa
ram
eter
sth
epi
pelin
em
ust
ente
rde
faul
tm
ode
whi
chle
ads
tosu
b-op
timal
proc
essi
ng.
Pul
sar
timin
gan
alys
isle
ssef
fect
ive.
Effi
cien
cyde
grad
ed,
mi-
nor
impa
cton
scie
nce
outp
uts.
Ope
rate
inde
faul
tm
ode,
ther
eby
ensu
ring
the
scie
nce
data
isst
illpr
oces
sed
and
pres
erve
din
the
appr
opria
teda
taar
chiv
e.Th
eda
tam
ust
befla
gged
tosh
owit
has
been
subj
ecte
dto
defa
ult
mod
epr
oces
sing
.
Mar
gina
lR
emot
e
Rem
ove
RFI
(FM
.SD
P.P
ST.
105)
The
hard
war
eex
ecut
ing
the
RFI
miti
gatio
nco
defa
ils.
The
sign
al-to
-noi
sera
tioof
the
dete
cted
puls
ew
illbe
low
erw
ithou
tRFI
mit-
igat
ion.
Pul
sar
timin
gan
alys
isle
ssef
fect
ive.
Min
orim
pact
onsc
ienc
eou
tput
s.A
dda
flag
toth
eda
tam
akin
git
clea
rth
atR
FIm
itiga
tion
isye
tto
bepe
rform
ed,
and
proc
eed
toth
ene
xtst
epso
that
pipe
line
proc
essi
ngdo
esno
thal
tand
noda
talo
st.
Mar
gina
lR
emot
e
Cal
ibra
te(F
M.S
DP.
PS
T.10
6)Th
eha
rdw
are
exec
utin
gth
eca
libra
tion
code
fails
.Th
esi
gnal
-to-n
oise
ratio
ofth
ede
tect
edpu
lse
will
belo
wer
with
outc
alib
ra-
tion.
Pul
sar
timin
gan
alys
isle
ssef
fect
ive.
Min
orim
pact
onsc
ienc
eou
tput
s.A
dda
flag
toth
eda
tam
akin
git
clea
rth
atca
libra
tion
isye
tto
bepe
rform
ed,
and
proc
eed
toth
ene
xtst
epso
that
pipe
line
pro-
cess
ing
does
not
halt
and
noda
talo
st.
Mar
gina
lR
emot
e
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 19 of 47
Tabl
e5:
Har
dwar
ein
duce
dfa
ilure
mod
es7-
11.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dA
vera
ge(F
M.S
DP.
PS
T.10
7)Th
eha
rdw
are
exec
utin
gth
eco
dere
spon
sibl
efo
rpr
oduc
ing
aver
aged
data
prod
ucts
fails
.
Inab
ility
topr
oduc
ein
-te
rmed
iate
outp
utda
tapr
oduc
ts.
Pul
sar
timin
gan
aly-
sis
unaf
fect
ed.
Dat
apr
oduc
tsus
eful
for
post
-pr
oces
sing
anal
ysis
lost
.
No
impa
cton
scie
nce
outp
uts
solo
ngas
prim
ary
data
prod
uct
isst
ored
.In
term
e-di
ate
data
prod
ucts
can
bere
crea
ted
via
post
-pro
cess
ing.
Add
afla
gto
the
data
mak
ing
itcl
ear
that
aver
agin
gis
yet
tobe
perfo
rmed
,an
dpr
ocee
dto
the
next
step
soth
atpi
pelin
epr
o-ce
ssin
gdo
esno
tha
ltan
dno
data
lost
.
Mar
gina
lR
emot
e
Arc
hive
Av-
erag
eP
rodu
cts
(FM
.SD
P.P
ST.
108)
The
hard
war
eex
ecut
ing
the
code
that
arch
ives
mul
tiple
inte
rmed
iate
av-
erag
edda
tapr
oduc
ts,
and
the
data
cube
,fai
ls.
Sto
rage
ofpr
imar
yan
dav
erag
edda
tapr
oduc
tsfa
ils.
Pul
sar
timin
gpi
pelin
efa
ilsto
pers
ist
prim
ary
scie
nce
data
.
Cat
astro
phic
impa
cton
scie
nce
outp
uts.
Itis
impe
rativ
eth
atth
epr
i-m
ary
data
prod
uct
ofth
etim
ing
pipe
line,
the
data
cube
,is
per-
sist
ed.
Thus
this
step
mus
tbe
re-r
unup
onfa
ilure
until
the
pri-
mar
yda
tapr
oduc
tata
min
imum
isst
ored
.Th
ism
ayho
ldup
pro-
cess
ing,
thus
may
requ
ireth
ebu
fferin
gof
data
from
aan
ysu
b-se
quen
tsca
ns.
Crit
ical
Rem
ote
Det
erm
ine
TOA
s(F
M.S
DP.
PS
T.10
9)Th
eha
rdw
are
exec
utin
gth
eco
deth
atde
term
ines
puls
eTO
As
fails
.
Pul
sear
rival
times
can-
notb
eco
mpu
ted.
Pul
sar
timin
gpi
pelin
eca
nnot
mea
sure
puls
ear
rival
times
,co
mpu
tere
sidu
als,
and
upda
tetim
ing
mod
els.
Tim
-in
gpi
pelin
eal
sofa
ilsto
trig
ger
aler
tsfo
rpr
ofile
chan
ges
ofsc
ient
ific
in-
tere
st.
Min
orim
pact
onsc
i-en
ceou
tput
s.TO
As
can
beco
mpu
ted
via
post
-pr
oces
sing
ifne
cess
ary.
Add
afla
gto
the
data
mak
ing
itcl
eart
hatt
heTO
As
coul
dno
tbe
dete
rmin
ed.
Pro
ceed
toar
chiv
eth
eda
taso
that
pipe
line
pro-
cess
ing
does
not
halt
and
noda
talo
st.
Mar
gina
lR
emot
e
Arc
hive
TOA
s(F
M.S
DP.
PS
T.11
0)Th
eha
rdw
are
exec
utin
gth
eco
deth
atse
nds
the
com
pute
dTO
As
toth
ear
chiv
efa
ils.
Failu
reto
stor
eTO
As.
Pul
sar
timin
gpi
pelin
efa
ilsto
arch
ive
usef
ulsc
i-en
ceda
ta.
Min
orim
pact
onsc
i-en
ceou
tput
s.TO
As
can
beco
mpu
ted
via
post
-pr
oces
sing
ifne
cess
ary.
Con
tinue
retr
ying
toar
chiv
eth
eTO
As
until
som
etim
eout
perio
dTB
Dha
sel
apse
d.If
the
TOA
sca
nnot
bear
chiv
ed,a
dda
flag
toth
eda
tain
dica
ting
this
,and
pro-
ceed
toth
ene
xtst
ep.
Mar
gina
lR
emot
e
Gen
erat
eR
esid
uals
(FM
.SD
P.P
ST.
111)
The
hard
war
eex
ecut
ing
the
code
that
gene
rate
stim
ing
resi
dual
sfa
ils.
Failu
reto
gene
rate
tim-
ing
resi
dual
s.P
ulsa
rtim
ing
pipe
line
cann
otde
tect
scie
ntifi
-ca
llyin
tere
stin
gpr
ofile
chan
ges.
This
prev
ents
rapi
dfo
llow
-up.
Min
orim
pact
onsc
ienc
eou
tput
s.R
esid
uals
can
beco
mpu
ted
via
post
-pr
oces
sing
ifne
cess
ary.
Add
afla
gto
the
data
indi
catin
gth
atth
ere
sidu
als
coul
dno
tbe
com
pute
d.P
roce
edto
arch
ive
the
data
soth
atpi
pelin
epr
o-ce
ssin
gdo
esno
tha
ltan
dno
data
lost
.
Mar
gina
lR
emot
e
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 20 of 47
Tabl
e6:
Har
dwar
ein
duce
dfa
ilure
mod
es12
-16.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dQ
AR
esid
uals
(FM
.SD
P.P
ST.
112)
The
hard
war
eex
ecut
ing
the
code
that
eval
uate
sth
equ
ality
ofth
ere
sidu
-al
sfa
ils.
Poor
qual
ityre
sidu
als
prop
agat
edth
roug
hpi
pelin
e.
Pul
sar
timin
gpi
pelin
eco
ntin
ues
proc
essi
ngw
ithpo
orre
sidu
als.
Min
orim
pact
onsc
ienc
eou
tput
s.R
esid
uals
can
beco
mpu
ted
via
post
-pr
oces
sing
ifne
cess
ary.
Pro
ceed
toth
ene
xtst
epso
that
pipe
line
proc
essi
ngdo
esno
tha
ltan
dno
data
lost
.A
p-pe
nda
flag
toth
eda
tain
dica
ting
that
the
resi
dual
sre
quire
aQ
Aan
alys
is.
Mar
gina
lR
emot
e
Upd
ate
Tim
-in
gM
odel
(FM
.SD
P.P
ST.
113)
The
hard
war
eex
ecut
ing
the
code
that
upda
tes
the
timin
gm
odel
fails
.
Tim
ing
mod
elno
tup
-da
ted.
Pul
sar
timin
gpi
pelin
eca
nnot
proc
eed
with
fur-
ther
proc
essi
ngst
eps.
Min
orim
pact
onsc
ienc
eou
tput
s.Ti
min
gm
odel
can
beup
date
dvi
apo
st-
proc
essi
ngif
nece
ssar
y.
Add
afla
gto
the
data
mak
ing
itcl
ear
that
the
timin
gm
odel
has
not
been
upda
ted,
and
pro-
ceed
toar
chiv
eth
eda
taso
that
pipe
line
proc
essi
ngdo
esno
thal
tand
noda
talo
st.
Mar
gina
lR
emot
e
Arc
hive
Tim
-in
gM
odel
(FM
.SD
P.P
ST.
114)
The
hard
war
eex
ecut
ing
the
code
that
arch
ives
the
timin
gm
odel
fails
.
Tim
ing
mod
elno
tar
chiv
ed.
Pul
sar
timin
gpi
pelin
eca
nnot
carr
you
tits
pri-
mar
ypu
rpos
e,to
auto
-m
atic
ally
upda
tetim
ing
mod
els.
Min
orim
pact
onsc
ienc
eou
tput
s.Ti
min
gm
odel
can
bere
com
pute
dvi
apo
st-p
roce
ssin
gif
nec-
essa
ry.
Con
tinue
retr
ying
toar
chiv
eth
etim
ing
mod
elun
tilso
me
time-
out
perio
dTB
Dha
sel
apse
d.If
mod
elno
tarc
hive
d,ad
da
flag
toth
eda
tain
dica
ting
this
,and
pro-
ceed
toth
eda
taar
chiv
alst
ep.
Mar
gina
lR
emot
e
Eva
luat
eM
odel
Cha
nges
(FM
.SD
P.P
ST.
115)
The
hard
war
eex
ecut
ing
the
code
that
eval
uate
sch
ange
sto
the
timin
gm
odel
fails
.
Inab
ility
tode
tect
sign
ifi-
cant
profi
lech
ange
s.P
ulsa
rtim
ing
pipe
line
cann
otde
tect
sci-
entifi
cally
sign
ifica
ntpu
lse
profi
lech
ange
s(e
.g.
glitc
hes
orm
ode
chan
ges)
.
Min
orto
mar
gina
lim
pact
onsc
ienc
eou
tput
s.Fa
il-ur
eto
eval
uate
prev
ents
rapi
dfo
llow
-up.
Dat
aca
nbe
post
-pro
cess
edal
low
ing
bela
ted
eval
ua-
tion.
Add
afla
gto
the
data
mak
ing
itcl
ear
the
mod
elha
sno
tbe
enev
alua
ted
for
chan
ge,
and
pro-
ceed
toth
eda
taar
chiv
alst
epso
that
pipe
line
proc
essi
ngdo
esno
thal
tand
noda
talo
st.
Mar
gina
lR
emot
e
Gen
erat
eA
lert
(FM
.SD
P.P
ST.
116)
The
hard
war
eex
ecut
ing
the
code
that
gene
rate
sal
erts
fails
.
Ale
rts
notg
ener
ated
.P
ulsa
rtim
ing
pipe
line
cann
otal
ert
TMor
the
com
mun
ityto
scie
ntifi
-ca
llyin
tere
stin
gev
ents
.
Min
orto
mar
gina
lim
pact
onsc
ienc
eou
tput
s.A
dda
flag
toth
eda
tam
akin
git
clea
rth
atth
eda
tare
quire
sfo
llow
-up
anal
ysis
.C
ontin
ueto
atte
mpt
toge
nera
tean
aler
tun-
tilso
me
time-
outp
erio
dTB
Dha
sel
apse
d.
Mar
gina
lR
emot
e
All
arch
iv-
ing
func
tions
(FM
.SD
P.P
ST.
117)
The
hard
war
ear
chiv
ing
puls
artim
ing
data
(dat
acu
bes,
resi
dual
s,TO
As,
met
adat
aor
timin
gm
od-
els)
fails
.
Sci
ence
data
not
per-
sist
ed.
Pul
sar
timin
gpi
pelin
eco
mpl
etes
proc
essi
ngho
wev
ersc
ienc
eda
tais
lost
.
Mar
gina
lto
Crit
ical
im-
pact
onsc
ienc
eou
tput
s.If
arch
ivin
gfa
ilsdu
eto
aha
rdw
are
erro
r,ca
usin
gda
talo
ss,
the
obse
rvat
ion
mus
tbe
resc
hedu
led
and
repe
ated
Mar
gina
lto
Crit
ical
Rem
ote
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 21 of 47
5.2 Control & Communication Failures
Control failures can occur in a variety of ways. For example,
• control can be lost at the node level, due to the failure of a node management daemonor controlling SDP process.
• control can be lost at the rack or compute island level, similarly to above. This could becaused, for example, via a Top of Rack (TOR) switch failure.
• control can be lost/degraded due to a failure of the SDP management system. Thiscan be caused by either software or hardware failures/errors. Whilst a connection to themanagement system will likely always be available due to the network topology used,bandwidth could be reduced.
• control can be lost due to a problem with the telescope manager, or the LMC. Note theLMC is known as the execution control system [RD9] (see section 2.1.1 in the externaldocument) in SDP.
• control can fail due to communication errors. This could be caused by, for example, thefailure of networking hardware, a network security intrusion, or the corruption of networktraffic due to software problems (e.g. in firmware).
• control can fail due to use of inappropriate commands, and/or human error.
While there are many possible control failure scenarios, we consider only high level failuresfor brevity.
Clearly communication failures can cause many of the control issues outline above. Howevercommunication problems can also affect SDP processing, and these possibilities are consid-ered separately. Communication failures occur due to,
• the corruption of data packets.
• networking hardware failures, or hardware failures at the node level (e.g. at the NICs).
• software errors in processing components which corrupt or invalidate communication.
• incompatible communication protocols or data types.
A number of failure modes related to control and communications are listed in Tables 7,through to Table 13 below. For simplicity only scenarios where inherent redundancy failsare presented (i.e. a worst case scenario).
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 22 of 47
Tabl
e7:
Con
trola
ndC
omm
unic
atio
nfa
ilure
mod
es1-
6.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dA
LL(F
M.S
DP.
PS
T.20
1)C
ontro
lof
atim
ing
pipe
line
com
pone
ntis
tem
pora
rily
lost
.
Tim
ing
pipe
line
com
po-
nent
cann
otbe
con-
trolle
dor
mon
itore
dex
-te
rnal
ly.
Failu
reto
corr
ectly
con-
trol
and
mon
itor
puls
artim
ing
proc
essi
ng.
Ope
ratio
nal
relia
bil-
ityan
def
ficie
ncy
are
degr
aded
.In
tegr
ityof
scie
nce
data
coul
dbe
com
prom
ised
ifpr
oces
sing
cond
ucte
din
corr
ectly
.
Allo
wal
ltim
ing
pipe
line
com
po-
nent
sto
oper
ate
auto
nom
ousl
yin
eith
era
defa
ult
ora
stan
dard
mod
e.A
ttem
ptto
confi
rm/re
-es
tabl
ish
cont
rol
afte
rth
eco
m-
plet
ion
ofea
chsc
an.
Rai
sean
alar
m.
Min
orR
emot
e
ALL
(FM
.SD
P.P
ST.
202)
Con
trol
ofa
timin
gpi
pelin
eco
mpo
nent
islo
stfo
ra
perio
dof
time
that
exce
eds
asc
anle
ngth
.
Tim
ing
pipe
line
com
po-
nent
cann
otbe
con-
trolle
dor
mon
itore
dex
-te
rnal
ly.
Failu
reto
corr
ectly
con-
trola
ndm
onito
rth
etim
-in
gpr
oces
sing
.
Ope
ratio
nal
relia
bil-
ityan
def
ficie
ncy
are
degr
aded
.In
tegr
ityof
scie
nce
data
coul
dbe
com
prom
ised
ifpr
oces
sing
cond
ucte
din
corr
ectly
.
Com
plet
epr
oces
sing
ofda
taob
-ta
ined
durin
gth
epe
rvio
us/c
ur-
rent
scan
solo
ngas
com
man
dsar
eva
lid,
rais
ean
alar
m,
then
awai
tins
truc
tion
from
TM.
Min
orR
emot
e
ALL
(FM
.SD
P.P
ST.
203)
Con
trol
para
met
ers
give
nto
atim
ing
pipe
line
com
pone
ntar
ein
cor-
rect
lyfo
rmat
ted
orin
valid
.
Tim
ing
pipe
line
com
-po
nent
inco
rrec
tlypr
oces
ses
data
.
Tim
ing
pipe
line
com
po-
nent
cann
otco
rrec
tlypr
oces
sth
eda
tain
-ge
sted
from
CS
Pca
usin
gda
talo
ss/
sub-
optim
alpr
oces
sing
.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
Inte
grity
ofsc
ienc
eda
taco
mpr
o-m
ised
.
Aut
omat
ical
lyde
tect
inco
rrec
tpa
ram
eter
san
dau
tono
mou
sly
ente
rde
faul
tm
ode
topr
even
tth
elo
ssof
scie
nce
data
.R
aise
anal
arm
.
Min
orR
emot
e
ALL
(FM
.SD
P.P
ST.
204)
Con
trolc
omm
ands
give
nto
the
timin
gpi
pelin
eco
mpo
nent
are
inva
lidor
inco
rrec
tlyfo
rmat
ted.
Tim
ing
pipe
line
com
-po
nent
inco
rrec
tlypr
oces
ses
data
.
Tim
ing
pipe
line
com
po-
nent
cann
otco
rrec
tlypr
oces
sth
eda
tain
-ge
sted
from
CS
Pca
usin
gda
talo
ss/
sub-
optim
alpr
oces
sing
.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
Inte
grity
ofsc
ienc
eda
taco
mpr
o-m
ised
.
Aut
omat
ical
lyde
tect
inco
rrec
tco
mm
ands
and
auto
nom
ousl
yen
ter
defa
ult
mod
eto
prev
ent
the
loss
ofsc
ienc
eda
ta.
Rai
sean
alar
m.
Min
orR
emot
e
ALL
(FM
.SD
P.P
ST.
205)
No
mon
itor
orco
ntro
lsi
gnal
stra
nsm
itted
orre
-ce
ived
from
outs
ide
ofth
eS
DP.
Tim
ing
pipe
line
com
po-
nent
cann
otbe
con-
trolle
dor
mon
itore
dex
-te
rnal
ly.
Failu
reto
corr
ectly
con-
trola
ndm
onito
rth
etim
-in
gpr
oces
sing
.
Ope
ratio
nal
relia
bil-
ityan
def
ficie
ncy
are
degr
aded
.In
tegr
ityof
scie
nce
data
coul
dbe
com
prom
ised
ifpr
oces
sing
cond
ucte
din
corr
ectly
.
Red
unda
ntso
ftwar
em
onito
r/
cont
rol
netw
ork.
Allo
wtim
-in
gpi
pelin
eto
oper
ate
au-
tono
mou
sly
inde
faul
tm
ode
inth
eev
ent
ofco
ntro
lfai
lure
.D
e-co
uple
cont
rol
and
mon
itorin
gw
ithin
the
Exe
cutio
nC
ontro
lC
ompo
nent
.
Min
orR
emot
e
ALL
(FM
.SD
P.P
ST.
206)
No
mon
itor
orco
ntro
lsi
gnal
stra
nsm
itted
orre
-ce
ived
tem
pora
rily
insi
deof
the
SD
P.
Tim
ing
pipe
line
com
po-
nent
cann
otbe
con-
trolle
din
tern
ally
.
Failu
reto
corr
ectly
con-
trola
ndm
onito
rth
etim
-in
gpr
oces
sing
.
Ope
ratio
nal
relia
bil-
ityan
def
ficie
ncy
are
degr
aded
.In
tegr
ityof
scie
nce
data
coul
dbe
com
prom
ised
ifpr
oces
sing
cond
ucte
din
corr
ectly
.
Red
unda
ntso
ftwar
em
onito
r/-co
ntro
lne
twor
k.A
llow
tim-
ing
pipe
line
toop
erat
eau
-to
nom
ousl
yin
defa
ult
mod
ein
the
even
tof
cont
rolf
ailu
re.
De-
coup
leco
ntro
lan
dm
onito
ring
with
inth
eE
xecu
tion
Con
trol
Com
pone
nt.
Min
orR
emot
e
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 23 of 47
Tabl
e8:
Con
trola
ndC
omm
unic
atio
nfa
ilure
mod
es7-
14.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dA
LL(F
M.S
DP.
PS
T.20
7)N
om
onito
ror
cont
rol
sign
als
trans
mitt
edor
re-
ceiv
edte
mpo
raril
yin
side
ofth
eS
DP,
for
ape
riod
oftim
eth
atex
ceed
sa
scan
leng
th.
Tim
ing
pipe
line
com
po-
nent
cann
otbe
con-
trolle
din
tern
ally
.
Failu
reto
corr
ectly
con-
trola
ndm
onito
rth
etim
-in
gpr
oces
sing
.
Ope
ratio
nal
relia
bil-
ityan
def
ficie
ncy
are
degr
aded
.In
tegr
ityof
scie
nce
data
coul
dbe
com
prom
ised
ifpr
oces
sing
cond
ucte
din
corr
ectly
.
Red
unda
ntso
ftwar
em
onito
r/-co
ntro
lnet
wor
k.C
ompl
ete
pro-
cess
ing
ofda
taob
tain
eddu
r-in
gth
epr
evio
us/
curr
ent
scan
solo
ngas
com
man
dsar
eva
lid,
rais
ean
alar
m,
then
awai
tin
-st
ruct
ion
from
TM.
Min
orR
emot
e
ALL
(FM
.SD
P.P
ST.
208)
Mis
sing
orco
rrup
tmon
i-to
rand
cont
rolp
acke
ts.
Una
ble
tore
liabl
ym
on-
itor
orco
ntro
lpi
pelin
eco
mpo
nent
s.
Failu
reto
corr
ectly
con-
trola
ndm
onito
rth
etim
-in
gpr
oces
sing
.
Ope
ratio
nal
relia
bil-
ityan
def
ficie
ncy
are
degr
aded
.In
tegr
ityof
scie
nce
data
coul
dbe
com
prom
ised
ifpr
oces
sing
cond
ucte
din
corr
ectly
.
Allo
wtim
ing
pipe
line
toop
erat
eau
tono
mou
sly
inde
faul
tmod
ein
the
even
tofc
ontro
lfai
lure
.
Sig
nific
ant
Rem
ote
ALL
(FM
.SD
P.P
ST.
209)
Rou
ting
and
trans
mis
-si
onof
data
with
inS
DP
fails
due
tom
issi
ngor
corr
uptd
ata
pack
ets.
Dat
ano
ttra
nsm
itted
.P
ulsa
rtim
ing
anal
ysis
notp
ossi
ble.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
All
scie
nce
data
lost
.
Res
ilien
ceof
rout
ing.
Allo
wtim
ing
pipe
line
toop
erat
ing
au-
tono
mou
sly
inde
faul
tm
ode
inth
eev
ent
com
mun
icat
ions
fail-
ure
that
prio
ritiz
essa
ving
the
sci-
ence
data
.
Sig
nific
ant
Rem
ote
ALL
(FM
.SD
P.P
ST.
210)
Rou
ting
and
trans
mis
-si
onof
data
with
inS
DP
tem
pora
rily
fails
due
tone
twor
ker
rors
orfa
il-ur
es.
Dat
ano
ttra
nsm
itted
.P
ulsa
rtim
ing
anal
ysis
notp
ossi
ble.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
All
scie
nce
data
lost
.
Red
unda
ntda
tane
twor
k.R
e-si
lienc
eof
rout
ing.
Sig
nific
ant
Rem
ote
ALL
(FM
.SD
P.P
ST.
211)
Rou
ting
and
trans
mis
-si
onof
data
with
inS
DP
fails
due
tone
twor
ker
-ro
rsor
failu
res,
for
ape
-rio
dof
time
that
exce
eds
asc
anle
ngth
.
Dat
ano
ttra
nsm
itted
.P
ulsa
rtim
ing
anal
ysis
notp
ossi
ble.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
All
scie
nce
data
lost
.
Red
unda
ntda
tane
twor
k.R
e-si
lienc
eof
rout
ing.
Cat
astro
phic
Ext
rem
ely
unlik
ely
ALL
(FM
.SD
P.P
ST.
212)
Com
poun
dro
utin
g/
com
mun
icat
ion
erro
rsoc
curr
ing
atdi
ffere
ntlo
catio
nsw
ithin
SD
P
Dat
ano
ttra
nsm
itted
.P
ulsa
rtim
ing
anal
ysis
notp
ossi
ble.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
All
scie
nce
data
lost
.
Red
unda
ntda
tane
twor
k.R
e-si
lienc
eof
rout
ing.
Cea
sepr
o-ce
ssin
gan
daw
ait
TMin
stru
c-tio
n.
Cat
astro
phic
Ext
rem
ely
unlik
ely
Tim
ing
Rec
eive
(FM
.SD
P.P
ST.
213)
Con
trol
para
met
ers
sent
toth
etim
ing
re-
ceiv
eco
mpo
nent
are
corr
upte
dvi
apa
cket
loss
orso
me
othe
rco
mm
unic
atio
ner
ror.
Tim
ing
rece
ive
in-
corr
ectly
proc
esse
sre
ceiv
edda
ta.
Tim
ing
rece
ive
cann
otco
rrec
tlyin
gest
the
data
from
CS
Pca
usin
gda
talo
ss.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
Inte
grity
ofsc
ienc
eda
taco
mpr
o-m
ised
.
Aut
omat
ical
lyde
tect
inco
rrec
tpa
ram
eter
san
dau
tono
mou
sly
ente
rde
faul
tm
ode
topr
even
tth
elo
ssof
scie
nce
data
.
Min
orO
ccas
iona
l
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 24 of 47
Tabl
e9:
Con
trola
ndC
omm
unic
atio
nfa
ilure
mod
es14
-19.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dTi
min
gR
ecei
ve(F
M.S
DP.
PS
T.21
4)R
outin
gan
dtra
nsm
is-
sion
ofda
tafro
mth
eC
SP
fails
due
toto
om
any
mis
sing
orco
rrup
tda
tapa
cket
s.
No
puls
artim
ing
data
re-
ceiv
ed.
Pul
sar
timin
gan
alys
isno
tpos
sibl
e.O
pera
tiona
lre
liabi
lity
and
effic
ienc
yar
ede
-gr
aded
.A
llsc
ienc
eda
talo
st.
Res
ilien
ceof
rout
ing.
The
ca-
paci
tyto
requ
estt
hatd
ata
bere
-se
nt.
Cat
astro
phic
Rem
ote
Tim
ing
Rec
eive
(FM
.SD
P.P
ST.
215)
Rou
ting
and
trans
-m
issi
onof
data
from
the
CS
Pte
mpo
raril
yfa
ilsdu
eto
netw
ork
com
mun
icat
ion
failu
res.
No
puls
artim
ing
data
re-
ceiv
ed.
Pul
sar
timin
gan
alys
isno
tpos
sibl
e.O
pera
tiona
lre
liabi
lity
and
effic
ienc
yar
ede
-gr
aded
.A
llsc
ienc
eda
talo
st.
Re-
esta
blis
hco
nnec
tivity
,an
dif
poss
ible
requ
est
scan
data
bere
sent
from
CS
P.
Cat
astro
phic
Rem
ote
Tim
ing
Rec
eive
(FM
.SD
P.P
ST.
216)
Dat
are
ceiv
edfro
mth
eC
SP
ism
argi
nally
cor-
rupt
edvi
apa
cket
loss
orso
me
othe
rco
mm
unic
a-tio
ner
ror.
Tim
ing
rece
ive
pro-
cess
espa
rtly
corr
upte
dda
ta.
Pul
sar
timin
gan
alys
isle
ssef
fect
ive.
Sci
ence
data
lose
sso
me
ofits
utili
ty.
Mon
itor
prop
ortio
nof
data
sub-
ject
toco
rrup
tion.
Con
tinue
tofu
nctio
nno
rmal
lyso
long
asle
ssth
an20
%TB
Cof
the
data
isco
r-ru
pted
.If
mor
eth
an20
%TB
Cis
corr
upte
dra
ise
anal
arm
,bu
tco
ntin
ueto
func
tion
and
anno
-ta
teth
epr
oces
sed
data
with
afla
gin
dica
ting
that
itsut
ility
issi
gnifi
cant
lyde
grad
ed.
Mar
gina
lO
ccas
iona
l
Tim
ing
Rec
eive
(FM
.SD
P.P
ST.
217)
Tim
ing
rece
ive
tem
-po
raril
ylo
ses
conn
ec-
tivity
with
dow
nstre
amS
DP
com
pone
nts.
Tim
ing
rece
ive
cann
otpa
ssda
tath
roug
hth
etim
ing
pipe
line.
Pul
sar
timin
gan
alys
isno
tpos
sibl
e.S
cien
tific
outp
utno
tpro
-du
ced.
Sen
dth
esc
ienc
eda
tato
the
pres
erva
tion
syst
emw
ithou
tpro
-ce
ssin
gto
prev
ent
data
loss
.Fl
agth
eda
taas
requ
iring
follo
w-
uppo
st-p
roce
ssin
g.G
ener
ate
anal
ert.
Mar
gina
lR
emot
e
Tim
ing
Rec
eive
(FM
.SD
P.P
ST.
218)
Tim
ing
rece
ive
lose
sal
lco
nnec
tivity
with
dow
nstre
amS
DP
com
-po
nent
sfo
ra
perio
dof
time
long
erth
ana
scan
dura
tion.
Tim
ing
rece
ive
cann
otpa
ssda
tath
roug
hth
etim
ing
pipe
line.
Pul
sar
timin
gan
alys
isno
tpos
sibl
e.S
cien
tific
outp
utno
tpro
-du
ced.
Res
ilien
ceof
rout
ing.
Cat
astro
phic
Ext
rem
ely
unlik
ely
Tim
ing
Re-
ceiv
e/
Inge
st(F
M.S
DP.
PS
T.21
9)
Failu
reto
inge
stre
ceiv
edda
tain
atim
ely
fash
ion,
caus
ing
ada
taba
cklo
gw
hich
cann
otbe
cach
ed.
Dat
ado
esno
ten
ter
the
pipe
line
quic
kly
enou
ghto
com
plet
etim
ing
pro-
cess
ing
inth
eal
lotte
dtim
e.
Pul
sart
imin
gan
alys
isin
-co
mpl
ete.
Sci
entifi
cou
tput
sde
-gr
aded
.S
cien
ceda
talo
ses
som
eof
itsut
ility
,so
me
data
loss
.
Res
ilien
ceof
rout
ing,
auto
mat
iclo
adba
lanc
ing
topr
even
tre
-so
urce
cont
entio
nan
dpr
oces
s-in
gde
lays
.
Mar
gina
lR
emot
e
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 25 of 47
Tabl
e10
:C
ontro
land
Com
mun
icat
ion
failu
rem
odes
20-2
3.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dTi
min
gR
e-ce
ive
/In
gest
(FM
.SD
P.P
ST.
220)
Sub
-inte
grat
ion
data
im-
pact
edby
pack
etlo
ssw
hen
usin
gFT
P(a
sda
tase
nt1
sub-
inta
tatim
e).
Tim
ing
rece
ive
pro-
cess
espa
rtly
corr
upte
dda
ta.
Miti
gatio
nst
rate
gyin
curs
com
puta
tiona
lov
erhe
ad.
Red
uced
effe
ctiv
enes
sof
puls
artim
ing
anal
ysis
.M
inor
degr
adat
ion
tosc
i-en
ceou
tput
s.R
eque
stth
atda
tabe
rese
nt.
Ifre
send
impo
ssib
le,
add
aze
-ro
edsu
b-in
tin
plac
eof
the
cor-
rupt
edsu
b-in
t.U
pdat
ecu
mu-
lativ
etra
ckin
gof
lost
sub-
ints
and
sub-
int
sam
ples
.If
cum
u-la
tive
data
loss
mor
eth
an20
%TB
Cth
ento
om
uch
sign
alha
sbe
enlo
stan
dan
alar
mm
ust
bera
ised
.Ta
gth
eda
taso
the
prop
ortio
nof
lost
sub-
ints
isre
cord
ed.
Sca
nde
-pe
nden
t.Fr
actio
nal
loss
isim
-po
rtan
t.S
ever
ityra
nges
from
min
orto
criti
cal
due
tocu
-m
ulat
ive
effe
cts.
Occ
asio
nal
Rem
ove
RFI
(FM
.SD
P.P
ST.
221)
Rem
ove
RFI
func
-tio
nte
mpo
raril
ylo
ses
conn
ectiv
ityw
ithdo
wn-
stre
amS
DP
com
po-
nent
s.
Rem
ove
RFI
func
-tio
nca
nnot
pass
data
thro
ugh
the
timin
gpi
pelin
e.
Pul
sar
timin
gan
alys
isno
tpos
sibl
e.S
cien
tific
outp
utde
-gr
aded
.R
etry
send
ing
the
data
until
som
etim
e-ou
tpe
riod
TBD
has
elap
sed.
Ifre
try
fails
,se
ndth
esc
ienc
eda
tato
the
pres
erva
-tio
nsy
stem
with
out
proc
essi
ngto
prev
ent
data
loss
.Fl
agth
eda
taas
requ
iring
follo
w-u
ppo
st-
proc
essi
ng.
Gen
erat
ean
aler
t.
Mar
gina
lR
emot
e
Cal
ibra
te(F
M.S
DP.
PS
T.22
2)C
alib
rate
func
tion
tem
-po
raril
ylo
ses
conn
ectiv
-ity
with
dow
nstre
amS
DP
com
pone
nts.
Cal
ibra
tefu
nctio
nca
n-no
tpa
ssda
tath
roug
hth
etim
ing
pipe
line.
Pul
sar
timin
gan
alys
isno
tpos
sibl
e.S
cien
tific
outp
utde
-gr
aded
.R
etry
send
ing
the
data
until
som
etim
e-ou
tpe
riod
TBD
has
elap
sed.
Ifre
try
fails
,se
ndth
esc
ienc
eda
tato
the
pres
erva
-tio
nsy
stem
with
out
proc
essi
ngto
prev
ent
data
loss
.Fl
agth
eda
taas
requ
iring
follo
w-u
ppo
st-
proc
essi
ng.
Gen
erat
ean
aler
t.
Mar
gina
lR
emot
e
Arc
hive
Av-
erag
eP
rodu
cts
(FM
.SD
P.P
ST.
223)
Ave
rage
func
tion
tem
-po
raril
ylo
ses
conn
ectiv
-ity
with
dow
nstre
amS
DP
com
pone
nts.
Ave
rage
func
tion
cann
otpa
ssda
tath
roug
hth
etim
ing
pipe
line.
Pul
sar
timin
gan
alys
isno
tpos
sibl
e.S
cien
tific
outp
utno
tpro
-du
ced.
Ret
ryse
ndin
gth
eda
taun
tilso
me
time-
out
perio
dTB
Dha
sel
apse
d.If
retr
yfa
ils,
send
the
scie
nce
data
toth
epr
eser
va-
tion
syst
emw
ithou
tpr
oces
sing
topr
even
tda
talo
ss.
Flag
the
data
asre
quiri
ngfo
llow
-up
post
-pr
oces
sing
.G
ener
ate
anal
ert.
Mar
gina
lR
emot
e
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 26 of 47
Tabl
e11
:C
ontro
land
Com
mun
icat
ion
failu
rem
odes
24-2
8.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dA
rchi
veA
v-er
age
Pro
duct
s(F
M.S
DP.
PS
T.22
4)
Con
nect
ivity
with
the
pres
erva
tion
syst
emis
tem
pora
rily
lost
,pre
vent
-in
gst
orag
eof
aver
aged
data
prod
ucts
and
the
prim
ary
data
cube
.
Sto
rage
ofpr
imar
yan
dav
erag
edda
tapr
oduc
tsfa
ils.
Pul
sar
timin
gpi
pelin
efa
ilsto
pers
ist
prim
ary
scie
nce
data
.
Pote
ntia
lfo
rcr
itica
lim
-pa
cton
scie
nce
outp
uts.
Itis
impe
rativ
efo
rth
epr
i-m
ary
data
prod
uct
ofth
etim
-in
gpi
pelin
e,th
eda
tacu
be,
tobe
pers
iste
d.Th
usth
isst
epm
ust
bere
-run
upon
failu
reun
-til
the
prim
ary
data
prod
uct
ata
min
imum
isst
ored
.O
run
tilso
me
time-
out
perio
dTB
Dha
sel
apse
d.G
ener
ate
anal
ert.
Ifth
eda
tais
reta
ined
ina
buffe
ran
dno
tdi
scar
ded
until
scie
nce
outp
uts
are
pers
iste
d,th
enth
ese
verit
yis
redu
ced
tom
argi
nal.
Mar
gina
lto
Crit
ical
Rem
ote
Arc
hive
Av-
erag
eP
rodu
cts
(FM
.SD
P.P
ST.
225)
Arc
hive
Ave
rage
Pro
d-uc
tsfu
nctio
nte
mpo
raril
ylo
ses
conn
ectiv
ityw
ithdo
wns
tream
SD
Pco
m-
pone
nts.
Arc
hive
Ave
rage
Pro
d-uc
tsfu
nctio
nca
nnot
pass
data
thro
ugh
the
timin
gpi
pelin
e.
Pul
sar
timin
gan
alys
isno
tpos
sibl
e.S
ome
scie
ntifi
cou
tput
notp
rodu
ced.
Gen
erat
ean
aler
t,an
dpr
epar
efo
rne
xtsc
an(n
ofu
rthe
rpr
o-ce
ssin
gpo
ssib
le).
Flag
the
data
forf
ollo
w-u
ppo
stpr
oces
sing
.
Min
orR
emot
e
Det
erm
ine
TOA
s(F
M.S
DP.
PS
T.22
6)D
eter
min
eTO
As
func
-tio
nte
mpo
raril
ylo
ses
conn
ectiv
ityw
ithdo
wn-
stre
amS
DP
com
po-
nent
s.
Det
erm
ine
TOA
sfu
nc-
tion
cann
otpa
ssda
tath
roug
hth
etim
ing
pipe
line.
Pul
sar
timin
gan
alys
isno
tpos
sibl
e.S
ome
scie
ntifi
cou
tput
notp
rodu
ced.
Ret
ryse
ndin
gth
eda
taun
tilso
me
time-
out
perio
dTB
Dha
sel
apse
d.G
ener
ate
anal
ert
ifda
tais
nots
ent,
and
prep
are
for
the
next
scan
(no
furt
her
pro-
cess
ing
poss
ible
).Fl
agth
eda
tafo
rfol
low
-up
post
proc
essi
ng.
Min
orR
emot
e
Arc
hive
TOA
s(F
M.S
DP.
PS
T.22
7)C
onne
ctiv
ityw
ithth
epr
eser
vatio
nsy
stem
iste
mpo
raril
ylo
st,
pre-
vent
ing
the
stor
age
ofTO
As.
Failu
reto
stor
eTO
As.
Pul
sar
timin
gpi
pelin
efa
ilsto
arch
ive
usef
ulsc
i-en
ceda
ta.
Min
orim
pact
onsc
i-en
ceou
tput
s.TO
As
can
beco
mpu
ted
via
post
-pr
oces
sing
ifne
cess
ary.
Con
tinue
toat
tem
ptto
arch
ive
the
TOA
sun
tilso
me
time-
out
perio
dTB
Dha
sel
apse
d.If
arch
ivin
gfa
ils,
add
afla
gto
the
data
mak
ing
itcl
ear
that
the
TOA
sha
veno
tbe
enar
chiv
ed.
Pro
ceed
toth
ene
xtst
epso
that
pipe
line
proc
essi
ngdo
esno
thal
tand
noda
talo
st.
Mar
gina
lR
emot
e
Arc
hive
TOA
s(F
M.S
DP.
PS
T.22
8)A
rchi
veTO
As
func
-tio
nte
mpo
raril
ylo
ses
conn
ectiv
ityw
ithdo
wn-
stre
amS
DP
com
po-
nent
s.
Arc
hive
TOA
sfu
nc-
tion
cann
otpa
ssda
tath
roug
hth
etim
ing
pipe
line.
Pul
sar
timin
gan
alys
isno
tpos
sibl
e.S
ome
scie
ntifi
cou
tput
notp
rodu
ced.
Ret
ryse
ndin
gth
eda
taun
tilso
me
time-
out
perio
dTB
Dha
sel
apse
d.G
ener
ate
anal
ert
ifda
tais
not
sent
.Fl
agth
eda
tafo
rfol
low
-up
post
proc
essi
ng.
Min
orR
emot
e
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 27 of 47
Tabl
e12
:C
ontro
land
Com
mun
icat
ion
failu
rem
odes
29-3
3.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dG
ener
ate
Res
idua
ls(F
M.S
DP.
PS
T.22
9)
Gen
erat
eR
esid
uals
func
tion
tem
pora
rily
lose
sco
nnec
tivity
with
dow
nstre
amS
DP
com
-po
nent
s.
Gen
erat
eR
esid
uals
func
tion
cann
otpa
ssda
tath
roug
hth
etim
ing
pipe
line.
Pul
sar
timin
gan
alys
isno
tpos
sibl
e.S
ome
scie
ntifi
cou
tput
notp
rodu
ced.
Ret
ryse
ndin
gth
eda
taun
tilso
me
time-
out
perio
dTB
Dha
sel
apse
d.If
the
data
isno
tse
nt,
gene
rate
anal
ert.
Then
prep
are
fort
hene
xtsc
an(n
ofu
rthe
rpro
-ce
ssin
gpo
ssib
le).
Flag
the
data
forf
ollo
w-u
ppo
stpr
oces
sing
.
Min
orR
emot
e
Sen
dR
esid
uals
toQ
AS
yste
m(F
M.S
DP.
PS
T.23
0)
Sen
dR
esid
uals
toQ
AS
yste
mfu
nctio
nte
m-
pora
rily
lose
sco
nnec
-tiv
ityw
ithdo
wns
tream
SD
Pco
mpo
nent
s.
Sen
dR
esid
uals
toQ
AS
yste
mfu
nctio
nca
nnot
pass
data
thro
ugh
the
timin
gpi
pelin
e.
Pul
sar
timin
gan
alys
isqu
ality
redu
ced.
Qua
lity
ofsc
ienc
eou
tput
affe
cted
.R
etry
send
ing
the
data
until
som
etim
e-ou
tpe
riod
TBD
has
elap
sed.
Gen
erat
ean
aler
t,an
dfla
gth
eda
tafo
rre
sidu
alQ
A,
and
mov
eto
the
next
proc
essi
ngst
ep.
Min
orR
emot
e
Upd
ate
Tim
ing
Mod
el(F
M.S
DP.
PS
T.23
1)Ti
min
gm
odel
fort
hepu
l-sa
rbe
ing
obse
rved
can-
not
beob
tain
edex
ter-
nally
.
Tim
ing
mod
elno
tup
-da
ted.
Pul
sar
timin
gpi
pelin
eca
nnot
proc
eed
with
fur-
ther
proc
essi
ngst
eps.
Min
orim
pact
onsc
ienc
eou
tput
s.Ti
min
gm
odel
can
beup
date
dvi
apo
st-
proc
essi
ngif
nece
ssar
y.
Con
tinue
toat
tem
ptto
obta
inth
etim
ing
mod
elun
tilso
me
time-
out
perio
dTB
Dha
sel
apse
d.If
un-
avai
labl
eon
retr
y,ad
da
flag
toth
eda
tain
dica
ting
this
.P
roce
edto
the
next
step
soth
atpi
pelin
epr
oces
sing
does
noth
alta
ndno
data
lost
.
Mar
gina
lR
emot
e
Eva
luat
eM
odel
Cha
nges
(FM
.SD
P.P
ST.
232)
Eva
luat
eM
odel
Cha
nges
func
tion
tem
-po
raril
ylo
ses
conn
ec-
tivity
with
dow
nstre
amS
DP
com
pone
nts.
Eva
luat
eM
odel
Cha
nges
func
tion
can-
not
pass
data
thro
ugh
the
timin
gpi
pelin
e.
Can
not
gene
rate
aler
tsba
sed
ofch
ange
sin
atim
ing
profi
le.
Qua
lity
ofsc
ienc
eou
tput
affe
cted
.G
ener
ate
anal
ert,
and
flag
the
data
for
mod
elch
ange
anal
ysis
post
-pro
cess
ing.
Then
proc
eed
toar
chiv
eth
etim
ing
mod
el.
Min
orR
emot
e
Arc
hive
Tim
-in
gM
odel
(FM
.SD
P.P
ST.
233)
Con
nect
ivity
with
the
pres
erva
tion
syst
emis
tem
pora
rily
lost
,pr
e-ve
ntin
gst
orag
eof
the
upda
ted
timin
gm
odel
.
Tim
ing
mod
elno
tsen
tto
the
arch
ive/
pres
erva
tion
syst
em.
Tim
ing
mod
elno
tar
chiv
ed.
Pip
elin
efa
ilsto
auto
mat
ical
lyup
date
timin
gm
odel
s.
Min
orim
pact
onsc
ienc
eou
tput
s.Ti
min
gm
odel
can
bere
com
pute
dvi
apo
st-p
roce
ssin
gif
nec-
essa
ry.
Ret
ryse
ndin
gth
eda
taun
tilso
me
timeo
utpe
riod
TBD
has
elap
sed.
Ifth
eda
tais
not
sent
,ra
ise
anal
arm
.A
dda
flag
toth
eda
tam
akin
git
clea
rth
etim
-in
gm
odel
has
not
been
per-
sist
ed.
Pro
ceed
toth
ene
xtst
epso
that
pipe
line
proc
essi
ngdo
esno
thal
tand
noda
talo
st.
Mar
gina
lR
emot
e
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 28 of 47
Tabl
e13
:C
ontro
land
Com
mun
icat
ion
failu
rem
odes
34-3
6.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dU
pdat
eTi
m-
ing
Mod
el(F
M.S
DP.
PS
T.23
4)
Upd
ate
Tim
ing
Mod
elfu
nctio
nte
mpo
raril
ylo
ses
conn
ectiv
ityw
ithdo
wns
tream
SD
Pco
m-
pone
nts.
Upd
ate
Tim
ing
Mod
elfu
nctio
nca
nnot
pass
data
thro
ugh
the
timin
gpi
pelin
e.
Tim
ing
mod
elno
tup
-da
ted.
Aut
omat
icup
date
oftim
-in
gm
odel
sfa
ils.
Gen
erat
ean
aler
t,an
dfla
gth
eda
tafo
rfo
llow
-up
post
proc
ess-
ing.
Min
orR
emot
e
Gen
erat
eA
lert
(FM
.SD
P.P
ST.
235)
Con
nect
ivity
with
the
aler
tsy
stem
iste
m-
pora
rily
lost
,pr
even
ting
rapi
dfo
llow
-up.
Ale
rts
notg
ener
ated
.P
ulsa
rtim
ing
pipe
line
cann
otal
ert
TMor
the
rese
arch
com
mun
ityto
scie
ntifi
cally
inte
rest
ing
even
ts.
Min
orto
mar
gina
lim
pact
onsc
ienc
eou
tput
s.A
dda
flag
toth
eda
tam
akin
git
clea
rth
atth
eda
tare
quire
sfo
llow
-up
anal
ysis
.C
ontin
ueto
atte
mpt
toge
nera
tean
aler
tun-
tilso
me
time-
outp
erio
dTB
Dha
sel
apse
d.
Mar
gina
lR
emot
e
ALL
-M
eta-
data
Acq
uisi
tion
(FM
.SD
P.P
ST.
236)
Con
nect
ivity
with
the
syst
em/s
resp
onsi
ble
for
man
agin
gan
dsu
pply
ing
met
adat
ais
tem
pora
r-ily
lost
.Th
isim
pact
sth
eac
quis
ition
ofsk
ym
odel
s,R
FIm
asks
,ca
libra
tion
stra
tegi
es,
puls
arep
hem
erid
es,
Sta
ndar
dP
rofil
esan
dtim
ing
mod
els
Dat
are
quire
dfo
rpr
o-ce
ssin
gno
tav
aila
ble,
caus
ing
proc
essi
ngst
eps
tobe
mis
sed.
Pul
sar
timin
gpi
pelin
eun
able
toru
nco
rrec
tly.
Min
orto
mar
gina
lim
pact
onsc
ienc
eou
tput
s.R
etry
obta
inin
gth
ere
quire
dm
etad
ata
until
som
etim
e-ou
tpe
riod
TBD
has
elap
sed.
Ifm
etad
ata
still
unav
aila
ble,
gen-
erat
ean
aler
t.A
dda
flag
toth
eda
tam
akin
git
clea
rtha
tthe
data
requ
ires
follo
w-u
pan
alys
is.
Pro
-ce
edto
the
next
proc
essi
ngst
epw
here
poss
ible
inde
faul
tmod
e.
Mar
gina
lR
emot
e
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 29 of 47
5.3 Data Failures
Data failures arise when data is incorrectly formatted, contains invalid values, or is not pro-vided when expected. Formatting and validity issues typically arise through software errorsand incorrectly implemented interfaces. It is also possible for such errors to occur due to com-munication issues (e.g. packet loss), or memory problems (e.g. bit flips) that can cause datacorruption.
Data problems can also arise when using external databases. It is possible for data requestedof an external resource to become corrupted during transfer, or data mismanagement. As thepulsar timing pipeline requires external data to function (e.g. pulsar ephemerides), such errorsare plausible.
A number of failure modes related to data are listed in Tables 14, through to Table 16.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 30 of 47
Tabl
e14
:D
ata
failu
rem
odes
1-6.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dTi
min
gR
e-ce
ive
/In
gest
(FM
.SD
P.P
ST.
301)
Sub
-inte
grat
ion
data
in-
corr
ectly
form
atte
d/c
on-
tain
sin
valid
valu
es.
Tim
ing
rece
ive
pro-
cess
espa
rtly
corr
upte
dor
inco
rrec
tlyfo
rmat
ted
data
.M
itiga
tion
stra
tegy
incu
rsco
mpu
tatio
nal
over
head
.
Red
uced
effe
ctiv
enes
sof
puls
artim
ing
anal
ysis
.M
inor
degr
adat
ion
tosc
i-en
ceou
tput
s.A
dda
zero
edsu
b-in
tin
plac
eof
inco
rrec
tlyfo
rmat
ted
orin
valid
sub-
inte
grat
ion
(or
sub-
int
data
poin
t).U
pdat
ecu
mul
ativ
etra
ck-
ing
oflo
stsu
b-in
tsan
dsu
b-in
tsa
mpl
es.
Ifcu
mul
ativ
eda
talo
ssm
ore
than
20%
TBC
then
too
muc
hsi
gnal
has
been
lost
and
anal
arm
mus
tbe
rais
ed.
Tag
the
data
soth
epr
opor
tion
oflo
stsu
b-in
tsis
reco
rded
.
Sca
nde
-pe
nden
t.Fr
actio
nal
loss
isim
-po
rtan
t.S
ever
ityra
nges
from
min
orto
criti
cal
due
tocu
-m
ulat
ive
effe
cts.
Rem
ote
Rem
ove
RFI
(FM
.SD
P.P
ST.
302)
No
RFI
mas
kpr
ovid
ed.
Can
not
rem
ove/
miti
gate
RFI
.Th
esi
gnal
-to-n
oise
ratio
ofth
ede
tect
edpu
lse
will
belo
wer
ed.
Pul
sar
timin
gan
alys
isle
ssef
fect
ive.
Min
orto
Mar
gina
lim
pact
onsc
ienc
eou
tput
s.A
dda
flag
toth
eda
tam
ak-
ing
itcl
ear
that
RFI
miti
gatio
nis
yet
tobe
perfo
rmed
,an
dpr
o-ce
edto
the
next
proc
essi
ngst
epso
that
pipe
line
proc
essi
ngdo
esno
thal
tand
noda
talo
st.
Mar
gina
lE
xtre
mel
yU
nlik
ely
Rem
ove
RFI
(FM
.SD
P.P
ST.
303)
Inva
lid/
corr
upt
RFI
mas
kpr
ovid
edto
the
RFI
miti
gatio
nco
mpo
-ne
nt.
Can
not
rem
ove/
miti
gate
RFI
.Th
esi
gnal
-to-n
oise
ratio
ofth
ede
tect
edpu
lse
will
belo
wer
ed.
Pul
sar
timin
gan
alys
isle
ssef
fect
ive.
Min
orto
Mar
gina
lim
pact
onsc
ienc
eou
tput
s.S
ame
asFM
.SD
P.P
ST.
302.
Mar
gina
lR
emot
e
Rem
ove
RFI
(FM
.SD
P.P
ST.
304)
Inap
prop
riate
RFI
mas
kpr
ovid
edto
the
RFI
mit-
igat
ion
com
pone
nt.
The
sign
al-to
-noi
sera
tioof
the
dete
cted
puls
ew
illbe
low
erw
ithou
tRFI
mit-
igat
ion.
Pul
sar
timin
gan
alys
isle
ssef
fect
ive.
Min
orto
Mar
gina
lim
pact
onsc
ienc
eou
tput
s.U
ndo
the
miti
gatio
nst
epan
dA
dda
flag
toth
eda
tam
akin
git
clea
rtha
tRFI
miti
gatio
nis
yett
obe
perfo
rmed
.M
usta
lso
expl
ain
that
the
appl
ied
mas
kfa
iled
toin
-cr
ease
the
sign
al-to
-noi
sera
tio.
Min
orO
ccas
iona
l
Cal
ibra
te(F
M.S
DP.
PS
T.30
5)N
oca
libra
tion
solu
tion
prov
ided
.Th
esi
gnal
-to-n
oise
ratio
ofth
ede
tect
edpu
lse
will
belo
wer
with
outc
alib
ra-
tion.
Pul
sar
timin
gan
alys
isle
ssef
fect
ive.
Min
orim
pact
onsc
ienc
eou
tput
s.A
dda
flag
toth
eda
tam
ak-
ing
itcl
ear
that
calib
ratio
nis
yet
tobe
perfo
rmed
,an
dpr
o-ce
edto
the
next
proc
essi
ngst
epso
that
pipe
line
proc
essi
ngdo
esno
thal
tand
noda
talo
st.
Mar
gina
lR
emot
e
Cal
ibra
te(F
M.S
DP.
PS
T.30
6)In
valid
/co
rrup
tca
libra
-tio
nso
lutio
npr
ovid
ed.
The
sign
al-to
-noi
sera
tioof
the
dete
cted
puls
ew
illbe
low
erw
ithou
tcal
ibra
-tio
n.
Pul
sar
timin
gan
alys
isle
ssef
fect
ive.
Min
orim
pact
onsc
ienc
eou
tput
s.S
ame
asFM
.SD
P.P
ST.
305.
Mar
gina
lR
emot
e
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 31 of 47
Tabl
e15
:D
ata
failu
rem
odes
7-11
.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dC
alib
rate
(FM
.SD
P.P
ST.
307)
Inap
prop
riate
calib
ratio
nso
lutio
npr
ovid
ed.
The
sign
al-to
-noi
sera
tioof
the
dete
cted
puls
ew
illbe
low
erw
ithou
tcal
ibra
-tio
n.
Pul
sar
timin
gan
alys
isle
ssef
fect
ive.
Min
orim
pact
onsc
ienc
eou
tput
s.U
ndo
the
calib
ratio
nst
epan
dad
da
flag
toth
eda
tam
akin
git
clea
rth
atca
libra
tion
isye
tto
bepe
rform
ed.
Mus
tals
oex
plai
nth
atth
eap
plie
dst
rate
gyfa
iled
toin
crea
seth
esi
gnal
-to-n
oise
ra-
tio.
Min
orO
ccas
iona
l
Ave
rage
(FM
.SD
P.P
ST.
308)
No
spec
ifica
tions
pro-
vide
dfo
rthe
requ
ired
av-
erag
edda
tapr
oduc
ts.
Inab
ility
topr
oduc
ein
-te
rmed
iate
outp
utda
tapr
oduc
ts.
Pul
sar
timin
gan
alys
isun
affe
cted
.O
nly
data
prod
ucts
usef
ulfo
rpo
st-
proc
essi
ngan
alys
isar
elo
st.
No
impa
cton
scie
nce
outp
uts
solo
ngas
the
prim
ary
data
prod
uct
isst
ored
.In
term
e-di
ate
data
prod
ucts
can
bere
crea
ted
via
post
-pro
cess
ing.
Sen
dth
epr
imar
yda
tacu
beto
the
pres
erva
tion
arch
ive,
alon
gw
ithso
me
defa
ulta
vera
ged
data
prod
ucts
.
Min
orO
ccas
iona
l
Arc
hive
Av-
erag
eP
rodu
cts
(FM
.SD
P.P
ST.
309)
Ave
rage
dda
tapr
oduc
tsin
corr
ectly
form
atte
d/
cont
ain
inva
lidva
lues
due
toso
ftwar
eer
ror.
Ave
rage
dda
tapr
od-
ucts
are
not
pers
iste
d.C
anno
tse
ndin
valid
orco
rrup
ted
data
toth
epr
eser
vatio
nar
chiv
e.
Pul
sar
timin
gan
alys
isun
affe
cted
.O
nly
data
prod
ucts
usef
ulfo
rpo
st-
proc
essi
ngan
alys
esar
elo
st.
No
impa
cton
scie
nce
outp
uts
solo
ngas
the
prim
ary
data
prod
uct
isst
ored
.In
term
e-di
ate
data
prod
ucts
can
bere
crea
ted
via
post
-pro
cess
ing.
Sen
dth
epr
imar
yda
tacu
beto
the
pres
erva
tion
arch
ive.
Flag
that
aver
age
data
prod
ucts
wer
ein
valid
and
need
recr
eatin
g.R
aise
anal
arm
.
Min
orR
emot
e
Det
erm
ine
TOA
s(F
M.S
DP.
PS
T.31
0)N
ost
anda
rdpr
ofile
pro-
vide
d.N
oTO
As
dete
rmin
ed.
Pul
sart
imin
gan
alys
isin
-co
mpl
ete.
Min
orto
Mar
gina
lim
pact
onsc
ienc
eou
tput
s.R
etry
obta
inin
gth
est
anda
rdpr
ofile
until
som
etim
e-ou
tper
iod
TBD
has
elap
sed.
Ifno
neav
ail-
able
,en
ter
defa
ult
mod
ean
dse
ndth
eda
tato
the
pres
er-
vatio
nar
chiv
e.A
nnot
ate
the
data
and
flag
for
repr
oces
sing
.P
repa
reto
proc
ess
the
next
scan
(can
not
proc
eed
with
tim-
ing
proc
essi
ngw
ithou
tthe
stan
-da
rdpr
ofile
).R
aise
anal
arm
.
Mar
gina
lR
emot
e
Det
erm
ine
TOA
s(F
M.S
DP.
PS
T.31
1)In
valid
/cor
rupt
stan
dard
profi
lepr
ovid
ed.
No
TOA
sde
term
ined
.P
ulsa
rtim
ing
anal
ysis
in-
com
plet
e.M
inor
toM
argi
nali
mpa
cton
scie
nce
outp
uts.
Ret
ryob
tain
ing
the
stan
dard
profi
leun
tilso
me
time-
outp
erio
dTB
Dha
sel
apse
d.If
none
avai
l-ab
le,
ente
rde
faul
tm
ode
and
send
the
data
toth
epr
eser
-va
tion
arch
ive.
Ann
otat
eth
eda
taan
dfla
gfo
rre
proc
essi
ng.
Pre
pare
topr
oces
sth
ene
xtsc
an(c
anno
tpr
ocee
dw
ithou
tth
est
anda
rdpr
ofile
).R
aise
anal
arm
.
Mar
gina
lR
emot
e
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 32 of 47
Tabl
e16
:D
ata
failu
rem
odes
12-1
7.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dA
rchi
veTO
As
(FM
.SD
P.P
ST.
312)
Inva
lid/
corr
upt
com
-pu
ted
TOA
sdu
eto
soft-
war
eer
rors
.
Failu
reto
stor
eTO
As.
Can
not
send
inva
lidor
corr
upte
dda
tato
the
pres
erva
tion
arch
ive.
Pul
sar
timin
gpi
pelin
efa
ilsto
arch
ive
usef
ulsc
i-en
ceda
ta.
Min
orim
pact
onsc
i-en
ceou
tput
s.TO
As
can
beco
mpu
ted
via
post
-pr
oces
sing
ifne
cess
ary.
Rai
sean
alar
m,
and
flag
the
data
indi
catin
gth
atTO
As
need
tobe
com
pute
ddu
ring
post
-pr
oces
sing
.
Mar
gina
lR
emot
e
Gen
erat
eR
esid
uals
(FM
.SD
P.P
ST.
313)
Inva
lid/
corr
upt
com
-pu
ted
TOA
sdu
eto
soft-
war
eer
rors
.
Can
not
com
pute
resi
du-
als
from
inva
lid/
corr
upt
TOA
s.
Pul
sar
timin
gpi
pelin
efa
ilsto
com
pute
resi
d-ua
lsfo
rth
eob
serv
edpu
lsar
.C
anno
tde
tect
scie
ntifi
cally
inte
rest
ing
profi
lech
ange
s.
Min
orim
pact
onsc
i-en
ceou
tput
s.TO
As
can
beco
mpu
ted
via
post
-pr
oces
sing
ifne
cess
ary.
Rai
sean
alar
m,
and
flag
the
data
indi
catin
gth
atTO
As
and
resi
dual
sne
edto
beco
mpu
ted
durin
gpo
st-p
roce
ssin
g.
Mar
gina
lR
emot
e
QA
Res
idua
ls(F
M.S
DP.
PS
T.31
4)In
valid
/co
rrup
tre
sidu
-al
spr
ovid
ed,
unab
leto
asse
sth
eirq
ualit
y.
Can
not
cont
inue
pro-
cess
ing.
Pul
sar
timin
gpi
pelin
eha
lts.
Min
orim
pact
onsc
ienc
eou
tput
s.R
esid
uals
can
beco
mpu
ted
via
post
-pr
oces
sing
ifne
cess
ary.
Rai
sean
alar
m,
and
flag
the
data
indi
catin
gth
atre
sidu
als
need
tobe
com
pute
ddu
ring
post
-pro
cess
ing.
Mar
gina
lR
emot
e
Upd
ate
Tim
-in
gM
odel
(FM
.SD
P.P
ST.
315)
Inva
lid/c
orru
ptre
sidu
als
prov
ided
,un
able
toup
-da
teth
etim
ing
mod
el.
Tim
ing
mod
elno
tup
-da
ted.
Pul
sar
timin
gpi
pelin
eca
nnot
proc
eed
with
fur-
ther
proc
essi
ngst
eps.
Min
orim
pact
onsc
ienc
eou
tput
s.Ti
min
gm
odel
can
beup
date
dvi
apo
st-
proc
essi
ngif
nece
ssar
y.
Rai
sean
alar
m,
and
flag
the
data
indi
catin
gth
atre
sidu
als
need
tobe
com
pute
ddu
ring
post
-pro
cess
ing.
Mar
gina
lR
emot
e
Upd
ate
Tim
ing
Mod
el/
Arc
hive
Tim
ing
Mod
el(F
M.S
DP.
PS
T.31
6)
Inva
lid/
corr
upt
timin
gm
odel
prov
ided
,un
able
toup
date
.
Tim
ing
mod
elno
tup
-da
ted.
Pul
sar
timin
gpi
pelin
eca
nnot
proc
eed
with
fur-
ther
proc
essi
ngst
eps.
Min
orim
pact
onsc
ienc
eou
tput
s.Ti
min
gm
odel
can
beup
date
dvi
apo
st-
proc
essi
ngif
nece
ssar
y.
Rai
sean
alar
m,
and
flag
the
data
indi
catin
gth
atth
etim
ing
mod
elne
eds
tobe
upda
ted
dur-
ing
post
-pro
cess
ing.
Mar
gina
lR
emot
e
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 33 of 47
5.4 Software/Algorithm Failures
Software and algorithms can fail for a variety of reasons. The causes range from bugs inad-vertently introduced at the software design stage, to bugs accidental coded during implemen-tation. Aside from bugs, software can also fail when,
• non-deterministic algorithms do not complete on certain types of input data.
• software/algorithm logic is incorrectly coded preventing loops from terminating.
• numerical precision is incorrectly handled, causing sub-optimal performance or failure.
• incorrect data types are used when handling numerical data causing precision errors.
• errors in parallelism cause data to be incorrectly processed, for example, via memoryaccess errors.
• slow runtime which causes failures at the system level (due to delay).
• similarly sub-optimal implementation, which causes failures at the system level (due toresource contention).
• security vulnerabilities are exploited by attackers.
A number of failure modes related to software/algorithms are listed in Tables 17 and Table 18.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 34 of 47
Tabl
e17
:S
oftw
are/
Alg
orith
mfa
ilure
mod
es1-
9.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dA
LL(F
M.S
DP.
PS
T.40
1)Fu
nctio
n,pr
oces
sor
ap-
plic
atio
nth
row
san
arith
-m
etic
erro
r(d
ivid
eby
zero
,ar
ithm
etic
over
flow
orun
derfl
ow,l
oss
ofpr
e-ci
sion
).
Func
tion
fails
toco
m-
plet
eex
ecut
ion.
Tim
ing
pipe
line
fails
toco
mpl
ete
anas
sign
edta
sk.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
Inte
grity
ofsc
ienc
eda
taco
uld
beco
mpr
omis
ed.
Exe
cutio
nfra
mew
ork
re-r
uns
func
tion
/al
gorit
hmag
ain
with
inpu
tda
ta.
Err
orre
port
logg
edan
dfe
dto
softw
are
deve
lopm
ent
team
fori
nves
tigat
ion.
Crit
ical
Ext
rem
ely
unlik
ely
ALL
(FM
.SD
P.P
ST.
402)
Func
tion,
proc
ess
orap
-pl
icat
ion
enco
unte
red
alo
gic
erro
r(in
finite
loop
sor
infin
itere
curs
ion,
loop
coun
ter
erro
rs,
arra
yin
-de
xou
tofb
ound
sex
cep-
tion)
.
Sam
eas
for
FM.S
DP.
PS
T.40
1.S
ame
asfo
rFM
.SD
P.P
ST.
401.
Sam
eas
for
FM.S
DP.
PS
T.40
1.S
ame
asfo
rFM
.SD
P.P
ST.
401.
Crit
ical
Ext
rem
ely
unlik
ely
ALL
(FM
.SD
P.P
ST.
403)
Func
tion,
proc
ess
orap
plic
atio
nen
coun
tere
da
reso
urce
erro
r(N
ull
poin
ter,
acce
ssvi
ola-
tions
,re
sour
cele
aks,
buffe
rov
erflo
w-u
se-
afte
r-fre
eer
ror)
.
Sam
eas
for
FM.S
DP.
PS
T.40
1.S
ame
asfo
rFM
.SD
P.P
ST.
401.
Sam
eas
for
FM.S
DP.
PS
T.40
1.S
ame
asfo
rFM
.SD
P.P
ST.
401.
Crit
ical
Ext
rem
ely
unlik
ely
ALL
(FM
.SD
P.P
ST.
404)
Func
tion,
proc
ess
orap
-pl
icat
ion
enco
unte
red
am
ulti-
thre
adin
ger
ror.
Sam
eas
for
FM.S
DP.
PS
T.40
1.S
ame
asfo
rFM
.SD
P.P
ST.
401.
Sam
eas
for
FM.S
DP.
PS
T.40
1.S
ame
asfo
rFM
.SD
P.P
ST.
401.
Crit
ical
Ext
rem
ely
unlik
ely
ALL
(FM
.SD
P.P
ST.
405)
Func
tion,
proc
ess
orap
-pl
icat
ion
enco
unte
red
anin
terfa
ceer
ror.
Sam
eas
for
FM.S
DP.
PS
T.40
1.S
ame
asfo
rFM
.SD
P.P
ST.
401.
Sam
eas
for
FM.S
DP.
PS
T.40
1.S
ame
asfo
rFM
.SD
P.P
ST.
401.
Crit
ical
Ext
rem
ely
unlik
ely
ALL
(FM
.SD
P.P
ST.
406)
Non
-det
erm
inis
ticda
tade
pend
entf
unct
ion
does
not
term
inat
ein
allo
tted
time.
Func
tion
fails
toco
m-
plet
eex
ecut
ion.
Tim
ing
pipe
line
fails
toco
mpl
ete
anas
sign
edta
sk.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
Inte
grity
ofsc
ienc
eda
taco
uld
beco
mpr
omis
ed.
Mon
itor
proc
essi
ngpr
ogre
ss,
and
forc
eea
rlyte
rmin
atio
nif
func
tion
/alg
orith
mno
tcon
verg
-in
g.Ta
gth
epr
oces
sed
data
with
ano
teex
plai
ning
how
the
pro-
cess
ing
was
curt
aile
d.G
ener
ate
alo
gen
try
expl
aini
ngho
wto
re-
prod
uce
the
erro
rmod
e.
Crit
ical
Rem
ote
ALL
(FM
.SD
P.P
ST.
407)
Func
tion,
proc
ess
orap
-pl
icat
ion
does
not
re-
spon
dto
com
man
dsin
atim
ely
man
ner.
Com
pone
nts
can’
tbe
confi
gure
dco
rrec
tly.
Tim
ing
pipe
line
can
com
plet
eex
ecut
ion,
but
poss
ibly
with
sub-
optim
alco
nfigu
ratio
n,e.
g.de
faul
tmod
e.
Inte
grity
ofsc
ienc
eda
taco
uld
beco
mpr
omis
ed.
Re-
star
tth
efu
nctio
n/
proc
ess
prio
rto
the
next
scan
.G
ener
ate
alo
gen
try
desc
ribin
gth
eer
ror
stat
ean
dst
eps
tore
prod
uce.
Crit
ical
Rem
ote
ALL
(FM
.SD
P.P
ST.
408)
Run
time
exce
eds
allo
t-te
dtim
e.P
roce
ssin
gba
cklo
gcr
e-at
ed.
Pla
ces
addi
tiona
llo
adon
proc
essi
ngre
-so
urce
s.
Inte
grity
ofsc
ienc
eda
taco
uld
beco
mpr
omis
edif
som
eda
taca
nnot
bepr
oces
sed.
Ifru
ntim
ebe
gins
toin
crea
seau
-to
mat
ical
lylo
adba
lanc
eto
pro-
vide
addi
tiona
lres
ourc
es.
Min
orO
ccas
iona
l
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 35 of 47
Tabl
e18
:S
oftw
are/
Alg
orith
mfa
ilure
mod
es9-
14.
Func
tion
FMD
escr
iptio
nLo
calE
ffec
tS
ub-s
yste
mE
ffec
tS
yste
mE
ffec
tM
itiga
tion
Sev
erity
Like
lihoo
dA
LL(F
M.S
DP.
PS
T.40
9)C
omm
unic
atio
ntim
e-ou
tca
used
byne
twor
kco
n-ne
ctiv
ityis
sues
.
Inab
ility
topr
oces
sda
ta.
Tim
ing
pipe
line
fails
toco
mpl
ete
anas
sign
edta
sk.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
Inte
grity
ofsc
ienc
eda
taco
uld
beco
mpr
omis
ed.
Ret
ryob
tain
ing
the
data
until
som
etim
e-ou
tpe
riod
TBD
has
elap
sed.
Gen
erat
ean
aler
t.
Min
orO
ccas
iona
l
ALL
(FM
.SD
P.P
ST.
410)
Func
tion,
proc
ess
orap
-pl
icat
ion
beco
mes
unre
-sp
onsi
ve.
Inab
ility
topr
oces
sda
ta.
Tim
ing
pipe
line
fails
toco
mpl
ete
anas
sign
edta
sk.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
Inte
grity
ofsc
ienc
eda
taco
uld
beco
mpr
omis
ed.
Re-
star
tth
efu
nctio
nim
med
i-at
ely.
Gen
erat
ea
log
entr
yde
-sc
ribin
gth
eer
rors
tate
and
step
sto
repr
oduc
e.
Min
orR
emot
e
ALL
(FM
.SD
P.P
ST.
411)
Err
orch
ecki
ngpr
oce-
dure
sfa
ilin
the
exec
ut-
ing
appl
icat
ion
orfu
nc-
tion.
Inab
ility
topr
oces
sda
ta.
Tim
ing
pipe
line
fails
toco
mpl
ete
anas
sign
edta
sk.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
Inte
grity
ofsc
ienc
eda
taco
uld
beco
mpr
omis
ed.
Re-
star
tth
efu
nctio
nim
med
i-at
ely.
Gen
erat
ea
log
entr
yde
-sc
ribin
gth
eer
rors
tate
and
step
sto
repr
oduc
e.
Min
orE
xtre
mel
yun
likel
y
ALL
(FM
.SD
P.P
ST.
412)
Sec
urity
brea
ches
and
intr
usio
nsoc
curr
ing
dur-
ing
norm
alex
ecut
ion.
Inab
ility
topr
oces
sda
ta.
Tim
ing
pipe
line
fails
toco
mpl
ete
anas
sign
edta
sk.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
Inte
grity
ofsc
ienc
eda
taco
uld
beco
mpr
omis
ed.
Term
inat
eal
lfun
ctio
nsan
dpr
o-ce
sses
and
gene
rate
anal
ert.
Min
orE
xtre
mel
yun
likel
y
ALL
(FM
.SD
P.P
ST.
413)
Inm
emor
yer
rors
caus
edby
bit
flips
orpo
wer
surg
esco
rrup
ting
exec
utin
gco
de.
Inab
ility
topr
oces
sda
ta.
Tim
ing
pipe
line
fails
toco
mpl
ete
anas
sign
edta
sk.
Ope
ratio
nal
relia
bilit
yan
def
ficie
ncy
are
de-
grad
ed.
Inte
grity
ofsc
ienc
eda
taco
uld
beco
mpr
omis
ed.
Re-
star
tth
efu
nctio
nim
med
i-at
ely.
Gen
erat
ea
log
entr
yde
-sc
ribin
gth
eer
rors
tate
and
step
sto
repr
oduc
e.G
ener
ate
anal
ert.
Min
orE
xtre
mel
yun
likel
y
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 36 of 47
6 Summary
In this document we have summarised pulsar timing pipeline failure modes at a high levelof abstraction. Numerous failure types have been identified and contextualised accordingto number of key assumptions. Our next steps will be to improve upon this work followingfeedback from our SDP colleagues, and incorporate those improvements into analyses ofpulsar and transient search pipeline failure modes.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 37 of 47
A FMECA Detection methods
Table 19 through to Table 21 summarises the detection methods for each failure mode.
Table 19: Summary of the detection methods for each of the failure modes discussed in thisdocument (Part 1).
Failure Mode Code Detection MethodFM.SDP.PST.101 Monitor the health status of the SDP data ingest nodes, and
monitor network connectivity status.FM.SDP.PST.102 Same as for FM.SDP.PST.101.FM.SDP.PST.103 Same as for FM.SDP.PST.101.FM.SDP.PST.104 Monitor the health status of the SDP compute nodes, and
monitor network connectivity status.FM.SDP.PST.105 Same as for FM.SDP.PST.104.FM.SDP.PST.106 Same as for FM.SDP.PST.104.FM.SDP.PST.107 Same as for FM.SDP.PST.104.FM.SDP.PST.108 Same as for FM.SDP.PST.104.FM.SDP.PST.109 Same as for FM.SDP.PST.104.FM.SDP.PST.110 Same as for FM.SDP.PST.104.FM.SDP.PST.111 Same as for FM.SDP.PST.104.FM.SDP.PST.112 Same as for FM.SDP.PST.104.FM.SDP.PST.113 Same as for FM.SDP.PST.104.FM.SDP.PST.114 Same as for FM.SDP.PST.104.FM.SDP.PST.115 Same as for FM.SDP.PST.104.FM.SDP.PST.116 Same as for FM.SDP.PST.104.FM.SDP.PST.117 Same as for FM.SDP.PST.104.FM.SDP.PST.201 Monitor the health status of software modules, and monitor
network connectivity status.FM.SDP.PST.202 Monitor the health status of software modules, and monitor
network connectivity status.FM.SDP.PST.203 QA of control parameters sent between TM/LMC and the tim-
ing pipeline components.FM.SDP.PST.204 QA of control commands sent between TM/LMC and the tim-
ing pipeline components.FM.SDP.PST.205 Active monitoring of software components and the communi-
cation network between them.FM.SDP.PST.206 Active monitoring of software components and the communi-
cation network between them.FM.SDP.PST.207 Active monitoring of software components and the communi-
cation network between them.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 38 of 47
Table 20: Summary of the detection methods for each of the failure modes discussed in thisdocument (Part 2).
Failure Mode Code Detection Method
FM.SDP.PST.208 Active monitoring of software components and the communi-cation network between them.
FM.SDP.PST.209 Active monitoring of data processing hardware.FM.SDP.PST.210 Active monitoring of data processing hardware.FM.SDP.PST.211 Active monitoring of data processing hardware.FM.SDP.PST.212 Active monitoring of data processing hardware.FM.SDP.PST.213 QA of control parameters sent between TM/LMC and the tim-
ing pipeline component.FM.SDP.PST.214 Active monitoring of data processing hardware.FM.SDP.PST.215 Monitor network connectivity status.FM.SDP.PST.216 Monitor network connectivity status and QA of received data.FM.SDP.PST.217 Monitor network connectivity status and QA of received data.FM.SDP.PST.217 Monitor network connectivity status and QA of received data.FM.SDP.PST.218 Monitor network connectivity status and QA of received data.FM.SDP.PST.219 Monitor the processing load placed upon data ingest nodes,
and monitor network connectivity status.FM.SDP.PST.220 Monitor cumulative sub-integration loss for each beam per
scan.FM.SDP.PST.221 Monitor network connectivity status and QA of received data.FM.SDP.PST.222 Monitor network connectivity status and QA of received data.FM.SDP.PST.223 Monitor network connectivity status and QA of received data.FM.SDP.PST.224 Monitor network connectivity status and QA of received data.FM.SDP.PST.225 Monitor network connectivity status and QA of received data.FM.SDP.PST.226 Monitor network connectivity status and QA of received data.FM.SDP.PST.227 Monitor network connectivity status and QA of received data.FM.SDP.PST.228 Monitor network connectivity status and QA of received data.FM.SDP.PST.229 Monitor network connectivity status and QA of received data.FM.SDP.PST.230 Monitor network connectivity status and QA of received data.FM.SDP.PST.231 Monitor network connectivity status and QA of received data.FM.SDP.PST.232 Monitor network connectivity status and QA of received data.FM.SDP.PST.233 Monitor network connectivity status and QA of received data.FM.SDP.PST.234 Monitor network connectivity status and QA of received data.FM.SDP.PST.235 Monitor network connectivity status and QA of received data.FM.SDP.PST.236 Monitor network connectivity status and QA of received data.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 39 of 47
Table 21: Summary of the detection methods for each of the failure modes discussed in thisdocument (Part 3).
Failure Mode Code Detection Method
FM.SDP.PST.301 QA the data received to ensure it is formatted correctly andcontains valid data values.
FM.SDP.PST.302 Check that the RFI mask is valid.FM.SDP.PST.303 Check that the RFI mask is valid.FM.SDP.PST.304 Check the signal-to-noise ratio of the detected pulse increases
post RFI mitigation.FM.SDP.PST.305 Check for valid calibration strategy.FM.SDP.PST.306 Check for valid calibration strategy.FM.SDP.PST.307 Check the signal-to-noise ratio of the detected pulse increases
post calibration.FM.SDP.PST.308 Check for valid configuration.FM.SDP.PST.309 QA the format and values of the averaged data products.FM.SDP.PST.310 QA the standard profile.FM.SDP.PST.311 QA the standard profile.FM.SDP.PST.312 QA the computed TOAs.FM.SDP.PST.313 QA the computed TOAs.FM.SDP.PST.314 QA the residuals.FM.SDP.PST.315 QA the residuals.FM.SDP.PST.316 QA the residuals.FM.SDP.PST.401 Process monitoring at the operating system / execution frame-
work level.FM.SDP.PST.402 Same as for FM.SDP.PST.401.FM.SDP.PST.403 Same as for FM.SDP.PST.401.FM.SDP.PST.404 Same as for FM.SDP.PST.401.FM.SDP.PST.405 Same as for FM.SDP.PST.401.FM.SDP.PST.406 Process monitoring at the operating system / execution frame-
work level.FM.SDP.PST.406 Process monitoring at the operating system / execution frame-
work level.FM.SDP.PST.407 Process monitoring at the operating system / execution frame-
work level.FM.SDP.PST.408 Process monitoring at the operating system / execution frame-
work level.FM.SDP.PST.409 Process monitoring at the operating system / execution frame-
work level.FM.SDP.PST.410 Process monitoring at the operating system / execution frame-
work level.FM.SDP.PST.411 Process monitoring at the operating system / execution frame-
work level.FM.SDP.PST.412 Process monitoring at the operating system / execution frame-
work level.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 40 of 47
B FMECA Results
Table 22 through to Table 24 summarises the results of our analysis.
Table 22: Summary of the criticality scores for each of the failure modes discussed in thisdocument (Part 1).
Failure Mode Code Severity Probability ScoreFM.SDP.PST.101 Minor Occasional 3FM.SDP.PST.102 Critical Remote 8FM.SDP.PST.103 Catastrophic Extremely unlikely 5FM.SDP.PST.104 Marginal Remote 4FM.SDP.PST.105 Marginal Remote 4FM.SDP.PST.106 Marginal Remote 4FM.SDP.PST.107 Marginal Remote 4FM.SDP.PST.108 Critical Remote 8FM.SDP.PST.109 Marginal Remote 4FM.SDP.PST.110 Marginal Remote 4FM.SDP.PST.111 Marginal Remote 4FM.SDP.PST.112 Marginal Remote 4FM.SDP.PST.113 Marginal Remote 4FM.SDP.PST.114 Marginal Remote 4FM.SDP.PST.115 Marginal Remote 4FM.SDP.PST.116 Marginal Remote 4FM.SDP.PST.117 Marginal to Critical Remote 4 to 8FM.SDP.PST.201 Minor Remote 2FM.SDP.PST.202 Minor Remote 2FM.SDP.PST.203 Minor Remote 2FM.SDP.PST.204 Minor Remote 2FM.SDP.PST.205 Minor Remote 2FM.SDP.PST.206 Minor Remote 2FM.SDP.PST.207 Minor Remote 2FM.SDP.PST.208 Significant Remote 6FM.SDP.PST.209 Significant Remote 6FM.SDP.PST.210 Significant Remote 6FM.SDP.PST.211 Catastrophic Extremely unlikely 5FM.SDP.PST.212 Catastrophic Extremely unlikely 5FM.SDP.PST.213 Minor Occasional 3FM.SDP.PST.214 Catastrophic Remote 6
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 41 of 47
Table 23: Summary of the criticality scores for each of the failure modes discussed in thisdocument (Part 2).
Failure Mode Code Severity Probability ScoreFM.SDP.PST.215 Catastrophic Remote 10FM.SDP.PST.216 Marginal Occasional 6FM.SDP.PST.217 Marginal Remote 4FM.SDP.PST.218 Catastrophic Extremely unlikely 4FM.SDP.PST.219 Marginal Remote 4FM.SDP.PST.220 Scan dependent. Fractional loss
is important. Severity rangesfrom minor to critical due to cu-mulative effects.
Occasional 2 to 8
FM.SDP.PST.221 Marginal Remote 4FM.SDP.PST.222 Marginal Remote 4FM.SDP.PST.223 Marginal Remote 4FM.SDP.PST.224 Marginal to Critical Remote 4 to 8FM.SDP.PST.225 Minor Remote 2FM.SDP.PST.226 Minor Remote 2FM.SDP.PST.227 Marginal Remote 4FM.SDP.PST.228 Minor Remote 2FM.SDP.PST.229 Minor Remote 2FM.SDP.PST.230 Minor Remote 2FM.SDP.PST.231 Marginal Remote 4FM.SDP.PST.232 Minor Remote 2FM.SDP.PST.233 Marginal Remote 4FM.SDP.PST.234 Minor Remote 2FM.SDP.PST.235 Marginal Remote 4FM.SDP.PST.236 Marginal Remote 4FM.SDP.PST.315 Marginal Remote 4FM.SDP.PST.316 Marginal Remote 4FM.SDP.PST.301 Scan dependent. Fractional loss
is important. Severity rangesfrom minor to critical due to cu-mulative effects.
Remote 2 to 8
FM.SDP.PST.302 Marginal Extremely Unlikely 4FM.SDP.PST.303 Marginal Remote 4FM.SDP.PST.304 Minor Occasional 4FM.SDP.PST.305 Marginal Remote 4FM.SDP.PST.306 Marginal Remote 4FM.SDP.PST.307 Minor Occasional 3FM.SDP.PST.308 Minor Occasional 4FM.SDP.PST.309 Minor Remote 4FM.SDP.PST.310 Marginal Remote 4FM.SDP.PST.311 Marginal Remote 4FM.SDP.PST.312 Marginal Remote 4FM.SDP.PST.313 Marginal Remote 4
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 42 of 47
Table 24: Summary of the criticality scores for each of the failure modes discussed in thisdocument (Part 3).
Failure Mode Code Severity Probability ScoreFM.SDP.PST.314 Marginal Remote 4FM.SDP.PST.315 Marginal Remote 4FM.SDP.PST.316 Marginal Remote 4FM.SDP.PST.401 Critical Extremely unlikely 4FM.SDP.PST.402 Critical Extremely unlikely 4FM.SDP.PST.403 Critical Extremely unlikely 4FM.SDP.PST.404 Critical Extremely unlikely 4FM.SDP.PST.405 Critical Extremely unlikely 4FM.SDP.PST.406 Critical Remote 8FM.SDP.PST.406 Critical Remote 8FM.SDP.PST.407 Minor Occasional 3FM.SDP.PST.408 Minor Occasional 3FM.SDP.PST.409 Minor Remote 2FM.SDP.PST.410 Minor Extremely unlikely 1FM.SDP.PST.411 Minor Extremely unlikely 1FM.SDP.PST.412 Minor Extremely unlikely 1
C Applicable Requirements
We currently do not have access to Innoslate, thus these requirements may not be up-to-date.
Table 25: Level 2 SDP requirements relevant to the failure mode analysis.
Requirement ID Name DescriptionSDP REQ-30 Graceful degradation The failure of a single component should
not cause the SDP to become unavail-able.
SDP REQ-33 Flagging control The SDP shall flag data according to apre-selected RFI Mask.
SDP REQ-52 Failsafe The SDP shall actively ensure that inter-nal failures do not result in a hazardoussituation to the systems and personnelwith which it interfaces.
SDP REQ-133 Pulsar Search Post Process-ing
SDP shall be capable of operating in apulsar search mode, concurrently withcontinuum imaging mode, single pulsetransient search mode and pulsar timingmode, within the same subarray.
SDP REQ-276 Data Product Provenance The SDP shall create and maintain prove-nance links between science data prod-ucts and observing projects and propos-als.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 43 of 47
SDP REQ-281 Protection against data loss The SDP shall protect the preserved sci-ence data products against data loss andmalicious or accidental modification.
SDP REQ-285 Accessibility The SDP shall enable per user access toSDP resources (hardware and software)using the Authentication and Authorisa-tion facilities provided by the SKA (as perEN 50600-2-5. Data centre facilities andinfrastructures. Part 2-5. Security sys-tems).
SDP REQ-450 SDP standard pipeline prod-ucts
The SDP shall produce processing logsand quality assessment logs for allpipelines. These should be traceable tothe originating Schedule Blocks.
SDP REQ-470 Receive Data The SDP shall receive the observeddata from CSP in compliance with theSDP-CSP ICD 100-000000-002 and 300-000000-002.
SDP REQ-472 Handle Missing Data The SDP shall be capable of handlingmissing data packets coming from CSP insuch a way that it minimises the scientificimpact of the lost data.
SDP REQ-476 Flag RFI The SDP shall be capable of auto-matically flagging known and unknownRFI using algorithms as applied in theAOFlagger.
SDP REQ-477 Excise RFI The SDP shall be capable of automati-cally excising known and unknown RFI.
SDP REQ-478 Detect RFI The SDP shall be capable of detectingdata that is corrupted by RFI.
SDP REQ-479 Remove Sources The SDP shall be capable of removingstrong sources at the highest time andfrequency resolution.
SDP REQ-480 Integrate Data The SDP shall be capable of integratingdata in time and/or frequency.
SDP REQ-524 Pulsar Timing Input SDP shall be capable of receiving pulsartiming data and dynamic spectrum data inaccordance with the SDP-CSP InterfaceControl Document (100-000000-002 and300-000000-002).
SDP REQ-527 Pulsar Search Data Input The SDP shall be capable of receivingpulsar periodicity search data in accor-dance with the SDP-CSP Interface Con-trol Document (100-000000-002 and 300-000000-002).
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 44 of 47
SDP REQ-529 Pulsar Timing Precision When provided with a suitable template,signal-to-noise and pulsar parameters,SDP shall be able to measure the arrivaltime of a pulse with a precision of 5ns.
SDP REQ-530 Pulsar Timing ToA Determini-nation
SDP shall be capable of determining thetime of arrival of a pulse from pulsar tim-ing data.
SDP REQ-532 Single pulse Transient PostProcessing
SDP shall be capable of operating in asingle pulse transient search mode, con-currently with continuum imaging modeand pulsar search mode and pulsar tim-ing mode, within the same subarray.
SDP REQ-534 Pulsar Timing Data Prepara-tion
SDP shall be capable of performingdata pre-processing (adding the sub-integrations from each pulsar togetherinto one data file) on pulsar timing data.
SDP REQ-539 Non-imaging Transient Input SDP shall be capable of receiving sin-gle pulse transient search data in accor-dance with the SDP-CSP Interface Con-trol Document (100-000000-002 and 300-000000-002).
SDP REQ-542 Pulsar Timing Error Estima-tion
SDP shall be able to estimate the uncer-tainty in the arrival time of a pulse to bet-ter than 5%.
SDP REQ-543 Pulsar Timing Systematic Er-ror
SDP shall not add more than 5ns system-atic error in the time-of-arrival determina-tion.
SDP REQ-544 Single pulse Transient Alerts SDP shall provide preliminary alerts forthe detection of fast (single pulse) tran-sient events within 10s of the data con-taining that event arriving at the SDP.
SDP REQ-546 Single pulse TransientSearch Output
SDP shall output a single ranked list ofsingle pulse transient candidates (withdurations greater 50 µsec) from each ob-servation.
SDP REQ-558 Pulsar Search Output SDP shall output a single ranked list ofpulsar periodicity candidates from eachobservation.
SDP REQ-565 Pulsar Timing Model Fitting SDP shall be capable of fitting a pulsartiming model to pulsar times of arrival.
SDP REQ-640 Single Pulse data preparationperformance
While receiving single pulse transientsearch data the SDP shall prepare thedata for processing within 100 millisec-onds.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 45 of 47
SDP REQ-641 Transient Buffer Receive Mid The SKA1 MID SDP shall start recordingTransient Buffer data no later than 60 sec-onds from the time that the highest fre-quency component of a transient signalarrives at the telescope.
SDP REQ-642 Transient Buffer Receive Low The SKA1 LOW SDP shall start record-ing Transient Buffer data no later than 900seconds from the time that the highestfrequency component of a transient sig-nal arrives at the telescope.
SDP REQ-643 Transient Buffer Receive The SDP shall receive Transient Bufferdata from the CSP for the purpose ofarchiving the transient buffer data.
SDP REQ-644 Pulsar timing compute perfor-mance
When performing pulsar timing the SDPshall have at least sufficient performanceto execute an algorithm of comparablecomplexity to using PSRCHIVE (for pro-cessing PSRFITS fits files and produc-ing pulsar arrival times) and TEMPO2 (forcomputing time residuals and updatingtiming models).
SDP REQ-645 Pulsar timing quantity When performing pulsar timing process-ing the SDP shall be able to processdata from 16 pulsars concurrently withSKA1 MID constrained to a net, on sky,bandwidth of 20GHz per polarisation.
SDP REQ-646 Single Pulse search computeperformance
When performing single pulse transientsearch the SDP shall have at least suf-ficient performance to execute an algo-rithm of comparable complexity to usingPulsar Feature Lab (for heuristics), Gaus-sian Hellinger Very Fast Decision Tree(classification) and Sigproc Gtools (TBC-043) (for coincidence tests).
SDP REQ-647 Single pulse reception rate While performing single pulse transientsearch the SDP shall be able to receiveone candidate per beam every 1 second(TBC-044).
SDP REQ-648 Pulsar search compute per-formance
When performing pulsar search the SDPshall have at least sufficient performanceto execute an algorithm of compara-ble complexity to using Pulsar FeatureLab (for heuristics), Gaussian HellingerVery Fast Decision Tree (classification)and Sigproc Gtools (TBC-045) (for coin-cidence tests).
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 46 of 47
SDP REQ-649 Pulsar search performance While performing pulsar search the SDPshall be able to process a maximum of1000 candidates per beam.
SDP REQ-653 Flag invalid data The SDP shall flag invalid data (NaN orInf) and data invalid according to meta-data.
SDP REQ-706 Delivery latency The SDP shall start delivering any sci-ence data product, regardless of physi-cal location, within 10 minutes (for a 1TBscience product) (TBC-077) of receiving aretrieval request for a science data prod-uct.
SDP REQ-722 TM command acknowledge-ment latency
The SDP shall acknowledge receipt ofcommands from TM within 1s.
SDP REQ-731 Science events The SDP shall send events to the TM forthe following activities: -detection of animaging transient -detection of a singlepulse transient.
SDP REQ-763 SDP Critical failure identifica-tion
The SDP shall identify more than 99% ofall critical failures and report them to TM.
SDP REQ-764 SDP Isolation of critical fail-ures
The SDP, shall isolate 95% of all criticalfailures and report it to TM.
SDP REQ-786 Dynamic Spectrum dataproduct
The SDP when commanded shall receiveand store a high time resolution dynamicspectrum data product (time-frequency-polarisation).
SDP REQ-787 Dynamic spectrum sub-arraysupport
The SDP, when configured in dynamicspectrum mode, shall receive and storedynamic spectrum mode data for a to-tal of up to 16 dual polarisation beams(with SKA1 Mid constrained to a net, onsky, bandwidth of 20 GHz per polarisa-tion) from one to sixteen subarrays, inde-pendently and concurrently.
SDP REQ-807 Dynamic Spectrum ModeData Preparation
SDP shall perform data pre-processing(aggregating sub-integrations from ascan into a single file) for dynamicspectrum mode data for SKA1 Low andSKA1 Mid.
Document no.: SDP Memo 43Revision: C1Release date: 2018-04-17
UnrestrictedAuthor: R. J. Lyon et. al.
Page 47 of 47