Fault diagnosis of SPC switching systems based on structure and signalling

5
Fault diagnosis of SPC switching systems based on structure and signalling C.R. Shashidhar and F.P. Coakley Abstract: The paper presents a new approach to fault diagnosis of stored program controlled (SPC) switching systems. It uses the functional behaviour of SPC systems to model faults. This method uses circuit, structure descriptions and call processing programs to generate automatically the data required for fault analysis and avoids fault simulation. 1 Introduction Fault diagnosis of stored program controlled (SPC) switching systems assumes greater importance as the size and complexity of such systems grow. Despite advances in technology, components are bound to fail. Statistical observations have shown that around 40 - 50% of failure modes in SPC systems are due to component failures [1]. To prevent system degradation due to component failures it is necessary to locate and repair faults. Many methods have been adopted to localise a hardware failure in SPC systems to a field replaceable unit (FRU) [2]. Traditional methods use various forms of fault dictionaries to isolate the fault. However, many of these methods use fault simulation techniques to generate the dictionaries, which require considerable computation. Fault dictionaries rely on the single 'stuck-at' fault model. Construction of fault dictionaries with multiple-fault situations is generally computationally impracticable. Even if the fault dictionary is constructed, it has to be updated many times owing to design changes. Also, only a small percentage of pre- computed faults may be observed in practice, and therefore considerable effort will be wasted in producing such a large dictionary. The approach of Abramovici and Menon [3] avoids fault simulation but requires true-value simulation and has to be ruled out for complex systems such as SPC. The approach of Chu [4] uses the connectivity of circuits to derive fault dictionaries. However, this method does not make use of fault modelling based on system behaviour and is thus system dependent. Recently, many papers [5-7] have proposed artificial intelligence techniques for fault diagnosis. These methods use fault-symptom pairs for fault isolation, knowledge of which has to be acquired from the past experience of practitioners and experts. This approach will be unsuitable if each system is different and undergoes continuous modifications as in SPC systems. In contrast to the dictionary approach, the functional testing approach is adopted by some switching systems [8, 9]. The strategy in this case is to adopt functional tests for each printed circuit board (PCB). A sequence of tests is conducted and faulty cards are isolated based on the results of these tests. In practice, it is difficult to associate tests with a particular PCB. It is more natural to associate a test with a function. Hardware modifications will inevitably lead to the design of new functional tests. The C.R. Shashidhar is, and F.P. Coakley was formerly, with the Depart- ment of Electrical Engineering, Unversity of Essex, Wivenhoe Park, Colchester, Essex CO4 3SQ, England. F.P. Coakley is now with the Department of Electronic & Electrical Engineering, University of Surrey, Guildford GU2 5XH, England 30 new tests have to be designed not only for the PCB which has been modified, but also for all PCBs which are inter- connected to a modified PCB. In this paper we present another approach to fault diagnosis that avoids simulation. It uses the hierarchical structure of SPC systems and their standardised signalling protocol. In contrast to other methods it has the following advantages: (i) It uses fault modelling based on the signalling protocol and is more general than single stuck-at fault modelling. (ii) It uses system structure as a parameter, and changes in system hardware can be incorporated without much effort. (iii) Most of the data required for this method can be generated from documents such as circuit component lists and call processing programs. This method attempts to use the behaviour of the system as seen by the subscriber to diagnose the faults. It does not require any specialised test equipment. It is suitable for small exchanges and private branch exchanges and can be easily implemented by a small diagnosis co-processor attached to the exchange. control processor switch network interface circuits 1 to subscribers Fig. 1 SPC system structure 2 SPC system structure A typical switching system block schematic diagram is Software & Microsystems, Vol. 4, No. 2, April 1985

Transcript of Fault diagnosis of SPC switching systems based on structure and signalling

Fault diagnosis of SPC switching systemsbased on structure and signalling

C.R. Shashidhar and F.P. Coakley

Abstract: The paper presents a new approach to fault diagnosis of stored program controlled (SPC) switchingsystems. It uses the functional behaviour of SPC systems to model faults. This method uses circuit, structuredescriptions and call processing programs to generate automatically the data required for fault analysis andavoids fault simulation.

1 Introduction

Fault diagnosis of stored program controlled (SPC) switchingsystems assumes greater importance as the size andcomplexity of such systems grow. Despite advances intechnology, components are bound to fail. Statisticalobservations have shown that around 40 - 50% of failuremodes in SPC systems are due to component failures [1].To prevent system degradation due to component failuresit is necessary to locate and repair faults.

Many methods have been adopted to localise a hardwarefailure in SPC systems to a field replaceable unit (FRU) [2].Traditional methods use various forms of fault dictionariesto isolate the fault. However, many of these methods usefault simulation techniques to generate the dictionaries,which require considerable computation. Fault dictionariesrely on the single 'stuck-at' fault model. Construction offault dictionaries with multiple-fault situations is generallycomputationally impracticable. Even if the fault dictionaryis constructed, it has to be updated many times owing todesign changes. Also, only a small percentage of pre-computed faults may be observed in practice, and thereforeconsiderable effort will be wasted in producing such a largedictionary.

The approach of Abramovici and Menon [3] avoidsfault simulation but requires true-value simulation and hasto be ruled out for complex systems such as SPC. Theapproach of Chu [4] uses the connectivity of circuits toderive fault dictionaries. However, this method does notmake use of fault modelling based on system behaviour andis thus system dependent. Recently, many papers [5-7]have proposed artificial intelligence techniques for faultdiagnosis. These methods use fault-symptom pairs for faultisolation, knowledge of which has to be acquired from thepast experience of practitioners and experts. This approachwill be unsuitable if each system is different and undergoescontinuous modifications as in SPC systems.

In contrast to the dictionary approach, the functionaltesting approach is adopted by some switching systems [8,9 ] . The strategy in this case is to adopt functional tests foreach printed circuit board (PCB). A sequence of tests isconducted and faulty cards are isolated based on theresults of these tests. In practice, it is difficult to associatetests with a particular PCB. It is more natural to associatea test with a function. Hardware modifications willinevitably lead to the design of new functional tests. The

C.R. Shashidhar is, and F.P. Coakley was formerly, with the Depart-ment of Electrical Engineering, Unversity of Essex, Wivenhoe Park,Colchester, Essex CO4 3SQ, England. F.P. Coakley is now with theDepartment of Electronic & Electrical Engineering, University ofSurrey, Guildford GU2 5XH, England

30

new tests have to be designed not only for the PCB whichhas been modified, but also for all PCBs which are inter-connected to a modified PCB.

In this paper we present another approach to faultdiagnosis that avoids simulation. It uses the hierarchicalstructure of SPC systems and their standardised signallingprotocol. In contrast to other methods it has the followingadvantages:

(i) It uses fault modelling based on the signallingprotocol and is more general than single stuck-at faultmodelling.

(ii) It uses system structure as a parameter, and changesin system hardware can be incorporated without mucheffort.

(iii) Most of the data required for this method can begenerated from documents such as circuit component listsand call processing programs.

This method attempts to use the behaviour of the system asseen by the subscriber to diagnose the faults. It does notrequire any specialised test equipment. It is suitable forsmall exchanges and private branch exchanges and can beeasily implemented by a small diagnosis co-processorattached to the exchange.

control processor

switch network

interface circuits

1 to subscribers

Fig. 1 SPC system structure

2 SPC system structure

A typical switching system block schematic diagram is

Software & Microsystems, Vol. 4, No. 2, April 1985

shown in Fig. 1. It comprises a control processor, switchingnetwork and interface circuits. The central processor can beeither a single or multiple processor, depending on exchangeconfiguration and design. A digital switching network isnow standard and is a combination of time and spaceswitches. A single SPC system controls many subscribers,ranging from 10 to 100000. Because of reliability andavailability considerations many hardware units are dupli-cated, and the system will be structured such that a singlefailure affects a limited number of subscribers. Thehierarchical structure of the exchange is shown in Fig. 2.At the top of the hierarchy are call control processors,which are assisted by slave processors for performingroutine functions such as scanning signalling and switching.These in turn drive buffers/multiplexers and line interfacecircuits.

The tree structure is inherent in the interface portions ofthe exchanges even when complex redundancy schemes areincorporated. The redundancy scheme is not implementeddown to individual subscriber level because of costconsiderations. Therefore groups of subscribers have to beprovided with hardware elements which are in turn dupli-cated.

Although many types of architecture have been imple-mented, the protocol between the subscribers and thesystem has remained standard. This is depicted in Table 1.

Table 1: Signalling protocol

Subscriber actionsLift handsetDial digits

Called subscriber respondsRelease

System responsesdial tonebusy tone/engage tonering back tone/NU toneringing to called subscriberset up speech pathremove/clearpower or set up paths

The following characteristics distinguish SPC systemsfrom other digital systems:

(i) Continuous design modifications and updates areimplemented throughout the life cycle of the system.

(ii) Many FRUs are identical, nondigital and use LSIand VLSI components.

(iii) As in LSI/VLSI few observation points are availableand their inclusion is dictated by design and cost con-siderations.

3 Fault modelling for SPC systems

Conventional test generation techniques have modelledfaults as 'stuck-at' fault situations. Recently, faults in amicroprocessor have been successfully modelled at in-struction level [10]. However, for systems such as SPCthere seems to be no fault modelling that reflects thegeneral behaviour. We propose the following fault modellingfor the system based on signalling protocol (for ourmodelling we consider a signal as either a tone or speech orany voltage/current change as seen by a subscriber or aterminal):

(i) type 1: absence of a particular signal to a sub-scriber, for example absence of ring back, dial tone, NUtone etc.

Software & Microsystems, Vol. 4, No. 2, April 1985

(ii) type 2: presence of an additional signal alongwith the expected signal, for example speech along withtones, overheard conversation along with the expected oneetc.

(iii) type 3: presence of an unexpected signal insteadof the expected signal, for example hearing ring back toneinstead of dial tone etc.

(iv) type 4: presence of a distorted signal.

Since the intended behaviour of the system is characterised

maintenance control

callcontrol

processor

callcontrol

processor

I scan I

buffers/multiplexers

buffers/multiplexers

interfacecircuits

interfacecircuits

Fig. 2 Hierarchical structure

by the relation of signals and subscriber it is reasonable toassume that the above set of fault models is general. Theobservations for the above fault situations can be obtainedeither from a test call generator, available in most exchanges,or from subscriber complaints.

For analysis purposes it is assumed that a hard core,typically a maintenance processor, is functional and thatthere exists only one component failure at a time. Thedesign of SPC systems is based on the presence of a singlefault that can be repaired within a specified time, duringwhich another component failure is generally not expected.Since the method is aimed at locating component failures,repeatable test conditions can be obtained.

4 System descriptions

It is difficult to represent the entire system behaviour ata single level. Structure, connectivity and functional de-scriptions are required to analyse the behaviour of thesystem.

4.1 Structure description

For any diagnosis to be useful it is necessary to identify thefault to an FRU. Thus it is necessary to have data regardingthe FRUs in a system and their interrelations. In SPCsystems it is useful to have FRUs related to the subscribersthey control directly or indirectly. This information can bederived from system configuration data and is presented inTable 2 for a typical exchange. This information is utilised

31

by a fault analyser routine when the faults mentioned inSection 3 are referred to a group of subscribers.

Table 2: Structure description

A typical structure —Card name

CPU

Signallinginput unitShelf multiplexLine interfaceunit

Monarch-120Number ofcards ofthis type

1

1

5

20

Number ofsubscribersthis card controls120

120

32

4

adc ress

•-

buffer

data buffer in

I—

"mory

—cs

memoryout

memoryin

m PISO

t1

decodercs

clock

out

t2

counter

Fig. 3 Functional description of circuit schematic diagram

Description:

data -> memoryjn when (cs = 0) and (address = xxxx)memory_out-»- PISOjn when (t2+)PISOjout -*• out when (t1+)

4.2 Functional description

The complete behaviour of any hardware unit is due to theinteraction of various signals and components. Thefunctional behaviour can be described by using these signalsand their timings in relation to their associated signals. Atypical circuit block schematic diagram and its functionaldescription are shown in Fig. 3. The circuit converts thedata written into its memory from parallel to serial format.

Table 3: Connectivity description

The table contains the following entries:name of the cardidentifier to signify the beginning of the list of componentscomponent name: list of signal names connected to this

componentidentifier to signify the end of data for this card

cardiICSId: sig1, sig2, sig3Ic2: sig10, sig4

end

card2

ICSId: sig1,sig5Ic2: sig3, sig5, sig11Ic3: sig2, sig12, sig3

end

4.3 Connectivity description

FRUs are composed of different components physicallyconnected. The interconnection data are useful for diagnosissince the components corresponding to a signal can bederived from such details. A typical data format for repre-senting this knowledge is shown in Table 3.

5 Fault analysis method

Each FRU unit in any SPC system controls subscriber/subscribers in a direct or an indirect way. For example, theline interface circuits directly control the subscriber be-haviour, whereas the scanning and signalling processor assiststhe subscriber to set up a connection. Therefore an FRUunit can be split into different hardware sections:

(i) a section that controls a single subscriber (SCSS)

(ii) a section that controls a group of subscribers(SCGS).

The components belonging to the above sections can befound from the connectivity and functional descriptionsand from a part of the call processing programs that inter-faces with hardware units. The call control programsupplies information about the signals that affect subscribers(these are address and data values). A simple algorithmfinds all the signals that affect the subscriber from thefunctional description and computes the related componentsfrom the connectivity descriptions of the cards. By re-cursively using the above algorithm for groups of sub-scribers the components belonging to above sections can bederived. A similar procedure is adopted to compute the listof components responsible for generating any signal.

5.1 Type-1 faults (absence of a signal)

For a given signal and from connectivity descriptions thesignal path or list of FRUs responsible for generating thatsignal are found using the methods mentioned above. Basedon the assumption that at any time only a single FRU canbe faulty, the fault can either affect:

(i) a group of subscribers

(ii) a single subscriber.

In case (i) we propose that the fault should be only in apart that controls a group of subscribers. Any otherproposal will contradict our basic assumption of a singlefaulty FRU and the observation that the fault has affectedthe service to a grpup of subscribers. In case (ii) we ascribethe fault to the list of components that are responsible forgenerating that particular signal for a single subscriber.

Thus we assume that it is possible to obtain the datacorresponding to each subscriber. These can be derivedfrom a test line card which has access to all the subscriberinterface circuits, or by observation points provided in theline interface circuits. These methods fail if these types offeedback are not provided to the maintenance processor.Such cases may be few; if they occur the designer would beadvised to include the necessary feedback.

32 Software & Microsystems, Vol. 4, No. 2, April 1985

5.2 Type-2 faults (presence of an additional signal Aalong with the expected signal E)

The. reasons for the fault could be:

(i) a wrong signal A being sent out

(ii) a correct signal E sent along with it.

An analysis for signal A similar to a type-1 fault wouldgenerate a list of suspected components. However, case (ii)suggests that all components concerned with signal E mightbe correct, but that the components not common to boththe signals could still be faulty. Therefore the final list ofsuspected components consists of suspected faultycomponents from signal A which are not in the correct listof signal E.

5.3 Type-3 faults (presence of an unexpected signal Uand absence of an expected signal E)

The analysis is similar to type-1 faults, but with signal Eand signal U being wrong. Thus the final list is the union oflists generated by both the signals.

5.4 Type-4 faults (distorted signals)

The analysis is similar to type-1 faults with the exceptionthat all the components responsible for controlling theanalogue parts of the exchange will be in the suspect list.Since this type of fault is usually observed as speech distor-tions by the subscriber, it is possible to trace the fault tothe data manipulating part rather than the control part.

6 Fault isolation

The block schematic diagram of the scheme is shown inFig. 4. The fault situations are described in short crypticEnglish sentences. A sentence analyser checks for correctsyntax and also interrogates the technician about the

symptomgenerator I" 1L 1

1• i

symptoms

sentenceanalyser

1

*

exchangestructure

faultanalyser

1diagnosabilityfigures j

componentconnectivity

functionaldescription

call processprograms

list ofsuspectedcomponents

datagenerator

FRUIist

signal path

Fig. 4 Block schematic diagram of the fault isolation scheme

Software & Microsystems, Vol. 4, No. 2, April 1985

attributes of the fault. The attributes describe whether thefault affects a single subscriber or a group of subscribersand also the range of subscribers affected. The faultanalyser takes the correct sentences and attributes, andthen using the techniques described in preceding Sectionsgenerates the list of suspect components. It also acceptsany fault-negating sentences that the technician mayprovide to narrow down the fault lists.

As shown in the Fig. 4 the data required for analysis areautomatically generated by the data generator using thesystem descriptions. These data are generated off-line anddown-loaded into the analysing processor. They consist of alist of FRUs for each telephonic signal, such as ring tone,dial tone etc. For each of these signals the hardware-relatedcall processing programs specify the address and data valuesused for controlling the signal. A path searching algorithmcomputes the components that are used when a particularsignal is sent to a subscriber. The components associated arederived using the functional description of an FRU and theconnectivity of the components. When hardware changesare implemented by a designer, the changes are reflected aschanges in connectivity and functional descriptions. Thesechanges are automatically sensed by the data generatorprogram and new lists are down-loaded into the analysingprocessor.

7 Diagnostic resolution (DR)

The average number of cards that are to be replaced infault situations can be obtained by giving a list of faultsituations to the above program. Two figures of merit havebeen defined for this purpose:

symptom resolution (SR)number of FRUs suspected to be faulty

number of FRUs that are involved in thegeneration of that particular signal

exchange resolution (ER)number of FRUs suspected to be faulty

total number of FRUs in the exchange

The above two figures of merit give an indication of theefficiency of the analysis method and the system diag-nosibility. Since the signals between the exchange and thesubscriber are standard, the fault situations can be enu-merated, and performing the analysis for different exchangeswould give a measure of the diagnosibility of their structure.SR would also aid the designer to introduce observationpoints so that SR might be improved to the set standards.

The above procedure was applied to an experimentalPABX [11], available in the Department of ElectricalEngineering at the University of Essex, which contains atotal of 25 FRUs. Table 4 presents some of the resultsobtained for a particular class of signals. SR and ER bothimprove with the number of positive assertions. Thus itis possible to diagnose the fault to a single FRU (corre-sponding to an ER of 4%) if at least three symptom state-ments (one fault assertive and two fault absent) are given.Also, the SR value increases for faults that affect a group ofsubscribers. This is because the number of FRUs controllinga group of subscribers increases and there is a strong inter-relationship among them. Thus a fault in any one of them

33

could affect a group of subscribers. Therefore, to increasethe diagnosibility it is necessary to improve the autonomyof the FRUs or increase the observation points to isolatethe fault.

In many cases it is not possible to diagnose faults downto a single FRU, and this is reflected in the SR and ERfigures. This is due to the structural deficiencies of thesystem. System diagnosibility, like testability, has to betaken into account during design phases. SR and ER willhighlight such deficiencies. Testability reflects the easewith which an embedded component can be tested.Diagnosibility determines how well a particular fault canbe isolated given its associated symptoms. Often a testablecomponent cannot be diagnosed owing to structuraldeficiencies.

Table 4: Symptom resolution and exchange resolution results

Total number of cards in the exchange = 25

Number of fault symptom statements = 1

SR = 20%Type-1 fault affecting asingle subscriber

ER = 9%

Number of fault symptom statements = 2(One fault assertive and one fault absent)

SR = 12%ER = 8%

Type-1 fault affecting agroup of 20 subscribers ER = 9.5%

Number of statements = 1

SR = 23%

Number of statements = 3

SR = 6.5%ER = 4%

8 Computational effort

The effort to produce a list of suspected componentsdepends on the analysis routine and the database generator.It is not possible to give a closed-loop form for the effort.However, the complexity of the analysis routine is0(Ns (47Vy)2), where Ns refers to the number of symptomsentences in which a fault is described and Nf is the averagenumber of FRUs in a particular telephonic signal path. Theterm 4 arises owing to an FRU being split into four sections,consisting of a common part controlling a group of sub-scribers, an individual part controlling an individualsubscriber, and data (speech) manipulating and control(signal) parts of an exchange.

The database can be generated off-line, and the dataloaded to the on-line fault analyser processor. The com-putational complexity depends on the effort involved inproducing a list of all ICs for each exchange signal and theeffort required to compute the list of FRUs in each signalpath together with the list of components in each of thesubparts of an FRU. The complexity of the routine is0(NfN^(Ncf), where Nt is the number of telephonicsignals, Nf is the number of FRUs and Nc is the averagenumber of ICs in an FRU. The cube term is due to the pathsearch algorithm used to compute the number of compo-nents for a particular signal.

34

9 Conclusion

A new approach to fault diagnosis has been proposed thatuses system structure as a parameter. It is aimed at locatinghardware failures. It cannot be claimed to be an exhaustivemethod capable of resolving all problem cases. Systemfailures occur owing to hardware and software faults. Onmany occasions a software fault might simulate a hardwarefault. The concept of a repairable unit in software does notexist, as software faults are due to design errors. Also,intermittent faults cause difficulties for which no structuredapproach is available.

The fault diagnosis methods adopted in this paper arewell known. However, instead of arbitrary tests, the faultclassification used eases analysis and aids automation. Thelinking of call processing programs, connectivity diagramsand functional descriptions will assist in automating a largepart of exchange software. It also provides a standard wayof comparing and improving diagnosibility of SPC systems.Therefore we envisage the following advantages:

(i) Since the database required for this method can beautomated from circuit descriptions, the maintenancepersonnel will not be required to know the structure whenchanges are implemented.

(ii) There will be feedback to the designer and to theuser about diagnosibility of the system.

(iii) The proposed method allows the field technicianto concentrate on other problems, rather than requiringhim to know the intricate structure of the exchange.

10 Acknowledgments

We gratefully acknowledge P.E. Jones for his support andhelp in many different ways.

11 References

1 DAVIS, E.A., and GILOTH, P.K.: 'No. 4 ESS: performanceobjectives and service experience', Bell Syst. Tech. J., 1981,60, pp. 1203-1234

2 KEENE, R.G., LIND, G.R., MILSTEAD, R.M., and ROHN,B.: 'Centralised automatic trouble locating and analysis system'.International Switching Symposium, Montreal, Canada, 1981,Session 13C3

3 ABRAMOVICI, M., MENON, P.R., and MILLER, D.T.:'Critical path tracing - an alternative to fault simulation'.Proceedings of 20th IEEE/ACM Design Automation Confer-ence, 1983, pp.. 214-219

4 CHU, N.N.Y.: 'Generation of circuit pack fault patterns basedon diagnostic and circuit connectivity'. IEEE Total SystemsReliability Symposium, 1983, pp. 130-135

5 DAVIS, R., SHROBE, H., and HAMSCHER, W.: 'Diagnosisbased on description of structure and function'. AAAI-82,pp. 137-142

6 GENERSETH, M.R.: 'Diagnosis using hierarchical designmodels'. AAAI-82, pp. 278-283

7 HARTLEY, R.T.: 'CRIB - Computer fault finding throughknowledge engineering', IEEE Computer, 1984, 17, pp. 76-84

8 LAGER, J.P., BROCHET, B., and CAUHAPE, M.: 'Faultdetection and processing in MT time division-system'. Proceed-ings of International Switching Symposium, Paris, 1979,pp. 193-200

9 GARWOOD, G.J.: 'Fault diagnosis in path finder SPCexchange'. Proceedings of International Switching Symposium,Paris, 1979, pp. 178-185

10 THATTE, S.M., and ABRAHAM, J.A.: 'Test generation forgeneral microprocessor architectures'. Proceedings of IEEEFault Tolerant Computing Symposium, 1979, pp. 203-210

11 'The Monarch 120 call connect system'. Part I/Part 2, BritishTelecom Technical Description Manuals

Software & Microsystems, Vol. 4, No. 2, April 1985