A Large-Scale Industrial Case Study on Architecture-based Software Reliability Analysis

© ABB Group April 9, 2023 | Slide 1

A Large-Scale Industrial Case Study on Architecture-based Software Reliability Analysis

Heiko Koziolek, Bastian Schlich, Carlos Bilich, ABB Corporate Research, 2010-11-01

Architecture-based Software Reliability Analysis (ABSRA)

What?

Typical questions of software architects concerning reliability

„What is the reliability (probability of failures) in my system?“

„How do individual components contribute to the system reliability?“

„Which architectural alternative is best for reliability?“

„Where shall I introduce fault-tolerance mechanisms?“

„How to distribute my limited testing efforts among components?“

Additional questions by ABB

„How much more reliable is a new architecture than a former one?“

„Does ABSRA work on large-scale systems?“


Architecture-based Software Reliability Analysis (ABSRA)

How?


Softwarecomponents, control flow, reliabilities

R=0.995

R=0.982

R=0.937

Markov Model

combine

Markov Model

Solution

trans-form

R = 0.9923Predicted system

reliabilitysolve

im-prove

Related workExisting empirical studies


”… very little effort has been devoted to the validation of architecture-based software reliability techniques.”

[Gokhale2007, IEEE Transactions on Dependable and Secure Computing, Vol. 4, No. 1]

Source Name Year Lang. LOC # Components[Gokhale2004, Perf. Eval.]

SHARPE 1998 C 35,000 30

[Goseva2001, ISSRE]

ESA 2001 C 10,000 3

[Goseva2005,ISSRE]

GCC 2005 C 350,000 13

[Wang2005,JSS]

SMS 2006 C/C++ 13,000 15

[Goseva2006,ISSRE]

IDN 2006 C 11,000 6

Source Name Year Lang. LOC # Components[Gokhale2004, Perf. Eval.]

SHARPE 1998 C 35,000 30

[Goseva2001, ISSRE]

ESA 2001 C 10,000 3

[Goseva2005,ISSRE]

GCC 2005 C 350,000 13

[Wang2005,JSS]

SMS 2006 C/C++ 13,000 15

[Goseva2006,ISSRE]

IDN 2006 C 11,000 6

Our Paper ABB 2010 C++ >3,000,000 8 (>100)

System under study: Process control system


System under study: Process control systemTopology


Plant / Office Network

NetworkIsolation

Device

RemoteWorkplaces

Firewall

Internet

RemoteWorkplaces

Redundant Network

Workplaces

Controllers

Servers

Fieldbus

Remote I/O andField devices

System under study: Process control systemSubsystems within the servers


Which steps are required for ABSRA?

Estimate component failure probabilities

Estimate transition probabilities

Construct the Markov model

Exploit the results


Estimate component failure probabilitiesExisting methods

Code metrics [Nagappan2006]

• Validity debated

Reliability growth modeling [IEEE Std 1633-2008]

• Requires component failure reports

Random/statistical testing [Miller1992]

• Does not scale, difficult to apply on components

Fault injection [Gokhale2004]

• Does not determine the current reliability

Explicit failure modeling [Cheung2008]

• Accuracy unknown


Reliability growth modelingGeneral principle


0 ,

)(

))(exp()()(),,(

1

llilii

ilg

Littlewood/Verrall Model

Reliability growth modeling Using the Littlewood/Verrall-model on one subsystem


Filtered subsystem bug list Release dates

Curve fitting in CASRE 3.0http://www.openchannelsoftware.com/projects/CASRE_3.0/

Reliability growth modeling Result


R1= ...

R8= ...

R4= ...

R3= ...

R5= ...

R6= ...

R7= ...

R2= ...





Exploit the results


Estimate component transition probabilitiesExisting methods

Exploiting design document [Gokhale2007]

• Only static dependencies in SW architecture

Profiling [Goseva2005]

• Complicated filtering of data required

Manual code instrumentation• Can be time-comsuming


Self-coded script

Estimate component transition probabilitiesProfiling with proprietary tools


Example trace from profiling

Set up and ran the system





Exploit the results


Construct the Markov modelExisting state-based methods

[Littlewood1979]

[Cheung1980]

[Laprie1984]

[Kubat1989]

[Gokhale1998]

[Ledoux1999]

[Gokhale1998-2]


[Goseva-Popstojanova2001]

Cheung modelAdding failure & end states, compute reliability


[Cheung1980]





Exploit the results


Exploit the resultsPossibilities

Estimate system reliability [Cheung1980]

• Experience by customers hard to validate

Conduct sensitivity analysis [Gokhale2002]

• Study system reliability for varying component failure rates

Assess costs of bugs [Cheung1980]

• Quantify the effect of an error in component

Evaluate design alternatives [Goseva2001]

• Values for new componentes need to be guessed

Allocate test budgets efficiently [Pietrantuono2010]

• Test critical components more often


Sensitivity AnalysisImpact of varying subsystem failure rates


http://www.prismmodelchecker.org/

Evaluation Cost estimations in person hours (best/worst case)


ConclusionsLessons learned

Getting failure and transition probabilities is hard

Time consuming, error-prone, limited automation

Main obstacle for ABSRA is data collection

Currently rather simple models

No technologies, concurrency, hardware

Difficult to evaluate architecture alternatives

Limited decision support from the predictions

Lack of empirical studies in literature

Predominantly small systems

Often dubious techniques for estimating failure rates

Replicated case studies needed


A Large-Scale Industrial Case Study on Architecture-based Software Reliability Analysis

Technology

Transcript of A Large-Scale Industrial Case Study on Architecture-based Software Reliability Analysis