ftc1.ppt

34
DS - IX - NFT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 1 INTRODUCTION Wintersemester 2000/2001 Leitung: Prof. Dr. Miroslaw Malek www.informatik.hu-berlin.de/~rok/ftc

Transcript of ftc1.ppt

Page 1: ftc1.ppt

DS - IX - NFT - 1

HUMBOLDT-UNIVERSITÄT ZU BERLININSTITUT FÜR INFORMATIK

DEPENDABLE SYSTEMS

Vorlesung 1

INTRODUCTION

Wintersemester 2000/2001

Leitung: Prof. Dr. Miroslaw Malek

www.informatik.hu-berlin.de/~rok/ftc

Page 2: ftc1.ppt

DS - IX - NFT - 2

FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline:

1. Introduction (Unit I)– Motivation– System views– Dependability rings– Dependable design methodology

2. Dependability Concepts, Measures and Models (UNIT DCMM)– Basic definitions– Dependability measures– Dependability models– Examples– Dependability evaluation tools

3. Testing Techniques (UNIT TT)– Testing techniques principles– Processor testing – Memory testing– Network testing

Page 3: ftc1.ppt

DS - IX - NFT - 3

FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline:

4. Fault Diagnosis Techniques (UNIT FST)– Fault detection techniques– Fault location (isolation) methods

5. Fault Recovery and Tolerance Techniques (UNIT FRTT) (System Level)– Dynamic techniques– Static techniques– Hybrid techniques

6. Fault-tolerant and Fault-secure Memories (UNIT FRTT)– Fault-tolerant techniques in manufacturing– Replication– Coding– Reconfiguration

Page 4: ftc1.ppt

DS - IX - NFT - 4

FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline:

7. Network Fault Tolerance (UNIT NFT)– Computer networks

– Basic techniques

– Example – multistage networks

8. Case Studies (UNIT CS)– ESS and 3B20

– FTMP – Fault-tolerant Multiprocessor

– SIFT – Software-implemented Fault Tolerance

– Communication controller

– Fault-tolerant Building Block Architecture

Page 6: ftc1.ppt

DS - IX - NFT - 6

Major References on Fault-tolerant Computing (Books/General) 1

• Chang, H. Y., E.G. Manning and G. Metze, Fault Diagnosis in Digital Systems, Wiley –Interscience, 1970.

• Friedman, A. D. and P. R. Menon, Fault Detection in Digital Circuits, Prentice-Hall, 1971.

• Breuer, M. A. and A.D. Friedman, Diagnosis and Reliable Design of Digital Systems, Computer Science Press, 1976.

• Kraft, G. D. and W. N. Toy, Microprogrammed Control and Reliable Design of Small Computers, Prentice-Hall, 1981.

• Anderson, T. and P.A. Lee, Fault Tolerance Principles and Practice, Prentice-Hall, 1982.

• Siewiorek, D.P. and R. S. Swarz, The Theory and Practice of Reliable Systems Design, Digital Press, 1982 & 1995.

• Lala, P.K., Fault Tolerant and Fault Testable Hardware Design, Prentice-Hall International, 1985.

• Pradhan, D. K. (ed.), Fault Tolerant Computing: Theory and Techniques, Vols. I and II, Prentice-Hall, 1986.

Page 7: ftc1.ppt

DS - IX - NFT - 7

Major References on Fault-tolerant Computing (Books/General) 2

• Avizienis, A., H. Kopetz and J. C. Laprie (eds.), The Evolution of Fault-Tolerant Computing, Springer-Verlag, 1987.

• Johnson, B. W., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley, 1989.

• Negrini, R., M. G. Sami and R. Stefanelli, Fault Tolerance Through Reconfiguration in VLSI and WSI Arrays, MIT Press, 1989.

• Laprie, J. C. (ed.), Dependable computing and Fault-Tolerant Systems, Vol. 5: Dependability: Basic Concepts and Terminology, Springer-Verlag Wien New York, 1992.

• Landwehr, C. E., B. Randell, L. Simoncini (eds.), Dependable Computing and Fault-Tolerant Systems, Vol. 8, Dependable Computing for Critical Applications 3, Springer-Verlag Wien New York, 1993.

• Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, System Implementation, Kluwer Academic Publishers, 1994.

• Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, Paradigms for Dependable Applications, Kluwer Academic Publishers, 1994.

Page 8: ftc1.ppt

DS - IX - NFT - 8

Major References on Fault-tolerant Computing (Books/General) 3

• Koob, G. M. and C. G. Lau (eds.), Foundations of Dependable Comp-uting, Models and Frameworks for Dependable Systems, Kluwer Academic Publishers, 1994.

• Malek, M. (ed.), Responsive Computing, Kluwer Acad. Publish., 1994.• Fussel, D. S. and M. Malek (eds.), Responsive Computer Systems,

Steps Toward Fault-Tolerant Real-Time Systems, Kluwer Academic Publishers, 1995.

• Cristian, F., G. Le Lann and T. Lunt (eds.), Dependable computing and Fault-Tolerant Systems, Vol. 9, Dependable Computing for Critical Applications 4, Springer-Verlag Wien New York, 1995.

• Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996.

• A. A. Shvartsman, Fault-Tolerant Parallel Computation, Kluwer, 1997• W. Schneeweiss, Die Fehlerbaum-Methode, LiLoLe-Verlag, 1999• S. Montenegro, Sichere und fehlertolerante Steuerungen, Hanser

Muenchen, 1999.

Page 9: ftc1.ppt

DS - IX - NFT - 9

Major References on Fault-tolerant Computing (Books/Reliability Evaluation)

• Myers, G. J., Software Reliability Principles and Practice, Wiley-Interscience, 1976.

• Trivedi, K. S., Probability and Statistics with Reliability Queuing and Computer Science Applications, Prentice-Hall, 1982.

• Asche, H. and H. Feingold, Repairable Systems Reliability, Marcel Dekker, 1984.

• Musa, J. D., A. Iannino and K. Okumoto, Software Reliability: Measurement, Prediction, Application, McGraw-Hill, 1987.

• W. Schneeweiss, Petri Nets for Reliability Modeling, LiLoLe, 1999

Page 10: ftc1.ppt

DS - IX - NFT - 10

Major References on Fault-tolerant Computing (Books/Coding)

• Sellers, E. F., M. Y. Hsiao and L. W. Bearnson, Error Detecting Logic for Digital Computers, McGraw-Hill, 1968.

• Peterson, W. and E. Welding, Error-Correcting Codes (2nd ed.), MIT Press, 1972.

• Wakerly, J., Errors Detecting Codes, Self-Checking Circuits and Applications, The Computer Science Library, 1978.

• Lin, S. and D. J. Castello, Error Control Coding: Fundamentals and Application, Prentice-Hall, 1983.

• Nagle, H. T., J. D. Irwin and D. Hoffman, Error Detecting and Correcting Codes for Computer Scientist and Engineers, MacMillan Publishers, 1986.

• Rao, T. R. N. and E. Fujiwara, Error-Control Coding for Computer Systems, Prentice-Hall, 1989.

Page 11: ftc1.ppt

DS - IX - NFT - 11

Major References on Fault-tolerant Computing (Books/Software)

• Myers, G. J., The Art of Software Testing, Wiley-Interscience, 1970.

• Deutsch, M. D., Software Verification and Validation, Prent.-Hall, 1982.

• Shooman, M. L., Software Engineering, McGraw-Hill, 1983.

• Beizer, B., Software Testing Techniques, Van Nostrand Reinhold, 1983.

• Bernstein, P. A., V. Hadzlacos and N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987.

• Neufelder, A. M., Earning Software Reliability, Marcel Dekker Inc., 1993.

• Lyu, M. R. (ed.), Software Fault Tolerance, John Wiley and Sons, 1995.

• Lyu, M. R. (ed.), Handbook of Software Reliability Engineering, Computer Science Press, 1995.

Page 12: ftc1.ppt

DS - IX - NFT - 12

Major References on Fault-tolerant Computing (Journals)

• Special Issue of Proc. Of IEEE, October 1978• Special Issue of Computer, October 1979• Special Issue of Computer, March 1980• Special Issue of Computer, August 1984• Special Issue of IEEE Software, May 1995• IEEE Trans. on Reliability• IEEE Trans. On Software Engineering• Computer• Design and Test• Electronics• Proc. Of IEEE• Computer Design• Journal of Electronic Testing: Theory and Applications• Journal of Parallel and Distributed Computing• IEEE Trans. on Parallel and Distributed Computing• Real-Time Systems Journal

Page 13: ftc1.ppt

DS - IX - NFT - 13

Major References on Fault-tolerant Computing (Conference Proceedings)

• Fault-Tolerant Computing Symposium

• Reliability and Maintainability Symposium

• Reliability in Distributed Software and Database Systems Symposium

• Test Conference

• Distributed Computing Systems Conference

• Parallel Processing Conference

• Real-Time Systems Symposium

• Computer Architecture Symposium

Page 14: ftc1.ppt

DS - IX - NFT - 14

INTRODUCTION

• OBJECTIVES:– MOTIVATION FOR FAULT-TOLERANT SYSTEMS

– TO INTRODUCE VARIOUS VIEWS OF COMPUTER SYSTEMS AND THEIR RELATIONS TO COMPUTER SYSTEM DEPENDABILITY

– TO PRESENT BASIC CONCEPTS AND APPROACHES

– TO INTRODUCE DEPENDABLE DESIGN METHODOLOGY

• CONTENTS: – MOTIVATION

– SYSTEM VIEWS

– SYSTEM DEPENDABILITY CONCEPTS

– APPROACHES TO DEPENDABLE DESIGN

– DEPENDABILITY RINGS

– DEPENDABLE DESIGN METHODOLOGY

Page 15: ftc1.ppt

DS - IX - NFT - 15

TYPES OF SYSTEMS

• Dependable (Reliable) System– A system which delivers a required service during its lifetime

• Fault-Tolerant Computer Systems– A system that has the capability to continue the correct execution of

its programs and input/output functions in the presence of faults

• Real-Time-Computer Systems– are the ones that deliver service to a user within a specified

deadline (physical time, duration, etc.)

• Responsive Computer System– are Fault-Tolerant Real-Time Systems that deliver satisfactory

service in a timely manner

Page 16: ftc1.ppt

DS - IX - NFT - 16

MOTIVATION FOR RELIABLE AND FAULT-TOLERANT COMPUTING

• ECONOMIC NECESSITY

• LIFE SAVING

• NOVICE USERS

• HARSH ENVIRONMENTS

• MORE COMPLEX SYSTEMS

Page 17: ftc1.ppt

DS - IX - NFT - 17

DEVICE RELIABILITY AND SYSTEM RELIABILITY

106

105

104

103

102

10

1

1950 1960 1970 1980 1990

Equivalent –

Device Reliability

Mean Time between Failures

(MTBF) in Years Minimum Acceptable

Reliability

System Reliability

Relays – Vacuum Tubes – Semiconductors – SSI – MSI – LSI - VLSI

Page 18: ftc1.ppt

DS - IX - NFT - 18

DEPENDABILITY – PERFORMANCE TRADE-OFF

1 10 100 1000 10000 100000

0.99999

0.9999

0.999

0.99

0.9

Massively Parallel/

Distributed Systems

CommercialFault-Tolerant

Systems

Ultra Reliable Systems

Ava

ilabi

lity

Throughput (MIPS)

Page 19: ftc1.ppt

DS - IX - NFT - 19

EXAMPLES

• DEFENSE SYSTEMS• FLIGHT SYSTEMS• AIR TRAFFIC CONTROL• COMMUNICATION SYSTEMS• BANKING SYSTEMS• AIRLINE SEAT RESERVATIONS• TELEPHONE SYSTEMS• HOUSEHOLD APPLIANCES• VIDEO GAMES

Page 20: ftc1.ppt

DS - IX - NFT - 20

VIEW 1: SYSTEM LIFE CYCLE

SYSTEM CONSTRAINTS

OBSOLESCENCE NEEDSNEW

TECHNOLOGY

CONCEPT FORMULATION

SYSTEM SPECIFICATION

DESIGN

PROTOTYPE

PRODUCTION

INSTALLATION

OPERATIONAL LIFE

MODIFICATION AND RETIREMENT

• Notice that testing, verification or validation should occur after every phase of life cycle

• Very few tools exist, and for some steps of the cycle only

Page 21: ftc1.ppt

DS - IX - NFT - 21

VIEW 2: PACKAGING LEVELS OF INTEGRATION

• APPLICATIONS• APPLICATIONS MODULES• SPECIAL-PURPOSE LANGUAGES• STANDARD LANGUAGES• OPERATING SYSTEMS• CABINETS/FRAMES• BOXES/CAGES• PRINTED CIRCUIT BOARDS/CARDS, WAFERS, TCMs• INTEGRATED CIRCUITS (CHIPS)

• Dependability must be considered at every level• System decomposition (partitioning) may have a significant

impact on dependability

Page 22: ftc1.ppt

DS - IX - NFT - 22

VIEW 3: WORKLOAD VIEW

PREPARATION USEFUL

WORK

SEMI USEFUL WORK

FAULT

SERVICING

IDLING

LIVEWARE

HARDWARE/ SOFTWARE

• ELIMINATE IDLING AND USE IT FOR TESTING TO IMPROVE DEPENDABILITY

Page 23: ftc1.ppt

DS - IX - NFT - 23

VIEW 4: LEVELS OF ABSTRACTION FOR DIGITAL COMPUTERS

Disks, Tapes Quantum & El-ectromagnetic

Transistors

Resistors, Capacitors, Inductors, Power Sources, Diodes

Circuit

Data Paths, Registers, Data Operators, Control (Hardwired), Microprogramming (Microstore)

Register Trans- fer Level (RTL)

Logic

Software, Memory State, Processor State, Effective Address Calculation, Instruction Decode, Instruction Execution

HLL, ISP (Inst- raction Set

Processor

Program

Processors, Memories, Switches, Links (Networks), Controllers, ALUs, I/Os

PMS

COMPONENTSSUBLEVELLEVEL

• DEPENDABILITY AND TESTING MUST BE CONSIDERED AT EVERY LEVEL

Page 24: ftc1.ppt

DS - IX - NFT - 24

VIEW 5: COMPUTER SYSTEM SOFTWAREPACKAGES

ASSEMBLERS

COMPILERS

OPERATING SYSTEMS

UTILITY PROGRAMS

DEBUGGING PROGRAMS

FILE PROCESSING PROGRAMSFIRMWARE

MICROPROGRAM & MICROPRO-

GRAMMING SYSTEMSHARDWARE

CPUs

I/O DEVICES

MEMORIES

INTERCONNECTION NETWORKS

LIVEWARE

MAINTENANCE PERSONNEL

OPERATORS

SYSTEM DESIGNERS

SYSTEM ANALYSTS

PROGRAMMERS

USERS

FAULTS ARE ATTRIBUTED TO: HARDWARE: 20%-65%; SOFTWARE: 20%-80%; PEOPLE: 15%-40%; AT&T’s: 20-40-40%; (2/3 applications + 1/3 OS)

Page 25: ftc1.ppt

DS - IX - NFT - 25

(WARNING!!!)

VIEW 6: IF YOU DO NOT FOLLOW DEPENDABLE DESIGN METHODOLOGY

YOU MAY END UP WITH THE FOLLOWING:

SIX PHASES OF A PROJECT

1. ENTHUSIASM2. DISILLUSIONMENT3. PANIC AND HYSTERIA4. SEARCH FOR THE GUILTY5. PUNISHMENT OF THE INNOCENT6. PRAISE AND AWARDS FOR THE NON-PARTICIPANTS

(Author unknown – found in one of the computer companies)

Page 26: ftc1.ppt

DS - IX - NFT - 26

SYSTEM DEPENDABILITY CONCEPTS

• RELIABILITY– Is a conditional probability that the system will perform its intended function

without failure at time t provided it was fully operational at time t = 0

• AVAILABILITY– Instantaneous availability is the probability that a system is performing

correctly at time t and is equal to reliability of non-repairable systems A (t) = R (t)

– Steady-state availability is the probability that a system will be operational at any random point of time and is expressed as the fraction of time a system is operational during its expected lifetime

As (t) =

• SURVIVABILITY is the probability that a system will deliver the required service in the presence of a defined a priori set of faults or any of its subset

LIFETIME

UPTIME

Page 27: ftc1.ppt

DS - IX - NFT - 27

APPROACHES

• FAULT INTOLERANCE

• FAULT TOLERANCE

• MAINTAINABILITY

• HARDWARE/SOFTWARE TRADE-OFFS

Page 28: ftc1.ppt

DS - IX - NFT - 28

HARDWARE/SOFTWARE CONTINUUM AND VERTICAL MIGRATION

HARDWARE

INSTRUCTIONS

INTEGER ARITHMETIC ADD/SUB

MPY/DIV

FLOATING-POINT ARITHMETIC

VECTOR PROCESSING

MULTIPROCESSING (e.g., submachine set-up)

SOFTWARE

EXAMPLES

M6800

MC68000

VAX-11/780 IBM-30XX

CRAY-XMP C-205

SYSTOLIC ARRAYS, RECONFIGURABLE OR EXPERIMENTAL MULTICOMPUTERS

VERTICAL MIGRATION is a transfer of functions’ implementation from software to firmware and/or hardware or vice-versa.

Vertical Migration improves performance and dependability, and reduces cost.

Page 29: ftc1.ppt

DS - IX - NFT - 29

DEPENDABILITY (RELIABILITY) RINGS FOR FAULT TOLERANCE

Logic Level

Acceptance Test

Register-Transfer Level

Acceptance Test

System Hardware

Acceptance Test

Operating System, Languages and Application

Acceptance TestDependability

Rings

Each Dependability Ring should provide measures and mechanisms for Fault Tolerance (Detection, Location, Testability and Recovery)

Page 30: ftc1.ppt

DS - IX - NFT - 30

A BOOTSTRAP – TEST RINGS IN A MULTICOMPUTER SYSTEM

Diagnostic and

Maintenance Processor (s)

(Hardcore)

Processor

Memories

Network

Test Rings

Page 31: ftc1.ppt

DS - IX - NFT - 31

DEPENDABLE DESIGN METHODOLOGY

• Identify fault classes, fault latency and fault impact• Determine qualitative and quantitative specs for fault tolerance

and evaluate your design in specific environment • Identify “weak spots” and assess potential damage• Decompose the system• Develop fault and error detection techniques and algorithms• Develop fault isolation techniques and algorithms• Develop recovery/reintegration/restart• Evaluate degree of fault tolerance• Refine, iterate for improvement; try to eliminate “weak spots”

and minimize potential damage

Page 32: ftc1.ppt

DS - IX - NFT - 32

REAL-TIME SYSTEMS DESIGN

• Identify time/critical tasks and specify their timing (deadlines, durations, frequency, periodicity, if any). Characterize the system load and environment.

• Characterize timing of a system (hardware and software).• Map timing specification onto a system timing (find the best

resource allocation and scheduling methods), and incorporate concurrent monitoring.

• Verify and validate the design for quantitative and qualitative specifications.

• Refine, iterate and fine-tune the design.

Page 33: ftc1.ppt

DS - IX - NFT - 33

RESPONSIVE SYSTEM DESIGN

• Determine qualitative and quantitative specifications for fault tolerance and task timeliness which meet user requirements.

• Determine system timing (hardware and software) assess damage, availability and responsiveness.

• Develop and time fault and error detection techniques and algorithms.• Develop and time fault isolation techniques and algorithms.• Develop time recovery/reintegration/restart.• Map timing specification onto system timing under appropriate

assumptions and incorporate concurrent monitoring.• Evaluate responsiveness.• Refine and iterate for improvement.

RESPONSIVE SYSTEMS NEED ARCHITECTS OF SPACE AND ARCHITECTS OF TIME

Page 34: ftc1.ppt

DS - IX - NFT - 34

REFERENCES(TEXTBOOK)

• C. G. Bell, J. C. Mudge and J. E. McNamara “Seven Views of Computer Systems”, Chapter 1 in the book by the same authors titled “Computer Engineering”, Digital Press, 1978.

• G.J. Lipovski and M. Malek, “Parallel Computing: Theory and Comparisons”, Wiley-Interscience, New York, 1987.

• M. Malek, “Parallel Computer Systems Testing and Integration”, in the book titled “Testing and Diagnosis of VLSI and LSI”, M. G. Sami and F. Lombardi (eds.), Kluwer, 1988.

• Pankaj Jalote, Fault Tolerance in Distributed Systems / Textbook Binding / Published 1994

• Dhiraj K. Pradhan, Fault-Tolerant Computer System Design, Textbook Binding, 1996.