Human failure (LSCITS EngD 2012)

35
Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 1 Human failure

description

An introduction to human error and the implications for the design of socio-technical systems.

Transcript of Human failure (LSCITS EngD 2012)

Page 1: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 1

Human failure

Page 2: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 2

In this lecture…

• What do we mean by human failure

• Human error and socio-technical systems

• Designing error tolerant systems

Page 3: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 3

Human failure

• Human failures are said to account for– 50-70% of aviation disasters

– 44,000 – 98,000 deaths each year in America result from medication errors

– 60-85% of shuttle incidents at NASA

– 92-95% of car crashes

– 70% of shipping accidents

• A common reaction to human failure is to blame the user

• But “To Err is Human”, so we should be concerned with designing systems that are resilient to human error

Page 4: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 4

What is human error?

Page 5: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 5

Definitions of human error

• An inappropriate or undesirable human decision or behaviour that reduces or has the potential for reducing the effectiveness, dependability or performance of a system

• Examples:– Errors of omission - forget to do something

– Errors of commission - doing something incorrectly

– Sequence errors - out of order

– Timing errors - too slow - too fast - too late

Page 6: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 6

System dependability model

A system characteristic that can (but need not) lead to a system error

An erroneous system state that can (but need not) lead to a system failure

System fault System error

Externally-observed, unexpected and undesirable system behaviour

System failure

Page 7: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 7

A dependability perspective

• Human error – behaviour that leads to the introduction of a fault into a system

– Development errors

– Operational errors

– Maintenance errors

• Emphasises that human errors do not necessarily lead to system failures

• We are not just interested in errors in system operation

Page 8: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 8

Example

• Operator specifies a value of 12 rather than -12 for the temperature of a freezer

• Developer has not included a check that values set are below 0 degrees

• The resultant system fault is that the system thermostat is set to the wrong value

• The resultant system error (erroneous state) is that the refrigerant pump is not switched on

• The observed system failure is that the freezer is warm and its contents have defrosted

Page 9: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 9

What is an error?

• Determining a human error often involves a judgment.

– Sometimes a human action is clearly a human error

– Sometimes a human action is only clearly an error with hindsight

– Sometimes a human action that would ordinarily be an error is not an error.

• Users are not just people who cause errors, but they are often the ones who trap and correct errors (human or technical errors)

– Never simply assume that systems are inherently safe and humans introduce errors

– Many now prefer the terms “human reliability” or “resilience” to “human error”

Page 10: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 10

Human error ambiguity?

• It can be difficult to distinguish between safe and erroneous behaviour.

– An action that is an error in one context may not be in another

• Failing to follow procedures for the safe use of ladders

• Not an error if the goal is to rescue a child trapped in a fire

• Erroneous actions?– Following a rule or instruction if following it causes a

failure

– Deliberately not following a rule or instruction if resources are unavailable or if not following the rule avoids a failure

– Deviating from a defined process or procedure to save time or improve quality

Page 11: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 11

GEMS

• GEMS (Generic Error Modelling System) was developed by a psychologist, James Reason, at Manchester University

• Based on the notion that human actions are based around:

• Intentions, Goals, Plans and Actions

• In GEMS, human error occurs as:– The failure to perform some plan or task properly

– The failure to apply the correct plan

Page 12: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 12

Types of human activity

• GEMS distinguishes three ways in which actions are performed:

– Skills-based performance

• Routine things done without much cognitive effort e.g. driving a car

– Rule-based performance

• Following a set of rules or procedure e.g. transferring data from one system to another

– Knowledge based performance

• Applying knowledge in completing some task e.g. planning travel from St Andrews to a meeting in Rome

Page 13: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 13

Human error classification

• Slips, which occur in skills based performance– Are an “execution failure”, where the operator’s

intentions are correct but actions are not carried out properly

• Lapses, which also occur in skills based performance

– Also are an “execution failure”, but this time where an operator forgets to do something, loses their place in a task, etc.

• Mistakes, which occur in rule and knowledge based performance

– These are “planning failures”, where an inappropriate set of actions is carried out

Page 14: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 14

Human error in complex socio-technical systems

Page 15: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 15

Influences on human actions

Technology

Users

Groups

Organisations

Regulations

Sh

arp

En

d

B

lun

t E

nd

Page 16: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 16

The socio-technical systems stack

Page 17: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 17

Human fallibility and dependability

• Human fallibility can influence the dependability of an LSCITS:

– During the development process

– During the deployment process

– During the maintenance/management process

– During the operational process

• Errors made during development, deployment and maintenance create vulnerabilities that may interact with ‘errors’ during the operational process to cause system failure

Page 18: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 18

Example• Maintenance error leads to vulnerability in

system– Say the automatic backup disk is switched from A to

B to check that a change to the backup system has been made correctly. The maintainer then forgets to switch the backup disk back to A and dismounts B

– Consequence of maintenance error is that backups are not made

• Operator error leads to an erroneous command being input to the system

– Operator accidentally overwrites a file in the system with incorrect data

– Goes to backup system to recover previous version of file

• File cannot be recovered

Page 19: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 19

Failure trajectories

• Failures rarely have a single cause. Generally, they arise because several events occur simultaneously

– Loss of data in a critical system

• User mistypes command and instructs data to be deleted

• System does not check and ask for confirmation of destructive action

• No backup of data available

• A failure trajectory is a sequence of undesirable events that coincide in time, usually initiated by some human action. It represents a failure in the defensive layers in the system

Page 20: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 20

Vulnerabilities and defences

• Vulnerabilities– Faults in the (socio-technical) system which, if triggered

by a human error, can lead to system failure

– e.g. missing check on input validity

• Defences– System features that avoid, tolerate or recover from

human error

– Type checking that disallows allocation of incorrect types of value

• When an adverse event happens, the key question is not ‘whose fault was it’ but ‘why did the system defences fail?’

Page 21: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 21

Reason’s Swiss Cheese Model

Page 22: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 22

Active failures

• Active failures– Active failures are the unsafe acts committed by people

who are in direct contact with the system (slips, lapses, mistakes, and procedural violations).

– Active failures have a direct and usually short-lived effect on the integrity of the defenses.

• Latent conditions– Fundamental vulnerabilities in one or more layers of the

socio-technical system such as system faults, system and process misfit, alarm overload, inadequate maintenance, etc.

– Latent conditions may lie dormant within the system for many years before they combine with active failures and local triggers to create an accident opportunity.

Page 23: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 23

Defensive layers

• Complex IT systems should have many defensive layers:– some are engineered - alarms, physical barriers,

automatic shutdowns,

– others rely on people - surgeons, anesthetists, pilots, control room operators,

– and others depend on procedures and administrative controls.

• In an ideal word, each defensive layer would be intact.

• In reality, they are more like slices of Swiss cheese, having many holes- although unlike in the cheese, these holes are continually opening, shutting, and shifting their location.

Page 24: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 24

Dynamic vulnerabilities

• While some vulnerabilities are static (e.g. programming errors), others are dynamic and depend on the context where the system is used.

• For example– vulnerabilities may be related to human actions

whose performance is dependent on workload, state of mind, etc. An operator may be distracted and forget to check something

– vulnerabilities may depend on configuration – checks may depend on particular programs being up and running so if program A is running in a system then a check may be made but if program B is running, then the check is not made

Page 25: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 25

Human error and complexity

• System complexity and change adds to the ambiguity of human errors

– Many human errors are insignificant and do not lead to failure

– Many human errors are spotted and resolved by defensive layers in the system

– Human actions that are correct may become erroneous because of a change elsewhere in the system

– An error can be made many times without contributing to a failure, but then suddenly, because some system components has changed in some way, it will.

Page 26: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 26

Human error and complexity

• An error can be made many times without contributing to a failure, but then suddenly one day it will

• Example– An operator logs information by sending it to an email

address which uses an obsolete domain name (dcs.st-and.ac.uk)

– Version X of the system relies on a DNS that maps the obsolete name to the new name (cs.st-andrews.ac.uk) so this works OK – no error is reported

– Four years after the initial change, a new DNS is installed and the domain name mapping is removed

– The day after this happens, the system fails because the email log message cannot be sent

Page 27: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 27

Designing resilient systems

Page 28: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 28

System resilience

• Failure avoidance– Fault avoidance

– Fault detection

– Fault tolerance

• Failure recovery– Returning to normal operation after the occurrence of

a system failure

Page 29: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 29

Incident reduction

• Reduce the number of latent conditions in the different layers of the system (plug the holes)

– If the number of faults in a software system is reduced, this increases the strength of the defensive layer

– However, this technical approach ON ITS OWN cannot be completely effective as it is practically impossible to reduce the number of latent conditions in the system to zero

• Increase the number of defensive layers and hence reduce the probability of an accident trajectory occurring

• Reduce the number of active faults that occur

Page 30: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 30

Conditions leading to human error

• Distractions

• Incomplete or incorrect data

• Boredom

• Inadequate resources

• Cognitive overload

• Stress

• Illness

• Time pressure

Page 31: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 31

Systems design and human error

• Once we begin to understand what human errors are possible and how they can come about, we can start designing systems that better withstand human error

• Avoidance– Design the system so that certain classes of human

error are eliminated

• Detection– Make it easier for the operator and others to spot

errors

• Tolerance– Ensure that individual errors are unlikely to lead to

system failure

Page 32: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 32

Design guidelines

• Minimise potential for slips, lapses and mistakes by designing systems and work environments where people aren’t distracted or overwhelmed, but aren’t bored either

• Minimise potential for mistakes by designing systems and work environments where people are able to understand what is happening and the consequences of an action

• Minimise potential for mistakes by making sure people are trained properly

• Minimise potential for deliberate violations by making sure rules are well designed and well understood. Monitor the application of rules

Page 33: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 33

Detection and tolerance

• Detecting and correcting error

– Automated correction can be useful, but can be dangerous!

– Alarms and alerts may be better than automated correction, but need to be well designed.

– Allow for human correction, by making it possible to ‘undo’

– Make it easier for the user or another person to spot errors

• Tolerating human error– Remember, breaking the rules might be for good reasons.

Think of users as people attempting to do things, rather than simply as operators of the system.

– Human error is common, so try not to create systems where a single human error can cause a failure.

Page 34: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 34

Recovery

• Design for failure– Discussed in previous module on systems

engineering for LSCITS

• Make work visible

• Switch from enforcing mode to auditing mode

• Support role transferability

• Balance recovery and security

Page 35: Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 35

Key points

• Human error accounts for the majority of all systems failures

• Human error is very common, and only occasionally leads to failure

• The same human action may or may not be an error depending on the context of that action

• There are methods for analysing and predicting human errors, and these can be used to improve system design

• Some systems are more prone to accidents than others because of the way they have been designed

• Critical systems should be designed to minimise or detect human error

• Blaming the user is a common response to human error, but the fault lies with the system and the system engineers.