Human failure (LSCITS EngD 2012)

Human Failure, LSCITS, EngD course in Socio-technical Systems,, 2012 Slide 1

Human failure


In this lecture…

• What do we mean by human failure

• Human error and socio-technical systems

• Designing error tolerant systems


Human failure

• Human failures are said to account for– 50-70% of aviation disasters

– 44,000 – 98,000 deaths each year in America result from medication errors

– 60-85% of shuttle incidents at NASA

– 92-95% of car crashes

– 70% of shipping accidents

• A common reaction to human failure is to blame the user

• But “To Err is Human”, so we should be concerned with designing systems that are resilient to human error


What is human error?


Definitions of human error

• An inappropriate or undesirable human decision or behaviour that reduces or has the potential for reducing the effectiveness, dependability or performance of a system

• Examples:– Errors of omission - forget to do something

– Errors of commission - doing something incorrectly

– Sequence errors - out of order

– Timing errors - too slow - too fast - too late


System dependability model

A system characteristic that can (but need not) lead to a system error

An erroneous system state that can (but need not) lead to a system failure

System fault System error

Externally-observed, unexpected and undesirable system behaviour

System failure


A dependability perspective

• Human error – behaviour that leads to the introduction of a fault into a system

– Development errors

– Operational errors

– Maintenance errors

• Emphasises that human errors do not necessarily lead to system failures

• We are not just interested in errors in system operation


Example

• Operator specifies a value of 12 rather than -12 for the temperature of a freezer

• Developer has not included a check that values set are below 0 degrees

• The resultant system fault is that the system thermostat is set to the wrong value

• The resultant system error (erroneous state) is that the refrigerant pump is not switched on

• The observed system failure is that the freezer is warm and its contents have defrosted


What is an error?

• Determining a human error often involves a judgment.

– Sometimes a human action is clearly a human error

– Sometimes a human action is only clearly an error with hindsight

– Sometimes a human action that would ordinarily be an error is not an error.

• Users are not just people who cause errors, but they are often the ones who trap and correct errors (human or technical errors)

– Never simply assume that systems are inherently safe and humans introduce errors

– Many now prefer the terms “human reliability” or “resilience” to “human error”


Human error ambiguity?

• It can be difficult to distinguish between safe and erroneous behaviour.

– An action that is an error in one context may not be in another

• Failing to follow procedures for the safe use of ladders

• Not an error if the goal is to rescue a child trapped in a fire

• Erroneous actions?– Following a rule or instruction if following it causes a

failure

– Deliberately not following a rule or instruction if resources are unavailable or if not following the rule avoids a failure

– Deviating from a defined process or procedure to save time or improve quality


GEMS

• GEMS (Generic Error Modelling System) was developed by a psychologist, James Reason, at Manchester University

• Based on the notion that human actions are based around:

• Intentions, Goals, Plans and Actions

• In GEMS, human error occurs as:– The failure to perform some plan or task properly

– The failure to apply the correct plan


Types of human activity

• GEMS distinguishes three ways in which actions are performed:

– Skills-based performance

• Routine things done without much cognitive effort e.g. driving a car

– Rule-based performance

• Following a set of rules or procedure e.g. transferring data from one system to another

– Knowledge based performance

• Applying knowledge in completing some task e.g. planning travel from St Andrews to a meeting in Rome


Human error classification

• Slips, which occur in skills based performance– Are an “execution failure”, where the operator’s

intentions are correct but actions are not carried out properly

• Lapses, which also occur in skills based performance

– Also are an “execution failure”, but this time where an operator forgets to do something, loses their place in a task, etc.

• Mistakes, which occur in rule and knowledge based performance

– These are “planning failures”, where an inappropriate set of actions is carried out


Human error in complex socio-technical systems


Influences on human actions

Technology

Users

Groups

Organisations

Regulations

Sh

arp

En

d

B

lun

t E

nd


The socio-technical systems stack


Human fallibility and dependability

• Human fallibility can influence the dependability of an LSCITS:

– During the development process

– During the deployment process

– During the maintenance/management process

– During the operational process

• Errors made during development, deployment and maintenance create vulnerabilities that may interact with ‘errors’ during the operational process to cause system failure


Example• Maintenance error leads to vulnerability in

system– Say the automatic backup disk is switched from A to

B to check that a change to the backup system has been made correctly. The maintainer then forgets to switch the backup disk back to A and dismounts B

– Consequence of maintenance error is that backups are not made

• Operator error leads to an erroneous command being input to the system

– Operator accidentally overwrites a file in the system with incorrect data

– Goes to backup system to recover previous version of file

• File cannot be recovered


Failure trajectories

• Failures rarely have a single cause. Generally, they arise because several events occur simultaneously

– Loss of data in a critical system

• User mistypes command and instructs data to be deleted

• System does not check and ask for confirmation of destructive action

• No backup of data available

• A failure trajectory is a sequence of undesirable events that coincide in time, usually initiated by some human action. It represents a failure in the defensive layers in the system


Vulnerabilities and defences

• Vulnerabilities– Faults in the (socio-technical) system which, if triggered

by a human error, can lead to system failure

– e.g. missing check on input validity

• Defences– System features that avoid, tolerate or recover from

human error

– Type checking that disallows allocation of incorrect types of value

• When an adverse event happens, the key question is not ‘whose fault was it’ but ‘why did the system defences fail?’


Reason’s Swiss Cheese Model


Active failures

• Active failures– Active failures are the unsafe acts committed by people

who are in direct contact with the system (slips, lapses, mistakes, and procedural violations).

– Active failures have a direct and usually short-lived effect on the integrity of the defenses.

• Latent conditions– Fundamental vulnerabilities in one or more layers of the

socio-technical system such as system faults, system and process misfit, alarm overload, inadequate maintenance, etc.

– Latent conditions may lie dormant within the system for many years before they combine with active failures and local triggers to create an accident opportunity.


Defensive layers

• Complex IT systems should have many defensive layers:– some are engineered - alarms, physical barriers,

automatic shutdowns,

– others rely on people - surgeons, anesthetists, pilots, control room operators,

– and others depend on procedures and administrative controls.

• In an ideal word, each defensive layer would be intact.

• In reality, they are more like slices of Swiss cheese, having many holes- although unlike in the cheese, these holes are continually opening, shutting, and shifting their location.


Dynamic vulnerabilities

• While some vulnerabilities are static (e.g. programming errors), others are dynamic and depend on the context where the system is used.

• For example– vulnerabilities may be related to human actions

whose performance is dependent on workload, state of mind, etc. An operator may be distracted and forget to check something

– vulnerabilities may depend on configuration – checks may depend on particular programs being up and running so if program A is running in a system then a check may be made but if program B is running, then the check is not made


Human error and complexity

• System complexity and change adds to the ambiguity of human errors

– Many human errors are insignificant and do not lead to failure

– Many human errors are spotted and resolved by defensive layers in the system

– Human actions that are correct may become erroneous because of a change elsewhere in the system

– An error can be made many times without contributing to a failure, but then suddenly, because some system components has changed in some way, it will.


Human error and complexity

• An error can be made many times without contributing to a failure, but then suddenly one day it will

• Example– An operator logs information by sending it to an email

address which uses an obsolete domain name (dcs.st-and.ac.uk)

– Version X of the system relies on a DNS that maps the obsolete name to the new name (cs.st-andrews.ac.uk) so this works OK – no error is reported

– Four years after the initial change, a new DNS is installed and the domain name mapping is removed

– The day after this happens, the system fails because the email log message cannot be sent


Designing resilient systems


System resilience

• Failure avoidance– Fault avoidance

– Fault detection

– Fault tolerance

• Failure recovery– Returning to normal operation after the occurrence of

a system failure


Incident reduction

• Reduce the number of latent conditions in the different layers of the system (plug the holes)

– If the number of faults in a software system is reduced, this increases the strength of the defensive layer

– However, this technical approach ON ITS OWN cannot be completely effective as it is practically impossible to reduce the number of latent conditions in the system to zero

• Increase the number of defensive layers and hence reduce the probability of an accident trajectory occurring

• Reduce the number of active faults that occur


Conditions leading to human error

• Distractions

• Incomplete or incorrect data

• Boredom

• Inadequate resources

• Cognitive overload

• Stress

• Illness

• Time pressure


Systems design and human error

• Once we begin to understand what human errors are possible and how they can come about, we can start designing systems that better withstand human error

• Avoidance– Design the system so that certain classes of human

error are eliminated

• Detection– Make it easier for the operator and others to spot

errors

• Tolerance– Ensure that individual errors are unlikely to lead to

system failure


Design guidelines

• Minimise potential for slips, lapses and mistakes by designing systems and work environments where people aren’t distracted or overwhelmed, but aren’t bored either

• Minimise potential for mistakes by designing systems and work environments where people are able to understand what is happening and the consequences of an action

• Minimise potential for mistakes by making sure people are trained properly

• Minimise potential for deliberate violations by making sure rules are well designed and well understood. Monitor the application of rules


Detection and tolerance

• Detecting and correcting error

– Automated correction can be useful, but can be dangerous!

– Alarms and alerts may be better than automated correction, but need to be well designed.

– Allow for human correction, by making it possible to ‘undo’

– Make it easier for the user or another person to spot errors

• Tolerating human error– Remember, breaking the rules might be for good reasons.

Think of users as people attempting to do things, rather than simply as operators of the system.

– Human error is common, so try not to create systems where a single human error can cause a failure.


Recovery

• Design for failure– Discussed in previous module on systems

engineering for LSCITS

• Make work visible

• Switch from enforcing mode to auditing mode

• Support role transferability

• Balance recovery and security


Key points

• Human error accounts for the majority of all systems failures

• Human error is very common, and only occasionally leads to failure

• The same human action may or may not be an error depending on the context of that action

• There are methods for analysing and predicting human errors, and these can be used to improve system design

• Some systems are more prone to accidents than others because of the way they have been designed

• Critical systems should be designed to minimise or detect human error

• Blaming the user is a common response to human error, but the fault lies with the system and the system engineers.

Human failure (LSCITS EngD 2012)

Technology

Transcript of Human failure (LSCITS EngD 2012)