Quality Assurance & Quality Control University of New Mexico Kristin Vanderbilt William Michener...

Post on 13-Jan-2016

213 views 0 download

Transcript of Quality Assurance & Quality Control University of New Mexico Kristin Vanderbilt William Michener...

Quality Assurance & Quality Control

University of New Mexico Kristin Vanderbilt William Michener James Brunt Troy Maddux

University of South Carolina Don Edwards

Instruction Materials Primary References

Michener and Brunt (2000) Ecological Data: Design, Management and Processing. Blackwell Science.

Edwards (2000) Ch. 4 Brunt (2000) Ch. 2 Michener (2000) Ch. 7

Dux, J.P. 1986. Handbook of Quality Assurance for the Analytical Chemistry Laboratory. Van Nostrand Reinhold Company

Mullins, E. 1994. Introduction to Control Charts in the Analytical Laboratory: Tutorial Review. Analyst 119: 369 – 375.

Outline:

Define QA/QC QC procedures

Designing data sheets Data entry using validation rules, filters,

lookup tables QA procedures

Graphics and Statistics Outlier detection

Samples Simple linear regression

QA/QC “mechanisms [that] are designed to prevent

the introduction of errors into a data set, a process known as data contamination”

Brunt 2000

Errors (2 types)

Commission: Incorrect or inaccurate data in a dataset

Can be easy to find Malfunctioning instrumentation

Sensor drift Low batteries Damage Animal mischief

Data entry errors

Omission Difficult or impossible to find Inadequate documentation of data values, sampling

methods, anomalies in field, human errors

Quality Control “mechanisms that are applied in advance,

with a priori knowledge to ‘control’ data quality during the data acquisition process”

Brunt 2000

Quality Assurance “mechanisms [that] can be applied after

the data have been collected, entered in a computer and analyzed to identify errors of omission and commission” graphics statistics

Brunt 2000

QA/QC Activities Defining and enforcing standards for

formats, codes, measurement units and metadata.

Checking for unusual or unreasonable patterns in data.

Checking for comparability of values between data sets.

Brunt 2000

Outline:

Define QA/QC QC procedures

Designing data sheets Data entry using validation rules, filters,

lookup tables QA procedures

Graphics and Statistics Outlier detection

Samples Simple linear regression

Flowering Plant Phenology – Data Entry Form Design

Four sites, each with 3 transects Each species will have

phenological class recorded

Plant Life Stage______________ _____________________________ _____________________________ _____________________________ _____________________________ _____________________________ _____________________________ _____________________________ _______________

What’s wrong with this data sheet?

Data Collection Form Development:

Plant Life Stageardi P/G V B FL FR M S D NParpu P/G V B FL FR M S D NPatca P/G V B FL FR M S D NPbamu P/G V B FL FR M S D NPzigr P/G V B FL FR M S D NP

P/G V B FL FR M S D NPP/G V B FL FR M S D NP

PHENOLOGY DATA SHEETCollectors:_________________________________Date:___________________ Time:_________Location: black butte, deep well, five points, goat draw

Transect: 1 2 3

Notes: _________________________________________

P/G = perennating or germinating M = dispersingV = vegetating S = senescingB = budding D = deadFL = flowering NP = not presentFR = fruiting

Outline:

Define QA/QC QC procedures

Designing data sheets Data entry using validation rules, filters,

lookup tables QA procedures

Graphics and Statistics Outlier detection

Samples Simple linear regression

Other methods for preventing data contamination

Double-keying of data by independent data entry technicians followed by computer verification for agreement

Randomly check 10% of data to calculate frequency of errors

Use text-to-speech program to read data back Filters for “illegal” data

Computer/database programs Legal range of values Sanity checks

Edwards 2000

Table ofPossibleValues

and Ranges

Raw DataFile

Illegal DataFilter

Report ofProbable

Errors

Flow of Information in Filtering “Illegal” Data

Edwards 2000

Illegal-data filter in SASData Checkum; Set all;

Message=repeat(“ ”,39);If Y1<0 or Y1>1 then do; message=“Y1 is not

on the interval [0,1]”; output; end;If Y3>Y4 then do; message=“Y3 is larger than

Y4”; output; end;…….(add as many checks as desired)If message NE repeat(“ ”, 39);Keep ID message;

Proc Print Data=Checkum;

Outline:

Define QA/QC QC procedures

Designing data sheets Data entry using validation rules, filters,

lookup tables QA procedures

Graphics and Statistics Outlier detection

Samples Simple linear regression

Identifying Sensor Errors: Comparison of data from three Met stations, Sevilleta LTER

Identification of Sensor Errors: Comparison of data from three Met stations, Sevilleta LTER

QA/QC in the Lab: Using Control Charts

Laboratory quality control using statistical process control charts:

Determine whether analytical system is “in control” by examining Mean Variability (range)

All control charts have three basic components:

a centerline, usually the mathematical average of all the samples plotted.

upper and lower statistical control limits that define the constraints of common cause variations.

performance data plotted over time.

16

18

20

22

24

26

0 2 4 6 8 10 12 14 16 18 20

Time

Mean

LCL

UCL

X-Bar Control Chart

Constructing an X-Bar control chart

Each point represents a check standard run with each group of 20 samples, for example

UCL = mean + 3*standard deviation

Things to look for in a control chart:

The point of making control charts is to look at variation, seeking patterns or statistically unusual values. Look for: 1 data point falling outside the control limits 6 or more points in a row steadily increasing

or decreasing 8 or more points in a row on one side of the

centerline 14 or more points alternating up and down

Linear trend

0

2

4

6

8

10

12

0 2 4 6 8 10 12

time

Outline:

Define QA/QC QC procedures

Designing data sheets Data entry using validation rules, filters,

lookup tables QA procedures

Graphics and Statistics Outlier detection

Samples Simple linear regression

Outliers An outlier is “an unusually extreme value

for a variable, given the statistical model in use”

The goal of QA is NOT to eliminate outliers! Rather, we wish to detect unusually extreme values.

Edwards 2000

Outlier Detection

“the detection of outliers is an intermediate step in the elimination of [data] contamination” Attempt to determine if contamination is

responsible and, if so, flag the contaminated value.

If not, formally analyse with and without “outlier(s)” and see if results differ.

Edwards 2000

Methods for Detecting Outliers Graphics

Scatter plots Box plots Histograms Normal probability plots

Formal statistical methods Grubbs’ test

Edwards 2000

X-Y scatter plots of gopher tortoise morphometrics

Michener 2000

Example of exploratory data analysis using SAS Insight®Michener 2000

Box Plot Interpretation

Upper quartileMedian

Lower quartile

Upper adjacent value

Pts. > upper adjacent value

Lower adjacent valuePts. < lower adjacent value

Inter-quartile range

Box Plot Interpretation

Inter-quartile range

IQR = Q(75%) – Q(25%)

Upper adjacent value = largest observation < (Q(75%) + (1.5 X IQR))

Lower adjacent value = smallest observation > (Q(25%) - (1.5 X IQR))

Box Plots Depicting Statistical Distribution of Soil Temperature

Statistical tests for outliers assume that the data are normally distributed….

CHECK THIS ASSUMPTION!

Y

-3 -2 -1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

a. The Normal Frequency Curve

Y

-3 -2 -1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

b. Cumulative Normal Curve Areas

Normal density and Cumulative Distribution Functions

Edwards 2000

Quantiles of Standard Normal

Y

-2 -1 0 1 2

78

910

1112

Normal Plot of 30 Observations from a Normal Distribution

Edwards 2000

Quantiles of Standard Normal

Y

-2 -1 0 1 2

01

23

4

a. a right skewed distribution

Quantiles of Standard Normal

Y

-2 -1 0 1 2

0.0

1.0

2.0

3.0

b. a left-truncated Normal distribution

Quantiles of Standard Normal

Y

-2 -1 0 1 2

-50

5

c. a heavy-tailed distribution

Quantiles of Standard Normal

Y

-2 -1 0 1 22

46

810

12

14

16

d. a contaminated Normal distribution

Edwards 2000

Normal Plots from Non-normally Distributed Data

Grubbs’ Text is a Statistical Test to determine if a point is an outlier.

Grubbs’ test for outlier detection:

Tn = (Yn – Ybar)/S

where Yn is the possible outlier,

Ybar is the mean of the sample, and

S is the standard deviation of the sample

Contamination exists if Tn is greater than T.01[n]

Example of Grubbs’ test for outliers: rainfall in acre-feet from seeded clouds (Simpson et al. 1975)

4.1 7.7 17.5 31.432.7 40.6 92.4 115.3118.3 119.0 129.6 198.6200.7 242.5 255.0 274.7274.7 302.8 334.1 430.0489.1 703.4 978.0 1656.01697.8 2745.6T26 = 3.539 > 3.029; Contaminated

Edwards 2000But Grubbs’ test is sensitive to non-normality……

0 500 1500 2500

05

10

15

20

rainfall

a. rainfall

Quantiles of Standard Normal

rain

fall

-2 -1 0 1 2

0500

1500

2500

b. Normal plot

0.5 1.5 2.5 3.5

02

46

810

log10(rainfall)

c. log10(rainfall)

Quantiles of Standard Normal

log10(r

ain

fall)

-2 -1 0 1 2

0.5

1.5

2.5

3.5

d. Normal plot

Edwards 2000

Checking Assumptions on Rainfall Data

Contaminated

Normallydistributed

Simple Linear Regression:check for model-based……

Outliers Influential (leverage) points

Influential points in simple linear regression:

A “leverage” point is a point with an unusual regressor value that has more weight in determining regression coefficients than the other data values.

Edmonds 2000

x

y

0 10 20 30 40 50

50

100

150

outlier andleverage pt

leverage pt

outlier

Edwards 2000

Influential Data Points in a Simple Linear Regression

x

y

0 10 20 30 40 50

50

100

150

outlier andleverage pt

leverage pt

outlier

Edwards 2000

Influential Data Points in a Simple Linear Regression

x

y

0 10 20 30 40 50

50

100

150

outlier andleverage pt

leverage pt

outlier

Edwards 2000

Influential Data Points in a Simple Linear Regression

x

y

0 10 20 30 40 50

50

100

150

outlier andleverage pt

leverage pt

outlier

Edwards 2000

Influential Data Points in a Simple Linear Regression

body weight (kg)

bra

in w

eig

ht (g

)

0 2000 4000 6000

01000

2000

3000

4000

5000

African Elephant

Asian Elephant

Man

Edwards 2000

Brain weight vs. body weight, 63 species of terrestrial mammals

Leverage pts.

“Outliers”

log10 body weight (kg)

log10 b

rain

weig

ht (g

)

-2 -1 0 1 2 3 4

-10

12

3

Elephants

Man

Chinchilla

mispunched point

Edwards 2000

Logged brain weight vs. logged body weight

“Outliers”

Outliers in simple linear regression

0

50

100

150

200

250

0 50 100 150 200 250 300 350 400 450

Beta-Glucosidase

Ph

os

ph

ata

se

Observation 62

Outliers: identify using studentized residuals Contamination may exist if

|ri| > t /2, n-3

= 0.01

Simple linear regression:Outlier identification

n = 86

t/2,83 = 1.98

Simple linear regression:detecting leverage points

hi = (1/n) + (xi – x)2/(n-1)Sx2

A point is a leverage point if hi > 4/n, where n is the number of points used to fit the regression

Regression with leverage point:Soil nitrate vs. soil moisture

Regression without leverage point

Observation 46

Output from SAS:Leverage points

n = 336

hi cutoff value: 4/336=0.012

Process from raw data to verified data to validated data (reviewed by a qualified scientist) is called data maturity, and implies increasing

confidence in the quality of the data through time

Data Quality (Confidenc

e in the Data)

Data Maturity

Raw Data

Verified Data

Validated Data