Handling Data and Figures of Merit Data comes in different formats time Histograms Lists But…. Can...

Post on 28-Jan-2016

214 views 0 download

Transcript of Handling Data and Figures of Merit Data comes in different formats time Histograms Lists But…. Can...

Handling Data and Figures of Merit

Data comes in different formatstimeHistogramsLists

But….Can contain the same information about quality

What is meant by quality?

(figures of merit)Precision, separation (selectivity), limits of detection,Linear range

day weight day weight day weight

1 140 31 143.9 61 1442 140.1 32 144 62 144.23 139.8 33 142.5 63 144.54 140.6 34 142.9 64 144.25 140 35 142.8 65 143.96 139.8 36 143.9 66 144.27 139.6 37 144 67 144.58 140 38 144.8 68 144.39 140.8 39 143.9 69 144.2

10 139.7 40 144.5 70 144.911 140.2 41 143.9 71 14412 141.7 42 144 72 143.813 141.9 43 144.2 73 14414 141.4 44 143.8 74 143.815 142.3 45 143.5 75 14416 142.3 46 143.8 76 144.517 141.9 47 143.2 77 143.718 142.1 48 143.5 78 143.919 142.5 49 143.6 79 14420 142.3 50 143.4 80 144.221 142.1 51 143.9 81 14422 142.5 52 143.6 82 144.423 143.5 53 144 83 143.824 143 54 143.8 84 144.125 143.2 55 143.626 143 56 143.827 143.4 57 14428 143.5 58 144.229 142.7 59 14430 143.7 60 143.9

My weight

Plot as a function of time data was acquired:

139

140

141

142

143

144

145

146

0 10 20 30 40 50 60

Day

weigh

t (lbs

)

Do not use curved lines to connect data points – that assumes you know more about the relationship of the data than you really do

Comments: background is white (less ink); Font size is larger than Excel default (use 14 or 16)

day weight day weight day weight1 140 31 143.9 61 1442 140.1 32 144 62 144.23 139.8 33 142.5 63 144.54 140.6 34 142.9 64 144.25 140 35 142.8 65 143.96 139.8 36 143.9 66 144.27 139.6 37 144 67 144.58 140 38 144.8 68 144.39 140.8 39 143.9 69 144.2

10 139.7 40 144.5 70 144.911 140.2 41 143.9 71 14412 141.7 42 144 72 143.813 141.9 43 144.2 73 14414 141.4 44 143.8 74 143.815 142.3 45 143.5 75 14416 142.3 46 143.8 76 144.517 141.9 47 143.2 77 143.718 142.1 48 143.5 78 143.919 142.5 49 143.6 79 14420 142.3 50 143.4 80 144.221 142.1 51 143.9 81 14422 142.5 52 143.6 82 144.423 143.5 53 144 83 143.824 143 54 143.8 84 144.125 143.2 55 143.626 143 56 143.827 143.4 57 14428 143.5 58 144.229 142.7 59 14430 143.7 60 143.9

Bin refers to what groups of weight to cluster. LikeA grade curve which lists number of students who got between 95 and 100 pts95-100 would be a bin

Assume my weight is a single, random, set of similar data

0

5

10

15

20

25

Weight (lbs)

# o

f O

bse

rvat

ion

sMake a frequency chart (histogram) of the data

Create a “model” of my weight and determine averageWeight and how consistent my weight is

139

140

141

142

143

144

145

146

0 10 20 30 40 50 60

Day

weigh

t (lbs

)

0

5

10

15

20

25

Weight (lbs)

# o

f O

bse

rvat

ion

s

= measure of the consistency, or similarity, of weights

average143.11

s = 1.4 lbs

Inflection pt

s = standard deviation

Characteristics of the Model Population(Random, Normal)

Peak height, APeak location (mean or average), Peak width, W, at baselinePeak width at half height, W1/2

Standard deviation, s, estimates the variation in an infinite population,

Related concepts

f xA

ex

2

1

2

2

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-5 -4 -3 -2 -1 0 1 2 3 4 5

s

Am

pli

tud

e

Width is measuredAt inflection point =s

W1/2

Triangulated peak: Base width is 2s < W < 4s

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-5 -4 -3 -2 -1 0 1 2 3 4 5

s

Am

pli

tud

e

+/- 1s

Area +/- 2s = 95.4%

Area +/- 3s = 99.74 %

pp s~ 6

Pp = peak to peak – or – largest separation of measurements

Peak to peak is sometimesEasier to “see” on the data vs time plot

Area = 68.3%

139

140

141

142

143

144

145

146

0 10 20 30 40 50 60

Day

weigh

t (lbs

)

Peak topeak

pp s~ 6

139.5

144.9

s~ pp/6 = (144.9-139.5)/6~0.9

(Calculated s= 1.4)

0

5

10

15

20

25

Weight (lbs)

# o

f O

bse

rvat

ion

s

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-5 -4 -3 -2 -1 0 1 2 3 4 5

s

Am

pli

tud

e

Scale up the first derivative and second derivative to see better

There are some other important characteristics of a normal (random)population

1st derivative2nd derivative

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

-5 -4 -3 -2 -1 0 1 2 3 4 5

s

Am

pli

tud

ePopulation, 0th derivative

1st derivative,Peak is at the inflection Determines the std. dev.

2nd derivativePeak is at the inflectionOf first derivative – shouldBe symmetrical for normalPopulation; goes to zero at Std. dev.

Asymmetry can be determined from principle component analysis

A. F. (≠Alanah Fitch) = asymmetric factor

Is there a difference between my “baseline” weight and school weight?Can you “detect” a difference? Can you “quantitate” a difference?

139

140

141

142

143

144

145

146

0 10 20 30 40 50 60

Day

weigh

t (lbs

)

Vacation

School Begins

Baseline

Comparing TWO populations of measurements

0

5

10

15

20

25

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

# o

f O

bse

rvat

ion

s

Exact same information displayed differently, but now we divideThe data into different measurement populations

baseline

school

Model of the data as two normal populations

0

5

10

15

20

25

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

# o

f O

bse

rvat

ion

s

Average Baseline weight

Average schoolweight

Standard deviationOf baseline weight

Standard deviationOf the school weight

139

140

141

142

143

144

145

146

0 10 20 30 40 50 60

Day

weigh

t (lbs

)

0

5

10

15

20

25

Weight (lbs)

# o

f O

bse

rvat

ion

s

0

5

10

15

20

25

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

# o

f O

bse

rvat

ion

s

We have two models to describe the population of measurementsOf my weight. In one we assume that all measurements fall into a single population. In the second we assume that the measurementsHave sampled two different populations.

Which is the better model?How to we quantify “better”?

0

5

10

15

20

25

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

# o

f O

bse

rvat

ion

s

0

5

10

15

20

25

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

# o

f O

bse

rvat

ion

s

Compare how closeThe measured dataFits the model

Did I gain weight?

The red bars represent the differenceBetween the two population model andThe data

The purple lines representThe difference betweenThe single populationModel and the dataWhich modelHas less summeddifferences?

This process (summing of the squares of the differences)Is essentially what occurs in an ANOVA

Analysis of variance

Normally sum the square of the difference in order to account forBoth positive and negative differences.

In the bad old days you had to work out all the sums of squares.In the good new days you can ask Excel program to do it for you.

Anova: Single Factor5% certaintySUMMARY

Groups Count Sum Average VarianceColumn 1 12 277.41 23.1175 8.70360227Column 2 12 345.72 28.81 6.50010909

ANOVASource of Variation SS df MS F P-value F critBetween Groups 194.4273 1 194.4273 25.5762995 4.59E-05 4.300949Within Groups 167.2408 22 7.601856 Source of Variation

Total 361.6682 23

Test: is F<Fcritical? If true = hypothesis true, single population if false = hypothesis false, can not be explained

by a single population at the5% certainty level

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

14 19 24 29 34 39

Length (cm)

Fre

qu

ency

White, N=12, Sum sq diff=0.037, stdev=2.55 White, N=38, Sum sq diff=0.028, stdev=2.15

Red, N=12, Sum sq diff=0.11, stdev=3.27Red, N=40, Sum sq diff=0.017, stdev-2.67

0

0.05

0.1

0.15

0.2

0.25

0.3

14 19 24 29 34 39

Length, cm

Fre

qu

ency

N=24 Sum sq diff=0.0449, stdev=3.96N=78, sum sq diff=0.108, stdev=4.05

In an Analysis of Variance you test the hypothesis that the sample isBest described as a single population.1. Create the expected frequency (Gaussian from normal error curve)2. Measure the deviation between the histogram point and the expected

frequency3. Square to remove signs4. SS = sum squares5. Compare to expected SS which scales with population size6. If larger than expected then can not explain deviations assuming a

single population

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

14 19 24 29 34 39

Length (cm)

Fre

qu

ency

White, N=12, Sum sq diff=0.037, stdev=2.55 White, N=38, Sum sq diff=0.028, stdev=2.15

Red, N=12, Sum sq diff=0.11, stdev=3.27Red, N=40, Sum sq diff=0.017, stdev-2.67

0

0.05

0.1

0.15

0.2

0.25

0.3

14 19 24 29 34 39

Length, cm

Fre

qu

ency

N=24 Sum sq diff=0.0449, stdev=3.96N=78, sum sq diff=0.108, stdev=4.05

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

15 17 19 21 23 25 27 29 31 33 35

Length (cm)

Sq

uar

e D

iffe

ren

ce E

xpec

ted

Mea

sure

d

The square differencesFor an assumption ofA single populationIs larger than forThe assumption ofTwo individual populations

There are other measurements which describe the two populations

Resolution of two peaks

Rx xW W

a b

a b

2 2

Mean or average

Baseline width

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1 1.5 2 2.5 3 3.5 4

x

Sig

nal

xa xb

x xa b

W a

2W b

2

In this example

W Wa b

2 2

Peaks are baseline resolved when R > 1R x xW W

a ba b 1

2 2:

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1 1.5 2 2.5 3 3.5 4

x

Sig

nal

xa xb

x xa b

W a

2W b

2

In this example

W Wa b

2 2

Peaks are just baseline resolved when R = 1

R x xW W

a ba b 1

2 2:

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 1.5 2 2.5 3 3.5 4

x

Sig

nal

xa xb

x xa b

W a

2W b

2

In this example

W Wa b

2 2

Peaks are not baseline resolved when R < 1

R x xW W

a ba b 1

2 2:

2008 Data

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

14 19 24 29 34 39

Length (cm)

Fre

qu

ency

White, N=12, Sum sq diff=0.037Red, N=12, Sum sq diff=0.11

What is the R for this data?

x W Wp R W 12

R 1

Visually less resolved Visually better resolved

Comparison of 1978 Low Lead to 1979 High Lead

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160Series2 Series3

Comparison of 1978 Low Lead to 1978 High Lead

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160IQ Verbal

% M

easu

red

Anonymous 2009 student analysis of Needleman data

W

W

a

b

211 2 7 0 4 2

21 3 0 9 5 3 5

~ ~

~ ~R

x xW W

a b

a b

2 2

Visually less resolved Visually better resolved

Comparison of 1978 Low Lead to 1979 High Lead

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160Series2 Series3

Comparison of 1978 Low Lead to 1978 High Lead

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160IQ Verbal

% M

easu

red

Anonymous 2009 student analysis of Needleman data

W

W

a

b

211 2 7 0 4 2

21 3 0 9 5 3 5

~ ~

~ ~

x xa b ~ ~11 2 9 5 1 7R

x xW W

a b

a b

2 2

1 7

4 2 3 50 2 2~ .

Other measures of the quality of separation of the Peaks

1. Limit of detection2. Limit of quantification3. Signal to noise (S/N)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-6 -4 -2 0 2 4 6 8 10 12

s

Am

pli

tud

e

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-6 -4 -2 0 2 4 6 8 10 12

s

Am

pli

tud

e

3s

X blank

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-6 -4 -2 0 2 4 6 8 10 12

s

Am

pli

tud

e

3s

X limit of detection

x x sLOD blank b lank 3

99.74%Of the observationsOf the blank will lie below the mean of theFirst detectable signal (LOD)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-6 -4 -2 0 2 4 6 8 10 12

s

Am

pli

tud

e

3s

Two peaks are visible when all the data is summed together

Estimate the LOD (signal) of this data

139

140

141

142

143

144

145

146

0 10 20 30 40 50 60

Day

weigh

t (lbs

)

0

5

10

15

20

25

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

# o

f O

bse

rvat

ion

s

Other measures of the quality of separation of the Peaks

1. Limit of detection2. Limit of quantification3. Signal to noise (S/N)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-6 -4 -2 0 2 4 6 8 10 12

s

Am

pli

tud

ex x sLOQ blank b lank 9 Your book suggests 10

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-6 -4 -2 0 2 4 6 8 10 12

s

Am

pli

tud

e

9s

Limit of quantification requires absolute Certainty that no blank is part of the measurement

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-6 -4 -2 0 2 4 6 8 10 12

s

Am

pli

tud

e

139

140

141

142

143

144

145

146

0 10 20 30 40 50 60

Day

weigh

t (lbs

)

Estimate the LOQ (signal) of this data

139

140

141

142

143

144

145

146

0 10 20 30 40 50 60

Day

weigh

t (lbs

)

0

5

10

15

20

25

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

# o

f O

bse

rvat

ion

s

Other measures of the quality of separation of the Peaks

1. Limit of detection2. Limit of quantification3. Signal to noise (S/N)

Signal = xsample - xblank

Noise = N = standard deviation, s

S

N

x x

s

x x

ppsam ple b lank sam ple b lank

6

Estimate the S/N of this data

139

140

141

142

143

144

145

146

0 10 20 30 40 50 60

Day

weig

ht (l

bs)

Vacation

School Begins

Baseline

Signal

Peak to peak variation within mean school ~ 6s where s = N for Noise

(This assumes pp school ~ pp baseline)

0

5

10

15

20

25

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

# o

f O

bse

rvat

ion

s

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30

Sample number

len

gth

(cm

)

Can you “tell” where the switch betweenRed and white potatoes begins?

What is the signal (length of white)?What is the background (length of red)?What is the S/N ?

Effect of sample size on the measurement

Error curvePeak height grows with # of measurements.+ - 1 s always has same proportion of total number of measurements

However, the actual value of s decreases as population grows

ss

nsam ple

popu la tion

sam ple

22.5

23

23.5

24

24.5

25

25.5

26

26.5

27

0 2 4 6 8 10 12 14

Sample number

Red

Ru

nn

ing

Len

gth

Ave

rag

e

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Red

Ru

nn

ing

Std

ev

2008 Data

y = -0.8807x + 5.9303

R2 = 0.9491

2.5

2.7

2.9

3.1

3.3

3.5

3.7

3.9

4.1

1.5 2 2.5 3 3.5 4

sqrt number of samples

std

ev r

ed le

ng

th c

m

ss

nsam ple

popu la tion

sam ple

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

14 19 24 29 34 39

Length (cm)

Fre

qu

ency

White, N=12, Sum sq diff=0.037, stdev=2.55 White, N=38, Sum sq diff=0.028, stdev=2.15

Red, N=12, Sum sq diff=0.11, stdev=3.27Red, N=40, Sum sq diff=0.017, stdev-2.67

Calibration Curve

A calibration curve is based on a selected measurement as linearIn response to the concentration of the analyte.

Or… a prediction of measurement due to some changeCan we predict my weight change if I had spent a longer time on Vacation?

bxay

vacationondaysbalbsfitch

0

5

10

15

20

25

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

# o

f O

bse

rvat

ion

s

vacationondaysbalbsfitch

5 days

The calibration curve contains information about the sampling Of the population

y = 0.3542x + 140.04

R2 = 0.7425

139

139.5

140

140.5

141

141.5

142

142.5

143

0 1 2 3 4 5 6

Days on Vacation

Fit

ch W

eig

ht,

lbs

Can get this by using “trend line”

y = -0.8807x + 5.9303

R2 = 0.9491

2.5

2.7

2.9

3.1

3.3

3.5

3.7

3.9

4.1

1.5 2 2.5 3 3.5 4

sqrt number of samples

std

ev r

ed le

ng

th c

mThis is just a trendlineFrom “format” data Sample sqrt(#samples) stdev

1 1 #DIV/0!2 1.414213562 2.0364683 1.732050808 4.4757274 2 4.314415 2.236067977 3.8440456 2.449489743 3.8446047 2.645751311 3.7351248 2.828427125 3.4584149 3 3.23505510 3.16227766 3.09305311 3.31662479 2.93594412 3.464101615 2.950187

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.296113395R Square 0.087683143Adjusted R Square -0.013685397Standard Error 0.703143388Observations 11

ANOVAdf SS MS F Significance F

Regression 1 0.427662048 0.427662 0.864994 0.376617

Residual 9 4.449695616 0.494411Total 10 4.877357664

Coefficients Standard Error t Stat P-value Lower 95%Intercept 3.884015711 0.514960076 7.542363 3.53E-05 2.719094X Variable 1 -0.06235252 0.067042092 -0.93005 0.376617 -0.21401

Using the analysisData pack

Get an errorAssociated withThe intercept

In the best of all worlds you should have a series of blanksThat determine you’re the “noise” associated with the background

x x sLOD blank b lank 3

Sometimes you forget, so to fall back and punt, estimateThe standard deviation of the “blank” from the linear regression

But remember, in doing this you are acknowledgingA failure to plan ahead in your analysis

x x b conc LODLOD blank [ . ]

[ . ]conc LODs

bb lank

3

Extrapolation of the associated errorCan be obtained from the LinearRegression data

Sensitivity (slope)

x x sLOD blank b lank 3x s x b conc LODb lank b lank b lank 3 [ . ]

The concentration LOD depends on BOTHStdev of blank and sensitivity

Signal LOD

!!Note!! Signal LOD ≠ Conc LODWe want Conc. LOD

-350

-300

-250

-200

-150

-100

-50

0

024681012

pH or pM

mV

y = -31.143x - 74.333

R2 = 0.9994

-350

-300

-250

-200

-150

-100

-50

0

024681012

pH or pM

mV

y = -31.143x - 74.333

R2 = 0.9994

-350

-300

-250

-200

-150

-100

-50

0

024681012

pH or pM

mV

y = -31.143x - 74.333

R2 = 0.9994

y = -41x - 118.5

R2 = 0.9872

-350

-300

-250

-200

-150

-100

-50

0

024681012

pH or pM

mV

Difference in slope is one measure selectivity

In a perfect method the sensing device would have zeroSlope for the interfering species

Selectivity

Pb2+

H+

Limit of linearity

5% deviation

Summary: Figures of Merit Thus far

R = resolutionS/NLOD = both signal and concentrationLOQLOLSensitivity (calibration curve slope)Selectivity (essentially difference in slopes)

Can be expressed in terms of signal, but betterExpression is in terms of concentration

Tests: Anova

Why is the limit of detection important?

Why has the limit of detection changed so much in theLast 20 years?

The End

0

5

10

15

20

25

40 60 80 100 120 140 160

Verbal IQ

% o

f M

easu

rem

ents

0

5

10

15

20

25

40 60 80 100 120 140 160

Verbal IQ

% o

f M

easu

rem

ents

Which of these two data sets would be likelyTo have better numerical value for theAbility to distinguish between two differentPopulations?

Needleman’s data

2008 Data

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

14 19 24 29 34 39

Length (cm)

Fre

qu

ency

White, N=12, Sum sq diff=0.037Red, N=12, Sum sq diff=0.11

Height for normalized Bell curve <1

Which population is more variable?How can you tell?

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

14 19 24 29 34 39

Length (cm)

Fre

qu

ency

White, N=12, Sum sq diff=0.037, stdev=2.55 White, N=38, Sum sq diff=0.028, stdev=2.15

Red, N=12, Sum sq diff=0.11, stdev=3.27Red, N=40, Sum sq diff=0.017, stdev-2.67

Increasing the sample size decreases the std dev and increases separationOf the populations, notice that the means also change, will do so untilWe have a reasonable sample of the population

0

5

10

15

20

25

40 60 80 100 120 140 160

Verbal IQ

% o

f M

easu

rem

ents

0

5

10

15

20

25

40 60 80 100 120 140 160

Verbal IQ

% o

f M

easu

rem

ents