Handling Data and Figures of Merit Data comes in different formats time Histograms Lists But…. Can...

Handling Data and Figures of Merit

Data comes in different formatstimeHistogramsLists

But….Can contain the same information about quality

What is meant by quality?

(figures of merit)Precision, separation (selectivity), limits of detection,Linear range

day weight day weight day weight

1 140 31 143.9 61 1442 140.1 32 144 62 144.23 139.8 33 142.5 63 144.54 140.6 34 142.9 64 144.25 140 35 142.8 65 143.96 139.8 36 143.9 66 144.27 139.6 37 144 67 144.58 140 38 144.8 68 144.39 140.8 39 143.9 69 144.2

10 139.7 40 144.5 70 144.911 140.2 41 143.9 71 14412 141.7 42 144 72 143.813 141.9 43 144.2 73 14414 141.4 44 143.8 74 143.815 142.3 45 143.5 75 14416 142.3 46 143.8 76 144.517 141.9 47 143.2 77 143.718 142.1 48 143.5 78 143.919 142.5 49 143.6 79 14420 142.3 50 143.4 80 144.221 142.1 51 143.9 81 14422 142.5 52 143.6 82 144.423 143.5 53 144 83 143.824 143 54 143.8 84 144.125 143.2 55 143.626 143 56 143.827 143.4 57 14428 143.5 58 144.229 142.7 59 14430 143.7 60 143.9

My weight

Plot as a function of time data was acquired:

0 10 20 30 40 50 60

t (lbs

Do not use curved lines to connect data points – that assumes you know more about the relationship of the data than you really do

Comments: background is white (less ink); Font size is larger than Excel default (use 14 or 16)

day weight day weight day weight1 140 31 143.9 61 1442 140.1 32 144 62 144.23 139.8 33 142.5 63 144.54 140.6 34 142.9 64 144.25 140 35 142.8 65 143.96 139.8 36 143.9 66 144.27 139.6 37 144 67 144.58 140 38 144.8 68 144.39 140.8 39 143.9 69 144.2

10 139.7 40 144.5 70 144.911 140.2 41 143.9 71 14412 141.7 42 144 72 143.813 141.9 43 144.2 73 14414 141.4 44 143.8 74 143.815 142.3 45 143.5 75 14416 142.3 46 143.8 76 144.517 141.9 47 143.2 77 143.718 142.1 48 143.5 78 143.919 142.5 49 143.6 79 14420 142.3 50 143.4 80 144.221 142.1 51 143.9 81 14422 142.5 52 143.6 82 144.423 143.5 53 144 83 143.824 143 54 143.8 84 144.125 143.2 55 143.626 143 56 143.827 143.4 57 14428 143.5 58 144.229 142.7 59 14430 143.7 60 143.9

Bin refers to what groups of weight to cluster. LikeA grade curve which lists number of students who got between 95 and 100 pts95-100 would be a bin

Assume my weight is a single, random, set of similar data

Weight (lbs)

sMake a frequency chart (histogram) of the data

Create a “model” of my weight and determine averageWeight and how consistent my weight is

0 10 20 30 40 50 60

t (lbs

Weight (lbs)

= measure of the consistency, or similarity, of weights

average143.11

s = 1.4 lbs

Inflection pt

s = standard deviation

Characteristics of the Model Population(Random, Normal)

Peak height, APeak location (mean or average), Peak width, W, at baselinePeak width at half height, W1/2

Standard deviation, s, estimates the variation in an infinite population,

Related concepts

-5 -4 -3 -2 -1 0 1 2 3 4 5

Width is measuredAt inflection point =s

Triangulated peak: Base width is 2s < W < 4s

-5 -4 -3 -2 -1 0 1 2 3 4 5

+/- 1s

Area +/- 2s = 95.4%

Area +/- 3s = 99.74 %

pp s~ 6

Pp = peak to peak – or – largest separation of measurements

Peak to peak is sometimesEasier to “see” on the data vs time plot

Area = 68.3%

0 10 20 30 40 50 60

t (lbs

Peak topeak

pp s~ 6

s~ pp/6 = (144.9-139.5)/6~0.9

(Calculated s= 1.4)

Weight (lbs)

-5 -4 -3 -2 -1 0 1 2 3 4 5

Scale up the first derivative and second derivative to see better

There are some other important characteristics of a normal (random)population

1st derivative2nd derivative

-5 -4 -3 -2 -1 0 1 2 3 4 5

ePopulation, 0th derivative

1st derivative,Peak is at the inflection Determines the std. dev.

2nd derivativePeak is at the inflectionOf first derivative – shouldBe symmetrical for normalPopulation; goes to zero at Std. dev.

Asymmetry can be determined from principle component analysis

A. F. (≠Alanah Fitch) = asymmetric factor

Is there a difference between my “baseline” weight and school weight?Can you “detect” a difference? Can you “quantitate” a difference?

0 10 20 30 40 50 60

t (lbs

Vacation

School Begins

Baseline

Comparing TWO populations of measurements

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

Exact same information displayed differently, but now we divideThe data into different measurement populations

baseline

school

Model of the data as two normal populations

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

Average Baseline weight

Average schoolweight

Standard deviationOf baseline weight

Standard deviationOf the school weight

0 10 20 30 40 50 60

t (lbs

Weight (lbs)

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

We have two models to describe the population of measurementsOf my weight. In one we assume that all measurements fall into a single population. In the second we assume that the measurementsHave sampled two different populations.

Which is the better model?How to we quantify “better”?

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

Compare how closeThe measured dataFits the model

Did I gain weight?

The red bars represent the differenceBetween the two population model andThe data

The purple lines representThe difference betweenThe single populationModel and the dataWhich modelHas less summeddifferences?

This process (summing of the squares of the differences)Is essentially what occurs in an ANOVA

Analysis of variance

Normally sum the square of the difference in order to account forBoth positive and negative differences.

In the bad old days you had to work out all the sums of squares.In the good new days you can ask Excel program to do it for you.

Anova: Single Factor5% certaintySUMMARY

Groups Count Sum Average VarianceColumn 1 12 277.41 23.1175 8.70360227Column 2 12 345.72 28.81 6.50010909

ANOVASource of Variation SS df MS F P-value F critBetween Groups 194.4273 1 194.4273 25.5762995 4.59E-05 4.300949Within Groups 167.2408 22 7.601856 Source of Variation

Total 361.6682 23

Test: is F<Fcritical? If true = hypothesis true, single population if false = hypothesis false, can not be explained

by a single population at the5% certainty level

14 19 24 29 34 39

Length (cm)

White, N=12, Sum sq diff=0.037, stdev=2.55 White, N=38, Sum sq diff=0.028, stdev=2.15

Red, N=12, Sum sq diff=0.11, stdev=3.27Red, N=40, Sum sq diff=0.017, stdev-2.67

14 19 24 29 34 39

Length, cm

N=24 Sum sq diff=0.0449, stdev=3.96N=78, sum sq diff=0.108, stdev=4.05

In an Analysis of Variance you test the hypothesis that the sample isBest described as a single population.1. Create the expected frequency (Gaussian from normal error curve)2. Measure the deviation between the histogram point and the expected

frequency3. Square to remove signs4. SS = sum squares5. Compare to expected SS which scales with population size6. If larger than expected then can not explain deviations assuming a

single population

14 19 24 29 34 39

Length (cm)

14 19 24 29 34 39

Length, cm

N=24 Sum sq diff=0.0449, stdev=3.96N=78, sum sq diff=0.108, stdev=4.05

15 17 19 21 23 25 27 29 31 33 35

Length (cm)

The square differencesFor an assumption ofA single populationIs larger than forThe assumption ofTwo individual populations

There are other measurements which describe the two populations

Resolution of two peaks

Rx xW W

Mean or average

Baseline width

1 1.5 2 2.5 3 3.5 4

x xa b

In this example

W Wa b

Peaks are baseline resolved when R > 1R x xW W

a ba b 1

1 1.5 2 2.5 3 3.5 4

x xa b

In this example

W Wa b

Peaks are just baseline resolved when R = 1

R x xW W

a ba b 1

1 1.5 2 2.5 3 3.5 4

x xa b

In this example

W Wa b

Peaks are not baseline resolved when R < 1

R x xW W

a ba b 1

2008 Data

14 19 24 29 34 39

Length (cm)

White, N=12, Sum sq diff=0.037Red, N=12, Sum sq diff=0.11

What is the R for this data?

x W Wp R W 12

Visually less resolved Visually better resolved

Comparison of 1978 Low Lead to 1979 High Lead

0 20 40 60 80 100 120 140 160Series2 Series3

0 20 40 60 80 100 120 140 160IQ Verbal

Anonymous 2009 student analysis of Needleman data

211 2 7 0 4 2

21 3 0 9 5 3 5

x xW W

Visually less resolved Visually better resolved

0 20 40 60 80 100 120 140 160Series2 Series3

0 20 40 60 80 100 120 140 160IQ Verbal

Anonymous 2009 student analysis of Needleman data

211 2 7 0 4 2

21 3 0 9 5 3 5

x xa b ~ ~11 2 9 5 1 7R

x xW W

4 2 3 50 2 2~ .

Other measures of the quality of separation of the Peaks

1. Limit of detection2. Limit of quantification3. Signal to noise (S/N)

-6 -4 -2 0 2 4 6 8 10 12

X blank

-6 -4 -2 0 2 4 6 8 10 12

X limit of detection

x x sLOD blank b lank 3

99.74%Of the observationsOf the blank will lie below the mean of theFirst detectable signal (LOD)

-6 -4 -2 0 2 4 6 8 10 12

Two peaks are visible when all the data is summed together

Estimate the LOD (signal) of this data

0 10 20 30 40 50 60

t (lbs

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

-6 -4 -2 0 2 4 6 8 10 12

ex x sLOQ blank b lank 9 Your book suggests 10

-6 -4 -2 0 2 4 6 8 10 12

Limit of quantification requires absolute Certainty that no blank is part of the measurement

-6 -4 -2 0 2 4 6 8 10 12

0 10 20 30 40 50 60

t (lbs

Estimate the LOQ (signal) of this data

0 10 20 30 40 50 60

t (lbs

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

Signal = xsample - xblank

Noise = N = standard deviation, s

ppsam ple b lank sam ple b lank

Estimate the S/N of this data

0 10 20 30 40 50 60

Vacation

School Begins

Baseline

Signal

Peak to peak variation within mean school ~ 6s where s = N for Noise

(This assumes pp school ~ pp baseline)

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

0 5 10 15 20 25 30

Sample number

Can you “tell” where the switch betweenRed and white potatoes begins?

What is the signal (length of white)?What is the background (length of red)?What is the S/N ?

Effect of sample size on the measurement

Error curvePeak height grows with # of measurements.+ - 1 s always has same proportion of total number of measurements

However, the actual value of s decreases as population grows

nsam ple

popu la tion

sam ple

0 2 4 6 8 10 12 14

Sample number

2008 Data

y = -0.8807x + 5.9303

R2 = 0.9491

1.5 2 2.5 3 3.5 4

sqrt number of samples

nsam ple

popu la tion

sam ple

14 19 24 29 34 39

Length (cm)

Calibration Curve

A calibration curve is based on a selected measurement as linearIn response to the concentration of the analyte.

Or… a prediction of measurement due to some changeCan we predict my weight change if I had spent a longer time on Vacation?

vacationondaysbalbsfitch

138 139 140 141 142 143 144 145 146 147

Weight (lbs)

vacationondaysbalbsfitch

5 days

The calibration curve contains information about the sampling Of the population

y = 0.3542x + 140.04

R2 = 0.7425

0 1 2 3 4 5 6

Days on Vacation

Can get this by using “trend line”

y = -0.8807x + 5.9303

R2 = 0.9491

1.5 2 2.5 3 3.5 4

sqrt number of samples

mThis is just a trendlineFrom “format” data Sample sqrt(#samples) stdev

1 1 #DIV/0!2 1.414213562 2.0364683 1.732050808 4.4757274 2 4.314415 2.236067977 3.8440456 2.449489743 3.8446047 2.645751311 3.7351248 2.828427125 3.4584149 3 3.23505510 3.16227766 3.09305311 3.31662479 2.93594412 3.464101615 2.950187

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.296113395R Square 0.087683143Adjusted R Square -0.013685397Standard Error 0.703143388Observations 11

ANOVAdf SS MS F Significance F

Regression 1 0.427662048 0.427662 0.864994 0.376617

Residual 9 4.449695616 0.494411Total 10 4.877357664

Coefficients Standard Error t Stat P-value Lower 95%Intercept 3.884015711 0.514960076 7.542363 3.53E-05 2.719094X Variable 1 -0.06235252 0.067042092 -0.93005 0.376617 -0.21401

Using the analysisData pack

Get an errorAssociated withThe intercept

In the best of all worlds you should have a series of blanksThat determine you’re the “noise” associated with the background

x x sLOD blank b lank 3

Sometimes you forget, so to fall back and punt, estimateThe standard deviation of the “blank” from the linear regression

But remember, in doing this you are acknowledgingA failure to plan ahead in your analysis

x x b conc LODLOD blank [ . ]

[ . ]conc LODs

bb lank

Extrapolation of the associated errorCan be obtained from the LinearRegression data

Sensitivity (slope)

x x sLOD blank b lank 3x s x b conc LODb lank b lank b lank 3 [ . ]

The concentration LOD depends on BOTHStdev of blank and sensitivity

Signal LOD

!!Note!! Signal LOD ≠ Conc LODWe want Conc. LOD

024681012

pH or pM

y = -31.143x - 74.333

R2 = 0.9994

024681012

pH or pM

y = -31.143x - 74.333

R2 = 0.9994

024681012

pH or pM

y = -31.143x - 74.333

R2 = 0.9994

y = -41x - 118.5

R2 = 0.9872

024681012

pH or pM

Difference in slope is one measure selectivity

In a perfect method the sensing device would have zeroSlope for the interfering species

Selectivity

Limit of linearity

5% deviation

Summary: Figures of Merit Thus far

R = resolutionS/NLOD = both signal and concentrationLOQLOLSensitivity (calibration curve slope)Selectivity (essentially difference in slopes)

Can be expressed in terms of signal, but betterExpression is in terms of concentration

Tests: Anova

Why is the limit of detection important?

Why has the limit of detection changed so much in theLast 20 years?

The End

40 60 80 100 120 140 160

Verbal IQ

40 60 80 100 120 140 160

Verbal IQ

Which of these two data sets would be likelyTo have better numerical value for theAbility to distinguish between two differentPopulations?

Needleman’s data

2008 Data

14 19 24 29 34 39

Length (cm)

White, N=12, Sum sq diff=0.037Red, N=12, Sum sq diff=0.11

Height for normalized Bell curve <1

Which population is more variable?How can you tell?

14 19 24 29 34 39

Length (cm)

Increasing the sample size decreases the std dev and increases separationOf the populations, notice that the means also change, will do so untilWe have a reasonable sample of the population

40 60 80 100 120 140 160

Verbal IQ

40 60 80 100 120 140 160

Verbal IQ

Handling Data and Figures of Merit Data comes in different formats time Histograms Lists But…. Can...

Documents

Transcript of Handling Data and Figures of Merit Data comes in different formats time Histograms Lists But…. Can...

On Moving Averages, Histograms and Time-Dependent …menth/papers/Menth17c.pdf · On Moving Averages, Histograms and Time ... helpto visualize time-dependent data ... existing extension

Chapter 2 Organizing Data Frequency Distributions, Histograms, and Related Topics.

Chapter 5: Exploring Data: Distributions Lesson Plan Exploring Data Displaying Distributions: Histograms Interpreting Histograms Displaying Distributions:

Synopses for Massive Data: Samples, Histograms, Wavelets ...db.ucsd.edu/static/Synopses.pdfSynopses for Massive Data: Samples, Histograms, Wavelets, Sketches By Graham Cormode, Minos

IOBM Histograms

1.2 - Displaying quantitative data with graphs (Histograms)

Photography - Histograms

Building Wavelet Histograms on Large Data in MapReducelifeifei/histogram/histogramSlides.pdfIntroduction: Wavelet Histograms A common choice for a histogram is the Haar wavelet histogram.

10.2 Histograms - Is 51 Edwin Markham440 Chapter 10 Data Displays 10.2 Histograms How can you use intervals, tables, and graphs to organize data? Work with a partner. a. Roll a number

Histograms REVIEWED

GCSE - Histograms

Scalable Histograms on Large Probabilistic Data

Histograms notes keygialeniosf.weebly.com/.../109685417/amdm_-_unit_3_histograms_not… · Histograms, on the other hand, are usually used to present "continuous data", that is data

Statistics-Histograms Looking at the Distribution of the Data

Chapter 2 Descriptive Statistics and Data Analysis · of shape (skewness, kurtosis) and frequency distributions and histograms. 2. What are frequency distributions and histograms?

Histograms! Histograms group data that is close together into “classes” and shows how many or what percentage of the data fall into each “class”. It is.

Target: pie charts and histograms. Probabilities & Data Displays.

Lecture 2 - Data and Data Summariescr173/Sta102_Fa14/Lec/Lec2.pdf · Sta102 / BME 102 (Colin Rundel) Lec 2 August 27, 2014 12 / 37. Numerical data Histograms and shape Skewness Histograms

Stemplots &Histograms

HISTOGRAMS SOLUTIONS