Module 7: Comparing Datasets and Comparing a Dataset with a Standard

Module 7: Comparing Datasets and Comparing a Dataset

with a Standard

How different is enough?

module 7 2

Concepts Independence of each data point Test statistics Central Limit Theorem Standard error of the mean Confidence interval for a mean Significance levels How to apply in Excel

module 7 3

Independent Measurements

Each measurement must be independent (shake up basket of tickets)

Example of non-independent measurements– Public responses to questions (one result affects

next person’s answer)– Samplers too close together, so air flows

affected

module 7 4

Test Statistics

Some number calculated based on data In student’s t test, for example, t If t is >= 1.96 and

– population normally distributed,– you’re to right of curve, – where 95% of data is in inner portion,

symmetrically between right and left (t=1.96 on right, -1.96 on left)

module 7 5

Test statistics correspond to significance levels

“P” stands for percentile Pth percentile is where p of data falls below,

and 1-p fall above

module 7 6

Two Major Types of Questions Comparing mean against a standard

– Does air quality here meet NAAQS? Comparing two datasets

– Is air quality different in 2006 than 2005?– Better?– Worse?

module 7 7

Comparing Mean to a Standard

Did air quality meet CARB annual standard of 12 microg/m3?

year Ft Smith avg

Ft Smith Min

Ft Smith Max

N_Fort Smith

‘05 14.78 0.1 37.9 77

module 7 8

Central Limit Theorem (magic!)

Even if underlying population is not normally distributed

If we repeatedly take datasets These different datasets have means that

cluster around true mean Distribution of these means is normally

distributed!

module 7 9

Magic Concept #2: Standard Error of the Mean

Represents uncertainty around mean

As sample size N gets bigger, error gets smaller!

The bigger the N, the more tightly you can estimate mean

LIKE standard deviation for a population, but this is for YOUR sample

module 7 10

For a “large” sample (N > 60), or when very close to a normal distribution…

Confidence interval for population mean is:

Choice of z determines 90%, 95%, etc.

module 7 11

For a “Small” SampleReplace Z value with a t value to get…

x t sn

…where “t” comes from Student’s t distribution, and depends on sample size

module 7 12

Student’s t Distribution vs. Normal Z Distribution

-5 0 5

T-distribution and Standard Normal Z distribution

T with 5 d.f.

Z distribution

module 7 13

Compare t and Z Values

Confidencelevel

t value with5 d.f

Z value

90% 2.015 1.6595% 2.571 1.9699% 4.032 2.58

module 7 14

What happens as sample gets larger?

-5 0 5

T-distribution and Standard Normal Z distribution

Z distribution

T with 60 d.f.

module 7 15

What happens to CI as sample gets larger?

For large samples

Z and t values become almost identical, so CIs are almost identical

module 7 16

First, graph and review data Use box plot add-in Evaluate spread Evaluate how far apart mean

and median are (assume sampling design and

QC are good)

module 7 17

Excel Summary Stats

module 7 18

Ft Smith

Min 0.125th 7.5

Median 13.7

75th 18.1Max 37.9

Mean 14.8SD 8.7

1.Use the box-plot add-in

2.Calculate summary stats

module 7 19

Our Question

Can we be 95%, 90%, or how confident that this mean of 14.78 is really greater than standard of 12?

We saw that N = 77, and mean and median not too different

Use z (normal) rather than t

module 7 20

The mean is 14.8 +- what? We know equation for CI is

Width of confidence interval represents how sure we want to be that this CI includes true mean

Now, decide how confident we want to be

module 7 21

CI Calculation

For 95%, z = 1.96 (often rounded to 2) Stnd error (sigma/N) = (8.66/square root of

77) = 0.98 CI around mean = 2 x 0.98 We can be 95% sure that mean is included

in (mean +- 2), or 14.8-2 at low end, to 14.8 + 2 at high end

This does NOT include 12 !

module 7 22

Excel can also calculate a confidence interval around the mean

Mean, plus and minus 1.93, is a 95% confidence interval that does NOT include 12!

module 7 23

We know we are more than 95% confident, but how confident can we

be that Ft Smith mean > 12? Calculate where on curve our mean of 14.8 is,

in terms of z (normal) score… …or if N small, use t score

module 7 24

To find where we are on the curve, calc the test statistic…

Ft Smith mean = 14.8, sigma =8.66, N =77

Calculate test statistic, in this case the z factor (we decided we can use the z rather than the t distribution)

If N was < 60, test stat is t, but calculated the same way

Data’s mean

Standard of 12

module 7 25

Calculate z Easily

Our mean 14.8 minus standard of 12 (treat real mean (mu) as standard) is numerator (= 2.8)

Standard error is sigma/square root of N = 0.98 (same as for CI)

so z = (2.8)/0.98 = z = 2.84 So where is this z on the curve? Remember, at z = 3 we are to the right of ~

module 7 26

Where on the curve?

So between 95 and 99% probable that the true mean will not include 12

module 7 27

You can calculate exactly where on the curve, using Excel

Use Normsdist function, with z

If z (or t) = 2.84, in Excel

Yields 99.8% probability that the true mean does NOT include 12

Module 7: Comparing Datasets and Comparing a Dataset with a Standard

Documents

Transcript of Module 7: Comparing Datasets and Comparing a Dataset with a Standard

Domino: Extracting, Comparing, and Manipulating …...Domino: Extracting, Comparing, and Manipulating Subsets across Multiple Tabular Datasets Samuel Gratzl, Nils Gehlenborg, Alexander

G006 Dataset for histopathological reporting of oesophageal and … · 2019-10-23 · CEff 231019 1 V3 Final Standards and datasets for reporting cancers Dataset for histopathological

Standards and datasets for reporting cancers Dataset for …€¦ · salivary malignancies and neck dissection specimens. In this revision, the dataset on neck dissection specimens

Creating Mosaic Datasets and Publishing Image Services Using …€¦ · •Image Service places share lock on mosaic dataset • Live update is only supported for SDE Mosaic dataset

Comparing Datasets and Comparing a Dataset with a Standard How different is enough?

Introduction to CKAN - United Nations · 2018. 1. 24. · CKAN Demo site A / Datasets I Create Dataset Datasets Organizations Groups About Add data amercader Search o O What are datasets?

Highway Driving Dataset for Semantic Video Segmentation · Scene understanding is an essential technique in semantic ... datasets. These datasets include various scenes, such as indoor,

Comparing Fully and Partially Synthetic Datasets for ...

Custom Local Search - microsoft.com · A custom local search query, Q, is the triple (dataset-selection, geometric-scope, text-query), where dataset-selection specifies which datasets

M -D : A DATASET OF DATASETS FOR LEARNING TO LEARN …

Going Deeper with Contextual CNN for Hyperspectral Image ... · pixel vector. The proposed approach is tested on three benchmark datasets: the Indian Pines dataset, the Salinas dataset

META-DATASET: A DATASET OF DATASETS FOR ...Published as a conference paper at ICLR 2020 META-DATASET aims to improve upon previous benchmarks in the above directions: it is signiﬁcantly

254.gap Datasets profile vs. Reference Datasetlilja/spec2000/254.gap.profile.pdf · 2003. 4. 22. · 254.gap Datasets profile vs. Reference Dataset The following are the profiles

Datasheets for Datasets - microsoft.com · Datasheets for datasets will facilitate better com-munication between dataset creators and users, and encourage the machine learning community

Managing and Serving Elevation and Lidar Data€¦ · LiDAR Project #1. Source Imagery Source Mosaic Datasets Derived Mosaic Dataset Combine into Derived Mosaic Dataset Use TABLE

Datasheets for Datasets - fatml.org · Datasheets for Datasets component, such as a dataset, can propagate throughout a system making them difﬁcult to track down. For example, biases

Novel Dataset for Fine-grained Abnormal Behavior ... · • Unlike previous crowd datasets with limited number of crowd behavior scenarios, our dataset consists of different behavior

STANDARDS AND DATASETS FOR REPORTING CANCERS · 2019-02-27 · CEff 070219 1 V5 Final Standards and datasets for reporting cancers Dataset for histopathological reporting of primary

· Use a database subsettingtool - ... Databene Benerator. DATASET EXPLOSION? Avoid large datasets (if you can). Design and prepare datasets for reuse across tests.

DEEP LEARNING WITH DIFFERENTIAL PRIVACY * Open AI · Our Datasets: “Fruit Flies of Machine Learning” MNIST dataset: 70,000 images 28⨉28 pixels each CIFAR-10 dataset: 60,000