Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

33
Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008

Transcript of Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

Page 1: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

Statistical basics

Marian ScottDept of Statistics, University of Glasgow

August 2008

Page 2: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

What shall we cover?

• Why might we need some statistical skills

• Statistical inference- what is it?

• how to handle variation

• exploring data

• probability models

• inferential tools- hypothesis tests and confidence intervals

Page 3: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

Why bother with Statistics

We need statistical skills to:

•  Make sense of numerical information, 

• Summarise data,

•  Present results (graphically),

•  Test hypotheses

•   Construct models

Page 4: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

statistical language

• variable- a single aspect of interest

• population- a large group of ‘individuals’

• sample- a subset of the population

• parameter- a single number summarising the variable in the population

• statistic- a single number summarising the variable in the sample

Page 5: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

statistical language- Radiation protection- C-14 in fish

• variable- radiocarbon level (Bq/KgC)

• population- all fish caught for human consumption in W Scotland

• sample- 20 fish bought in local markets

• parameter- population mean C-14 level

• statistic- sample mean C-14 level

Page 6: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

Variables- number and type

• Univariate: there is one variable of interest measured on the individuals in the sample. We may ask:

• What is the distribution of results-this may be further resolved into questions concerning the mean or average value of the variable and the scatter or variability in the results?

Page 7: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

Bivariate

• Bivariate two variables of interest are measured on each member of the sample. We may ask :

• How are the two variables related?• If one variable is time, how does the other

variable change?• How can we model the dependence of

one variable on the other?

Page 8: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

Multivariate

Multivariate many variables of interest are measured on the individuals in the sample, we might ask:

• What relationships exist between the variables?• Is it possible to reduce the number of variables,

but still retain 'all' the information?

Can we identify any grouping of the individuals on the basis of the variables?

Page 9: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

Data types

• Numerical: a variable may be either continuous or discrete.

• For a discrete variable, the values taken are whole numbers (e.g. number of invertebrates, numbers of eggs).

• For a continuous variable, values taken are real numbers (positive or negative and including fractional parts) (e.g. pH, alkalinity, DOC, temperature).

Page 10: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

categorical

• Categorical: a limited number of categories or classes exist, each member of the sample belongs to one and only one of the classes e.g. compliance status is categorical.

• Compliance is a nominal categorical variable since the categories are unordered.

• Level of diluent (eg recorded as low, medium ,high) would be an ordinal categorical variable since the different classes are ordered

Page 11: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

Inference and Statistical Significance

Sample Population

inference

• Is the sample representative?

• Is the population homogeneous?

Since only a sample has been taken from the population we cannot be 100% certain

Significance testing

Page 12: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Page 13: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

what are your objectives?

 describing a characteristic of interest (usually the average, but could also be the variability or a high percentile),

describing spatial patterns of a characteristic,mapping the spatial distribution,

quantifying contamination above a background or specified intervention level

  detecting temporal or spatial trends,

  assessing environmental impacts of specific facilities, or of events such as accidental releases,

Page 14: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

the statistical process

• A process that allows inferences about properties of a large collection of things (the population) to be made based on observations on a small number of individuals belonging to the population (the sample).

• The use of valid statistical sampling techniques increases the chance that a set of specimens (the sample, in the collective sense) is collected in a manner that is representative of the population.

Page 15: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

Variation

• soil or sediment samples taken side-by-side, from different parts of the same plant, or from different animals in the same environment, exhibit different activity densities of a given radionuclide.

• The distribution of values observed will provide an estimate of the variability inherent in the population of samples that, theoretically, could be taken.

Page 16: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

Representativeness

An essential concept is that the taking of a sufficient number of individual samples should reflect the population.

Representativeness of environmental samples is difficult to demonstrate.

Usually, representativeness is considered justified by the procedure used to select the samples

Page 17: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

What is the population?

• The population is the set of all items that could be sampled, such as all fish in a lake, all people living in the UK, all trees in a spatially defined forest, or all 20-g soil samples from a field. Appropriate specification of the population includes a description of its spatial extent and perhaps its temporal stability

Page 18: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

What are the sampling units?

In some cases, sampling units are discrete entities (i.e., animals, trees), but in others, the sampling unit might be investigator-defined, and arbitrarily sized.

Example- technetium in shellfishThe objective here is to provide a measure (the average)

of technetium in shellfish (eg lobsters for human consumption) for the west coast of Scotland.

• Population is all lobsters on the west coast• Sampling unit is an individual animal.

Page 19: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

• Summarising data- means, medians and other such statistics

Page 20: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Page 21: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Page 22: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Page 23: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Page 24: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

• plotting data- histograms, boxplots, stem and leaf plots, scatterplots

Page 25: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Page 26: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Page 27: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Page 28: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Page 29: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Page 30: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Page 31: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Page 32: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

median

lower quartile

upper quartile

Page 33: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.

Preliminary Analysis

• There is considerable variation – Across different sites – Within the same site

across different years• Distribution of data is

highly skewed with evidence of outliers and in some cases bimodality 2004 2005 2006 2007

02

00

40

06

00

80

0

Boxplots of FS: 114567

SEPA location code 114567Year

FS