Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

60
Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009

Transcript of Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Page 1: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Statistical basics

Marian ScottDept of Statistics, University of Glasgow

August 2009

Page 2: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

What shall we cover?

• Why might we need some statistical skills

• Statistical inference- what is it?

• how to handle variation

• exploring data

• probability models

• inferential tools- hypothesis tests and confidence intervals

Page 3: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Why bother with Statistics

We need statistical skills to:

•  Make sense of numerical information, 

• Summarise data,

•  Present results (graphically),

•  Test hypotheses

•  Construct models

Page 4: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

statistical language

• variable- a single aspect of interest

• population- a large group of ‘individuals’

• sample- a subset of the population

• parameter- a single number summarising the variable in the population

• statistic- a single number summarising the variable in the sample

Page 5: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

statistical language- Radiation protection- C-14 in fish

• variable- radiocarbon level (Bq/KgC)

• population- all fish caught for human consumption in W Scotland

• sample- 20 fish bought in local markets

• parameter- population mean C-14 level

• statistic- sample mean C-14 level

Page 6: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Questions

• Univariate: What is the distribution of results-this may be further resolved into questions concerning the mean or average value of the variable and the scatter or variability in the results?

• Bivariate: How are the two variables related?How can we model the dependence of one variable on the other?

• Multivariate: What relationships exist between the variables?Is it possible to reduce the number of variables, but still retain 'all' the information? Can we identify any grouping of the individuals on the basis of the variables?

Page 7: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Data types

• Numerical: a variable may be either continuous or discrete. – For a discrete variable, the values taken are whole numbers

(e.g. number of invertebrates). – For a continuous variable, values taken are real numbers ( e.g.

pH, alkalinity, DOC, temperature).• Categorical: a limited number of categories or classes

exist, each member of the sample belongs to one and only one of the classes. – Compliance is a nominal categorical variable since the

categories are unordered. – Level of diluent (eg recorded as low, medium ,high) would be

an ordinal categorical variable since the different classes are ordered

Page 8: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Inference and Statistical Significance

Sample Population

inference

• Is the sample representative?

• Is the population homogeneous?

Since only a sample has been taken from the population we cannot be 100% certain

Significance testing

Page 9: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 10: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

the statistical process

• A process that allows inferences about properties of a large collection of things (the population) to be made based on observations on a small number of individuals belonging to the population (the sample).

• The use of valid statistical sampling techniques increases the chance that a set of specimens (the sample, in the collective sense) is collected in a manner that is representative of the population.

Page 11: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Variation

• soil or sediment samples taken side-by-side, from different parts of the same plant, or from different animals in the same environment, exhibit different activity densities of a given radionuclide.

• The distribution of values observed will provide an estimate of the variability inherent in the population of samples that, theoretically, could be taken.

Page 12: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

What is the population?

• The population is the set of all items that could be sampled, such as all fish in a lake, all people living in the UK, all trees in a spatially defined forest, or all 20-g soil samples from a field. Appropriate specification of the population includes a description of its spatial extent and perhaps its temporal stability

Page 13: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

What are the sampling units?

In some cases, sampling units are discrete entities (i.e., animals, trees), but in others, the sampling unit might be investigator-defined, and arbitrarily sized.

Example- technetium in shellfishThe objective here is to provide a measure (the average)

of technetium in shellfish (eg lobsters for human consumption) for the west coast of Scotland.

• Population is all lobsters on the west coast• Sampling unit is an individual animal.

Page 14: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

• Summarising data- means, medians and other such statistics

Page 15: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 16: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 17: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 18: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 19: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

• plotting data- histograms, boxplots, stem and leaf plots, scatterplots

Page 20: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 21: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 22: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 23: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 24: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 25: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

median

lower quartile

upper quartile

Page 26: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Preliminary Analysis

• There is considerable variation – Across different sites – Within the same site

across different years• Distribution of data is

highly skewed with evidence of outliers and in some cases bimodality

2004 2005 2006 20070

20

04

00

60

08

00

Boxplots of FS: 114567

SEPA location code 114567Year

FS

Page 27: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

• probability models- the Normal especially

Page 28: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 29: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 30: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 31: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 32: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 33: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

• checking distributional assumptions

Page 34: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Histogram of FS

SEPA location code: 4556FS/100ml

De

nsi

ty

0 20 40 60 80 100

0.0

00

.02

0.0

40

.06

0.0

8

-2 -1 0 1 2

02

04

06

08

0

Normal Q-Q Plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

Histogram of log10(FS)

SEPA location code: 4556log10(FS)/100ml

De

nsi

ty

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

-2 -1 0 1 2

0.0

0.5

1.0

1.5

2.0

Normal Q-Q Plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

Page 35: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Modelling Continuous Variables checking normality

• Normal probability plot

• Should show a straight line

• p-value of test is also reported (null: data are Normally distributed)C1

Perc

ent

43210-1-2-3

99.9

99

95

90

80706050403020

10

5

1

0.1

Mean

0.439

0.1211StDev 1.015N 100AD 0.361P-Value

Probability Plot of C1Normal

Page 36: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Statistical inference

• Confidence intervals

• Hypothesis testing and the p-value

• Statistical significance vs real-world importance

Page 37: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

• a formal statistical procedure- confidence intervals

Page 38: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Confidence intervals- an alternative to hypothesis testing

• A confidence interval is a range of credible values for the population parameter. The confidence coefficient is the percentage of times that the method will in the long run capture the true population parameter.  

• A common form is sample estimator 2* estimated standard error

Page 39: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 40: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 41: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 42: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 43: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 44: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

• another formal inferential procedure- hypothesis testing

Page 45: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Hypothesis Testing

• Null hypothesis: usually ‘no effect’

• Alternative hypothesis: ‘effect’

• Make a decision based on the evidence (the data)

• There is a risk of getting it wrong!

• Two types of error:-– reject null when we shouldn’t

- Type I– don’t reject null when we should

- Type II

Page 46: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Significance Levels

• We cannot reduce probabilities of both Type I and Type II errors to zero.

• So we control the probability of a Type I error.

• This is referred to as the Significance Level or p-value.

• Generally p-value of <0.05 is considered a reasonable risk of a Type I error.(beyond reasonable doubt)

Page 47: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Statistical Significance vs. Practical Importance

• Statistical significance is concerned with the ability to discriminate between treatments given the background variation.

• Practical importance relates to the scientific domain and is concerned with scientific discovery and explanation.

Page 48: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Power

Power is related to Type II error

probability of power = 1 - making a Type II

error

Aim:

to keep power as high as possible

Page 49: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 50: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 51: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.
Page 52: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

• relationships- linear or otherwise

Page 53: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Correlations and linear relationships

• pearson correlation

• Strength of linear relationship

• Simple indicator lying between –1 and +1

• Check your plots for linearity

Page 54: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Interpreting correlations

• The correlation coefficient is used as a measure of the linear relationship between two variables,

• The correlation coefficient is a measure of the strength of the linear association between two variables. If the relationship is non-linear, the coefficient can still be evaluated and may appear sensible, so beware- plot the data first.

Page 55: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

3210-1

-1.0

-1.5

-2.0

-2.5

-3.0

-3.5

-4.0

log Fe

log P

Scatterplot of log P vs log Fe

3210-1

0

-1

-2

-3

-4

-5

-6

log Fe

log N

Scatterplot of log N vs log Fe

0-1-2-3-4-5-6

-1.0

-1.5

-2.0

-2.5

-3.0

-3.5

-4.0

log N

log P

Scatterplot of log P vs log N

0.167

0.1340.380

Page 56: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

• what is a statistical model?

Page 57: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Statistical models

• Outcomes or Responsesthese are the results of the practical work and are sometimes referred to as ‘dependent variables’.

• Causes or Explanationsthese are the conditions or environment within which the outcomes or responses have been observed and are sometimes referred to as ‘independent variables’, but more commonly known as covariates.

Page 58: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

Specifying a statistical models

• Models specify the way in which outcomes and causes link together, eg.

• Metabolite ~ Temperature• there should be an additional item on the right

hand side giving a formula:-

• Metabolite ~ Temperature + Error

Page 59: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

statistical model interpretation

• Metabolite ~ Temperature + Error

• The outcome Metabolite is explained by Temperature and other things that we have not recorded which we call Error.

• The task that we then have in terms of data analysis is simply to find out if the effect that Temperature has is ‘large’ in comparison to that which Error has so that we can say whether or not the Metabolite that we observe is explained by Temperature.

Page 60: Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2009.

summary

• hypothesis tests and confidence intervals are used to make inferences

• we build statistical models to explore relationships and explain variation

• a general linear modelling framework is very flexible

• assumptions should be checked.