SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ......

34
SEEC Toolbox seminars Data Exploration Greg Distiller 29th November

Transcript of SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ......

Page 1: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

SEEC Toolbox seminarsData Exploration

Greg Distiller

29th November

Page 2: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Motivation

I The availability of sophisticated statistical tools has grown.But “rubbish in - rubbish out”.

I ”. . . even the well-known assumptions continue to be violatedfrequently”.

Page 3: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Motivation

Examples of statistical pitfalls:

I Collinearity → Type II errors

I Zero inflation in GLMs may bias parameter estimates

I Spatio / temporal autocorrelation → Type I errors

I Linearity

I Ability to model interactions

Page 4: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Motivation

I These various pitfalls can be avoided with proper dataexploration.

I Exploration is separate from hypothesis testing.

I If a priori knowledge is very thin, exploration can be used as ahypothesis-generating exercise.

I Focus on graphical tools rather than testing.

Page 5: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

OutliersI Some techniques can be dominated by outliers.I They should not be removed as a matter of course!I Boxplots are typically used.I Cleveland dotplot (row nos vs observations) provides more

information.

Page 6: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Outliers

I Extreme values by chance?I Can simulate random observations.I Can consider dropping if they seem to be measurement errors.

Page 7: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Outliers

I Outliers in X vs Y space.I influential observations are usually both!

I Transformations an option but really not ideal (especially forthe response).I better to choose a more suitable model, or a more suitable

metric if not modelling.

Page 8: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Homogeneity of varianceI Important assumption in regression type models (incl.

ANOVA) and MV techniques like discriminant analysis.I Regression-type models → verify with residuals.I Can use conditional boxplots.

Source: Milandri S.G. et al (2012)

Page 9: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Homogeneity of varianceI Important assumption in regression type models (incl.

ANOVA) and MV techniques like discriminant analysis.I Regression-type models → verify with residuals.I Can use conditional boxplots.

Page 10: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

NormalityI Is normality required? What must be normal?

Source: Zuur et al. Analysing Ecological Data

I Implies normality of the residuals.

Page 11: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

NormalityI Is normality required? What must be normal?

Source: Zuur et al. Analysing Ecological Data

Source: Zuur et al. Analysing Ecological Data

I Implies normality of the residuals.

Page 12: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

NormalityI Is normality required? What must be normal?

Source: Zuur et al. Analysing Ecological Data

I Implies normality of the residuals.

Page 13: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

NormalityI Is normality required? What must be normal?

Source: Zuur et al. Analysing Ecological Data

I Implies normality of the residuals.

Page 14: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Normality

I There can be more to it than meets the eye:

Page 15: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Excess ZerosI Poisson GLM often used to model count data.

Source: Abraham, E. R., & Thompson, F. N. (2011)

I Zero-inflated models (ZIPs, ZINBs, Hurdle models . . .)

Page 16: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Excess ZerosI Poisson GLM often used to model count data.

Source: Abraham, E. R., & Thompson, F. N. (2011)

I Zero-inflated models (ZIPs, ZINBs, Hurdle models . . .)

Page 17: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Excess ZerosMV setting:I is there useful information in joint zeros?

I corrgram can be used to assess prevalence of “joint absences”:

Page 18: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Excess ZerosMV setting:I is there useful information in joint zeros?I corrgram can be used to assess prevalence of “joint absences”:

Page 19: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Collinearity

I NB when objective is to understand what covariates drive asystem.

I Common examples: morphological measurements, waterdepth and distance to the shore . . .

I Can lead to a confusing set of results.

Page 20: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

CollinearitySource: Krivokapich J et al (1999)

●●

●●

●● ●●

●●●● ●

● ●●●●● ● ●● ● ●

●●● ●● ● ●

● ●● ●●●

● ●●● ●

● ●●● ●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ● ●●

●●

●● ●●●●●●● ● ●●● ●●●● ●

●●● ●●●●●● ● ● ● ●● ●● ●● ●●●●●●

●●● ●● ●●●● ●●● ●●● ●

●●● ●● ● ●●●●● ●● ●●

● ●● ●●●●

● ●●●●● ● ●●● ●● ●● ● ●●●

●● ●●●●

●●

●●

●● ●

● ●●

●● ●●●

●●●

● ●●

●●● ●● ●

● ● ●●

● ●●

●● ●

●● ●●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●● ●●

●● ●●

●●

● ●●

●●●

●●

●●

●●

●●● ●●

● ●●●●● ●● ●● ●

●● ● ●●●● ●

● ●●●

● ●● ●●●●●

●●● ●

● ●●

● ● ●

● ● ●

● ●●

●●

●●

● ●

● ●●

● ●

●●

●●●

●●●● ●●

● ● ●●●

●●

●●●● ●

●●●

● ●

●●

●●●

●● ●

●●● ●●●●

●● ●● ●● ●

● ●● ● ●●●● ●

●●●●

●●● ●●

● ●●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●● ●

●●

Ejection fraction on Dobutamine

Bas

elin

e ej

ectio

n fr

actio

n

30 40 50 60 70 80 90

3040

5060

7080

90

Page 21: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

CollinearitySource: Krivokapich J et al (1999)

●●

●●

●●

●●

●●

●●

●●

●●

event

Bas

elin

e ej

ectio

n fr

actio

n

No event Cardiac event

2030

4050

6070

80

●●

●●

●●

●●●

event

Eje

ctio

n fr

actio

n on

Dub

utam

ine

No event Cardiac event

3040

5060

7080

90

Page 22: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Collinearity

> m1 <- glm(event ~ baseef + dobef, family = binomial)

> summary(m1)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.66404 0.56674 2.936 0.003323 **

baseef 0.02878 0.02605 1.105 0.269212

dobef -0.07770 0.02333 -3.331 0.000865 ***

Page 23: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Collinearity

I There are some diagnostic tools:I VIF

1

1− R2j

I scatterplots: plot covariates against any temporal / spatialvariables.

I High or even moderate collinearity is problematic when thesignal is weak.

Page 24: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Nature of the Relationship

I Use scatterplots of the response variable vs each covariate.

I Note that the absence of two-way relationships does notnecessarily mean that there are no relationships.

Page 25: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Nature of the Relationship

● ●

● ● ●

●●

●●

●●

Year

Sep

tem

ber

Arc

tic S

ea Ic

e E

xten

t (1,

000,

000

sq k

m)

1980 1985 1990 1995 2000 2005 2010

Source: Gary Witt. Using data from climate science to teach introductory statistics. Journal of Statistics

Education, 21(1), 2013.

Page 26: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Nature of the Relationship

Year

Sep

tem

ber

Arc

tic S

ea Ic

e E

xten

t (1,

000,

000

sq k

m)

1980 1985 1990 1995 2000 2005 2010

● ●

● ● ●

●●

●●

●●

Source: Gary Witt. Using data from climate science to teach introductory statistics. Journal of Statistics

Education, 21(1), 2013.

Page 27: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Nature of the Relationship

> m1 <- lm(ice ~ year, data = dat)

> m2 <- lm(ice ~ year + I(year^2), data = dat)

> anova(m1,m2)

Analysis of Variance Table

Model 1: ice ~ year

Model 2: ice ~ year + I(year^2)

Res.Df RSS Df Sum of Sq F Pr(>F)

1 32 10.529

2 31 6.822 1 3.7072 16.846 0.0002731 ***

Page 28: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Including interactions

I A coplot is a useful plot to visualise potential interactions.

I Can reveal data sparsity.

Page 29: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Independence

I A crucial assumption for many techniques.

I Violation increases type I errors.I Some examples:

I Data from different locations → are birds from locations closeto each other more similar than birds from locations furtheraway?

I Individuals from one family may tend to be similar due toshared genes.

I Repeated measurements.

Page 30: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Independence

I Plot the response against time or space, any pattern suggestsdependence.

I More formally, plot auto-correlation functions (ACF’s).

Page 31: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Independence

When the data are not independent:

I need a model that can account for the dependence:mixed-effects models, AR models, generalised least squares(modelling the correlation structure).

I can sometimes model the dependence with suitable covariates.

I residuals should display no dependence.

Page 32: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Final thoughts

I Not all steps are relevant for all data sets.

I Some techniques require analysis first (i.e. to produceresiduals)

I Transformations are not ideal.

“...the routine use and transparent reporting of systematic dataexploration would improve the quality of ecological research andany applied recommendations that it produces.”

Page 33: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

Final thoughts

I Not all steps are relevant for all data sets.

I Some techniques require analysis first (i.e. to produceresiduals)

I Transformations are not ideal.

“...the routine use and transparent reporting of systematic dataexploration would improve the quality of ecological research andany applied recommendations that it produces.”

Page 34: SEEC Toolbox seminars · 1980 1985 1990 1995 2000 2005 2010 l l l l l l l l l l l l l l l l l l ... I Data from di erent locations !are birds from locations close ... I Plot the response

SEEC Stats Toolbox

Want to broaden your stats knowledge? Unsure of what you can do with your data? Still developing your proposal?

Join us for the monthly SEEC Stats Toolbox seminars where we introduce you to statistical methods that are useful for ecologists, environmental and conservation scientists.

See you next year!