NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA

NUMERICAL ANALYSIS OF BIOLOGICAL AND

ENVIRONMENTAL DATA

Lecture 12Overview

John Birks

• Topics covered• Exploratory data analysis• Clustering• Gradient analysis• Hypothesis testing

• Principle of parsimony in data analysis

• Possible future developments

• Conventional• Less conventional

• Some applications• Volcanic tephras• Scotland’s most famous product

• Integrated analyses

• Problems of percentage compositional data

• Log-ratios• Chameleons of CA and CCA

• Software availability

• Web sites

• Final comments

OVERVIEW

EXPLORATORY DATA ANALYSIS

Essential first step

Feel for the data – ranges, need for transformations, rogue or outlying observations

NEVER FORGET THE GRAPH

CLUSTERING

Can be useful for some purposes – basic description, summarisation of large data sets. Fraught with problems and difficulties – choice of DC, choice of clustering method, difficulties of validation and evaluation

Good general purpose TWINSPAN – ORBACLAN – COINSPAN

GRADIENT ANALYSIS

Regression, calibration, ordination, constrained ordination, discriminant analysis and canonical variates analysis, analysis of stratigraphical and

spatial data.

HYPOTHESIS TESTING

Randomisation tests, Monte Carlo permutation tests.

1987 Wageningen

Cajo ter Braak

Classification of gradient analysis techniques by type of problem, response model and method of estimation.

a Constrained multivariate regression b Ordination after regression on covariablesc Constrained ordination after regression on covariables = constrained partial multivariate regression d “Reduced-rank regression” = “PCA of y with respect to x”

Type of problem

Linear Response Model

Unimodal Response Model

Least-Squares Estimation

Maximum Likelihood Estimation

Weighted Averaging Estimation

Regression Multiple regression Gaussian regression

Weighted averaging of site scores (WA)

Calibration Linear calibration; ‘inverse regression’

Gaussian calibration

Weighted averaging of species scores (WA)

Ordination Principal components analysis (PCA)

Gaussian ordination

Correspondence analysis (CA); detrended CA (DCA)

Constrained ordinationa

Redundancy analysis (RDA)d

Gaussian canonical ordination

Canonical CA (CCA); detrended CCA (DCCA)

Partial ordinationb

Partial components analysis

Partial Gaussian ordination

Partial CA; partial DCA

Partial constrained ordinationc

Partial redundancy analsyis

Partial Gaussian canonical ordination

Partial CCA; partial detrended CCA

A straight line displays the linear relation between the abundance value (y) of a species and an environmental variable (x), fitted to artificial data (). (a = intercept; b = slope or regression coefficient).

A unimodal relation between the abundance value (y) of a species and an environmental variable (x). (u = optimum or mode: t = tolerance; c = maximum).

GRADIENT ANALYSISLinear based-models or unimodal-based methods

Critical question, not a matter of personal preference

If gradients are short, sound statistical reasons to use linear methods – Gaussian-based methods break down, edge effects in CA and related techniques become serious, biplot interpretations easy.

If gradients are long, linear methods become ineffective (‘horseshoe’ effect).

How to estimate gradient length?Regression Hierarchical series of response models GLM and HOF

Calibration GLM, DCCA (single x variable)

Ordination DCA (detrending by segments, non-linear rescaling)

Constrained DCCA (detrending by segments, non-linear rescaling) ordination

Partial ordination Partial DCA (detrending by segments, non-linear rescaling)

Partial constrained Partial DCCA (detrending by segments, non-linear rescaling) ordination

HYPOTHESIS TESTING

Monte Carlo permutation tests and randomisation tests

Distribution free, do not require normality of error distribution

Do require INDEPENDENCE or EXCHANGEABILITY

Validity of permutation test results depends on the validity of the type of permutation for the data set at hand.

Completely randomised observations, completely random permutation is appropriate = randomisation test.

Randomised block design-permutation must be conditioned on blocks, e.g. type of farm declared as covariable, if randomisation is conditioned on these, permutations are restricted to within farm.

Time series or line transect – restricted permutations and data kept in order.

Spatial data on grid – restricted permutations and data kept in position.

Repeated measurements – BACI

PRINCIPLE OF PARSIMONY IN DATA ANALYSISWilliam of Occam (Ockham), 14th century English nominalist philosopher. Insisted that given a set of equally good explanations for a given phenomenon, the explanation to be favoured is the SIMPLEST EXPLANATION.

Strong appeal to common sense.

Entities should not be multiplied without necessity.

It is vain to do with more what can be done with less.

An explanation of the facts should be no more complicated than necessary.

Among competing hypotheses or models, favour the simplest one that is consistent with the data.

‘Shaved’ explanations to the minimum.

In data analysis:

1) Models should have as few parameters as possible.

2) Linear models should be preferred to non-linear models.

3) Models relying on few assumptions should be preferred to those relying on many.

4) Models should be simplified/pared down until they are MINIMAL ADEQUATE.

5) Simple explanations should be preferred to complex explanations.

MINIMAL ADEQUATE - as statistically acceptable as the most complex model

MODEL (MAM) - only contains significant parameters

- high explanatory power

- large number of degrees of freedom

- may not be one MAM

CLUSTERING - prefer simple cluster analysis methods (few assumptions, simple values of , , )

- intuitively sensible

REGRESSION - GAM – GLM

- In GAM, simplest smoothers to be used

- In GLM, model simplification to find MAM (e.g. AIC)

CALIBRATION - minimum number of components for lowest RMSEP in PLS or WA-PLS

ORDINATION - retain smallest number of statistically significant axes (broken stick test)

- retain ‘signal’ at expense of noise

RELEVANCE OF PRINCIPLE OF PARSIMONY TO DATA ANALYSIS

PARTIAL ORDINATION

remove effects of ‘nuisance variables’ (covariables or concomitant variables) by partialling out their effects

ordination of residuals

retain smallest number of statistically significant axes (broken stick test)

‘signal’ at expense of ‘noise’ and ‘nuisance variables’

CONSTRAINED ORDINATION

most powerful if the number of predictor variables is small compared to number of samples. Constraints are strong, arch effects avoided, no need for detrending, outlier effects minimised

minimal adequate model (forward selection, VIF, variable selection, AIC)

only retain statistically significant axes

PARTIAL CONSTRAINED ORDINATION

as above + partial ordination

STRATIGRAPHICAL DATA ANALYSIS

only retain statistically significant zones

simplify data to major axes or gradients of variation

CHOICE BETWEEN INDIRECT & DIRECT GRADIENT ANALYSIS

Indirect gradient analysis – two steps

Direct gradient analysis– one combined step

If relevant environmental data are to hand, direct approach is likely to be more effective and simpler than indirect approach. Generally achieve a simpler model from direct gradient analysis.

CHOICE BETWEEN REGRESSION & CONSTRAINED ORDINATION

Both regression procedures! One Y or many Y.

Depends on purpose – is it an advantage to analyse all species simultaneously or individually?

CONSTRAINED REGRESSION ORDINATION

Community assemblage or individual taxa?

HOLISTIC INDIVIDUALISTIC

COMMON GRADIENTS SEPARATE GRADIENTS

QUICK, SIMPLE SLOW, COMPLEX, DEMANDING

LITTLE THEORY MUCH THEORY (GLM)

EXPLORATORY MORE CONFIRMATORY, IN DEPTH

LIMITING FACTORS

Research questionsHypotheses to be tested and evaluated

Data quality

TYPES OF GRADIENT ANALYSIS METHODS BASED ON WEIGHTED

AVERAGING

Community data - incidences (1/0) or abundances ( 0) of species at sites.

Environmental data - quantitative and/or qualitative (1/0) variables at same sites.

Use weighted averages of species scores (appropriate for unimodal biological data) and linear combinations (weighted sums) of environmental variables (appropriate for linear environmental data)

Method Abbreviation

Response variables (y)

Predictors (x)

Lecture

Correspondence analysis

CA (also DCA)

Community data

- 6

Canonical correspondence analysis

CCA (also DCCA)

Community data

Environmental variables

7

CCA partial least squares

CCA-PLSCommunity data

Many environmental variables

11

Weighted averaging calibration

WAEnvironmental variable

Community data

8

WA partial least squares

WA-PLSEnvironmental variable(s)

Community data

8

Co-correspondence analysis

CO-CACommunity data

Community data

11

Also partial CA, partial DCA, partial CCA, partial DCCA.

Lecture topic2 Exploratory data Model specific ‘outlier’

detection; interactive analysis graphics

3 Clustering COINSPAN; better randomisation tests; CART; latent class analysis

4, 5 Regression analysis GLM and GAM framework evaluation by cross- validation. Give up SS, deviance, t, etc!

6 Indirect gradient ? quest for the ‘ideal’ ordination method, 2-analysis matrix CA and PCA

7 Direct gradient 3-matrix CCA and RDA (biology, environment, analysis species attributes); multi-component variance partitioning, vector-based reduced rank models with GAMs

8 Calibration and WAPLS; non-linear deshrinking; ? ML; mixed response reconstruction models; chemometrics, Bayesian framework, more consideration of spatial autocorrelation

9 Classification ? give up classical methods; use permutation tests; classification and regression trees and random forests

10 Stratigraphical and ? more consideration of temporal and spatial spatial data autocorrelation

11 Hypothesis testing More realistic permutation tests (restrictions); better p estimation

POSSIBLE FUTURE DEVELOPMENTS - CONVENTIONAL

Back propagation neural network – layers containing neurons

input vector

input layer

hidden layer

output layer

output vector

Clearly can have different types of input and output vectors, e.g.

INPUT VECTORS OUTPUT VECTORS

> 1 Predictor 1 or more Responses Regression> 1 ‘Responses’ 1 or more ‘Predictors’ Inverse regression or calibration

> 1 Variables 2 or more Classes Discriminant

analysis

NEURAL NETWORKS – THE LESS CONVENTIONAL DATA ANALYSIS

APPROACH IN THE FUTURE?

Malmgren & Nordlund (1997) Palaeo-3 136, 359–373

Planktonic foraminifera 54 core-top samples

Summer water and winter water temperatures

Core E48–22 Extends to oxygen stage 9 320,000 years

Compared neural network as a calibration tool with:

Imbrie & Kipp principal component regression 2-block PLS (SIMCA)

Modern analog technique (MAT) WA-PLS

CRITERION FOR NETWORK SUCCESSCross-validation leave-one-outEstimateRMSE (average error rate in training set)

RMSEP (predictions based on leave-one-out cross-validation)3 neurons 600–700 cycles

RMSEP Summer Winter °C rs rwNeural N 0.71 0.76 0.99 0.98

PLS 1.01 1.05 0.98 0.97

MAT 1.26 1.14 0.97 0.96

Imbrie & Kipp 1.22 1.05 0.97 0.96WA-PLS 1.04 0.86 0.97 0.96

CALIBRATION (INVERSE REGRESSION) AND

ENVIRONMENTAL RECONSTRUCTIONS

Changes in root-mean-square errors (RMSE) for S in relation to number of training epochs for 3-layer BP neural networks with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. The networks were trained over 50 intervals of 100 epochs each (in total of 5,000 epochs). As expected, the RMSEs decrease as training proceeds. The minimum RMSE, 0.3539, was obtained after training a network with 10 neurons in the hidden layer over 5,000 epochs. Similar results were obtained also for W (not shown in diagram).

Changes in root-mean-square errors of prediction (RMSEP) for S with increasing number of training epochs in a 3-layer back propagation neural network with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. These error rates were determined using the Leave-One-Out technique, implying training of the networks over 54 sets consisting of 53 observations each, with one observation left out for later testing. The lowest RMSEPs for both S and W, 0.7176 and 0.7636, respectively, were obtained for a configuration with 3 neurons (only the results for S are shown in the diagram). Note that set-ups with 1, 2, and 3 neurons gave lower RMSEPs than for 4, 5, and 10 neurons.

Relationships between observed and predicted S and W using a 3-layer BP neural networks with 3 neurons in the hidden layer. Lines are linear regression lines. The product-moment correlation coefficients (r) are shown in the lower right hand corners.

Summer Winter

Prediction errors for different network configurations: root-mean-square errors for the differences between observed and predicted S and W using a 3-layer BP neural network with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer.

Root-mean-square errors of prediction (RMSEP) are based on the Leave-One-Out technique in which each of the 54 observations in the data set is left out one at a time and the network is trained on the remaining observations. The trained network is then used to predict the excluded observation. The network was run over 50 intervals of 100 epochs each, and the error rates were recorded after each interval.

No. neurons

S W

RMSEP No. epochs RMSEP No. epochs

1 0.8779 500 0.8796 300

2 0.7850 1800 0.9013 700

3 0.7176 600 0.7636 700

4 1.0621 700 0.8776 700

5 1.0032 2200 0.9206 3600

10 1.2108 500 0.9332 3000

Prediction error for different methods: Root-mean-square errors of prediction (RMSEP) for S and W obtained from a 3-layer BP network, Imbrie-Kipp Transfer Functions (IKTF), the Modern Analog Technique (MAT), and Soft Modelling of Class Analogy (SIMCA)

Method S W

BP network 0.7136 0.7636 Neural Network

IKTF 1.2224 1.0550

MAT 1.2610 1.1346

SIMCA 1.0058 1.0501 PLS

WA-PLS 1.0419 0.8560 WA-PLS

Predictions were made using the Leave-One-Out technique

Predictions of S and W in core E48-22 from southern Indian Ocean based on a BP network, compared to the oxygen isotope (18O of Globorotalia truncautulinoides) curve presented by Williams (1976) for the uppermost 440 cm of the core. The cross-correlation coefficients for the relationships between 18O and the predicted S and W are –0.68 and –0.71, respectively, for zero lags (p<0.001). Interglacial isotope stages 1, 5, 7, and 9 as interpreted here, are indicated in the diagram.

Problems with ANN implementation and cross-validation

Easy to over-fit the model.

Leave-one-out cross-validation is not a stringent test as ANN will continue to train and optimise its network to the one sample left out. Need a training set (ca. 80%) and an optimisation (or selection set) (ca. 10%) to select the ANN model with the lowest prediction error AND an independent test set (ca. 10%) whose prediction error is calculated using the model selected by the optimisation set.

Telford et al. (2004) Palaeoceanography 19

947 Atlantic foraminifera data.

Split randomly 100 times into training set (747 samples), optimisation set (100 samples), and test set (100 samples).

Median RMSEP (ºC)

ANN MAT

Training set 0.72 0.94

Optimisation set 0.94 0.94

Test set 1.11 1.02

No advantage in the hours of ANN computing when cross-validated rigorously. ANN appears to be a very complicated (and slow) way of doing a MAT!

May not be so good after all!

Descriptive statistics for the SWAP diatom-pH data set

DIATOMS AND NEURAL NETWORKS

Min. Median

Mean Max. S.D. Range

N2 for samples 5.13 28.58 29.22 57.18

N2 for taxa 1 14.99 23.76 120.86

pH 4.33 5.27 5.56 7.25 0.77 2.92

No. of samples167

No. of taxa 267

% no. of +ve values in data 18.47

Total inertia 3.39

SWAP data-set: 167 lakes convergence

Artificial Neural Network

Yves Prairie & Julien Racca (2002)

SWAP data-set: 167 lakes

jack-knife predicted pH against observed pH



pH reconstruction by ANN and WA-PLS: (RLGH core)

SKELETONISATION ALGORITHM

Pruning algorithm comparable to BACKWARD ELIMINATION in regression models

1. Measure relevance Pi for each taxon i

Pi = E without i – E with i where E = RMSE

2. Train network with all taxa using back-propagation

3. Compute relevance Pi based on error propagation and weights

4. Taxon with smallest estimated relevance Pi [Did this in 5% classes of importance]

5. Re-train the network to a minimum again [After deleting a taxon, the values of the remaining taxon are not re-calculated, so the input data are always the same original relative abundance values] Racca et al. (2003)

N2

ANN functionality

Leave-one-predicted pH ANN

ROUND LOCH OF GLENHEAD

30% pruned ANN

60% pruned ANN

85% pruned ANN

0% pruned ANN

All taxa WA

All taxa ML

General characteristics of the 37 most functional taxa for calibration based on ANN modelling

approach.

Summary statistics of the SWAP diatom pH inference models according to the classes of taxa included based on the

Skeletonisation procedure

Apparent Cross-validation

Ideally apparent RMSE should be a reliable measure of the actual predictive of a model, and the difference between apparent and cross-validated RMSE indicates the extent to which the model has overfitted the data

Cross-validation

Apparent

Examples of the recently published diatom-based inference models in palaeolimnology used.

MAXIMUM ROBUSTNESS – ratio of taxa : lakes as small as possible

(1) increase the number of lakes(2) decrease the number of taxa

CURSE OF DIMENSIONALITY related to ratio of number of taxa to number of lakes, as this ratio determines the ratio of the dimensional space in which the function is determined to the number of observations for which the function is determined.

“Neural networks have the potential for data analysis and represent a viable alternative to more conventional data-analytical methods”. Malmgren & Nordlund (1997)

Advantages:1) Mixed linear and non-linear responses.2) Good empirical performance.3) Wide applicability.4) Many predictors and many ‘responses’.Disadvantages:1) Very much a black box.2) Conceptually complex.3) Little underlying theory.4) Easy to misuse and report erroneous model performance statistics.

PATTERN RECOGNITIONUnsupervised (cluster analysis, indirect gradient analysis) or supervised (discriminant analysis, direct gradient analysis)

Neural network

BELIEF NETWORKS

Statistical theory Linear methods

Discriminants & Decision Theory

LDANon-parametric methods

CART treesNearest-neighbour

K-NN

Vedde Ash mid Younger Dryas ca 10600 14C yrs BP(Rhyolitic type) Kråkenes, Norway

Several other sites in W NorwayBorrobol, ScotlandTynaspirit, ScotlandWhitrig, Scotland

Vedde Kråkenes(Basaltic type) W NorwayBorrobol Lower LG Interstadial ca 12500 14C yrs BP

Borrobol, ScotlandTynaspirit, ScotlandWhitrig, Scotland

Saksunarvatn early Holocene ca 9000 14C yrs BP = 9930 – 10010 cal yr

FaeroesKråkenes, W NorwayDallican Water, Shetland

SiO2 TiO2 Al2O3 FeO MnO MgO CaO Na2O K2O

“The way in which correlation by tephrochronology may revolutionise approaches to reconstructing the sequence of events in the N.E.Atlantic region...” Lowe & Turney (1997)

VOLCANIC TEPHRAS IN N.W.EUROPE OF LATE-GLACIAL AND EARLY HOLOCENE AGE

SiO2

K2OCaOMgO

TiO2Al2O3 FeO

Na2O

SBVBV SBVBV SBVBV SBVBV

SBVBV SBVBV SBVBV SBVBV

CANONICAL VARIATES ANALYSIS

(= multiple discriminant analysis)

2 = 0.841

28%

1 = 0.988

32.9%

Group means

CVA – individual samples

Vedde

Vedde B.

Saksun

Borrobol

CVA- biplot of variables

Vedde Scotland

+a few Vedde Norway

Vedde Norway

Borrobol

Saksunati

Vedde Basalt

0.955 cophenetic correlation

•Borrobol

•Saksunavatn

•Vedde Basaltic

•Vedde Norway

•Vedde Scotland

Minimum-variance

cluster analysis

√% data = chord distance

2 = 0.016

1.6%

1 = 0.96

95.9%

PCA √% data

97.4%

Vedde Norway

Vedde Scotland

Borobol

Saksunavatn

Vedde basaltic

Borrobol

Vedde Scotland

Vedde Norway

Saksunavatn

Vedde Basaltic

PCA97.4%

All samples

“Tephrochronology offers the potential of overcoming problems of correlation because ash layers provide time-parallel markers and therefore precise comparisons between sequences”

“The geochemical signature of each ash is unmistakable”

Lowe & Turney (1997)

Turney et al. (1997)

PCA

97.4%

All samples

Lapointe & Legendre (1994)

Applied Statistics 43, 237-257

SCOTLAND'S MOST FAMOUS PRODUCT

Dendrogram representing the minimum variance hierarchical classification of single-malt Scotch whiskies: two scales are provided at the top of the graph - the number of groups formed by cutting the dendrogram vertically at the given points and the fusion distances of the hierarchical classification (represented by vertical segments in the dendrogram); the vertical order of the whiskies is partly arbitrary - swapping the branches of a dendrogram does not change the corresponding cophenetic matrix (the 12 groups detailed in Appendix A are labelled A-L here)

Map of Scotland showing the positions of the Scottish distilleries, divided into 11 groups (symbols) in the regional classification of single-malt whiskies (Appendix B) (the six Speyside groups are deferred to Fig. 3):distiilery names are represented by four-letter abbreviations (see Fig. 3); the names of regions and of some major cities are also indicated - notice that two Scotches in the present study come from the Springbank distillery; Springbank pertains to the western group whereas Longrow is a member of the Islay group.

Map of the Speyside region showing six of the 11 groups (symbols) of Scotch distilleries of the regional classification of single-malt whiskies (Appendix B) (the names of regions and of some major cities are also indicated) and abbreviations and full names of the distilleries.

Looked at spatially constrained classification and constrained ordination (RDA)

Looked at similarities between results based on:

Colour

Nose All give consistent results. Can use one to

Body predict the other, except for finish.

Palate

Finish

TEST OF CONGRUENCE AMONG DISTANCE MATRICES (CADM)

Legendre & Lapointe (2005)

5 data sets - colour (14 variables +/-) 1

- nose (12 variables +/-) 2

- body (8 variables +/-) 3

- palate (15 variables +/-) 4

- finish (19 variables +/-) 5

(1 - Jaccard coefficient)½ to give 5 distance matrices

Overall CADM test - null hypothesis of incongruence rejected (H0) (p = 0.0001)Compare 1 with 2-5 - H0 rejected

2 with 1, 3-5 - H0 rejected

3 with 1, 2, 4, 5 - H0 rejected

4 with 1-3, 5 - H0 rejected

5 with 1-4 - H0 not rejected

Mantel test (2 matrices)

Finish not related to Colour, Nose, Body or Palate.

Principal co-ordinates analysis of Mantel-test statistics. Axis 1 = 28.7%, axis 2 = 26.3%.

Why is FINISH so different?

It is important!

How were the whiskies tested by the tasters?

Did they swallow or spit?

If the latter, the finish variables may not be fully detected.

ONLY WHEN SWALLOWING CAN ONE TOTALLY CAPTURE THE AFTERTASTE.

But,

“some professional blenders work only with their nose, not finding it necessary to let the whisky pass their lips”.

SINGLE MALTS MUST BE SWALLOWED!

INTEGRATED ANALYSES OF BIOLOGICAL AND

ENVIRONMENTAL DATA

For nature conservation and management purposes, useful to have an overview of the natural zonation of the area as a whole. Such zonation should:

1. Have characteristic or indicator species or life-forms

2. Correspond to a circumscribed range of environments

3. Have some geographical coherenceRequires integrated analysis of biological and environmental data.

INDIRECT CLUSTERING APPROACH

Biological data

Environmental data

Clusters e.g. TWINSPAN

Biological clusters

e.g. DISCRIM Canonical variates analysis

RIVPACS

cf. Indirect gradient analysis

Biological data PCA or CA Regression with environmental data

DIRECT CLUSTERING APPROACH

Biological data + Environmental data

Clusters or Zones

1. Latent class analysis with biological data as +/- or counts following binomial or Poisson distribution and environmental data following, after log transformation, normal distribution.

ter Braak et al. (2003) Ecological Modelling 160: 235-248

2. CCA, RDA, or DCCA of biological and environmental data combined in multivariate direct gradient analysis, followed by minimum-variance cluster analysis (Ward's method) or k-means minimum-variance cluster analysis.

Estimate characteristic species for each cluster.

Carey et al. (1995) J. Ecology 83: 833-845. Biogeographical zonation of Scotland.

Characteristic species of biogeographical zones

3. Principal co-ordinates analysis of mixed (biological and environmental) data using Gower's (1971) coefficient.

where sij is the similarity between sites i and j as measured by

the variable k and wijk is typically 1 or 0 depending on whether or

not the comparison is valid for variable k. Weights of zero are assigned when k is unknown for one or both sites or to binary variables to exclude negative matches. For binary variables sij is

the Jaccard coefficient. For categorical data the component similarity sijk is one when the two sites have the same value and

zero otherwise. For quantitative data

where Rk is the range of variable k

m

kijk

m

kijkijkij wsws

11

kjkikijk Rxxs 1

AN EXAMPLE

Altitude MoistureLimeston

eSheep Age

Site 1 120 1 - - 1

Site 2 150 2 + - 2

Site 3 110 3 + + 3

Clusters can then be defined using the principal co-ordinate axes scores in a minimum-variance cluster analysis or a partitioning of the sites on the basis of the ordination scores.

0625010111

0110010140301112 .

)(

s

4. Constrained indicator species analysis (COINSPAN)

Carleton, T.J. et al. (1996) J. Vegetation Science 7: 125-130

Like TWINSPAN (biological data only) but uses CCA first axis instead of CA first axis (as in TWINSPAN) as the basis for ordering samples prior to creating dichotomies.

The resulting clustering is based on CCA axis 1, a linear combination of environmental variables that maximises the dispersion of species scores.

COINSPAN clustering thus integrates biology and environment together. Surprisingly little used - has considerable potential.

Jackson D.A. (1997) Ecology 78, 929–940

Simulated data SIM 200 observations x 5 variables

Different means and variances

x1 x2 x3 x4 x5

Mean 30 60 60 120 120

Variance 16 16 64 64 4096

Correlations between all variables = 0

Transformed into percentages

Raw data – BASIS

Transformed data – PERCENTAGE or PROPORTIONS

PROBLEMS OF PERCENTAGE (COMPOSITIONAL) DATA

Bivariate casement plots of the basis (lower triangular matrix) and composition (upper triang-ular matrix) for the simulated data SIM. The basis relationship are independently generated, and correlations approximate zero. Note the strong linear relationships in the composition arising due to the constant-sum constraint, i.e. matrix closure. S1-S5 represent variables.

r

COMPOSITION

BASIS

Frequency distributions of the bivariate correlations for SIM obtained under randomization. Each plot corres-ponds to the correlation between two variables from the basis (lower triangular matrix) or the composition (upper triangular matrix) used in the previous figure. The basis matrix was randomized within each column, the composition recalculated, and the correlation recalculated. Each plot is a frequency distribution of the correlations obtained from 10 000 randomized matrices.

COMPOSITION

BASIS

Eigenvector coefficients from a principal component analysis of the correlation matrix of SIM. Results from a PCA of the basis and the composition are presented.

Scree plots of the eigenvalues for each component from the (a) simulated data (SIM) and (b) herbivorous zoo-plankton data (ZOO). The solid line represents the eigenvalues from the basic data (i.e. non-standardised), and the dashed line represents the eigenvalues from the compositional data (i.e. proportions).

SIM

Composition

Basis

Scatterplots of the first two components from a principal component analysis of SIM using the (a) basis and (b) composition in calculating the correlation matrix. Letters refer to the points positioned at the ends of axes 1 and 2.

Basis

Composition

UPGMA cluster analysis based on a correlation matrix of the variables (S1-S5 and H1-H5) from: (a) the basis data of the simulation data (SIM); (b) the compositional data of SIM; (c) the basis data of the zooplankton data (ZOO); and (d) the compositional data of ZOO.

BASIS

COMPOSITION

CLUSTER ANALYSIS

1. CENTRED LOG RATIO Aitchison (1986)

All variables are retained in analysis but are standardised by dividing each variable by a denominator based on a geometric composite of all variables.

PCA covariance matrix

i, j, ..., m and g(x) is the geometric mean of the variables, i.e.

xgxxgxY jiij log,logcov

mixxg1

POSSIBLE SOLUTIONS

Advantages:1. All variables are retained.2. Pairwise relationships are the same regardless of using basis or

compositional data.Problems:1. With SIM, correlations still very strong!

0.412 - 0.843-0.799 - -0.906

2. Zero values have unidentified log-ratio value. Replace zero values by small value.

3. Matrix is singular, so only m-1 components.

REF

REF

REF

2) CORRESPONDENCE ANALYSIS

Only considers proportional relationships between variables; unaffected by using basis or compositional data.

CA/DCA/CCA – focuses on relative abundances

PCA/RDA – focuses on absolute abundance

If an environmental variable influences total biomass, but leaves the species composition unchanged, the variable will be important in PCA/RDA but not at all important in CA/DCA/CCA.

One approach analyse total biomass separately by regression

analyse species composition by CCA

Analyses are fully complementary.

(PCA/RDA would probably give results close to the regression analysis).

REF

REF

REF

REF

UNRESOLVED QUESTION SINCE 1986 IN CA/CCA

How can CA and CCA

1. Model unimodal function (c.f. WA as approximate Gaussian ML regression)

and

2. Be linear with fit

Partial answer

CA and CCA model compositional data (proportions)

This compares with Aitchinson's log-ratio model and the polytomous GLM which are linear in centred logs but unimodal in the original data.

...1 11 ikkiik xbyyyy

REF

REF

REF

REF

CA and CCA are methods for analysing unimodal data.

CA and CCA are CHAMELEONS1) Unimodal methods

2) Linear methods

CCA can be derived as a weighted form of reduced rank regression = redundancy analysis = principal component analysis with respect to instrumental variables. The key element is that the relative abundance is a linear function of the environmental variables (relative here means relative to sample total and species total).

As unimodality and compositional data often go hand in hand, common element is that CCA models compositional (i.e. relative) abundance data instead of the absolute abundance data.

ECOLOGICAL TERMS

CCA (and CA) models relative abundances; takes sample size for granted. Usually the -diversity of a sample increases with its size. CCA and CA take that aspect of -diversity for granted and focuses, instead, on the -diversity (dissimilarity between sites). If the trend in -diversity coincides with -diversity (e.g. species disappear one by one along a gradient), CA and CCA can extract such trends.

In unimodal context, species scores are weighted averages of sample scores and vice versa. In linear context, species scores are derived from a weighted linear regression of transformed species data on to the sample scores.

THE TWO FACES OF CORRESPONDENCE ANALYSIS AND CANONICAL

CORRESPONDENCE ANALYSISREF

REF

REF

Linear context most useful when gradient length is < 3SD. Unimodal context most useful when gradient length is > 4SD. For intermediate lengths, either contexts may be useful.

Can transform unimodal model into linear model by ‘take logarithms and double centre’ (for data with no zeroes).

If data contain zeroes, no explicit linearising data transformation because we cannot take logarithms. In CA and CCA, a transformation is implicit that is close to the exact transformation.

EXACT

with

where and are geometric averages of across rows and columns, respectively and is the overall geometric average

CA/CCA

where and are the abundance totals across species in site i and across samples for species k

INHERIT THEIR TWO FACES FROM MODELS OF COMPOSITIONAL DATA.

ikylog kiikik gggyy

kgig

kiikik yyyyy

iyky

REF

REF

REF

REF

DATA TYPE AND CHOICE OF ORDINATION METHOD

Besides gradient length (standard deviations), data type is also important in selecting ordination method.

Absolute abundance Relative abundance(Compositional differences)

Unconstrained PCA (linear) CA, DCA (unimodal)

Constrained RDA (linear) CCA, DCCA (unimodal)

Constrained (PRC) (linear) -

PCA/RDA are weighted summations; CA/CCA are weighted averages, hence the difference between modelling absolute values (PCA/RDA) or relative values (CA/CCA).

Cannot currently model satisfactorily absolute abundances over long graidents. Need to partition the data into smaller gradients first (e.g. TWINSPAN).

SPECIES ABSENCES IN DATA SETS

Besides removing the absolute abundance effect, CA, DCA, CCA, and partial CCA (and WA and WA-PLS) do not consider species absences or zero values in the biological data.

Zero values - ? Show real absence

? Reflect incomplete sampling

? Chance

Is this an advantage or disadvantage?

CANOCO & CANODRAW MAT, ZONE, WINTRAN, C2

MicroComputer Power Steve Juggins111 Clover Lane Geography DepartmentITHACA, NY 14850 University of NewcastleUSA NEWCASTLE UPON TYNE

NE1 7RH

[email protected] ([email protected])http://www.microcomputerpower.com http://www.campus.ncl.ac.uk/staff/ Stephen.Juggins/

HOF

Jari OksanenDepartment of Biology

University of OuluOULU

Finland

([email protected])http://cc.oulu.fi/~jarioksa/

SOFTWARE AVAILABILITY

TWINSPAN (Mark Hill), DISCRIM (Cajo ter Braak), TWINGRP, RATEPOL, SPLIT, etc

John BirksDepartment of BiologyUniversity of BergenAllégaten 41N-5007 BERGENNorway

([email protected])

QUERIES

[email protected] Birks, Department of Biology, University of Bergen, Allégaten 41, N-

5007 Bergen, Norway

Fax: (+47) 55 58 96 67

[email protected] Simpson, Environmental Change Research Centre, University

College London, Gower Street, London, WC1E 6BT, UK

http://www.homepages.ucl.ac.uk/~ucfagls/ncourse/

www.okstate.edu/artsci/botany/ordinate

Mike Palmer's ordination site with masses of documentation, explanatory notes, links, details of software, etc.

www.canoco.com

Cajo ter Braak's site about CANOCO and with answers to many frequently asked questions (FAQ)

www.microcomputerpower.com

Richard Furnas' site about CANOCO and related software availability and ordering

www.canodraw.com

Petr Šmilauer's site about CANODRAW and CANOCO and related software

http://regent.bf.jcu.cz/maed

Details of Petr Šmilauer and Jan Lepš' course and data on multivariate analysis of ecological data.

VALUABLE WEB SITES FOR NUMERICAL ECOLOGISTS AND

PALAEOECOLOGISTS

WEB SITES continued

http://cc.oulu.fi/~jarioksa/

Jari Oksanen's site with his R vegan package, lecture notes, programs (e.g. HOF), documentation, comments, FAQ, and much more

http://www.bio.umontreal.ca/legendre/indexEnglish.html

Pierre Legendre's site with details of publications, software, activities, etc.

http://www.bio.umontreal.ca/Casgrain/en/labo/index.html

Software from Pierre Legendre's lab

http://labdsv.nr.usu.edu/

Dave Robert's site about quantitative vegetation ecology with lecture notes, software details, etc.

www.nku.edu/~boycer/fso/

Rick Boyce's site about fuzzy set ordination

http://cran.r-project.org/

R website

WEB SITES continued

www.stat.auckland.ac.nz/~mja/

Marti Andersen's site with new software, details of publications, activities, etc.

www.campus.ncl.ac.uk/staff/Stephen.Juggins

Steve Juggins' site for C2, WinTran, ZONE, etc.

www.chrono.qub.ac.uk/psimpoll/psimpoll.html

Keith Bennett's site for palaeoecological software, notes, etc.

www.chrono.qub.ac.uk/inqua

Keith Bennett's site of INQUA Data Analysis Sub-Commission software, newsletters, etc.

www.env.duke.edu/landscape/classes/env358/env358.html

Dean Urban's site with excellent lecture notes on Multivariate Methods for Environmental Applications

Numerical Analysis of Biological Data

Basic building-blocks and concepts and the resulting numerical methods

Niches

'Communities'

Continuum

conceptWeighte

d averagin

gTWINSPA

N

CA/DCA

Cluster analysis

Metric scaling

Non-metric scaling

Indicator-species analysis INDVAL

FINAL COMMENTS

GLM

Correlation &

covariance

Linear combinations

Cross-validati

onPermutation

tests

PLS

Cluster analysis

Multiple regression

PCA

RDA

Linear discriminant

analysis, canonical correlation

analysis

Regression models

Procrustes rotation

Co-inertia analysis

Numerical Analysis of Environmental Data


Gradients +

Numerical Analysis of Biological and Environmental Data


GLM & GAM

Niches &

Gradients

Weighted averaging

Cross-validatio

n

Permutation tests

PLS

Cluster analysis

Multiple regressionCA/DCA

CCA

TWINSPAN

WA WA-

PLS CCA-PLS Co-

CA

Multiple discriminant analysis

Regression models +

DISCRIM

COINSPAN

Co-inertia analysis

Distance-based PCoA Canonical analysis of principal co-ordinates

(CAP)

Andrew Lang 1844-1912. He uses statistics as a drunken man uses lamp-posts – for support rather than illumination.

From MacKay, 1977, and reproduced through the courtesy of the Institute of Physics.

Statistics are for illumination!

Sketches illustrating statistical zap and shotgun

THE PEOPLE WHO HAVE MADE THE STATISTICAL ZAP

POSSIBLE

Marti J. Anderson Richard

Telford

Pierre Legendre

Cajo J.F. ter Braak

Steve Juggins

Mark O. Hill

Gavin Simpson

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA

Documents

Transcript of NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA