NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA
description
Transcript of NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA
NUMERICAL ANALYSIS OF BIOLOGICAL AND
ENVIRONMENTAL DATA
Lecture 12Overview
John Birks
• Topics covered• Exploratory data analysis• Clustering• Gradient analysis• Hypothesis testing
• Principle of parsimony in data analysis
• Possible future developments
• Conventional• Less conventional
• Some applications• Volcanic tephras• Scotland’s most famous product
• Integrated analyses
• Problems of percentage compositional data
• Log-ratios• Chameleons of CA and CCA
• Software availability
• Web sites
• Final comments
OVERVIEW
EXPLORATORY DATA ANALYSIS
Essential first step
Feel for the data – ranges, need for transformations, rogue or outlying observations
NEVER FORGET THE GRAPH
CLUSTERING
Can be useful for some purposes – basic description, summarisation of large data sets. Fraught with problems and difficulties – choice of DC, choice of clustering method, difficulties of validation and evaluation
Good general purpose TWINSPAN – ORBACLAN – COINSPAN
GRADIENT ANALYSIS
Regression, calibration, ordination, constrained ordination, discriminant analysis and canonical variates analysis, analysis of stratigraphical and
spatial data.
HYPOTHESIS TESTING
Randomisation tests, Monte Carlo permutation tests.
1987 Wageningen
Cajo ter Braak
Classification of gradient analysis techniques by type of problem, response model and method of estimation.
a Constrained multivariate regression b Ordination after regression on covariablesc Constrained ordination after regression on covariables = constrained partial multivariate regression d “Reduced-rank regression” = “PCA of y with respect to x”
Type of problem
Linear Response Model
Unimodal Response Model
Least-Squares Estimation
Maximum Likelihood Estimation
Weighted Averaging Estimation
Regression Multiple regression Gaussian regression
Weighted averaging of site scores (WA)
Calibration Linear calibration; ‘inverse regression’
Gaussian calibration
Weighted averaging of species scores (WA)
Ordination Principal components analysis (PCA)
Gaussian ordination
Correspondence analysis (CA); detrended CA (DCA)
Constrained ordinationa
Redundancy analysis (RDA)d
Gaussian canonical ordination
Canonical CA (CCA); detrended CCA (DCCA)
Partial ordinationb
Partial components analysis
Partial Gaussian ordination
Partial CA; partial DCA
Partial constrained ordinationc
Partial redundancy analsyis
Partial Gaussian canonical ordination
Partial CCA; partial detrended CCA
A straight line displays the linear relation between the abundance value (y) of a species and an environmental variable (x), fitted to artificial data (). (a = intercept; b = slope or regression coefficient).
A unimodal relation between the abundance value (y) of a species and an environmental variable (x). (u = optimum or mode: t = tolerance; c = maximum).
GRADIENT ANALYSISLinear based-models or unimodal-based methods
Critical question, not a matter of personal preference
If gradients are short, sound statistical reasons to use linear methods – Gaussian-based methods break down, edge effects in CA and related techniques become serious, biplot interpretations easy.
If gradients are long, linear methods become ineffective (‘horseshoe’ effect).
How to estimate gradient length?Regression Hierarchical series of response models GLM and HOF
Calibration GLM, DCCA (single x variable)
Ordination DCA (detrending by segments, non-linear rescaling)
Constrained DCCA (detrending by segments, non-linear rescaling) ordination
Partial ordination Partial DCA (detrending by segments, non-linear rescaling)
Partial constrained Partial DCCA (detrending by segments, non-linear rescaling) ordination
HYPOTHESIS TESTING
Monte Carlo permutation tests and randomisation tests
Distribution free, do not require normality of error distribution
Do require INDEPENDENCE or EXCHANGEABILITY
Validity of permutation test results depends on the validity of the type of permutation for the data set at hand.
Completely randomised observations, completely random permutation is appropriate = randomisation test.
Randomised block design-permutation must be conditioned on blocks, e.g. type of farm declared as covariable, if randomisation is conditioned on these, permutations are restricted to within farm.
Time series or line transect – restricted permutations and data kept in order.
Spatial data on grid – restricted permutations and data kept in position.
Repeated measurements – BACI
PRINCIPLE OF PARSIMONY IN DATA ANALYSISWilliam of Occam (Ockham), 14th century English nominalist philosopher. Insisted that given a set of equally good explanations for a given phenomenon, the explanation to be favoured is the SIMPLEST EXPLANATION.
Strong appeal to common sense.
Entities should not be multiplied without necessity.
It is vain to do with more what can be done with less.
An explanation of the facts should be no more complicated than necessary.
Among competing hypotheses or models, favour the simplest one that is consistent with the data.
‘Shaved’ explanations to the minimum.
In data analysis:
1) Models should have as few parameters as possible.
2) Linear models should be preferred to non-linear models.
3) Models relying on few assumptions should be preferred to those relying on many.
4) Models should be simplified/pared down until they are MINIMAL ADEQUATE.
5) Simple explanations should be preferred to complex explanations.
MINIMAL ADEQUATE - as statistically acceptable as the most complex model
MODEL (MAM) - only contains significant parameters
- high explanatory power
- large number of degrees of freedom
- may not be one MAM
CLUSTERING - prefer simple cluster analysis methods (few assumptions, simple values of , , )
- intuitively sensible
REGRESSION - GAM – GLM
- In GAM, simplest smoothers to be used
- In GLM, model simplification to find MAM (e.g. AIC)
CALIBRATION - minimum number of components for lowest RMSEP in PLS or WA-PLS
ORDINATION - retain smallest number of statistically significant axes (broken stick test)
- retain ‘signal’ at expense of noise
RELEVANCE OF PRINCIPLE OF PARSIMONY TO DATA ANALYSIS
PARTIAL ORDINATION
remove effects of ‘nuisance variables’ (covariables or concomitant variables) by partialling out their effects
ordination of residuals
retain smallest number of statistically significant axes (broken stick test)
‘signal’ at expense of ‘noise’ and ‘nuisance variables’
CONSTRAINED ORDINATION
most powerful if the number of predictor variables is small compared to number of samples. Constraints are strong, arch effects avoided, no need for detrending, outlier effects minimised
minimal adequate model (forward selection, VIF, variable selection, AIC)
only retain statistically significant axes
PARTIAL CONSTRAINED ORDINATION
as above + partial ordination
STRATIGRAPHICAL DATA ANALYSIS
only retain statistically significant zones
simplify data to major axes or gradients of variation
CHOICE BETWEEN INDIRECT & DIRECT GRADIENT ANALYSIS
Indirect gradient analysis – two steps
Direct gradient analysis– one combined step
If relevant environmental data are to hand, direct approach is likely to be more effective and simpler than indirect approach. Generally achieve a simpler model from direct gradient analysis.
CHOICE BETWEEN REGRESSION & CONSTRAINED ORDINATION
Both regression procedures! One Y or many Y.
Depends on purpose – is it an advantage to analyse all species simultaneously or individually?
CONSTRAINED REGRESSION ORDINATION
Community assemblage or individual taxa?
HOLISTIC INDIVIDUALISTIC
COMMON GRADIENTS SEPARATE GRADIENTS
QUICK, SIMPLE SLOW, COMPLEX, DEMANDING
LITTLE THEORY MUCH THEORY (GLM)
EXPLORATORY MORE CONFIRMATORY, IN DEPTH
LIMITING FACTORS
Research questionsHypotheses to be tested and evaluated
Data quality
TYPES OF GRADIENT ANALYSIS METHODS BASED ON WEIGHTED
AVERAGING
Community data - incidences (1/0) or abundances ( 0) of species at sites.
Environmental data - quantitative and/or qualitative (1/0) variables at same sites.
Use weighted averages of species scores (appropriate for unimodal biological data) and linear combinations (weighted sums) of environmental variables (appropriate for linear environmental data)
Method Abbreviation
Response variables (y)
Predictors (x)
Lecture
Correspondence analysis
CA (also DCA)
Community data
- 6
Canonical correspondence analysis
CCA (also DCCA)
Community data
Environmental variables
7
CCA partial least squares
CCA-PLSCommunity data
Many environmental variables
11
Weighted averaging calibration
WAEnvironmental variable
Community data
8
WA partial least squares
WA-PLSEnvironmental variable(s)
Community data
8
Co-correspondence analysis
CO-CACommunity data
Community data
11
Also partial CA, partial DCA, partial CCA, partial DCCA.
Lecture topic2 Exploratory data Model specific ‘outlier’
detection; interactive analysis graphics
3 Clustering COINSPAN; better randomisation tests; CART; latent class analysis
4, 5 Regression analysis GLM and GAM framework evaluation by cross- validation. Give up SS, deviance, t, etc!
6 Indirect gradient ? quest for the ‘ideal’ ordination method, 2-analysis matrix CA and PCA
7 Direct gradient 3-matrix CCA and RDA (biology, environment, analysis species attributes); multi-component variance partition- ing, vector-based reduced rank models with GAMs
8 Calibration and WAPLS; non-linear deshrinking; ? ML; mixed response reconstruction models; chemometrics, Bayesian framework, more consideration of spatial autocorrelation
9 Classification ? give up classical methods; use permutation tests; classification and regression trees and random forests
10 Stratigraphical and ? more consideration of temporal and spatial spatial data autocorrelation
11 Hypothesis testing More realistic permutation tests (restrictions); better p estimation
POSSIBLE FUTURE DEVELOPMENTS - CONVENTIONAL
Back propagation neural network – layers containing neurons
input vector
input layer
hidden layer
output layer
output vector
Clearly can have different types of input and output vectors, e.g.
INPUT VECTORS OUTPUT VECTORS
> 1 Predictor 1 or more Responses Regression> 1 ‘Responses’ 1 or more ‘Predictors’ Inverse regression or calibration
> 1 Variables 2 or more Classes Discriminant
analysis
NEURAL NETWORKS – THE LESS CONVENTIONAL DATA ANALYSIS
APPROACH IN THE FUTURE?
Malmgren & Nordlund (1997) Palaeo-3 136, 359–373
Planktonic foraminifera 54 core-top samples
Summer water and winter water temperatures
Core E48–22 Extends to oxygen stage 9 320,000 years
Compared neural network as a calibration tool with:
Imbrie & Kipp principal component regression 2-block PLS (SIMCA)
Modern analog technique (MAT) WA-PLS
CRITERION FOR NETWORK SUCCESSCross-validation leave-one-outEstimateRMSE (average error rate in training set)
RMSEP (predictions based on leave-one-out cross-validation)3 neurons 600–700 cycles
RMSEP Summer Winter °C rs rwNeural N 0.71 0.76 0.99 0.98
PLS 1.01 1.05 0.98 0.97
MAT 1.26 1.14 0.97 0.96
Imbrie & Kipp 1.22 1.05 0.97 0.96WA-PLS 1.04 0.86 0.97 0.96
CALIBRATION (INVERSE REGRESSION) AND
ENVIRONMENTAL RECONSTRUCTIONS
Changes in root-mean-square errors (RMSE) for S in relation to number of training epochs for 3-layer BP neural networks with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. The networks were trained over 50 intervals of 100 epochs each (in total of 5,000 epochs). As expected, the RMSEs decrease as training proceeds. The minimum RMSE, 0.3539, was obtained after training a network with 10 neurons in the hidden layer over 5,000 epochs. Similar results were obtained also for W (not shown in diagram).
Changes in root-mean-square errors of prediction (RMSEP) for S with increasing number of training epochs in a 3-layer back propagation neural network with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer. These error rates were determined using the Leave-One-Out technique, implying training of the networks over 54 sets consisting of 53 observations each, with one observation left out for later testing. The lowest RMSEPs for both S and W, 0.7176 and 0.7636, respectively, were obtained for a configuration with 3 neurons (only the results for S are shown in the diagram). Note that set-ups with 1, 2, and 3 neurons gave lower RMSEPs than for 4, 5, and 10 neurons.
Relationships between observed and predicted S and W using a 3-layer BP neural networks with 3 neurons in the hidden layer. Lines are linear regression lines. The product-moment correlation coefficients (r) are shown in the lower right hand corners.
Summer Winter
Prediction errors for different network configurations: root-mean-square errors for the differences between observed and predicted S and W using a 3-layer BP neural network with 1, 2, 3, 4, 5, and 10 neurons in the hidden layer.
Root-mean-square errors of prediction (RMSEP) are based on the Leave-One-Out technique in which each of the 54 observations in the data set is left out one at a time and the network is trained on the remaining observations. The trained network is then used to predict the excluded observation. The network was run over 50 intervals of 100 epochs each, and the error rates were recorded after each interval.
No. neurons
S W
RMSEP No. epochs RMSEP No. epochs
1 0.8779 500 0.8796 300
2 0.7850 1800 0.9013 700
3 0.7176 600 0.7636 700
4 1.0621 700 0.8776 700
5 1.0032 2200 0.9206 3600
10 1.2108 500 0.9332 3000
Prediction error for different methods: Root-mean-square errors of prediction (RMSEP) for S and W obtained from a 3-layer BP network, Imbrie-Kipp Transfer Functions (IKTF), the Modern Analog Technique (MAT), and Soft Modelling of Class Analogy (SIMCA)
Method S W
BP network 0.7136 0.7636 Neural Network
IKTF 1.2224 1.0550
MAT 1.2610 1.1346
SIMCA 1.0058 1.0501 PLS
WA-PLS 1.0419 0.8560 WA-PLS
Predictions were made using the Leave-One-Out technique
Predictions of S and W in core E48-22 from southern Indian Ocean based on a BP network, compared to the oxygen isotope (18O of Globorotalia truncautulinoides) curve presented by Williams (1976) for the uppermost 440 cm of the core. The cross-correlation coefficients for the relationships between 18O and the predicted S and W are –0.68 and –0.71, respectively, for zero lags (p<0.001). Interglacial isotope stages 1, 5, 7, and 9 as interpreted here, are indicated in the diagram.
Problems with ANN implementation and cross-validation
Easy to over-fit the model.
Leave-one-out cross-validation is not a stringent test as ANN will continue to train and optimise its network to the one sample left out. Need a training set (ca. 80%) and an optimisation (or selection set) (ca. 10%) to select the ANN model with the lowest prediction error AND an independent test set (ca. 10%) whose prediction error is calculated using the model selected by the optimisation set.
Telford et al. (2004) Palaeoceanography 19
947 Atlantic foraminifera data.
Split randomly 100 times into training set (747 samples), optimisation set (100 samples), and test set (100 samples).
Median RMSEP (ºC)
ANN MAT
Training set 0.72 0.94
Optimisation set 0.94 0.94
Test set 1.11 1.02
No advantage in the hours of ANN computing when cross-validated rigorously. ANN appears to be a very complicated (and slow) way of doing a MAT!
May not be so good after all!
Descriptive statistics for the SWAP diatom-pH data set
DIATOMS AND NEURAL NETWORKS
Min. Median
Mean Max. S.D. Range
N2 for samples 5.13 28.58 29.22 57.18
N2 for taxa 1 14.99 23.76 120.86
pH 4.33 5.27 5.56 7.25 0.77 2.92
No. of samples167
No. of taxa 267
% no. of +ve values in data 18.47
Total inertia 3.39
SWAP data-set: 167 lakes convergence
Artificial Neural Network
Yves Prairie & Julien Racca (2002)
SWAP data-set: 167 lakes
jack-knife predicted pH against observed pH
Yves Prairie & Julien Racca (2002)
Yves Prairie & Julien Racca (2002)
pH reconstruction by ANN and WA-PLS: (RLGH core)
SKELETONISATION ALGORITHM
Pruning algorithm comparable to BACKWARD ELIMINATION in regression models
1. Measure relevance Pi for each taxon i
Pi = E without i – E with i where E = RMSE
2. Train network with all taxa using back-propagation
3. Compute relevance Pi based on error propagation and weights
4. Taxon with smallest estimated relevance Pi [Did this in 5% classes of importance]
5. Re-train the network to a minimum again [After deleting a taxon, the values of the remaining taxon are not re-calculated, so the input data are always the same original relative abundance values] Racca et al. (2003)
N2
ANN functionality
Leave-one-predicted pH ANN
ROUND LOCH OF GLENHEAD
30% pruned ANN
60% pruned ANN
85% pruned ANN
0% pruned ANN
All taxa WA
All taxa ML
General characteristics of the 37 most functional taxa for calibration based on ANN modelling
approach.
Summary statistics of the SWAP diatom pH inference models according to the classes of taxa included based on the
Skeletonisation procedure
Apparent Cross-validation
Ideally apparent RMSE should be a reliable measure of the actual predictive of a model, and the difference between apparent and cross-validated RMSE indicates the extent to which the model has overfitted the data
Cross-validation
Apparent
Examples of the recently published diatom-based inference models in palaeolimnology used.
MAXIMUM ROBUSTNESS – ratio of taxa : lakes as small as possible
(1) increase the number of lakes(2) decrease the number of taxa
CURSE OF DIMENSIONALITY related to ratio of number of taxa to number of lakes, as this ratio determines the ratio of the dimensional space in which the function is determined to the number of observations for which the function is determined.
“Neural networks have the potential for data analysis and represent a viable alternative to more conventional data-analytical methods”. Malmgren & Nordlund (1997)
Advantages:1) Mixed linear and non-linear responses.2) Good empirical performance.3) Wide applicability.4) Many predictors and many ‘responses’.Disadvantages:1) Very much a black box.2) Conceptually complex.3) Little underlying theory.4) Easy to misuse and report erroneous model performance statistics.
PATTERN RECOGNITIONUnsupervised (cluster analysis, indirect gradient analysis) or supervised (discriminant analysis, direct gradient analysis)
Neural network
BELIEF NETWORKS
Statistical theory Linear methods
Discriminants & Decision Theory
LDANon-parametric methods
CART treesNearest-neighbour
K-NN
Vedde Ash mid Younger Dryas ca 10600 14C yrs BP(Rhyolitic type) Kråkenes, Norway
Several other sites in W NorwayBorrobol, ScotlandTynaspirit, ScotlandWhitrig, Scotland
Vedde Kråkenes(Basaltic type) W NorwayBorrobol Lower LG Interstadial ca 12500 14C yrs BP
Borrobol, ScotlandTynaspirit, ScotlandWhitrig, Scotland
Saksunarvatn early Holocene ca 9000 14C yrs BP = 9930 – 10010 cal yr
FaeroesKråkenes, W NorwayDallican Water, Shetland
SiO2 TiO2 Al2O3 FeO MnO MgO CaO Na2O K2O
“The way in which correlation by tephrochronology may revolutionise approaches to reconstructing the sequence of events in the N.E.Atlantic region...” Lowe & Turney (1997)
VOLCANIC TEPHRAS IN N.W.EUROPE OF LATE-GLACIAL AND EARLY HOLOCENE AGE
SiO2
K2OCaOMgO
TiO2Al2O3 FeO
Na2O
SBVBV SBVBV SBVBV SBVBV
SBVBV SBVBV SBVBV SBVBV
CANONICAL VARIATES ANALYSIS
(= multiple discriminant analysis)
2 = 0.841
28%
1 = 0.988
32.9%
Group means
CVA – individual samples
Vedde
Vedde B.
Saksun
Borrobol
CVA
CVA- biplot of variables
Vedde Scotland
+a few Vedde Norway
Vedde Norway
Borrobol
Saksunati
Vedde Basalt
0.955 cophenetic correlation
•Borrobol
•Saksunavatn
•Vedde Basaltic
•Vedde Norway
•Vedde Scotland
Minimum-variance
cluster analysis
√% data = chord distance
2 = 0.016
1.6%
1 = 0.96
95.9%
PCA √% data
97.4%
Vedde Norway
Vedde Scotland
Borobol
Saksunavatn
Vedde basaltic
Borrobol
Vedde Scotland
Vedde Norway
Saksunavatn
Vedde Basaltic
PCA97.4%
All samples
“Tephrochronology offers the potential of overcoming problems of correlation because ash layers provide time-parallel markers and therefore precise comparisons between sequences”
“The geochemical signature of each ash is unmistakable”
Lowe & Turney (1997)
Turney et al. (1997)
PCA
97.4%
All samples
Lapointe & Legendre (1994)
Applied Statistics 43, 237-257
SCOTLAND'S MOST FAMOUS PRODUCT
Dendrogram representing the minimum variance hierarchical classification of single-malt Scotch whiskies: two scales are provided at the top of the graph - the number of groups formed by cutting the dendrogram vertically at the given points and the fusion distances of the hierarchical classification (represented by vertical segments in the dendrogram); the vertical order of the whiskies is partly arbitrary - swapping the branches of a dendrogram does not change the corresponding cophenetic matrix (the 12 groups detailed in Appendix A are labelled A-L here)
Map of Scotland showing the positions of the Scottish distilleries, divided into 11 groups (symbols) in the regional classification of single-malt whiskies (Appendix B) (the six Speyside groups are deferred to Fig. 3):distiilery names are represented by four-letter abbreviations (see Fig. 3); the names of regions and of some major cities are also indicated - notice that two Scotches in the present study come from the Springbank distillery; Springbank pertains to the western group whereas Longrow is a member of the Islay group.
Map of the Speyside region showing six of the 11 groups (symbols) of Scotch distilleries of the regional classification of single-malt whiskies (Appendix B) (the names of regions and of some major cities are also indicated) and abbreviations and full names of the distilleries.
Looked at spatially constrained classification and constrained ordination (RDA)
Looked at similarities between results based on:
Colour
Nose All give consistent results. Can use one to
Body predict the other, except for finish.
Palate
Finish
TEST OF CONGRUENCE AMONG DISTANCE MATRICES (CADM)
Legendre & Lapointe (2005)
5 data sets - colour (14 variables +/-) 1
- nose (12 variables +/-) 2
- body (8 variables +/-) 3
- palate (15 variables +/-) 4
- finish (19 variables +/-) 5
(1 - Jaccard coefficient)½ to give 5 distance matrices
Overall CADM test - null hypothesis of incongruence rejected (H0) (p = 0.0001)Compare 1 with 2-5 - H0 rejected
2 with 1, 3-5 - H0 rejected
3 with 1, 2, 4, 5 - H0 rejected
4 with 1-3, 5 - H0 rejected
5 with 1-4 - H0 not rejected
Mantel test (2 matrices)
Finish not related to Colour, Nose, Body or Palate.
Principal co-ordinates analysis of Mantel-test statistics. Axis 1 = 28.7%, axis 2 = 26.3%.
Why is FINISH so different?
It is important!
How were the whiskies tested by the tasters?
Did they swallow or spit?
If the latter, the finish variables may not be fully detected.
ONLY WHEN SWALLOWING CAN ONE TOTALLY CAPTURE THE AFTERTASTE.
But,
“some professional blenders work only with their nose, not finding it necessary to let the whisky pass their lips”.
SINGLE MALTS MUST BE SWALLOWED!
INTEGRATED ANALYSES OF BIOLOGICAL AND
ENVIRONMENTAL DATA
For nature conservation and management purposes, useful to have an overview of the natural zonation of the area as a whole. Such zonation should:
1. Have characteristic or indicator species or life-forms
2. Correspond to a circumscribed range of environments
3. Have some geographical coherenceRequires integrated analysis of biological and environmental data.
INDIRECT CLUSTERING APPROACH
Biological data
Environmental data
Clusters e.g. TWINSPAN
Biological clusters
e.g. DISCRIM Canonical variates analysis
RIVPACS
cf. Indirect gradient analysis
Biological data PCA or CA Regression with environmental data
DIRECT CLUSTERING APPROACH
Biological data + Environmental data
Clusters or Zones
1. Latent class analysis with biological data as +/- or counts following binomial or Poisson distribution and environmental data following, after log transformation, normal distribution.
ter Braak et al. (2003) Ecological Modelling 160: 235-248
2. CCA, RDA, or DCCA of biological and environmental data combined in multivariate direct gradient analysis, followed by minimum-variance cluster analysis (Ward's method) or k-means minimum-variance cluster analysis.
Estimate characteristic species for each cluster.
Carey et al. (1995) J. Ecology 83: 833-845. Biogeographical zonation of Scotland.
Characteristic species of biogeographical zones
3. Principal co-ordinates analysis of mixed (biological and environmental) data using Gower's (1971) coefficient.
where sij is the similarity between sites i and j as measured by
the variable k and wijk is typically 1 or 0 depending on whether or
not the comparison is valid for variable k. Weights of zero are assigned when k is unknown for one or both sites or to binary variables to exclude negative matches. For binary variables sij is
the Jaccard coefficient. For categorical data the component similarity sijk is one when the two sites have the same value and
zero otherwise. For quantitative data
where Rk is the range of variable k
m
kijk
m
kijkijkij wsws
11
kjkikijk Rxxs 1
AN EXAMPLE
Altitude MoistureLimeston
eSheep Age
Site 1 120 1 - - 1
Site 2 150 2 + - 2
Site 3 110 3 + + 3
Clusters can then be defined using the principal co-ordinate axes scores in a minimum-variance cluster analysis or a partitioning of the sites on the basis of the ordination scores.
0625010111
0110010140301112 .
)(
s
4. Constrained indicator species analysis (COINSPAN)
Carleton, T.J. et al. (1996) J. Vegetation Science 7: 125-130
Like TWINSPAN (biological data only) but uses CCA first axis instead of CA first axis (as in TWINSPAN) as the basis for ordering samples prior to creating dichotomies.
The resulting clustering is based on CCA axis 1, a linear combination of environmental variables that maximises the dispersion of species scores.
COINSPAN clustering thus integrates biology and environment together. Surprisingly little used - has considerable potential.
Jackson D.A. (1997) Ecology 78, 929–940
Simulated data SIM 200 observations x 5 variables
Different means and variances
x1 x2 x3 x4 x5
Mean 30 60 60 120 120
Variance 16 16 64 64 4096
Correlations between all variables = 0
Transformed into percentages
Raw data – BASIS
Transformed data – PERCENTAGE or PROPORTIONS
PROBLEMS OF PERCENTAGE (COMPOSITIONAL) DATA
Bivariate casement plots of the basis (lower triangular matrix) and composition (upper triang-ular matrix) for the simulated data SIM. The basis relationship are independently generated, and correlations approximate zero. Note the strong linear relationships in the composition arising due to the constant-sum constraint, i.e. matrix closure. S1-S5 represent variables.
r
COMPOSITION
BASIS
Frequency distributions of the bivariate correlations for SIM obtained under randomization. Each plot corres-ponds to the correlation between two variables from the basis (lower triangular matrix) or the composition (upper triangular matrix) used in the previous figure. The basis matrix was randomized within each column, the composition recalculated, and the correlation recalculated. Each plot is a frequency distribution of the correlations obtained from 10 000 randomized matrices.
COMPOSITION
BASIS
Eigenvector coefficients from a principal component analysis of the correlation matrix of SIM. Results from a PCA of the basis and the composition are presented.
Scree plots of the eigenvalues for each component from the (a) simulated data (SIM) and (b) herbivorous zoo-plankton data (ZOO). The solid line represents the eigenvalues from the basic data (i.e. non-standardised), and the dashed line represents the eigenvalues from the compositional data (i.e. proportions).
SIM
Composition
Basis
Scatterplots of the first two components from a principal component analysis of SIM using the (a) basis and (b) composition in calculating the correlation matrix. Letters refer to the points positioned at the ends of axes 1 and 2.
Basis
Composition
UPGMA cluster analysis based on a correlation matrix of the variables (S1-S5 and H1-H5) from: (a) the basis data of the simulation data (SIM); (b) the compositional data of SIM; (c) the basis data of the zooplankton data (ZOO); and (d) the compositional data of ZOO.
BASIS
COMPOSITION
CLUSTER ANALYSIS
1. CENTRED LOG RATIO Aitchison (1986)
All variables are retained in analysis but are standardised by dividing each variable by a denominator based on a geometric composite of all variables.
PCA covariance matrix
i, j, ..., m and g(x) is the geometric mean of the variables, i.e.
xgxxgxY jiij log,logcov
mixxg1
POSSIBLE SOLUTIONS
Advantages:1. All variables are retained.2. Pairwise relationships are the same regardless of using basis or
compositional data.Problems:1. With SIM, correlations still very strong!
0.412 - 0.843-0.799 - -0.906
2. Zero values have unidentified log-ratio value. Replace zero values by small value.
3. Matrix is singular, so only m-1 components.
REF
REF
REF
2) CORRESPONDENCE ANALYSIS
Only considers proportional relationships between variables; unaffected by using basis or compositional data.
CA/DCA/CCA – focuses on relative abundances
PCA/RDA – focuses on absolute abundance
If an environmental variable influences total biomass, but leaves the species composition unchanged, the variable will be important in PCA/RDA but not at all important in CA/DCA/CCA.
One approach analyse total biomass separately by regression
analyse species composition by CCA
Analyses are fully complementary.
(PCA/RDA would probably give results close to the regression analysis).
REF
REF
REF
REF
UNRESOLVED QUESTION SINCE 1986 IN CA/CCA
How can CA and CCA
1. Model unimodal function (c.f. WA as approximate Gaussian ML regression)
and
2. Be linear with fit
Partial answer
CA and CCA model compositional data (proportions)
This compares with Aitchinson's log-ratio model and the polytomous GLM which are linear in centred logs but unimodal in the original data.
...1 11 ikkiik xbyyyy
REF
REF
REF
REF
CA and CCA are methods for analysing unimodal data.
CA and CCA are CHAMELEONS1) Unimodal methods
2) Linear methods
CCA can be derived as a weighted form of reduced rank regression = redundancy analysis = principal component analysis with respect to instrumental variables. The key element is that the relative abundance is a linear function of the environmental variables (relative here means relative to sample total and species total).
As unimodality and compositional data often go hand in hand, common element is that CCA models compositional (i.e. relative) abundance data instead of the absolute abundance data.
ECOLOGICAL TERMS
CCA (and CA) models relative abundances; takes sample size for granted. Usually the -diversity of a sample increases with its size. CCA and CA take that aspect of -diversity for granted and focuses, instead, on the -diversity (dissimilarity between sites). If the trend in -diversity coincides with -diversity (e.g. species disappear one by one along a gradient), CA and CCA can extract such trends.
In unimodal context, species scores are weighted averages of sample scores and vice versa. In linear context, species scores are derived from a weighted linear regression of transformed species data on to the sample scores.
THE TWO FACES OF CORRESPONDENCE ANALYSIS AND CANONICAL
CORRESPONDENCE ANALYSISREF
REF
REF
Linear context most useful when gradient length is < 3SD. Unimodal context most useful when gradient length is > 4SD. For intermediate lengths, either contexts may be useful.
Can transform unimodal model into linear model by ‘take logarithms and double centre’ (for data with no zeroes).
If data contain zeroes, no explicit linearising data transformation because we cannot take logarithms. In CA and CCA, a transformation is implicit that is close to the exact transformation.
EXACT
with
where and are geometric averages of across rows and columns, respectively and is the overall geometric average
CA/CCA
where and are the abundance totals across species in site i and across samples for species k
INHERIT THEIR TWO FACES FROM MODELS OF COMPOSITIONAL DATA.
ikylog kiikik gggyy
kgig
kiikik yyyyy
iyky
REF
REF
REF
REF
DATA TYPE AND CHOICE OF ORDINATION METHOD
Besides gradient length (standard deviations), data type is also important in selecting ordination method.
Absolute abundance Relative abundance(Compositional differences)
Unconstrained PCA (linear) CA, DCA (unimodal)
Constrained RDA (linear) CCA, DCCA (unimodal)
Constrained (PRC) (linear) -
PCA/RDA are weighted summations; CA/CCA are weighted averages, hence the difference between modelling absolute values (PCA/RDA) or relative values (CA/CCA).
Cannot currently model satisfactorily absolute abundances over long graidents. Need to partition the data into smaller gradients first (e.g. TWINSPAN).
SPECIES ABSENCES IN DATA SETS
Besides removing the absolute abundance effect, CA, DCA, CCA, and partial CCA (and WA and WA-PLS) do not consider species absences or zero values in the biological data.
Zero values - ? Show real absence
? Reflect incomplete sampling
? Chance
Is this an advantage or disadvantage?
CANOCO & CANODRAW MAT, ZONE, WINTRAN, C2
MicroComputer Power Steve Juggins111 Clover Lane Geography DepartmentITHACA, NY 14850 University of NewcastleUSA NEWCASTLE UPON TYNE
NE1 7RH
[email protected] ([email protected])http://www.microcomputerpower.com http://www.campus.ncl.ac.uk/staff/ Stephen.Juggins/
HOF
Jari OksanenDepartment of Biology
University of OuluOULU
Finland
([email protected])http://cc.oulu.fi/~jarioksa/
SOFTWARE AVAILABILITY
TWINSPAN (Mark Hill), DISCRIM (Cajo ter Braak), TWINGRP, RATEPOL, SPLIT, etc
John BirksDepartment of BiologyUniversity of BergenAllégaten 41N-5007 BERGENNorway
QUERIES
[email protected] Birks, Department of Biology, University of Bergen, Allégaten 41, N-
5007 Bergen, Norway
Fax: (+47) 55 58 96 67
[email protected] Simpson, Environmental Change Research Centre, University
College London, Gower Street, London, WC1E 6BT, UK
http://www.homepages.ucl.ac.uk/~ucfagls/ncourse/
www.okstate.edu/artsci/botany/ordinate
Mike Palmer's ordination site with masses of documentation, explanatory notes, links, details of software, etc.
www.canoco.com
Cajo ter Braak's site about CANOCO and with answers to many frequently asked questions (FAQ)
www.microcomputerpower.com
Richard Furnas' site about CANOCO and related software availability and ordering
www.canodraw.com
Petr Šmilauer's site about CANODRAW and CANOCO and related software
http://regent.bf.jcu.cz/maed
Details of Petr Šmilauer and Jan Lepš' course and data on multivariate analysis of ecological data.
VALUABLE WEB SITES FOR NUMERICAL ECOLOGISTS AND
PALAEOECOLOGISTS
WEB SITES continued
http://cc.oulu.fi/~jarioksa/
Jari Oksanen's site with his R vegan package, lecture notes, programs (e.g. HOF), documentation, comments, FAQ, and much more
http://www.bio.umontreal.ca/legendre/indexEnglish.html
Pierre Legendre's site with details of publications, software, activities, etc.
http://www.bio.umontreal.ca/Casgrain/en/labo/index.html
Software from Pierre Legendre's lab
http://labdsv.nr.usu.edu/
Dave Robert's site about quantitative vegetation ecology with lecture notes, software details, etc.
www.nku.edu/~boycer/fso/
Rick Boyce's site about fuzzy set ordination
http://cran.r-project.org/
R website
WEB SITES continued
www.stat.auckland.ac.nz/~mja/
Marti Andersen's site with new software, details of publications, activities, etc.
www.campus.ncl.ac.uk/staff/Stephen.Juggins
Steve Juggins' site for C2, WinTran, ZONE, etc.
www.chrono.qub.ac.uk/psimpoll/psimpoll.html
Keith Bennett's site for palaeoecological software, notes, etc.
www.chrono.qub.ac.uk/inqua
Keith Bennett's site of INQUA Data Analysis Sub-Commission software, newsletters, etc.
www.env.duke.edu/landscape/classes/env358/env358.html
Dean Urban's site with excellent lecture notes on Multivariate Methods for Environmental Applications
Numerical Analysis of Biological Data
Basic building-blocks and concepts and the resulting numerical methods
Niches
'Communities'
Continuum
conceptWeighte
d averagin
gTWINSPA
N
CA/DCA
Cluster analysis
Metric scaling
Non-metric scaling
Indicator-species analysis INDVAL
FINAL COMMENTS
GLM
Correlation &
covariance
Linear combinations
Cross-validati
onPermutation
tests
PLS
Cluster analysis
Multiple regression
PCA
RDA
Linear discriminant
analysis, canonical correlation
analysis
Regression models
Procrustes rotation
Co-inertia analysis
Numerical Analysis of Environmental Data
Basic building-blocks and concepts and the resulting numerical methods
Gradients +
Numerical Analysis of Biological and Environmental Data
Basic building-blocks and concepts and the resulting numerical methods
GLM & GAM
Niches &
Gradients
Weighted averaging
Cross-validatio
n
Permutation tests
PLS
Cluster analysis
Multiple regressionCA/DCA
CCA
TWINSPAN
WA WA-
PLS CCA-PLS Co-
CA
Multiple discriminant analysis
Regression models +
DISCRIM
COINSPAN
Co-inertia analysis
Distance-based PCoA Canonical analysis of principal co-ordinates
(CAP)
Andrew Lang 1844-1912. He uses statistics as a drunken man uses lamp-posts – for support rather than illumination.
From MacKay, 1977, and reproduced through the courtesy of the Institute of Physics.
Statistics are for illumination!
Sketches illustrating statistical zap and shotgun
THE PEOPLE WHO HAVE MADE THE STATISTICAL ZAP
POSSIBLE
Marti J. Anderson Richard
Telford
Pierre Legendre
Cajo J.F. ter Braak
Steve Juggins
Mark O. Hill
Gavin Simpson