An open source implementation of the Modern Analog ... · An open source implementation of the...

16
Computers & Geosciences 32 (2006) 818–833 An open source implementation of the Modern Analog Technique (MAT) within the R computing environment $ Michael Sawada Laboratory for Applied Geomatics and GIS Science (LAGGISS) and Laboratory for Paleoclimatology and Climatology (LPC), Department of Geography and Ottawa-Carleton Geoscience Center, University of Ottawa, Ottawa, Ont., Canada K1N 6N5 Received 28 January 2005; accepted 24 October 2005 Abstract The purpose of this paper is to introduce an analytical solution for Quaternary geoscientists applying the modern analog technique (MAT) to fossil biological assemblages. We present a package called MATTOOLS that implements the MAT and offers new calibration techniques related to Monte Carlo simulation and response operating curves (ROC) that are used in assessing the critical thresholds of biological assemblage dissimilarity. The MATTOOLS solution to the MAT has the advantage of operating in the R language environment, a free open-source high-level language with hundreds of functions for statistical analysis and visualization. MATTOOLS therefore offers an easily extensible solution for individual research endeavors. We review current solutions for MAT calculations and provide an example of modern calibration using MATTOOLS. r 2005 Elsevier Ltd. All rights reserved. Keywords: Modern analog technique (MAT); Method of modern analogs; Quaternary; ROC; MATTOOLS; R-project.org 1. Introduction Within the Quaternary geosciences, the modern analog technique (MAT) (Overpeck et al., 1985; Prell, 1985) has become an increasingly popular method for the reconstruction of past environments (Davis, 2000). The MAT has been used to infer past climates and vegetation from fossil pollen (Guiot et al., 1993; Sawada et al., 1999; Gajewski et al., 2000; Williams et al., 2001; Williams, 2003; Sawada et al., 2004) as well as marine/ocean sea surface temperature and salinity at numerous time-scales (Ikeya and Cronin, 1993; e.g., Gonzalez-Donoso et al., 2000; Li et al., 2001; Kandiano and Bauch, 2003; Perez-Folgado et al., 2003; Pflaumann et al., 2003). The MAT provides an objective way of quantifying past environments and the application of this method requires a number of decisions. The recent increased interest in the MAT has led to the development of new methods that aid in making such decisions. Here it is implied that making a decision requires a choice between ARTICLE IN PRESS www.elsevier.com/locate/cageo 0098-3004/$ - see front matter r 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.cageo.2005.10.008 $ MATTOOLS can be downloaded at http://www.geomatics.uottawa.ca/mattools where news and updates will be posted. The version described in this paper is available at http://www.iamg.org/CGEditor/index.htm Tel.: +1 613 562 5800x1040; fax: +1 613 562 5145. E-mail address: [email protected].

Transcript of An open source implementation of the Modern Analog ... · An open source implementation of the...

Page 1: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESS

0098-3004/$ - se

doi:10.1016/j.ca

$MATTOO

described in thi�Tel.: +1 61

E-mail addr

Computers & Geosciences 32 (2006) 818–833

www.elsevier.com/locate/cageo

An open source implementation of the Modern AnalogTechnique (MAT) within the R computing environment$

Michael Sawada�

Laboratory for Applied Geomatics and GIS Science (LAGGISS) and Laboratory for Paleoclimatology and Climatology (LPC),

Department of Geography and Ottawa-Carleton Geoscience Center, University of Ottawa, Ottawa, Ont., Canada K1N 6N5

Received 28 January 2005; accepted 24 October 2005

Abstract

The purpose of this paper is to introduce an analytical solution for Quaternary geoscientists applying the modern analog

technique (MAT) to fossil biological assemblages. We present a package called MATTOOLS that implements the MAT

and offers new calibration techniques related to Monte Carlo simulation and response operating curves (ROC) that are

used in assessing the critical thresholds of biological assemblage dissimilarity. The MATTOOLS solution to the MAT has

the advantage of operating in the R language environment, a free open-source high-level language with hundreds of

functions for statistical analysis and visualization. MATTOOLS therefore offers an easily extensible solution for individual

research endeavors. We review current solutions for MAT calculations and provide an example of modern calibration

using MATTOOLS.

r 2005 Elsevier Ltd. All rights reserved.

Keywords: Modern analog technique (MAT); Method of modern analogs; Quaternary; ROC; MATTOOLS; R-project.org

1. Introduction

Within the Quaternary geosciences, the modern analog technique (MAT) (Overpeck et al., 1985; Prell, 1985)has become an increasingly popular method for the reconstruction of past environments (Davis, 2000). TheMAT has been used to infer past climates and vegetation from fossil pollen (Guiot et al., 1993; Sawada et al.,1999; Gajewski et al., 2000; Williams et al., 2001; Williams, 2003; Sawada et al., 2004) as well as marine/oceansea surface temperature and salinity at numerous time-scales (Ikeya and Cronin, 1993; e.g., Gonzalez-Donosoet al., 2000; Li et al., 2001; Kandiano and Bauch, 2003; Perez-Folgado et al., 2003; Pflaumann et al., 2003).

The MAT provides an objective way of quantifying past environments and the application of this methodrequires a number of decisions. The recent increased interest in the MAT has led to the development of newmethods that aid in making such decisions. Here it is implied that making a decision requires a choice between

e front matter r 2005 Elsevier Ltd. All rights reserved.

geo.2005.10.008

LS can be downloaded at http://www.geomatics.uottawa.ca/mattools where news and updates will be posted. The version

s paper is available at http://www.iamg.org/CGEditor/index.htm

3 562 5800x1040; fax: +1 613 562 5145.

ess: [email protected].

Page 2: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833 819

alternatives. Choosing a critical numerical threshold of dissimilarity that indicates whether a givenmodern biological assemblage is a good analog for a fossil assemblage is the first of these decisions. Newmethods that aid in this decision include Monte Carlo techniques (Anderson et al., 1989; Bartlein andWhitlock, 1993; Sawada et al., 2004) and/or response operating curves (ROC) (Gavin et al., 2003; Jackson andWilliams, 2004; Wahl, 2004). The second decision concerns the number of analogs to retain from a set ofmodern assemblages (Waelbroeck et al., 1998; Williams, 2003; Sawada et al., 2004). Statistical summaries ofthis retained set are used to derive estimates of past environmental conditions. Recent developments in‘‘jump’’ techniques have improved the decision process (Waelbroeck et al., 1998; Williams, 2003; Sawadaet al., 2004).

I refer to the above as ‘‘critical’’ choices because they cannot be generalized across different datasets ordisciplines and must therefore be determined on an application-by-application basis. Application of the MATwith regard to these critical choices is computationally intensive. No software is available that permits the easyapplication of the methods along with the flexibility to experiment with the different choices. Thus, there is astrong need for an Open Source solution that will allow for the simple application of the MAT andexperimentation with the critical choices within individual applications. This paper presents a solution calledMATTOOLS that operates within the R computing environment. The advantage of this approach todevelopment within R is that the environment is complete and extensible, and the approach allows forvisualization and exploratory data analysis prior and subsequent to MAT analysis.

2. The MAT

The MAT technique compares a fossil biological assemblage (F) to all modern assemblages (M) within agiven geographic extent. The comparison between assemblages, themselves, each a multivariate vector oftaxon observations, is done by way of a dissimilarity metric (Fig. 1). The modern assemblage with the smallestdissimilarity is in theory the ‘best’ modern analog. The environment that is associated with the most similarmodern assemblage may be assigned to the time and location of the fossil assemblage being tested. Insearching for analog assignments, the technique assumes that all possible past environmental conditionsproducing fossil biological assemblages currently exist within the spatial domain of the modern assemblagesand, furthermore, that variations in assemblage composition are determined by the environmental variables ofinterest; otherwise the application of the technique is problematic (Jackson and Williams, 2004).

Dissimilarity between fossil and modern biological assemblages can be measured using a number ofdifferent metrics (Prentice, 1980; Overpeck et al., 1985; Gavin et al., 2003; Jackson and Williams, 2004). Asignal-to-noise measure called squared chord distance (SCD) is robust in the context of terrestrial pollenassemblages (Overpeck et al., 1985; Gavin et al., 2003) and has been employed in the majority of terrestrialMAT reconstructions (Jackson and Williams, 2004; Sawada et al., 2004). MATTOOLS utilizes SCD bydefault. However, other dissimilarity thresholds like equal-weight distance metrics as illustrated in Fig. 1 canbe more accurate in regions like the North American tundra (Oswald et al., 2003). Because MATTOOLS codeis open-source, other distance metrics are easily added for different purposes. In any case, SCD, ranges fromzero for a perfect analog (no dissimilarity) to two at maximum dissimilarity and is given as:

SCD ¼Xn

i¼1

ðFp1=2i �Mp

1=2i Þ

2, (1)

where, SCD is the dissimilarity coefficient, Fpi is the proportion of species i within the fossil assemblage andMpi is the proportion of species i within the modern assemblage (Overpeck et al., 1985).

Frequent arbitrary choices in the MAT have included the critical limit of dissimilarity or cutoff value usedto distinguish between a good vs. poor analog as well as how many analogs to retain for estimating the pastenvironment as an average prediction. Attempts to make these choices less subjective generally utilize themodern dataset itself within modern calibration or training (Birks, 1998). For each dataset, moderncalibration allows: (a) determining a critical threshold of dissimilarity that can be used to separate good vs.poor analogs, and (b) the optimal number of analogs to retain in reconstructions.

Page 3: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESS

Fig. 1. Modern analog technique (MAT) applied to fictitious pollen assemblages. A fossil pollen spectrum, e.g., at 1000 yrs before present,

is compared with modern spectra from a range of vegetation types and environments using some distance metric, e.g., squared chord

distance. Modern pollen spectrum with minimum distance is considered best modern analog. From this, we infer that ecosystem around

modern pollen-sampling site is an analog for past ecosystem centered on fossil pollen-sampling site. Distance metrics from Overpeck et al.

(1985).

M. Sawada / Computers & Geosciences 32 (2006) 818–833820

2.1. Determining a critical threshold of dissimilarity

The fundamental question as to what constitutes a good analog is most often addressed by choosing anumeric threshold of dissimilarity. Paired samples with dissimilarities less than this threshold are consideredanalogs; samples with dissimilarities greater than this threshold are considered non-analogs. The derivation ofthe threshold value itself has been either based upon comparisons between modern biological assemblages, theso-called ‘training’ dataset (Birks, 1998), or adopted uncritically from other studies (Jackson and Williams,2004; Sawada et al., 2004). In either case, the chosen threshold is then applied ‘down-core’ to reconstructpaleoenvironments by rejecting as analogs any modern assemblages that exceed the threshold. However, anycritical threshold of dissimilarity will be a function of the number of taxa included within a given dataset(Waelbroeck et al., 1998; Sawada et al., 2001; Sawada et al., 2004) and so critical thresholds need to be tailoredto any MAT reconstruction (Sawada et al., 2001; Sawada et al., 2004; Wahl, 2004). Tailoring criticalthresholds to a modern dataset is computationally intensive and based on numerous pairwise comparisonsbetween assemblages within the modern training dataset. MATTOOLS implements two methods to assessingthe critical limit of dissimilarity for a given MAT application.

Monte Carlo: Anderson et al. (1989), Bartlein and Whitlock (1993) and Birks (1995) utilized the empiricaldistribution of dissimilarity coefficients generated by all pairwise comparisons between modern samples to

Page 4: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833 821

determine those thresholds that occur infrequently. This technique is computationally burdensome for largemodern collections of assemblages such as that described in Whitmore et al. (2005). An alternative is toundertake a Monte Carlo simulation using paired comparisons between randomly chosen modern assemblagepairs in order to determine threshold values that are unlikely to occur by chance alone—in a statistical sense(Sawada et al., 2004). For example, a threshold value that occurs by chance at most five times in one-hundred(0.05) would correspond to a a ¼ 0:05 level of significance. Monte Carlo derived cutoff values suffer fromType I and Type II statistical errors, and like other techniques based on pairwise comparisons of modernassemblages, are designed to quantify the measure of risk associated with false positive analogs (Gavin et al.,2003; Wahl, 2004). False positive analogs occur when the threshold value applied chooses an assemblage thatis in reality from a different biological zone. It is pertinent to note that the statistical significance does notnecessarily connote ecological significance. The Monte Carlo does not fully identify false negative analogs,that is, analogs with high dissimilarity that belong to the same biological zone. The advantage of the MonteCarlo approach is that it does not require a priori knowledge of the biological zonation or mutual affinity ofthe assemblages within the modern dataset.

ROC: When knowledge exists regarding the biological zone membership for each modern assemblage, it ispossible to determine an optimal threshold of dissimilarity. Here, the optimal dissimilarity is that value whichmaximizes the probability of true positive analogs while minimizing the probability of false positive ones. Forthis purpose, empirical receiver operating characteristic (ROC) analysis has been adapted to the MAT fromthe fields of diagnostic medical testing and signal analysis (Gavin et al., 2003; Oswald et al., 2003; Wahl, 2004).

Gavin et al. (2003) and Wahl (2004) develop a general framework for the application of ROC to the MAT.The technique is explained below and follows the notation of De Long et al. (1988). ROC analysis requiresfirstly that each modern assemblage be classified into biological zones of true membership. Zonal membershipis clearly a function of spatial scale. For example, with a pollen dataset of N samples, m of the modernassemblages could be assigned to a particular biome (leaving N�m assemblages unassigned). For each of thesem assemblages, the minimum dissimilarity is found among all other m assemblages creating a set X ¼

1; 2; . . . ;mf g of minimum dissimilarity values. Alternately the top two, five, etc. dissimilarities could beretained. The number of analogs retained can be varied and could include as many as all pairwise dissimilarityvalues within a particular biome (cf. Wahl, 2004). The resultant set X ¼ 1; 2; . . . ;mf g identifies the distributionof dissimilarity expected between true or good analogs (Figs. 2A and 2B, first panels). This distributionassumes that all samples were correctly classified. At this stage, errors not introduced in the pollen countingand identification process are due to incorrect assignment of a sample to a biome type. So, incorrect sampleassignments yield dissimilarity values that represent comparisons of very different assemblages and these arecalled false positive analogs. Next, a second set Y ¼ 1; 2; . . . ; nf g (where n ¼ N �m) of minimum dissimilarityvalues is created by taking each of the n assemblages and searching for minimum dissimilarities within the m

assemblages (Figs. 2A and 2B, first panels). This set represents the distribution of dissimilarities for pairedsamples from different biomes. Given these two distributions of dissimilarity values X & Y, at any giventhreshold of dissimilarity, z, one can determine the proportion of true analogs that are also classified as true(true positive fraction, TPF),

TPF ðzÞ ¼1

m

Xm

i¼1

IðX ipzÞ, (2)

where IðX ipzÞ ¼ 1 when the condition is true and zero otherwise. The proportion correctly classified as non-analogs at the same cutoff value is given as the true negative fraction (TNF),

TNF ðzÞ ¼1

n

Xn

i¼1

IðY i4zÞ, (3)

where IðY i4zÞ ¼ 1 when the condition is true and zero otherwise. An empirical ROC curve is constructed byvarying the values of z across the range of possible dissimilarity values, for example with Eq. (1), squaredchord distance, z ¼ z 2 R j 0pzp2f g, at regular intervals and plotting the false positive error FPE ¼ 1� TNF

against the TPF (Eqs. (1) and (2), Figs. 2A and 2B, second panels). The dissimilarity value at whichz ¼ max TPF � FPEð Þ represents the cutoff value that maximizes the true positive analogs and minimizes the

Page 5: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESS

Fig. 2. Example output from ROC plotting function mat.plotroc. Overlapping histograms on left indicate relative frequencies of

dissimilarity values between within zone comparisons (black) and outside of zone to within zone (grey) comparisons; gray arrow indicates

optimal dissimilarity value. Center graphs illustrate ROC curves where intersection of y ¼ x line with ROC curve indicates optimal value

that minimizes probability of false positive analogs. Right histograms illustrate specificity minus sensitivity over range of dissimilarity

coefficient, where this value is maximal probability of false positive analogs is minimized and is indicated by gray vertical bars within

histograms.

M. Sawada / Computers & Geosciences 32 (2006) 818–833822

false positive ones—the value of joint minimization (Gavin et al., 2003; Wahl, 2004) (Figs. 2A and 2B, thirdpanels).

The area under the ROC curve, AUC, is a measure of the diagnostic performance of the dissimilarity metricin distinguishing between similar versus different biological assemblages (Gavin et al., 2003). The AUC can becalculated using trapezoidal integration (default in MATTOOLS) or calculated directly from thenonparametric Mann–Whitney–Wilcoxon1 rank-sum statistic which is equivalent to the AUC (Metz et al.,1998b; Copas and Corbett, 2002; Gavin et al., 2003; Yan et al., 2003). The AUC can range from AUC ¼ 0:5when there is no ability of the metric to distinguish between true and false analogs, in other words, theperformance is no better than a coin toss, to AUC ¼ 1 when there is no overlap between TPF and TNF.Oswald et al. (2003) and Gavin et al. (2003) used AUC as a measure of the ability of different dissimilaritymetrics to distinguish between like and unlike taxon assemblages, thereby determining the best metric overallfor a set of biological zones, or one specific region, or one type of dataset or a particular type of biologicalassemblage.

The empirical distributions of Eqs. (2) and (3) are often non-Gaussian (Metz et al., 1998a; Gavin et al.,2003) and so nonparametric methods are more frequently used in calculating the standard errors of the AUC.The standard error of the AUC, SEAUC, can be calculated from the nonparametric Mann–Whitney–Wilcoxonrank-sum statistic or alternatively by the method of Hanley and McNeil (1982) which is computationallyfaster for large datasets and is the default method for calculating SEAUC in MATTOOLS.

1These may be referred to separately as the Mann–Whitney U and Wilcoxon rank-sum test http://jsekhon.fas.harvard.edu/stats/html/

wilcox.test.html They are equivalent tests under the same assumptions.

Page 6: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833 823

Software is available for the computation of ROC curves, e.g., ROCKIT (Metz et al., 1998a) and Analyze-Itfor Microsoft Excel (Analyse-It Software Ltd., 2004) among many others. However, the above software areextremely difficult to apply in ROC analysis in the context of MAT applications. MATTOOLS implementsthe comparison framework of Gavin et al. (2003) as described above for the computation of optimaldissimilarity thresholds and includes a number of diagnostic plots.

2.2. The number of analogs to retain in reconstructions

The central issue here is simple: The single best modern analog, as measured by some metric, may not inreality yield the most similar environment to a fossil assemblage (Jackson and Williams, 2004). Why? Becausethere are many environmental factors that can shape biological assemblages (Jackson and Williams, 2004) andthese are only confounded by analytical errors (cf. Mosimann, 1965). Consequently, nearly identicalassemblages can arise from different environments. As such, the ‘true’ best analog for a fossil assemblage maynot be the modern sample that is least dissimilar. Nevertheless, for applications sake, we must assume that the‘true’ best modern analog is included somewhere within the top-n modern analogs for a given fossilassemblage. Exactly what n should be is unclear; In most studies the number of analogs retained has beenarbitrary—ranging from 1 to 10 or more (reviewed in Jackson and Williams, 2004; Sawada et al., 2004).Sawada (2004) further suggests that the optimal number of analogs to retain will depend on the environmentalvariable undergoing reconstruction. In the terrestrial case, for example, mean annual temperature and totalannual precipitation each have a unique spatial structure and scale of influence. Thus, it should not besurprising that the optimal number of retained analogs for different environmental variables should differ.

Waelbroeck et al. (1998) introduced a ‘jump’ method to overcome the arbitrary choice of how many analogsto retain in estimating the environment of a fossil assemblage. The ‘jump’ method (JM) works in the followingway for each assemblage:

1.

All dissimilarity values are sorted in ascending order; 2. Some metric, a, that measures the change between adjacent values is determined. For example, percentage

increase in dissimilarity of each analog with respect to its predecessor is calculated as,

a ¼SCDiþ1

SCDi

� 1

� �� 100 (4)

and is the most frequently used jump measure (Waelbroeck et al., 1998; Williams, 2003; Sawada et al.,2004). This is the default method in MATTOOLS for comparison;

3.

Using the metric, a, each dissimilarity value is compared to the previous; 4. When the metric meets or exceeds some predefined level (a ¼ C; fC is a constantg), this value represents a

discontinuity or ‘jump’ in dissimilarity;

5. Retained in the calculation of the environmental estimate are the environmental values of those analogs

that precede the ‘jump’ or discontinuity.

In the case where no ‘jump’ is found then the estimated environmental value can approach the average ofthe dataset (Sawada et al., 2004). Adoption of the overall environmental variable average as a fossil estimate isclearly problematic if the modern set is known to span numerous different environmental gradients, forexample the North American modern pollen dataset (Whitmore et al., 2005). To solve this issue, and in theabsence of a jump in dissimilarity, Waelbroeck et al. (1998) suggests retaining some arbitrary number ofanalogs for such cases. Conversely, in such cases, Williams (2003) suggests retaining only those analogs belowa predefined critical threshold of dissimilarity. Nevertheless, within large modern datasets with thousands ofassemblages, like that of Whitmore et al. (2005), hundreds or thousands of analogs can be retained for a givenassemblage just because of no distinguishing ‘‘jump’’ or breach of a chosen critical threshold. As such, anarbitrary choice of retaining the top-n modern analogs for some assemblages seems unavoidable (Sawada etal., 2004). For large datasets, some reasonable number of the top-n may be retained and the jump methodapplied according to modification of Williams’ (2003). However, neither Waelbroeck et al. (1998) nor Williams(2003) specify how to determine the critical ‘jump’ value, a, as described above.

Page 7: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESS

0 20 40 60 80 100

0.888

0.890

0.892

0.894

0.896

0.898

0.900

Alpha

Pea

rson

's R

(cor

rela

tion)

alpha = 20

Fig. 3. Output of jump analysis for optimizing alpha for a given environmental reconstruction using function mat.plotjump.

M. Sawada / Computers & Geosciences 32 (2006) 818–833824

Building on Williams (2003) modifications, Sawada (2004) utilized a modified jump method (MJM) which isimplemented in MATTOOLS. Briefly, in applying the MJM, the user retains the top-n analogs and theprogram then determines an optimal value of a based on comparisons between biological assemblages withinthe modern dataset. The optimal value of a is that which maximizes the linear correlation between theobserved and predicted environmental values across all samples (Sawada et al., 2004). For N modernassemblages, the 2nd to ith dissimilarity values for each are searched using Eq. (4). The search commenceswith a values starting from 0% and then to 100% with steps or intervals of 1%. For each modern biologicalassemblage, analogs with a percentage change less than the current level of a are retained and theirenvironmental variable is averaged to produce a modern prediction for that assemblage. For each level of a,the predicted environmental values are correlated with the observed environmental values for the N modernassemblages and the correlation coefficients at each a level are retained. Finally, the coefficient of correlation isplotted against the levels of a to determine which value of a provides the maximum correlation between theobserved and predicted environmental variable (Fig. 3).

In the MJM, the value of a that provides maximum correlation is sensitive to the number of analogs uponwhich it is tested (Sawada et al., 2004). For example, the optimal value of a will be different if the top 100 vs.top 10 analogs are retained. As such a should be determined for each modern dataset or subset. In addition, awill vary when different environmental variables are predicted. The MATTOOLS function mat.jumpoptimizes the value of a for a given number of analogs in a modern dataset.

3. Existing MAT solutions

At present, researchers requiring the MAT can choose among the programs designed specifically for MATreconstructions such as ANALOG (Schweitzer, 1995, 1999), MODPOL (Maher, 2000), SIMMAX(Pflaumann et al., 1996), RAM98 (Waelbroeck et al., 1998) and programs designed for general transfer-function methods that include MAT such as PaleoToolBox (Sieger et al., 1999) and the commercial packageC2 v1.3 (Juggins, 2003).2 ANALOG is an MS-DOS based package with ANSI C source code that requiresdelimited text file input in addition to an ASCII ‘run-file’. ANALOG output is simple and offers only bestanalog assignments but in a verbose text file requiring post processing and reformatting. Similar in operation,SIMMAX is an MSDOS executable with FORTRAN source code and utilizes a similarity metric that is suitedto reconstructions based on planktonic foraminifera. The revised analog method (RAM) is available asFORTRAN source code and is particularly well suited to MAT assignments utilizing sparse modern

2Juggins, S., 2003. C2 1.3: Win95/98/nt2000/xp program for analysing and visualizing paleoenvironmental data, http://www.campus.

ncl.ac.uk/sta?/Stephen.Juggins/software/c2home.htm

Page 8: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833 825

calibration sets in ocean and marine environments. MODPOL is an MS DOS application with VGA graphicdisplay but is designed to work specifically with M70-type files distributed by the NGDC. Operating systemand programming language evolution in addition to new developments in the application of the MAT over thepast few years have left these programs depreciated in functionality. At the time of their introduction,however, ANALOG, SIMMAX, RAM98 and MODPOL were the first freely available solutions for thecalculation of the MAT.

More recently, MS Windows software with functionality similar to the above MS DOS counterparts likefreely available PaleoToolBox and commercial package C2 have emerged. PaleoToolBox offers a number oftransfer function methods and includes the MAT as one option, therein, one can create project files that canthen be used with a provided MS DOS executable of the RAM program. There are no graphical capabilities toPaleoToolBox. More extensive functionality is found in C2, a commercial MS Windows package thatcomputes a broad range of ecological and paleoecological transfer functions, ordination techniques as well asgraphical routines particularly suited to the plotting of stratigraphic diagrams. Although C2 has the ability tocompute MAT assignments, the software is not specifically designed for this task and so contains the basicoutputs with the addition of some diagnostic plots. While C2 complies with current Windows OS standardsand can import files from standard spreadsheets programs, it is neither open nor extensible by the user.Functionality and extensibility issues with the above programs motivate the need for a more open andextensible solution for MAT computations.

4. MATTOOLS & the R environment

The ideal MAT software solution should:

1.

3

4

Work with simple spreadsheet-like data-tables;

2. Provide a comprehensive environment with tools for data pre-processing, including such tasks as

rearranging, formatting and variable transformations;

3. Provide a comprehensive set of graphical tools for visualization of data and results; 4. Include some of the more recent calibration methodologies; 5. Operate in a stable and freely available environment; 6. Be usable across all major platforms (Windows, MacOS, and Linux); 7. Have numerous learning resources and a large support network; 8. Allow easy extensions of MAT functions for individual research projects.

The R language for statistical computing provides an Open Source solution for the MATTOOLS packagethat satisfies all of the above criteria. Since 1996 (Ihaka and Gentleman, 1996), R has been enthusiasticallydeveloped by a core development team and researchers world-wide. The R environment (R Development CoreTeam, 2003) is now managed by the R Foundation for Statistical Computing. R is available under GNUPublic License and offers a fully integrated environment largely based on the S language developed at LucentTechnologies (Becker et al., 1988). Grunsky (2002) reviews the definition, structure and philosophy of the Rcomputing environment. R provides numerous statistical, mathematical and graphical functions that arecompletely extensible by individual researchers. Full programs and functions can be written in R that takeadvantage of existing functionality built into the R-core function set as well as functions within other userdonated R packages. Currently, there are more than one-hundred supported libraries of functions that extendthe base package functionality and allow for such things as interfacing with Geographical InformationSystems (GIS) data formats for mapping in R (Lewin-Koh and Bivand, 2004)3 and spatial statistical libraries(Bivand, 2004)4 among others. R provides a free, mature and stable environment to implement MATTOOLSand expose its functionality to the Quaternary geoscience community.

Lewin-Koh, N.J., Bivand, R., 2004. Maptools 0.3-8. http://www.r-project.org/

Bivand, R., 2004. The spdep package 0.2-11. http://www.r-project.org/

Page 9: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESS

Table 1

High-level functions used in MATTOOLS. Function details (arguments, examples etc.) can be accessed by typing ‘‘?’’ before function

name, for example typing ?mat.dissim will open this specific function help file. Examples and descriptions of operation for each

function are given in text of this paper with exception of mat.roc.allpair

High-level functions Description

mat.dissim Creates dissimilarity object between either a modern dataset and itself or a modern and fossil dataset. This

list object contains matrices and parameters that are used in both optimizations of critical choices as well as

environmental reconstructions

mat.mc Undertakes a Monte Carlo simulation of pairwise modern spectra to find critical thresholds of dissimilarity

and associated probabilities

mat.plot.mc Plot output of mat.mc

mat.roc Undertakes receiver operating curve (ROC) analysis on a modern dataset when each row/site has a nominal

class defining biological zone membership. This function undertakes ROC on a within-zone by out-zone

basis

mat.roc.allpair Undertakes receiver operating curve (ROC) analysis on a modern dataset when each row/site has a nominal

class defining biological zone membership. This function undertakes ROC on a pairwise zone-by-zone basis

(all possible combinations of zones)

mat.plotroc Plots output from either mat.roc or mat.roc.allpair

mat.jump Undertakes jump method on a mat.dissim object to determine optimal alpha threshold for a given number of

retained analogs and an environmental variable

mat.plotjump Plots result of mat.jump for visualization of optimal alpha value.

mat.jumrecon Applies jump method downcore for fossil reconstruction.

mat.fossavg Constructs a data table that includes a choice of various weighted averages of top-n analogs

mat.plot.recon Plots output of mat.fossavg

M. Sawada / Computers & Geosciences 32 (2006) 818–833826

5. The MATTOOLS package

MATTOOLS implements a number of functions for calculation of MAT and associated measures thatunderlie the critical decisions in the application of the MAT. A list of primary functions and brief descriptionsare provided here in Table 1 and full details can be found within the help files provided with the MATTOOLSpackage.

6. Example application of MATTOOLS: assessing critical limits of dissimilarity for use in a paleoclimate

reconstruction of a fossil pollen core

The following provides an example session in R that utilizes both fossil and modern pollen datasets. Thisinvolves (a) getting the data into the R environment, (b) loading MATTOOLS, determining critical thresholdsbased on modern calibration followed by a paleotemperature reconstruction based on fossil pollen data.

6.1. Loading MATTOOLS

MATTOOLS is loaded as a standard library from the ‘‘Packages’’ menu within R by choosing the item‘‘Install package(s) from local (zip) filesy’’. One then navigates to the folder containing MATTOOLS.zip andchooses this file. The user again chooses ‘‘Packages’’ from the menu and the item ‘‘Load packagey’’ whichwill reveal a window showing a list of all installed packages. Scroll to and select the item mattools and click onthe OK button.

6.2. Getting data into the R environment

The R environment provides a number of functions for importing data from common data managementand statistical analysis packages. The simplest function is ‘read.csv’ which reads in comma-separated text files,these can be exported from MS ExcelTM and most other statistical software. More sophisticated input/output

Page 10: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESS

Fig. 4. (A) Modern calibration dataset: a file (e.g., a text file like a comma-separated values file [*.csv]) containing field names in first row of

modern calibration dataset where each subsequent row contains a sample identifier (Sample ID), coordinates in either a planar/projected

x,y system or as longitude and latitude in decimal degrees, and taxon abundances followed by modern environmental variables (Mod.Env

1,y,Mod.Env n) that will be used for modern training and/or paleoenvironmental reconstruction. Final and optional field would contain,

for each row, a nominal code representing biological zone to which each row/site belongs. (B) Fossil dataset: this file is only necessary if

one will be undertaking an environmental reconstruction for a fossil dataset. Structure of file is same as (A) where each row contains same

information but now represents a unique age or depth in a fossil core. In this file modern environmental variables observed at locations of

fossil cores are only required if anomalies between reconstructed and observed environments will be computed. Note: Order and number of

taxon fields in each should be identical, other fields can vary.

M. Sawada / Computers & Geosciences 32 (2006) 818–833 827

functions are found in the package ‘Foreign’ that is part of the core R distribution. For MATTOOLS, datashould be formatted in at most two files, using the file structure shown in Fig. 4. The modern data used in thisexample and included within the MATTOOLS package comes from a subset of sites within a modern pollenresearch database contained in an Excel worksheet (Whitmore et al., 2005) and a single fossil pollen samplecore from Zagoskin Lake (Ager, 2003) containing 60 levels ranging from 0BP to 30 000BP. Zagoskin Lake islocated on geographically on St. Michael Island on the western coast of Alaska (162.6W, 63.27N). The data isfirst loaded into the R environment from two datasets within the Data\zagoskin subfolderof the installed MATTOOLS package,5

modpoll ¼ read.csv(‘‘c://zagoskin//modpoll.csv’’)zag ¼ read.csv(‘‘c://zagoskin//zagoskin.csv’’)or alternatively one can load these example datasets directly from the Internet using a URL within the

read.csv function:modpoll ¼ read.csv(‘‘http://www.geomatics.uottawa.ca/mattools/modpoll.csv’’)zag ¼ read.csv(‘‘http://www.geomatics.uottawa.ca/mattools/zagoskin.csv’’)

6.3. Determination of optimal threshold of dissimilarity via modern calibration

The MATTOOLS functions mat.roc and mat.mc are the two main functions for critical thresholddetermination using either ROCs or the Monte Carlo approach (according to a user-specified level ofsignificance).

For example, the function called ‘mat.mc’ creates a Monte Carlo distribution for the dataset modpoll,which contains taxon counts6 for 104 taxa in columns 10 to 113. Ten-thousand comparisons are undertakenand dissimilarity values are requested by the user through the argument ‘probs’ at significance levels of 0.1,0.05, etc..., the results of which are seen in Fig. 5.critval ¼ mat.mc(modpoll,modTaxa ¼ 10:113,sampleSize ¼ 10000, probs ¼ c(0.1,

0.05, 0.025, 0.01, 0.001) ,counts ¼ T)mat.plot.mc(critval) #Plot histogram and cumulative frequencies

5The user may want to copy the zagoskin folder to the local c:\drive for the read.csv example.6Taxon counts are the default input data measure; however, most MATTOOLS functions allow the user to specify whether the input

data is in proportions or counts. If counts are input, they are automatically converted to proportions.

Page 11: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESS

Fig. 5. (A) Distribution of dissimilarity values from Monte Carlo pairwise simulation using function mat.mc and plot.mc on modern

dataset; (B) cumulative distribution function across range of dissimilarity with heavy portion of line indicating user specified tail of

significance; (C) a zoom view of user specified tail of significance with corresponding critical thresholds indicated by intersecting

horizontal/vertical gray lines with numbers annotated on these lines indicating threshold (x-axis) and corresponding Monte Carlo p-value

on y-axis.

M. Sawada / Computers & Geosciences 32 (2006) 818–833828

Page 12: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833 829

critval is a list object with associated Monte Carlo probabilities in the component ‘‘cutoffs’’, e.g.,4 critval$cutoffs$x[1] 0.100 0.050 0.025 0.010 0.001$y[1] 0.37052632 0.24913580 0.18030303 0.11696970 0.05090909

A list object is a general container that is like a computer folder with subfolders. Each subfolder is called alist component. Any dataset or R object can be stored in a list component, including other lists—akin tostoring a bunch of different files in a computer folder. Because list components are independent of each other,list objects are a good way to store datasets with variable length records or objects of different modes (say avectors, datatables, and matrices). List objects aside and with reference to the example above, a squared chorddistance of �0.25 corresponds to the a ¼ 0:05 level of significance. The function mat.mc also has a secondmore computationally intensive method called ‘‘allpairs’’ that uses all pairwise comparisons (Andersonet al., 1989; Bartlein and Whitlock, 1993) in determining the critical value distribution. This second method ispreferable for small modern datasets. Argument details are provided in the help file for each function and canbe accessed by typing ‘‘?’’ before the function name, for example typing ‘?mat.mc’ at the prompt willopen the help file specific to that function.

MATTOOLS provides a set of functions that tailor the ROC method to dissimilarity coefficientoptimization for individual MAT applications. The use of ROC in MATTOOLS requires that a column(coded as text or numbers) exist in the modern dataset that identifies the biological zone membership,affiliation or assignment of each modern biological assemblage (i.e., each sample row of the database). TheMATTOOLS function ‘mat.roc()’ produces ROC curves for each biological zone as well as grouping allwithin-zone dissimilarities and outside of zone dissimilarities to produce an overall ROC curve and generaloptimal dissimilarity value. For example, using the same modpoll dataset with each pollen sample classified asbelonging to one of nine vegetation zones (Fedorova et al., 1994), the function mat.roc produces,

modroc ¼ mat.roc(modpoll,modTaxa ¼ 10:113,colClasses ¼ 116,rocEvalSeq ¼ seq(0,2,0.02), counts ¼ T, aucmethod ¼ ‘‘wilcox’’)

o

ptDissVal A UC S EAUC n sol n Within n Outside Mountain vegetation 0 .08 0 .7646936 0 .009997428 1 1 184 1 007 Oceanic meadows 0 .10 0 .9897317 0 .013948721 1 2 6 2 165 Forest-tundra 0 .10 0 .9654532 0 .010657753 1 1 45 2 046 Southern tundra 0 .08 0 .9112191 0 .011684879 1 2 90 1 901 Typical tundra 0 .32 0 .8633824 0 .031352690 1 5 6 2 135 Arctic tundra 0 .20 0 .9530608 0 .016979984 1 7 6 2 115 Grasslands 0 .16 0 .9583138 0 .009715350 1 2 09 1 982 Polar deserts 0 .34 0 .9861223 0 .017194344 1 2 3 2 168 Central taiga 0 .16 0 .9552153 0 .010761066 1 1 82 2 009 Overall 0 .16 0 .9398312 0 .003562054 1 2 191 1 7528

By default the argument ‘numAnalogs ¼ 2’ provides ROC analyses based on the single best modernanalog, increasing numAnalogs, increases the number of analogs retained in the comparisons. The output ofmat.roc is in the form of a list with a summary of the analysis printed to the command window in R. Detailson arguments, computations and output list components can be found in the associated help file. The outputof mat.roc can be plotted using the function mat.plotroc (see Fig. 2).mat.plotroc(modroc)The optimal dissimilarity value output from ROC analysis depends considerably on the accuracy of the

sample affiliations. Therefore, different results are produced by different biological sample zonations. In theseexamples, the zonation utilized for sample affiliations (Fedorova et al., 1994) is a rather coarse and generalizedcontinental scale vegetation zonation. As such, there tends to be considerable overlap between the TPF and

Page 13: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833830

TNF distributions for most zones. An overall optimal value (above the last line in bolded typeface) ofdissimilarity that maximizes the true positive analogs and minimizes the false positive ones would be �0.16which corresponds roughly to the a ¼ 0:025 level of significance derived from the Monte Carlo simulationapproach. The overall ROC optimal dissimilarity value (above the last line in bolded typeface) is calculatedfrom all pairwise comparisons between zones.

6.4. Number of samples to retain

The number of analogs to retain in a given fossil reconstruction can be optimized using the modified jumpmethod MJM of Sawada et al. (2004). First a reconstruction object, here called modern.10 is created withthe function mat.dissim (see ?mat.dissim for details):modern.10 ¼ mat.dissim(inFossil ¼ modpoll, inModern ¼ modpoll, llMod ¼ 3:4,modTaxa ¼

10:113, llFoss ¼ 3:4,fosTaxa ¼ 10:113,numAnalogs ¼ 10,counts ¼ T, dist.method ¼‘‘spherical’’)

This newly created modern.10 object contains the 10 best modern analogs for each modern sample. Themodern.10 object is then sent to the function ‘mat.jump()’ to determine the optimal a value to usedown-core when mean July temperature (in column 114 of the modpoll dataset) is considered:modern.10.jump ¼ mat.jump(dObj ¼ modern.10,inModern ¼ modpoll,envColumn ¼ 114,

cutoff ¼ 0.16)mat.plotjump(modern.10.jump)With reference to Fig. 3, an a value of 20% is optimal when 10 analogs are retained for July temperature.

This a value may change if a different environmental variable is utilized or a different number of analogs werechosen.

6.5. Applying modern calibration exercise to paleoenvironmental reconstruction

For a paleoenvironmental reconstruction, first a dissimilarity object that defines the top-n analogs for eachlevel in the fossil core must be constructed. Once a dissimilarity object has been constructed then applicationof the MJM and critical values derived from the modern calibration exercises can be used. The reconstructionof the fossil core for Zagoskin Lake requires first that a dissimilarity object be created using the mat.dissimfunctionzag10.recon ¼ mat.dissim(inFossil ¼ zag, inModern ¼ modpoll, llMod ¼ 3:4,modTaxa¼ 10:113, llFoss ¼ 3:4,fosTaxa ¼ 10:113,numAnalogs ¼ 10,counts ¼ T, dist.method ¼‘‘spherical’’)

Once the dissimilarity object (now stored as a list object), consisting of the 10 closest dissimilarity scores foreach fossil sample, has been created, then either the MJM method can be applied to the reconstruction usingthe function ‘mat.jumprecon’:zag10.jump ¼ mat.jumprecon(dObj ¼ zag10.recon,modEnvCol ¼ 114,

fossEnvCol ¼ 1:7,alpha ¼ 20)or, alternatively, a reconstruction based on a weighted average of the top-n analogs can be utilized in

addition to a cutoff value derived from the Monte Carlo or ROC methods above. Moreover one can removeall values beyond a specified geographic distance (in meters). The function mat.fossavg constructs a datatable that includes either a weighted average of the top-n analogs (possible weightings include inversegeographic distance, inverse dissimilarity, inverse rank-order, equal weight (average)) or just the single best ortop-n individual best analogs. The resultant table may have a single entry for each fossil sample or multipleentries for each fossil sample depending on the choice of weighting and the number of analogs chosen whenthe mat.dissim function was run. For example, the function mat.fossavg is used in the following, withan equal-weight average of the top 10 best analogs, for Zagoskin Lake:zag10.avgrec ¼ mat.fossavg(dObj ¼ zag10.recon,modEnvCol ¼ 114,

fossCols ¼ 1:7,wmethod ¼ ‘‘equal.wt’’)mat.plot.recon(zag10.avgrec, inCritVal ¼ critval)Fig. 6 reveals the plot of this reconstruction utilizing all 10 best analogs for sample.

Page 14: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESS

Fig. 6. (A) Zagoskin L. reconstruction of average July temperature for past 30,000 years from function mat.fossavg based on top 10

best analogs. Black line is estimate and dotted lines represent weighted standard deviation for given weight method; (B) plot of

dissimilarity with Monte Carlo derived significance levels from function mat.mc (see text).

M. Sawada / Computers & Geosciences 32 (2006) 818–833 831

7. Conclusions

MATTOOLS integrates recent advances within the modern analog technique into a single set of R functionscontained in a distributable library that not only allows for paleoenvironmental reconstructions but alsoprovides a set of tools for the assessment of critical decisions within the analog technique. These decisionsinclude the threshold of dissimilarity and the number of analogs to retain for paleoenvironmentalreconstruction. Because the aforementioned decisions are dataset specific, the inherent functionality withinMATTOOLS is essential. Finally, because MATTOOLS is open source and programmed within the Rlanguage it lends itself to easy modification and integration in larger projects.

Acknowledgements

This research is supported by grants to M. Sawada from Canadian Foundation for Innovation, OntarioInnovation Trust, and Natural Sciences and Engineering Research Council (NSERC) of Canada. This paperrepresents a contribution to the NSERC funded Climate Systems History and Dynamics Project. Thecomments of Dr. K. Gajewski, Dr. Jack Williams, Dr. Roger Bivand on an earlier version of this manuscriptsignificantly improved its content. The assistance of Dr. Andre E. Viau in compiling test datasets is gratefullyacknowledged.

References

Ager, T.W., 2003. Late Quaternary vegetation and climate history of the central Bering land bridge from St. Michael island, Western

Alaska. Quaternary Research 60, 19–32.

Analyse-It Software Ltd., 2004. Analyse-it. Leeds, UK.

Page 15: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833832

Anderson, P., Bartlein, P.J., Brubaker, L., Gajewski, K., Ritchie, J.C., 1989. Modern analogs of late Quaternary pollen spectra from the

western interior of North America. Journal of Biogeography 16, 573–596.

Bartlein, P.J., Whitlock, C., 1993. Paleoclimatic interpretation of the Elk lake pollen record. In: Bradbury, J.P., Dean, W.E. (Eds.), Elk

Lake, Minnesota: Evidence for Rapid Climate Change in the North-Central United States. US Geological Survey, Denver, Colorado,

Special Paper 276, pp. 275–294.

Becker, R.A., Chambers, J.M., Wilks, A.R., 1988. The New S Language. Chapman & Hall, London, pp. 702.

Birks, H.J.B., 1998. D.G. Frey & E.S. Deevey review #1—numerical tools in palaeolimnology—progress, potentialities, and problems.

Journal of Paleolimnology 20 (4), 307–332.

Copas, J.B., Corbett, P., 2002. Overestimation of the receiver operating characteristic curve for logistic regression. Biometrika 89 (2),

315–331.

Davis, M.B., 2000. Palynology after Y2K understanding the source area of pollen in sediments. Annual Review of Earth and Planetary

Sciences 28 (1), 1–18.

De Long, E.R., De Long, D.M., Clarke-Pearson, D.L., 1988. Comparing the areas under two or more correlated receiver operating

characteristic curves: a nonparametric approach. Biometrics 44 (3), 837–845.

Fedorova, I.T., Volkova, Y.A., Varlyguin, E., 1994. World vegetation cover. Digital raster data on a 30-minute Cartesian orthonormal

geodetic (lat/long) 1080� 2160 grid. Global Ecosystems Database Version 2.0. USDOC/NOAA National Geophysical Data Center,

Boulder, CO.

Gajewski, K., Vance, R., Sawada, M., Fung, I., Gignac, L.D., Halsey, L., John, J., Maisongrande, P., Mandell, P., Mudie, P.J., Richard,

P.J.H., Sherin, R.A.G., Soroko, J., Vitt, D., 2000. The climate of North America and adjacent ocean waters ca. 6 ka. Canadian Journal

of Earth Sciences 37 (5), 661–681.

Gavin, D.G., Oswald, W.W., Wahl, E.R., Williams, J.W., 2003. A statistical approach to evaluating distance metrics and analog

assignments for pollen records. Quaternary Research 60 (3), 356–367.

Gonzalez-Donoso, J.M., Serrano, F., Linares, D., 2000. Sea surface temperature during the Quaternary at ODP sites 976 and 975

(Western Mediterranean). Palaeogeography Palaeoclimatology Palaeoecology 162 (1–2), 17–44.

Grunsky, E.C., 2002. R: a data analysis and statistical programming environment: an emerging tool for the geosciences. Computers &

Geosciences 28 (10), 1219–1222.

Guiot, J., de Beaulieu, J.L., Cheddadi, R., David, F., Ponel, P., Reille, M., 1993. The climate in Western Europe during the last glacial/

interglacial cycle derived from pollen and insect remains. Palaeogeography, Palaeoclimatology, Palaeoecology 103 (1–2), 73–93.

Hanley, J.A., McNeil, B.J., 1982. The meaning and use of the area under a receiver operating characteristic ROC curve. Radiology 143,

29–36.

Ihaka, R., Gentleman, R., 1996. R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics 5 (3),

299–314.

Ikeya, N., Cronin, T.M., 1993. Quantitative analysis of ostracoda and water masses around Japan: application to Pliocene and Pleistocene

paleooceanography. Micropaleontology 39 (3), 263–281.

Jackson, S.T., Williams, J.W., 2004. Modern analogs in Quaternary paleoecology: here today, gone yesterday, gone tomorrow? Annual

Review of Earth and Planetary Sciences 32 (1), 495–537.

Kandiano, E.S., Bauch, H.A., 2003. Surface ocean temperatures in the north-east Atlantic during the last 500 000 years: evidence from

foraminiferal census data. Terra Nova 15 (4), 265–271.

Li, T.G., Liu, Z.X., Hall, M.A., Berne, S., Saito, Y., Cang, S.X., Cheng, Z.B., 2001. Heinrich event imprints in the Okinawa trough:

evidence from oxygen isotope and planktonic foraminifera. Palaeogeography Palaeoclimatology Palaeoecology 176 (1–4), 133–146.

Maher, L., J., 2000. MODPOL.EXE: A tool for searching for modern analogs of Pleistocene pollen data. INQUA Sub-Commission on

Data-Handling Methods. Bennett, K. D. Newsletter 20, http://www.kv.geo.uu.se/inqua/news20/n20-ljm.htm.

Metz, C.E., Herman, B.A., Roe, C.A., 1998a. Statistical comparison of two ROC-curve estimates obtained from partially paired datasets.

Medical Decision Making 18 (1), 110–121.

Metz, C.E., Herman, B.A., Shen, J.-H., 1998b. Maximum likelihood estimation of receiver operating characteristic (ROC) curves from

continuously distributed data. Statistics in Medicine 17, 1033–1053.

Mosimann, J.E., 1965. Statistical methods for the pollen analyst: multinomial and negative multinomial techniques. In: Kummel, B.,

Raup, D. (Eds.), Handbook of Paleontological Techniques. W.H. Freeman and Company, San Francisco, pp. 636–673.

Oswald, W.W., Brubaker, L.B., Hu, F.S., Gavin, D.G., 2003. Pollen-vegetation calibration for tundra communities in the arctic foothills,

Northern Alaska. Journal of Ecology 91 (6), 1022–1033.

Overpeck, J.T., Webb III, T., Prentice, I.C., 1985. Quantitative interpretation of fossil pollen spectra: dissimilarity coefficients and the

method of modern analogs. Quaternary Research 23 (1), 87–108.

Perez-Folgado, M., Sierro, F.J., Flores, J.A., Cacho, I., Grimalt, J.O., Zahn, R., Shackleton, N., 2003. Western Mediterranean planktonic

foraminifera events and millennial climatic variability during the last 70 kyr. Marine Micropaleontology 48, 49–70.

Pflaumann, U., Duprat, J., Pujol, C., Labeyrie, L.D., 1996. Simmax—a modern analog technique to deduce Atlantic sea surface

temperatures from planktonic foraminifera in deep-sea sediments. Paleoceanography 11 (1), 15–35.

Pflaumann, U., Sarnthein, M., Chapman, M., d’Abreu, L., Funnell, B., Huels, M., Kiefer, T., Maslin, M., Schulz, H., Swallow, J., van

Kreveld, S., Vautravers, M., Vogelsang, E., Weinelt, M., 2003. Glacial North Atlantic: sea-surface conditions reconstructed by

GLAMAP 2000—art. no. 1065. Paleoceanography 18 (3), 1065.

Prell, W.L., 1985. The stability of low-latitude sea-surface temperatures: an evaluation of the CLIMAP reconstruction with emphasis on

the positive SST anomalies. Report TR 025, US Department of Energy, Washington, DC, 60pp.

Page 16: An open source implementation of the Modern Analog ... · An open source implementation of the Modern Analog Technique (MAT) within the R computing environment$ Michael Sawada Laboratory

ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833 833

Prentice, I.C., 1980. Multidimensional scaling as a research tool in Quaternary palynology: a review of theory and methods. Review of

Palaeobotany and Palynology 31, 71–104.

R Development Core Team, 2003. R: A language and environment for statistical computing. R Foundation for Statistical Computing,

Vienna, Austria. ISBN 3-900051-00-3, URL http://www.R-project.org.

Sawada, M., Gajewski, K., de Vernal, A., Richard, P., 1999. Comparison of marine and terrestrial Holocene climatic reconstructions from

northeastern North America. Holocene 9 (3), 267–277.

Sawada, M., Viau, A.E., Gajewski, K., 2001. Critical thresholds of dissimilarity in the modern analog technique (mat) for quantitative

paleoclimate reconstruction. In: Chylek, P., Lesins, G., (Eds.), Proceedings of the First Annual Conference on Global Warming and

the Next Ice Age, Dalhousie University, Halifax, NS, Canada, pp.149–152.

Sawada, M., Viau, A.E., Vettoretti, G., Peltier, W.R., Gajewski, K., 2004. Comparison of North-American pollen-based temperature and

global lake-status with CCCma AGCM2 output at 6 ka. Quaternary Science Reviews 23 (3–4), 225–244.

Schweitzer, P.N., 1995. Analog: a program for estimating paleoclimate parameters using the method of modern analogs. INQUAWorking

Group on Data-Handling Methods, Newsletter 13.

Schweitzer, P.N., 1999. ANALOG: A program for estimating paleoclimate parameters using the method of modern analogs. U.S.

Geological Survey Open-File Report 94–645. United States Geological Survey, Reston, VA, http://geochange.er.usgs.gov/pub/tools/

analog/doc/analog.html

Sieger, R., Gersonde, R., Zielinski, U., 1999. A new extended software package for quantitative paleoenvironmental reconstructions. EOS,

Transactions, American Geophysical Union Electronic Supplement, 11 May 1999.

Waelbroeck, C., Labeyrie, L., Duplessy, J.C., Guiot, J., Labracherie, M., Leclaire, H., Duprat, J., 1998. Improving past sea surface

temperature estimates based on planktonic fossil faunas. Paleoceanography 13 (3), 272–283.

Wahl, E.R., 2004. A general framework for determining cutoff values to select pollen analogs with dissimilarity metrics in the modern

analog technique. Review of Palaeobotany and Palynology 128 (3–4), 263–280.

Whitmore, J., Gajewski, K., Sawada, M., Williams, J.W., Shuman, B., Bartlein, P.J., Minckley, T., Viau, A.E., Webb III, T., Shafer, S.,

Anderson, P., Brubaker, L., 2005. Modern pollen data from North America and Greenland for multi-scale paleoenvironmental

applications. Quaternary Science Reviews 24 (16–17), 1828–1848.

Williams, J.W., 2003. Variations in tree cover in North America since the last glacial maximum. Global and Planetary Change 35 (1–2),

1–23.

Williams, J.W., Shuman, B.N., Webb, T., 2001. Dissimilarity analyses of late-Quaternary vegetation and climate in Eastern North

America. Ecology 82 (12), 3346–3362.

Yan, L., Dodier, R., Mozer, M.C., Wolniewicz, R., 2003. Optimizing classifier performance via an approximation to the

Wilcoxon–Mann–Whitney statistic. In: Fawcett, T., Mishra, N. (Eds.), Proceedings of the Twentieth International Conference on

Machine Learning (ICML-2003), AAAI Press, Washington, DC., pp. 848–855. http://aaai.org/Press/Proceedings/ICML/2003/

icml03.html