Modern Digital And Analog Communications Systems B P Lathi ...
An open source implementation of the Modern Analog ... · An open source implementation of the...
Transcript of An open source implementation of the Modern Analog ... · An open source implementation of the...
ARTICLE IN PRESS
0098-3004/$ - se
doi:10.1016/j.ca
$MATTOO
described in thi�Tel.: +1 61
E-mail addr
Computers & Geosciences 32 (2006) 818–833
www.elsevier.com/locate/cageo
An open source implementation of the Modern AnalogTechnique (MAT) within the R computing environment$
Michael Sawada�
Laboratory for Applied Geomatics and GIS Science (LAGGISS) and Laboratory for Paleoclimatology and Climatology (LPC),
Department of Geography and Ottawa-Carleton Geoscience Center, University of Ottawa, Ottawa, Ont., Canada K1N 6N5
Received 28 January 2005; accepted 24 October 2005
Abstract
The purpose of this paper is to introduce an analytical solution for Quaternary geoscientists applying the modern analog
technique (MAT) to fossil biological assemblages. We present a package called MATTOOLS that implements the MAT
and offers new calibration techniques related to Monte Carlo simulation and response operating curves (ROC) that are
used in assessing the critical thresholds of biological assemblage dissimilarity. The MATTOOLS solution to the MAT has
the advantage of operating in the R language environment, a free open-source high-level language with hundreds of
functions for statistical analysis and visualization. MATTOOLS therefore offers an easily extensible solution for individual
research endeavors. We review current solutions for MAT calculations and provide an example of modern calibration
using MATTOOLS.
r 2005 Elsevier Ltd. All rights reserved.
Keywords: Modern analog technique (MAT); Method of modern analogs; Quaternary; ROC; MATTOOLS; R-project.org
1. Introduction
Within the Quaternary geosciences, the modern analog technique (MAT) (Overpeck et al., 1985; Prell, 1985)has become an increasingly popular method for the reconstruction of past environments (Davis, 2000). TheMAT has been used to infer past climates and vegetation from fossil pollen (Guiot et al., 1993; Sawada et al.,1999; Gajewski et al., 2000; Williams et al., 2001; Williams, 2003; Sawada et al., 2004) as well as marine/oceansea surface temperature and salinity at numerous time-scales (Ikeya and Cronin, 1993; e.g., Gonzalez-Donosoet al., 2000; Li et al., 2001; Kandiano and Bauch, 2003; Perez-Folgado et al., 2003; Pflaumann et al., 2003).
The MAT provides an objective way of quantifying past environments and the application of this methodrequires a number of decisions. The recent increased interest in the MAT has led to the development of newmethods that aid in making such decisions. Here it is implied that making a decision requires a choice between
e front matter r 2005 Elsevier Ltd. All rights reserved.
geo.2005.10.008
LS can be downloaded at http://www.geomatics.uottawa.ca/mattools where news and updates will be posted. The version
s paper is available at http://www.iamg.org/CGEditor/index.htm
3 562 5800x1040; fax: +1 613 562 5145.
ess: [email protected].
ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833 819
alternatives. Choosing a critical numerical threshold of dissimilarity that indicates whether a givenmodern biological assemblage is a good analog for a fossil assemblage is the first of these decisions. Newmethods that aid in this decision include Monte Carlo techniques (Anderson et al., 1989; Bartlein andWhitlock, 1993; Sawada et al., 2004) and/or response operating curves (ROC) (Gavin et al., 2003; Jackson andWilliams, 2004; Wahl, 2004). The second decision concerns the number of analogs to retain from a set ofmodern assemblages (Waelbroeck et al., 1998; Williams, 2003; Sawada et al., 2004). Statistical summaries ofthis retained set are used to derive estimates of past environmental conditions. Recent developments in‘‘jump’’ techniques have improved the decision process (Waelbroeck et al., 1998; Williams, 2003; Sawadaet al., 2004).
I refer to the above as ‘‘critical’’ choices because they cannot be generalized across different datasets ordisciplines and must therefore be determined on an application-by-application basis. Application of the MATwith regard to these critical choices is computationally intensive. No software is available that permits the easyapplication of the methods along with the flexibility to experiment with the different choices. Thus, there is astrong need for an Open Source solution that will allow for the simple application of the MAT andexperimentation with the critical choices within individual applications. This paper presents a solution calledMATTOOLS that operates within the R computing environment. The advantage of this approach todevelopment within R is that the environment is complete and extensible, and the approach allows forvisualization and exploratory data analysis prior and subsequent to MAT analysis.
2. The MAT
The MAT technique compares a fossil biological assemblage (F) to all modern assemblages (M) within agiven geographic extent. The comparison between assemblages, themselves, each a multivariate vector oftaxon observations, is done by way of a dissimilarity metric (Fig. 1). The modern assemblage with the smallestdissimilarity is in theory the ‘best’ modern analog. The environment that is associated with the most similarmodern assemblage may be assigned to the time and location of the fossil assemblage being tested. Insearching for analog assignments, the technique assumes that all possible past environmental conditionsproducing fossil biological assemblages currently exist within the spatial domain of the modern assemblagesand, furthermore, that variations in assemblage composition are determined by the environmental variables ofinterest; otherwise the application of the technique is problematic (Jackson and Williams, 2004).
Dissimilarity between fossil and modern biological assemblages can be measured using a number ofdifferent metrics (Prentice, 1980; Overpeck et al., 1985; Gavin et al., 2003; Jackson and Williams, 2004). Asignal-to-noise measure called squared chord distance (SCD) is robust in the context of terrestrial pollenassemblages (Overpeck et al., 1985; Gavin et al., 2003) and has been employed in the majority of terrestrialMAT reconstructions (Jackson and Williams, 2004; Sawada et al., 2004). MATTOOLS utilizes SCD bydefault. However, other dissimilarity thresholds like equal-weight distance metrics as illustrated in Fig. 1 canbe more accurate in regions like the North American tundra (Oswald et al., 2003). Because MATTOOLS codeis open-source, other distance metrics are easily added for different purposes. In any case, SCD, ranges fromzero for a perfect analog (no dissimilarity) to two at maximum dissimilarity and is given as:
SCD ¼Xn
i¼1
ðFp1=2i �Mp
1=2i Þ
2, (1)
where, SCD is the dissimilarity coefficient, Fpi is the proportion of species i within the fossil assemblage andMpi is the proportion of species i within the modern assemblage (Overpeck et al., 1985).
Frequent arbitrary choices in the MAT have included the critical limit of dissimilarity or cutoff value usedto distinguish between a good vs. poor analog as well as how many analogs to retain for estimating the pastenvironment as an average prediction. Attempts to make these choices less subjective generally utilize themodern dataset itself within modern calibration or training (Birks, 1998). For each dataset, moderncalibration allows: (a) determining a critical threshold of dissimilarity that can be used to separate good vs.poor analogs, and (b) the optimal number of analogs to retain in reconstructions.
ARTICLE IN PRESS
Fig. 1. Modern analog technique (MAT) applied to fictitious pollen assemblages. A fossil pollen spectrum, e.g., at 1000 yrs before present,
is compared with modern spectra from a range of vegetation types and environments using some distance metric, e.g., squared chord
distance. Modern pollen spectrum with minimum distance is considered best modern analog. From this, we infer that ecosystem around
modern pollen-sampling site is an analog for past ecosystem centered on fossil pollen-sampling site. Distance metrics from Overpeck et al.
(1985).
M. Sawada / Computers & Geosciences 32 (2006) 818–833820
2.1. Determining a critical threshold of dissimilarity
The fundamental question as to what constitutes a good analog is most often addressed by choosing anumeric threshold of dissimilarity. Paired samples with dissimilarities less than this threshold are consideredanalogs; samples with dissimilarities greater than this threshold are considered non-analogs. The derivation ofthe threshold value itself has been either based upon comparisons between modern biological assemblages, theso-called ‘training’ dataset (Birks, 1998), or adopted uncritically from other studies (Jackson and Williams,2004; Sawada et al., 2004). In either case, the chosen threshold is then applied ‘down-core’ to reconstructpaleoenvironments by rejecting as analogs any modern assemblages that exceed the threshold. However, anycritical threshold of dissimilarity will be a function of the number of taxa included within a given dataset(Waelbroeck et al., 1998; Sawada et al., 2001; Sawada et al., 2004) and so critical thresholds need to be tailoredto any MAT reconstruction (Sawada et al., 2001; Sawada et al., 2004; Wahl, 2004). Tailoring criticalthresholds to a modern dataset is computationally intensive and based on numerous pairwise comparisonsbetween assemblages within the modern training dataset. MATTOOLS implements two methods to assessingthe critical limit of dissimilarity for a given MAT application.
Monte Carlo: Anderson et al. (1989), Bartlein and Whitlock (1993) and Birks (1995) utilized the empiricaldistribution of dissimilarity coefficients generated by all pairwise comparisons between modern samples to
ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833 821
determine those thresholds that occur infrequently. This technique is computationally burdensome for largemodern collections of assemblages such as that described in Whitmore et al. (2005). An alternative is toundertake a Monte Carlo simulation using paired comparisons between randomly chosen modern assemblagepairs in order to determine threshold values that are unlikely to occur by chance alone—in a statistical sense(Sawada et al., 2004). For example, a threshold value that occurs by chance at most five times in one-hundred(0.05) would correspond to a a ¼ 0:05 level of significance. Monte Carlo derived cutoff values suffer fromType I and Type II statistical errors, and like other techniques based on pairwise comparisons of modernassemblages, are designed to quantify the measure of risk associated with false positive analogs (Gavin et al.,2003; Wahl, 2004). False positive analogs occur when the threshold value applied chooses an assemblage thatis in reality from a different biological zone. It is pertinent to note that the statistical significance does notnecessarily connote ecological significance. The Monte Carlo does not fully identify false negative analogs,that is, analogs with high dissimilarity that belong to the same biological zone. The advantage of the MonteCarlo approach is that it does not require a priori knowledge of the biological zonation or mutual affinity ofthe assemblages within the modern dataset.
ROC: When knowledge exists regarding the biological zone membership for each modern assemblage, it ispossible to determine an optimal threshold of dissimilarity. Here, the optimal dissimilarity is that value whichmaximizes the probability of true positive analogs while minimizing the probability of false positive ones. Forthis purpose, empirical receiver operating characteristic (ROC) analysis has been adapted to the MAT fromthe fields of diagnostic medical testing and signal analysis (Gavin et al., 2003; Oswald et al., 2003; Wahl, 2004).
Gavin et al. (2003) and Wahl (2004) develop a general framework for the application of ROC to the MAT.The technique is explained below and follows the notation of De Long et al. (1988). ROC analysis requiresfirstly that each modern assemblage be classified into biological zones of true membership. Zonal membershipis clearly a function of spatial scale. For example, with a pollen dataset of N samples, m of the modernassemblages could be assigned to a particular biome (leaving N�m assemblages unassigned). For each of thesem assemblages, the minimum dissimilarity is found among all other m assemblages creating a set X ¼
1; 2; . . . ;mf g of minimum dissimilarity values. Alternately the top two, five, etc. dissimilarities could beretained. The number of analogs retained can be varied and could include as many as all pairwise dissimilarityvalues within a particular biome (cf. Wahl, 2004). The resultant set X ¼ 1; 2; . . . ;mf g identifies the distributionof dissimilarity expected between true or good analogs (Figs. 2A and 2B, first panels). This distributionassumes that all samples were correctly classified. At this stage, errors not introduced in the pollen countingand identification process are due to incorrect assignment of a sample to a biome type. So, incorrect sampleassignments yield dissimilarity values that represent comparisons of very different assemblages and these arecalled false positive analogs. Next, a second set Y ¼ 1; 2; . . . ; nf g (where n ¼ N �m) of minimum dissimilarityvalues is created by taking each of the n assemblages and searching for minimum dissimilarities within the m
assemblages (Figs. 2A and 2B, first panels). This set represents the distribution of dissimilarities for pairedsamples from different biomes. Given these two distributions of dissimilarity values X & Y, at any giventhreshold of dissimilarity, z, one can determine the proportion of true analogs that are also classified as true(true positive fraction, TPF),
TPF ðzÞ ¼1
m
Xm
i¼1
IðX ipzÞ, (2)
where IðX ipzÞ ¼ 1 when the condition is true and zero otherwise. The proportion correctly classified as non-analogs at the same cutoff value is given as the true negative fraction (TNF),
TNF ðzÞ ¼1
n
Xn
i¼1
IðY i4zÞ, (3)
where IðY i4zÞ ¼ 1 when the condition is true and zero otherwise. An empirical ROC curve is constructed byvarying the values of z across the range of possible dissimilarity values, for example with Eq. (1), squaredchord distance, z ¼ z 2 R j 0pzp2f g, at regular intervals and plotting the false positive error FPE ¼ 1� TNF
against the TPF (Eqs. (1) and (2), Figs. 2A and 2B, second panels). The dissimilarity value at whichz ¼ max TPF � FPEð Þ represents the cutoff value that maximizes the true positive analogs and minimizes the
ARTICLE IN PRESS
Fig. 2. Example output from ROC plotting function mat.plotroc. Overlapping histograms on left indicate relative frequencies of
dissimilarity values between within zone comparisons (black) and outside of zone to within zone (grey) comparisons; gray arrow indicates
optimal dissimilarity value. Center graphs illustrate ROC curves where intersection of y ¼ x line with ROC curve indicates optimal value
that minimizes probability of false positive analogs. Right histograms illustrate specificity minus sensitivity over range of dissimilarity
coefficient, where this value is maximal probability of false positive analogs is minimized and is indicated by gray vertical bars within
histograms.
M. Sawada / Computers & Geosciences 32 (2006) 818–833822
false positive ones—the value of joint minimization (Gavin et al., 2003; Wahl, 2004) (Figs. 2A and 2B, thirdpanels).
The area under the ROC curve, AUC, is a measure of the diagnostic performance of the dissimilarity metricin distinguishing between similar versus different biological assemblages (Gavin et al., 2003). The AUC can becalculated using trapezoidal integration (default in MATTOOLS) or calculated directly from thenonparametric Mann–Whitney–Wilcoxon1 rank-sum statistic which is equivalent to the AUC (Metz et al.,1998b; Copas and Corbett, 2002; Gavin et al., 2003; Yan et al., 2003). The AUC can range from AUC ¼ 0:5when there is no ability of the metric to distinguish between true and false analogs, in other words, theperformance is no better than a coin toss, to AUC ¼ 1 when there is no overlap between TPF and TNF.Oswald et al. (2003) and Gavin et al. (2003) used AUC as a measure of the ability of different dissimilaritymetrics to distinguish between like and unlike taxon assemblages, thereby determining the best metric overallfor a set of biological zones, or one specific region, or one type of dataset or a particular type of biologicalassemblage.
The empirical distributions of Eqs. (2) and (3) are often non-Gaussian (Metz et al., 1998a; Gavin et al.,2003) and so nonparametric methods are more frequently used in calculating the standard errors of the AUC.The standard error of the AUC, SEAUC, can be calculated from the nonparametric Mann–Whitney–Wilcoxonrank-sum statistic or alternatively by the method of Hanley and McNeil (1982) which is computationallyfaster for large datasets and is the default method for calculating SEAUC in MATTOOLS.
1These may be referred to separately as the Mann–Whitney U and Wilcoxon rank-sum test http://jsekhon.fas.harvard.edu/stats/html/
wilcox.test.html They are equivalent tests under the same assumptions.
ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833 823
Software is available for the computation of ROC curves, e.g., ROCKIT (Metz et al., 1998a) and Analyze-Itfor Microsoft Excel (Analyse-It Software Ltd., 2004) among many others. However, the above software areextremely difficult to apply in ROC analysis in the context of MAT applications. MATTOOLS implementsthe comparison framework of Gavin et al. (2003) as described above for the computation of optimaldissimilarity thresholds and includes a number of diagnostic plots.
2.2. The number of analogs to retain in reconstructions
The central issue here is simple: The single best modern analog, as measured by some metric, may not inreality yield the most similar environment to a fossil assemblage (Jackson and Williams, 2004). Why? Becausethere are many environmental factors that can shape biological assemblages (Jackson and Williams, 2004) andthese are only confounded by analytical errors (cf. Mosimann, 1965). Consequently, nearly identicalassemblages can arise from different environments. As such, the ‘true’ best analog for a fossil assemblage maynot be the modern sample that is least dissimilar. Nevertheless, for applications sake, we must assume that the‘true’ best modern analog is included somewhere within the top-n modern analogs for a given fossilassemblage. Exactly what n should be is unclear; In most studies the number of analogs retained has beenarbitrary—ranging from 1 to 10 or more (reviewed in Jackson and Williams, 2004; Sawada et al., 2004).Sawada (2004) further suggests that the optimal number of analogs to retain will depend on the environmentalvariable undergoing reconstruction. In the terrestrial case, for example, mean annual temperature and totalannual precipitation each have a unique spatial structure and scale of influence. Thus, it should not besurprising that the optimal number of retained analogs for different environmental variables should differ.
Waelbroeck et al. (1998) introduced a ‘jump’ method to overcome the arbitrary choice of how many analogsto retain in estimating the environment of a fossil assemblage. The ‘jump’ method (JM) works in the followingway for each assemblage:
1.
All dissimilarity values are sorted in ascending order; 2. Some metric, a, that measures the change between adjacent values is determined. For example, percentageincrease in dissimilarity of each analog with respect to its predecessor is calculated as,
a ¼SCDiþ1
SCDi
� 1
� �� 100 (4)
and is the most frequently used jump measure (Waelbroeck et al., 1998; Williams, 2003; Sawada et al.,2004). This is the default method in MATTOOLS for comparison;
3.
Using the metric, a, each dissimilarity value is compared to the previous; 4. When the metric meets or exceeds some predefined level (a ¼ C; fC is a constantg), this value represents adiscontinuity or ‘jump’ in dissimilarity;
5. Retained in the calculation of the environmental estimate are the environmental values of those analogsthat precede the ‘jump’ or discontinuity.
In the case where no ‘jump’ is found then the estimated environmental value can approach the average ofthe dataset (Sawada et al., 2004). Adoption of the overall environmental variable average as a fossil estimate isclearly problematic if the modern set is known to span numerous different environmental gradients, forexample the North American modern pollen dataset (Whitmore et al., 2005). To solve this issue, and in theabsence of a jump in dissimilarity, Waelbroeck et al. (1998) suggests retaining some arbitrary number ofanalogs for such cases. Conversely, in such cases, Williams (2003) suggests retaining only those analogs belowa predefined critical threshold of dissimilarity. Nevertheless, within large modern datasets with thousands ofassemblages, like that of Whitmore et al. (2005), hundreds or thousands of analogs can be retained for a givenassemblage just because of no distinguishing ‘‘jump’’ or breach of a chosen critical threshold. As such, anarbitrary choice of retaining the top-n modern analogs for some assemblages seems unavoidable (Sawada etal., 2004). For large datasets, some reasonable number of the top-n may be retained and the jump methodapplied according to modification of Williams’ (2003). However, neither Waelbroeck et al. (1998) nor Williams(2003) specify how to determine the critical ‘jump’ value, a, as described above.
ARTICLE IN PRESS
0 20 40 60 80 100
0.888
0.890
0.892
0.894
0.896
0.898
0.900
Alpha
Pea
rson
's R
(cor
rela
tion)
alpha = 20
Fig. 3. Output of jump analysis for optimizing alpha for a given environmental reconstruction using function mat.plotjump.
M. Sawada / Computers & Geosciences 32 (2006) 818–833824
Building on Williams (2003) modifications, Sawada (2004) utilized a modified jump method (MJM) which isimplemented in MATTOOLS. Briefly, in applying the MJM, the user retains the top-n analogs and theprogram then determines an optimal value of a based on comparisons between biological assemblages withinthe modern dataset. The optimal value of a is that which maximizes the linear correlation between theobserved and predicted environmental values across all samples (Sawada et al., 2004). For N modernassemblages, the 2nd to ith dissimilarity values for each are searched using Eq. (4). The search commenceswith a values starting from 0% and then to 100% with steps or intervals of 1%. For each modern biologicalassemblage, analogs with a percentage change less than the current level of a are retained and theirenvironmental variable is averaged to produce a modern prediction for that assemblage. For each level of a,the predicted environmental values are correlated with the observed environmental values for the N modernassemblages and the correlation coefficients at each a level are retained. Finally, the coefficient of correlation isplotted against the levels of a to determine which value of a provides the maximum correlation between theobserved and predicted environmental variable (Fig. 3).
In the MJM, the value of a that provides maximum correlation is sensitive to the number of analogs uponwhich it is tested (Sawada et al., 2004). For example, the optimal value of a will be different if the top 100 vs.top 10 analogs are retained. As such a should be determined for each modern dataset or subset. In addition, awill vary when different environmental variables are predicted. The MATTOOLS function mat.jumpoptimizes the value of a for a given number of analogs in a modern dataset.
3. Existing MAT solutions
At present, researchers requiring the MAT can choose among the programs designed specifically for MATreconstructions such as ANALOG (Schweitzer, 1995, 1999), MODPOL (Maher, 2000), SIMMAX(Pflaumann et al., 1996), RAM98 (Waelbroeck et al., 1998) and programs designed for general transfer-function methods that include MAT such as PaleoToolBox (Sieger et al., 1999) and the commercial packageC2 v1.3 (Juggins, 2003).2 ANALOG is an MS-DOS based package with ANSI C source code that requiresdelimited text file input in addition to an ASCII ‘run-file’. ANALOG output is simple and offers only bestanalog assignments but in a verbose text file requiring post processing and reformatting. Similar in operation,SIMMAX is an MSDOS executable with FORTRAN source code and utilizes a similarity metric that is suitedto reconstructions based on planktonic foraminifera. The revised analog method (RAM) is available asFORTRAN source code and is particularly well suited to MAT assignments utilizing sparse modern
2Juggins, S., 2003. C2 1.3: Win95/98/nt2000/xp program for analysing and visualizing paleoenvironmental data, http://www.campus.
ncl.ac.uk/sta?/Stephen.Juggins/software/c2home.htm
ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833 825
calibration sets in ocean and marine environments. MODPOL is an MS DOS application with VGA graphicdisplay but is designed to work specifically with M70-type files distributed by the NGDC. Operating systemand programming language evolution in addition to new developments in the application of the MAT over thepast few years have left these programs depreciated in functionality. At the time of their introduction,however, ANALOG, SIMMAX, RAM98 and MODPOL were the first freely available solutions for thecalculation of the MAT.
More recently, MS Windows software with functionality similar to the above MS DOS counterparts likefreely available PaleoToolBox and commercial package C2 have emerged. PaleoToolBox offers a number oftransfer function methods and includes the MAT as one option, therein, one can create project files that canthen be used with a provided MS DOS executable of the RAM program. There are no graphical capabilities toPaleoToolBox. More extensive functionality is found in C2, a commercial MS Windows package thatcomputes a broad range of ecological and paleoecological transfer functions, ordination techniques as well asgraphical routines particularly suited to the plotting of stratigraphic diagrams. Although C2 has the ability tocompute MAT assignments, the software is not specifically designed for this task and so contains the basicoutputs with the addition of some diagnostic plots. While C2 complies with current Windows OS standardsand can import files from standard spreadsheets programs, it is neither open nor extensible by the user.Functionality and extensibility issues with the above programs motivate the need for a more open andextensible solution for MAT computations.
4. MATTOOLS & the R environment
The ideal MAT software solution should:
1.
3
4
Work with simple spreadsheet-like data-tables;
2. Provide a comprehensive environment with tools for data pre-processing, including such tasks asrearranging, formatting and variable transformations;
3. Provide a comprehensive set of graphical tools for visualization of data and results; 4. Include some of the more recent calibration methodologies; 5. Operate in a stable and freely available environment; 6. Be usable across all major platforms (Windows, MacOS, and Linux); 7. Have numerous learning resources and a large support network; 8. Allow easy extensions of MAT functions for individual research projects.The R language for statistical computing provides an Open Source solution for the MATTOOLS packagethat satisfies all of the above criteria. Since 1996 (Ihaka and Gentleman, 1996), R has been enthusiasticallydeveloped by a core development team and researchers world-wide. The R environment (R Development CoreTeam, 2003) is now managed by the R Foundation for Statistical Computing. R is available under GNUPublic License and offers a fully integrated environment largely based on the S language developed at LucentTechnologies (Becker et al., 1988). Grunsky (2002) reviews the definition, structure and philosophy of the Rcomputing environment. R provides numerous statistical, mathematical and graphical functions that arecompletely extensible by individual researchers. Full programs and functions can be written in R that takeadvantage of existing functionality built into the R-core function set as well as functions within other userdonated R packages. Currently, there are more than one-hundred supported libraries of functions that extendthe base package functionality and allow for such things as interfacing with Geographical InformationSystems (GIS) data formats for mapping in R (Lewin-Koh and Bivand, 2004)3 and spatial statistical libraries(Bivand, 2004)4 among others. R provides a free, mature and stable environment to implement MATTOOLSand expose its functionality to the Quaternary geoscience community.
Lewin-Koh, N.J., Bivand, R., 2004. Maptools 0.3-8. http://www.r-project.org/
Bivand, R., 2004. The spdep package 0.2-11. http://www.r-project.org/
ARTICLE IN PRESS
Table 1
High-level functions used in MATTOOLS. Function details (arguments, examples etc.) can be accessed by typing ‘‘?’’ before function
name, for example typing ?mat.dissim will open this specific function help file. Examples and descriptions of operation for each
function are given in text of this paper with exception of mat.roc.allpair
High-level functions Description
mat.dissim Creates dissimilarity object between either a modern dataset and itself or a modern and fossil dataset. This
list object contains matrices and parameters that are used in both optimizations of critical choices as well as
environmental reconstructions
mat.mc Undertakes a Monte Carlo simulation of pairwise modern spectra to find critical thresholds of dissimilarity
and associated probabilities
mat.plot.mc Plot output of mat.mc
mat.roc Undertakes receiver operating curve (ROC) analysis on a modern dataset when each row/site has a nominal
class defining biological zone membership. This function undertakes ROC on a within-zone by out-zone
basis
mat.roc.allpair Undertakes receiver operating curve (ROC) analysis on a modern dataset when each row/site has a nominal
class defining biological zone membership. This function undertakes ROC on a pairwise zone-by-zone basis
(all possible combinations of zones)
mat.plotroc Plots output from either mat.roc or mat.roc.allpair
mat.jump Undertakes jump method on a mat.dissim object to determine optimal alpha threshold for a given number of
retained analogs and an environmental variable
mat.plotjump Plots result of mat.jump for visualization of optimal alpha value.
mat.jumrecon Applies jump method downcore for fossil reconstruction.
mat.fossavg Constructs a data table that includes a choice of various weighted averages of top-n analogs
mat.plot.recon Plots output of mat.fossavg
M. Sawada / Computers & Geosciences 32 (2006) 818–833826
5. The MATTOOLS package
MATTOOLS implements a number of functions for calculation of MAT and associated measures thatunderlie the critical decisions in the application of the MAT. A list of primary functions and brief descriptionsare provided here in Table 1 and full details can be found within the help files provided with the MATTOOLSpackage.
6. Example application of MATTOOLS: assessing critical limits of dissimilarity for use in a paleoclimate
reconstruction of a fossil pollen core
The following provides an example session in R that utilizes both fossil and modern pollen datasets. Thisinvolves (a) getting the data into the R environment, (b) loading MATTOOLS, determining critical thresholdsbased on modern calibration followed by a paleotemperature reconstruction based on fossil pollen data.
6.1. Loading MATTOOLS
MATTOOLS is loaded as a standard library from the ‘‘Packages’’ menu within R by choosing the item‘‘Install package(s) from local (zip) filesy’’. One then navigates to the folder containing MATTOOLS.zip andchooses this file. The user again chooses ‘‘Packages’’ from the menu and the item ‘‘Load packagey’’ whichwill reveal a window showing a list of all installed packages. Scroll to and select the item mattools and click onthe OK button.
6.2. Getting data into the R environment
The R environment provides a number of functions for importing data from common data managementand statistical analysis packages. The simplest function is ‘read.csv’ which reads in comma-separated text files,these can be exported from MS ExcelTM and most other statistical software. More sophisticated input/output
ARTICLE IN PRESS
Fig. 4. (A) Modern calibration dataset: a file (e.g., a text file like a comma-separated values file [*.csv]) containing field names in first row of
modern calibration dataset where each subsequent row contains a sample identifier (Sample ID), coordinates in either a planar/projected
x,y system or as longitude and latitude in decimal degrees, and taxon abundances followed by modern environmental variables (Mod.Env
1,y,Mod.Env n) that will be used for modern training and/or paleoenvironmental reconstruction. Final and optional field would contain,
for each row, a nominal code representing biological zone to which each row/site belongs. (B) Fossil dataset: this file is only necessary if
one will be undertaking an environmental reconstruction for a fossil dataset. Structure of file is same as (A) where each row contains same
information but now represents a unique age or depth in a fossil core. In this file modern environmental variables observed at locations of
fossil cores are only required if anomalies between reconstructed and observed environments will be computed. Note: Order and number of
taxon fields in each should be identical, other fields can vary.
M. Sawada / Computers & Geosciences 32 (2006) 818–833 827
functions are found in the package ‘Foreign’ that is part of the core R distribution. For MATTOOLS, datashould be formatted in at most two files, using the file structure shown in Fig. 4. The modern data used in thisexample and included within the MATTOOLS package comes from a subset of sites within a modern pollenresearch database contained in an Excel worksheet (Whitmore et al., 2005) and a single fossil pollen samplecore from Zagoskin Lake (Ager, 2003) containing 60 levels ranging from 0BP to 30 000BP. Zagoskin Lake islocated on geographically on St. Michael Island on the western coast of Alaska (162.6W, 63.27N). The data isfirst loaded into the R environment from two datasets within the Data\zagoskin subfolderof the installed MATTOOLS package,5
modpoll ¼ read.csv(‘‘c://zagoskin//modpoll.csv’’)zag ¼ read.csv(‘‘c://zagoskin//zagoskin.csv’’)or alternatively one can load these example datasets directly from the Internet using a URL within the
read.csv function:modpoll ¼ read.csv(‘‘http://www.geomatics.uottawa.ca/mattools/modpoll.csv’’)zag ¼ read.csv(‘‘http://www.geomatics.uottawa.ca/mattools/zagoskin.csv’’)
6.3. Determination of optimal threshold of dissimilarity via modern calibration
The MATTOOLS functions mat.roc and mat.mc are the two main functions for critical thresholddetermination using either ROCs or the Monte Carlo approach (according to a user-specified level ofsignificance).
For example, the function called ‘mat.mc’ creates a Monte Carlo distribution for the dataset modpoll,which contains taxon counts6 for 104 taxa in columns 10 to 113. Ten-thousand comparisons are undertakenand dissimilarity values are requested by the user through the argument ‘probs’ at significance levels of 0.1,0.05, etc..., the results of which are seen in Fig. 5.critval ¼ mat.mc(modpoll,modTaxa ¼ 10:113,sampleSize ¼ 10000, probs ¼ c(0.1,
0.05, 0.025, 0.01, 0.001) ,counts ¼ T)mat.plot.mc(critval) #Plot histogram and cumulative frequencies
5The user may want to copy the zagoskin folder to the local c:\drive for the read.csv example.6Taxon counts are the default input data measure; however, most MATTOOLS functions allow the user to specify whether the input
data is in proportions or counts. If counts are input, they are automatically converted to proportions.
ARTICLE IN PRESS
Fig. 5. (A) Distribution of dissimilarity values from Monte Carlo pairwise simulation using function mat.mc and plot.mc on modern
dataset; (B) cumulative distribution function across range of dissimilarity with heavy portion of line indicating user specified tail of
significance; (C) a zoom view of user specified tail of significance with corresponding critical thresholds indicated by intersecting
horizontal/vertical gray lines with numbers annotated on these lines indicating threshold (x-axis) and corresponding Monte Carlo p-value
on y-axis.
M. Sawada / Computers & Geosciences 32 (2006) 818–833828
ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833 829
critval is a list object with associated Monte Carlo probabilities in the component ‘‘cutoffs’’, e.g.,4 critval$cutoffs$x[1] 0.100 0.050 0.025 0.010 0.001$y[1] 0.37052632 0.24913580 0.18030303 0.11696970 0.05090909
A list object is a general container that is like a computer folder with subfolders. Each subfolder is called alist component. Any dataset or R object can be stored in a list component, including other lists—akin tostoring a bunch of different files in a computer folder. Because list components are independent of each other,list objects are a good way to store datasets with variable length records or objects of different modes (say avectors, datatables, and matrices). List objects aside and with reference to the example above, a squared chorddistance of �0.25 corresponds to the a ¼ 0:05 level of significance. The function mat.mc also has a secondmore computationally intensive method called ‘‘allpairs’’ that uses all pairwise comparisons (Andersonet al., 1989; Bartlein and Whitlock, 1993) in determining the critical value distribution. This second method ispreferable for small modern datasets. Argument details are provided in the help file for each function and canbe accessed by typing ‘‘?’’ before the function name, for example typing ‘?mat.mc’ at the prompt willopen the help file specific to that function.
MATTOOLS provides a set of functions that tailor the ROC method to dissimilarity coefficientoptimization for individual MAT applications. The use of ROC in MATTOOLS requires that a column(coded as text or numbers) exist in the modern dataset that identifies the biological zone membership,affiliation or assignment of each modern biological assemblage (i.e., each sample row of the database). TheMATTOOLS function ‘mat.roc()’ produces ROC curves for each biological zone as well as grouping allwithin-zone dissimilarities and outside of zone dissimilarities to produce an overall ROC curve and generaloptimal dissimilarity value. For example, using the same modpoll dataset with each pollen sample classified asbelonging to one of nine vegetation zones (Fedorova et al., 1994), the function mat.roc produces,
modroc ¼ mat.roc(modpoll,modTaxa ¼ 10:113,colClasses ¼ 116,rocEvalSeq ¼ seq(0,2,0.02), counts ¼ T, aucmethod ¼ ‘‘wilcox’’)
o
ptDissVal A UC S EAUC n sol n Within n Outside Mountain vegetation 0 .08 0 .7646936 0 .009997428 1 1 184 1 007 Oceanic meadows 0 .10 0 .9897317 0 .013948721 1 2 6 2 165 Forest-tundra 0 .10 0 .9654532 0 .010657753 1 1 45 2 046 Southern tundra 0 .08 0 .9112191 0 .011684879 1 2 90 1 901 Typical tundra 0 .32 0 .8633824 0 .031352690 1 5 6 2 135 Arctic tundra 0 .20 0 .9530608 0 .016979984 1 7 6 2 115 Grasslands 0 .16 0 .9583138 0 .009715350 1 2 09 1 982 Polar deserts 0 .34 0 .9861223 0 .017194344 1 2 3 2 168 Central taiga 0 .16 0 .9552153 0 .010761066 1 1 82 2 009 Overall 0 .16 0 .9398312 0 .003562054 1 2 191 1 7528By default the argument ‘numAnalogs ¼ 2’ provides ROC analyses based on the single best modernanalog, increasing numAnalogs, increases the number of analogs retained in the comparisons. The output ofmat.roc is in the form of a list with a summary of the analysis printed to the command window in R. Detailson arguments, computations and output list components can be found in the associated help file. The outputof mat.roc can be plotted using the function mat.plotroc (see Fig. 2).mat.plotroc(modroc)The optimal dissimilarity value output from ROC analysis depends considerably on the accuracy of the
sample affiliations. Therefore, different results are produced by different biological sample zonations. In theseexamples, the zonation utilized for sample affiliations (Fedorova et al., 1994) is a rather coarse and generalizedcontinental scale vegetation zonation. As such, there tends to be considerable overlap between the TPF and
ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833830
TNF distributions for most zones. An overall optimal value (above the last line in bolded typeface) ofdissimilarity that maximizes the true positive analogs and minimizes the false positive ones would be �0.16which corresponds roughly to the a ¼ 0:025 level of significance derived from the Monte Carlo simulationapproach. The overall ROC optimal dissimilarity value (above the last line in bolded typeface) is calculatedfrom all pairwise comparisons between zones.
6.4. Number of samples to retain
The number of analogs to retain in a given fossil reconstruction can be optimized using the modified jumpmethod MJM of Sawada et al. (2004). First a reconstruction object, here called modern.10 is created withthe function mat.dissim (see ?mat.dissim for details):modern.10 ¼ mat.dissim(inFossil ¼ modpoll, inModern ¼ modpoll, llMod ¼ 3:4,modTaxa ¼
10:113, llFoss ¼ 3:4,fosTaxa ¼ 10:113,numAnalogs ¼ 10,counts ¼ T, dist.method ¼‘‘spherical’’)
This newly created modern.10 object contains the 10 best modern analogs for each modern sample. Themodern.10 object is then sent to the function ‘mat.jump()’ to determine the optimal a value to usedown-core when mean July temperature (in column 114 of the modpoll dataset) is considered:modern.10.jump ¼ mat.jump(dObj ¼ modern.10,inModern ¼ modpoll,envColumn ¼ 114,
cutoff ¼ 0.16)mat.plotjump(modern.10.jump)With reference to Fig. 3, an a value of 20% is optimal when 10 analogs are retained for July temperature.
This a value may change if a different environmental variable is utilized or a different number of analogs werechosen.
6.5. Applying modern calibration exercise to paleoenvironmental reconstruction
For a paleoenvironmental reconstruction, first a dissimilarity object that defines the top-n analogs for eachlevel in the fossil core must be constructed. Once a dissimilarity object has been constructed then applicationof the MJM and critical values derived from the modern calibration exercises can be used. The reconstructionof the fossil core for Zagoskin Lake requires first that a dissimilarity object be created using the mat.dissimfunctionzag10.recon ¼ mat.dissim(inFossil ¼ zag, inModern ¼ modpoll, llMod ¼ 3:4,modTaxa¼ 10:113, llFoss ¼ 3:4,fosTaxa ¼ 10:113,numAnalogs ¼ 10,counts ¼ T, dist.method ¼‘‘spherical’’)
Once the dissimilarity object (now stored as a list object), consisting of the 10 closest dissimilarity scores foreach fossil sample, has been created, then either the MJM method can be applied to the reconstruction usingthe function ‘mat.jumprecon’:zag10.jump ¼ mat.jumprecon(dObj ¼ zag10.recon,modEnvCol ¼ 114,
fossEnvCol ¼ 1:7,alpha ¼ 20)or, alternatively, a reconstruction based on a weighted average of the top-n analogs can be utilized in
addition to a cutoff value derived from the Monte Carlo or ROC methods above. Moreover one can removeall values beyond a specified geographic distance (in meters). The function mat.fossavg constructs a datatable that includes either a weighted average of the top-n analogs (possible weightings include inversegeographic distance, inverse dissimilarity, inverse rank-order, equal weight (average)) or just the single best ortop-n individual best analogs. The resultant table may have a single entry for each fossil sample or multipleentries for each fossil sample depending on the choice of weighting and the number of analogs chosen whenthe mat.dissim function was run. For example, the function mat.fossavg is used in the following, withan equal-weight average of the top 10 best analogs, for Zagoskin Lake:zag10.avgrec ¼ mat.fossavg(dObj ¼ zag10.recon,modEnvCol ¼ 114,
fossCols ¼ 1:7,wmethod ¼ ‘‘equal.wt’’)mat.plot.recon(zag10.avgrec, inCritVal ¼ critval)Fig. 6 reveals the plot of this reconstruction utilizing all 10 best analogs for sample.
ARTICLE IN PRESS
Fig. 6. (A) Zagoskin L. reconstruction of average July temperature for past 30,000 years from function mat.fossavg based on top 10
best analogs. Black line is estimate and dotted lines represent weighted standard deviation for given weight method; (B) plot of
dissimilarity with Monte Carlo derived significance levels from function mat.mc (see text).
M. Sawada / Computers & Geosciences 32 (2006) 818–833 831
7. Conclusions
MATTOOLS integrates recent advances within the modern analog technique into a single set of R functionscontained in a distributable library that not only allows for paleoenvironmental reconstructions but alsoprovides a set of tools for the assessment of critical decisions within the analog technique. These decisionsinclude the threshold of dissimilarity and the number of analogs to retain for paleoenvironmentalreconstruction. Because the aforementioned decisions are dataset specific, the inherent functionality withinMATTOOLS is essential. Finally, because MATTOOLS is open source and programmed within the Rlanguage it lends itself to easy modification and integration in larger projects.
Acknowledgements
This research is supported by grants to M. Sawada from Canadian Foundation for Innovation, OntarioInnovation Trust, and Natural Sciences and Engineering Research Council (NSERC) of Canada. This paperrepresents a contribution to the NSERC funded Climate Systems History and Dynamics Project. Thecomments of Dr. K. Gajewski, Dr. Jack Williams, Dr. Roger Bivand on an earlier version of this manuscriptsignificantly improved its content. The assistance of Dr. Andre E. Viau in compiling test datasets is gratefullyacknowledged.
References
Ager, T.W., 2003. Late Quaternary vegetation and climate history of the central Bering land bridge from St. Michael island, Western
Alaska. Quaternary Research 60, 19–32.
Analyse-It Software Ltd., 2004. Analyse-it. Leeds, UK.
ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833832
Anderson, P., Bartlein, P.J., Brubaker, L., Gajewski, K., Ritchie, J.C., 1989. Modern analogs of late Quaternary pollen spectra from the
western interior of North America. Journal of Biogeography 16, 573–596.
Bartlein, P.J., Whitlock, C., 1993. Paleoclimatic interpretation of the Elk lake pollen record. In: Bradbury, J.P., Dean, W.E. (Eds.), Elk
Lake, Minnesota: Evidence for Rapid Climate Change in the North-Central United States. US Geological Survey, Denver, Colorado,
Special Paper 276, pp. 275–294.
Becker, R.A., Chambers, J.M., Wilks, A.R., 1988. The New S Language. Chapman & Hall, London, pp. 702.
Birks, H.J.B., 1998. D.G. Frey & E.S. Deevey review #1—numerical tools in palaeolimnology—progress, potentialities, and problems.
Journal of Paleolimnology 20 (4), 307–332.
Copas, J.B., Corbett, P., 2002. Overestimation of the receiver operating characteristic curve for logistic regression. Biometrika 89 (2),
315–331.
Davis, M.B., 2000. Palynology after Y2K understanding the source area of pollen in sediments. Annual Review of Earth and Planetary
Sciences 28 (1), 1–18.
De Long, E.R., De Long, D.M., Clarke-Pearson, D.L., 1988. Comparing the areas under two or more correlated receiver operating
characteristic curves: a nonparametric approach. Biometrics 44 (3), 837–845.
Fedorova, I.T., Volkova, Y.A., Varlyguin, E., 1994. World vegetation cover. Digital raster data on a 30-minute Cartesian orthonormal
geodetic (lat/long) 1080� 2160 grid. Global Ecosystems Database Version 2.0. USDOC/NOAA National Geophysical Data Center,
Boulder, CO.
Gajewski, K., Vance, R., Sawada, M., Fung, I., Gignac, L.D., Halsey, L., John, J., Maisongrande, P., Mandell, P., Mudie, P.J., Richard,
P.J.H., Sherin, R.A.G., Soroko, J., Vitt, D., 2000. The climate of North America and adjacent ocean waters ca. 6 ka. Canadian Journal
of Earth Sciences 37 (5), 661–681.
Gavin, D.G., Oswald, W.W., Wahl, E.R., Williams, J.W., 2003. A statistical approach to evaluating distance metrics and analog
assignments for pollen records. Quaternary Research 60 (3), 356–367.
Gonzalez-Donoso, J.M., Serrano, F., Linares, D., 2000. Sea surface temperature during the Quaternary at ODP sites 976 and 975
(Western Mediterranean). Palaeogeography Palaeoclimatology Palaeoecology 162 (1–2), 17–44.
Grunsky, E.C., 2002. R: a data analysis and statistical programming environment: an emerging tool for the geosciences. Computers &
Geosciences 28 (10), 1219–1222.
Guiot, J., de Beaulieu, J.L., Cheddadi, R., David, F., Ponel, P., Reille, M., 1993. The climate in Western Europe during the last glacial/
interglacial cycle derived from pollen and insect remains. Palaeogeography, Palaeoclimatology, Palaeoecology 103 (1–2), 73–93.
Hanley, J.A., McNeil, B.J., 1982. The meaning and use of the area under a receiver operating characteristic ROC curve. Radiology 143,
29–36.
Ihaka, R., Gentleman, R., 1996. R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics 5 (3),
299–314.
Ikeya, N., Cronin, T.M., 1993. Quantitative analysis of ostracoda and water masses around Japan: application to Pliocene and Pleistocene
paleooceanography. Micropaleontology 39 (3), 263–281.
Jackson, S.T., Williams, J.W., 2004. Modern analogs in Quaternary paleoecology: here today, gone yesterday, gone tomorrow? Annual
Review of Earth and Planetary Sciences 32 (1), 495–537.
Kandiano, E.S., Bauch, H.A., 2003. Surface ocean temperatures in the north-east Atlantic during the last 500 000 years: evidence from
foraminiferal census data. Terra Nova 15 (4), 265–271.
Li, T.G., Liu, Z.X., Hall, M.A., Berne, S., Saito, Y., Cang, S.X., Cheng, Z.B., 2001. Heinrich event imprints in the Okinawa trough:
evidence from oxygen isotope and planktonic foraminifera. Palaeogeography Palaeoclimatology Palaeoecology 176 (1–4), 133–146.
Maher, L., J., 2000. MODPOL.EXE: A tool for searching for modern analogs of Pleistocene pollen data. INQUA Sub-Commission on
Data-Handling Methods. Bennett, K. D. Newsletter 20, http://www.kv.geo.uu.se/inqua/news20/n20-ljm.htm.
Metz, C.E., Herman, B.A., Roe, C.A., 1998a. Statistical comparison of two ROC-curve estimates obtained from partially paired datasets.
Medical Decision Making 18 (1), 110–121.
Metz, C.E., Herman, B.A., Shen, J.-H., 1998b. Maximum likelihood estimation of receiver operating characteristic (ROC) curves from
continuously distributed data. Statistics in Medicine 17, 1033–1053.
Mosimann, J.E., 1965. Statistical methods for the pollen analyst: multinomial and negative multinomial techniques. In: Kummel, B.,
Raup, D. (Eds.), Handbook of Paleontological Techniques. W.H. Freeman and Company, San Francisco, pp. 636–673.
Oswald, W.W., Brubaker, L.B., Hu, F.S., Gavin, D.G., 2003. Pollen-vegetation calibration for tundra communities in the arctic foothills,
Northern Alaska. Journal of Ecology 91 (6), 1022–1033.
Overpeck, J.T., Webb III, T., Prentice, I.C., 1985. Quantitative interpretation of fossil pollen spectra: dissimilarity coefficients and the
method of modern analogs. Quaternary Research 23 (1), 87–108.
Perez-Folgado, M., Sierro, F.J., Flores, J.A., Cacho, I., Grimalt, J.O., Zahn, R., Shackleton, N., 2003. Western Mediterranean planktonic
foraminifera events and millennial climatic variability during the last 70 kyr. Marine Micropaleontology 48, 49–70.
Pflaumann, U., Duprat, J., Pujol, C., Labeyrie, L.D., 1996. Simmax—a modern analog technique to deduce Atlantic sea surface
temperatures from planktonic foraminifera in deep-sea sediments. Paleoceanography 11 (1), 15–35.
Pflaumann, U., Sarnthein, M., Chapman, M., d’Abreu, L., Funnell, B., Huels, M., Kiefer, T., Maslin, M., Schulz, H., Swallow, J., van
Kreveld, S., Vautravers, M., Vogelsang, E., Weinelt, M., 2003. Glacial North Atlantic: sea-surface conditions reconstructed by
GLAMAP 2000—art. no. 1065. Paleoceanography 18 (3), 1065.
Prell, W.L., 1985. The stability of low-latitude sea-surface temperatures: an evaluation of the CLIMAP reconstruction with emphasis on
the positive SST anomalies. Report TR 025, US Department of Energy, Washington, DC, 60pp.
ARTICLE IN PRESSM. Sawada / Computers & Geosciences 32 (2006) 818–833 833
Prentice, I.C., 1980. Multidimensional scaling as a research tool in Quaternary palynology: a review of theory and methods. Review of
Palaeobotany and Palynology 31, 71–104.
R Development Core Team, 2003. R: A language and environment for statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-00-3, URL http://www.R-project.org.
Sawada, M., Gajewski, K., de Vernal, A., Richard, P., 1999. Comparison of marine and terrestrial Holocene climatic reconstructions from
northeastern North America. Holocene 9 (3), 267–277.
Sawada, M., Viau, A.E., Gajewski, K., 2001. Critical thresholds of dissimilarity in the modern analog technique (mat) for quantitative
paleoclimate reconstruction. In: Chylek, P., Lesins, G., (Eds.), Proceedings of the First Annual Conference on Global Warming and
the Next Ice Age, Dalhousie University, Halifax, NS, Canada, pp.149–152.
Sawada, M., Viau, A.E., Vettoretti, G., Peltier, W.R., Gajewski, K., 2004. Comparison of North-American pollen-based temperature and
global lake-status with CCCma AGCM2 output at 6 ka. Quaternary Science Reviews 23 (3–4), 225–244.
Schweitzer, P.N., 1995. Analog: a program for estimating paleoclimate parameters using the method of modern analogs. INQUAWorking
Group on Data-Handling Methods, Newsletter 13.
Schweitzer, P.N., 1999. ANALOG: A program for estimating paleoclimate parameters using the method of modern analogs. U.S.
Geological Survey Open-File Report 94–645. United States Geological Survey, Reston, VA, http://geochange.er.usgs.gov/pub/tools/
analog/doc/analog.html
Sieger, R., Gersonde, R., Zielinski, U., 1999. A new extended software package for quantitative paleoenvironmental reconstructions. EOS,
Transactions, American Geophysical Union Electronic Supplement, 11 May 1999.
Waelbroeck, C., Labeyrie, L., Duplessy, J.C., Guiot, J., Labracherie, M., Leclaire, H., Duprat, J., 1998. Improving past sea surface
temperature estimates based on planktonic fossil faunas. Paleoceanography 13 (3), 272–283.
Wahl, E.R., 2004. A general framework for determining cutoff values to select pollen analogs with dissimilarity metrics in the modern
analog technique. Review of Palaeobotany and Palynology 128 (3–4), 263–280.
Whitmore, J., Gajewski, K., Sawada, M., Williams, J.W., Shuman, B., Bartlein, P.J., Minckley, T., Viau, A.E., Webb III, T., Shafer, S.,
Anderson, P., Brubaker, L., 2005. Modern pollen data from North America and Greenland for multi-scale paleoenvironmental
applications. Quaternary Science Reviews 24 (16–17), 1828–1848.
Williams, J.W., 2003. Variations in tree cover in North America since the last glacial maximum. Global and Planetary Change 35 (1–2),
1–23.
Williams, J.W., Shuman, B.N., Webb, T., 2001. Dissimilarity analyses of late-Quaternary vegetation and climate in Eastern North
America. Ecology 82 (12), 3346–3362.
Yan, L., Dodier, R., Mozer, M.C., Wolniewicz, R., 2003. Optimizing classifier performance via an approximation to the
Wilcoxon–Mann–Whitney statistic. In: Fawcett, T., Mishra, N. (Eds.), Proceedings of the Twentieth International Conference on
Machine Learning (ICML-2003), AAAI Press, Washington, DC., pp. 848–855. http://aaai.org/Press/Proceedings/ICML/2003/
icml03.html