Pedology Development and Update Process of VNIR-Based...

11
Soil Science Society of America Journal Soil Sci. Soc. Am. J. 78:903–913 doi:10.2136/sssaj2013.08.0354 Received 19 Aug. 2013. *Corresponding author ([email protected]). © Soil Science Society of America, 5585 Guilford Rd., Madison WI 53711 USA All rights reserved. No part of this periodical may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Permission for printing and for reprinting the material contained herein has been obtained by the publisher. Development and Update Process of VNIR-Based Models Built to Predict Soil Organic Carbon Pedology S everal samples are usually needed for an accurate assessment of spatial and temporal variability of SOC (Vasques et al., 2010). Soil organic C is used as an indicator of soil quality due to its direct influence on other soil properties (Stevenson, 1994; Weil and Magdoff, 2004). Furthermore, there are incentives to improve SOC levels to mitigate increasing atmospheric concentration of green- house gases such as CO 2 , CH 4 , and N 2 O (Coppens et al., 2006; Fest et al., 2009; Purakayastha et al., 2008). Standard procedures for measuring SOC (e.g., dry and wet combustion) are time consuming and expensive and can limit the number of samples that researchers and farmers can invest. Additionally, the wet combustion procedure proposed by Walkley and Black (1934) create residues that can become an environmental concern. ese constraints have led to an increasing interest in proxi- mal SOC sensing with VNIR diffuse reflectance spectroscopy. e VNIR technique Cleiton H. Sequeira* School of Natural Resources Univ. of Nebraska-Lincoln Hardin Hall 3310 Holdrege St. Lincoln, NE 68583-0961 and USDA-NRCS National Soil Survey Center, 100 Centennial Mall North Rm. 152 Lincoln, NE 68508-3866 Skye A. Wills USDA-NRCS National Soil Survey Center 100 Centennial Mall North Rm. 152 Lincoln, NE 68508-3866 Sabine Grunwald Univ. of Florida 2181 McCarty Hall Gainesville, FL 32611-0290 Richard R. Ferguson Ellis C. Benham Larry T. West USDA-NRCS National Soil Survey Center 100 Centennial Mall North Rm. 152 Lincoln, NE 68508-3866 The large number of samples, time, and cost to assess soil organic C (SOC) with standard procedures has led to the interest in proximal sensing with vis- ible and near-infrared (VNIR) diffuse reflectance spectroscopy. The objectives of the present study were to (i) evaluate the effect of multivariate techniques and spectra preprocessing methods on the performance of VNIR-based models, (ii) evaluate the effect of subsetting datasets to improve the prediction accu- racy of models, and (iii) present a systematic iterative model development and update process. There were three datasets: Dataset-1 was used to the initial model development; Dataset-2 was used to revalidate models developed with Dataset-1; Dataset-3 was used to update promising models identified with Dataset-1 and -2. During initial model development with Dataset-1, the dataset was subset in clusters to try to improve model performance. Subsetting data- sets did not improve model performance. Revalidating models with Dataset-2 helped to identify the lack of robustness in the initial models. This is related to the increased sample diversity in Dataset-2 compared to Dataset-1 and highlights the importance of continuously updating models to cover more vari- ability. Based on Dataset-1 and 2, promising models were updated with the larger and more diverse Dataset-3. Following this update, the best model had a coefficient of multiple determination (R 2 ), root mean squared prediction error (RMSPE), and residual prediction deviation (RPD) of 0.95, 2.062, and 4.39%, respectively. Collecting and evaluating data in separate sets allowed models to be revalidated and updated with new independent samples. This continuous process provides robust models to end users. Abbreviations: CS, clipped spectra; NSSC, National Soil Survey Center; PC, principal component; PLSR, partial least-squared regression; RF, random forest; RMSPE, root mean squared prediction error; RPD, residual prediction deviation; SOC, soil organic C; SSL, Soil Survey Laboratory; VNIR, visible and near-infrared.

Transcript of Pedology Development and Update Process of VNIR-Based...

Page 1: Pedology Development and Update Process of VNIR-Based ...ufgrunwald.com/wp-content/uploads/2016/09/Sequeira-et-al-2014... · These constraints have led to an increasing interest in

Soil Science Society of America Journal

Soil Sci. Soc. Am. J. 78:903–913 doi:10.2136/sssaj2013.08.0354 Received 19 Aug. 2013. *Corresponding author ([email protected]). © Soil Science Society of America, 5585 Guilford Rd., Madison WI 53711 USA All rights reserved. No part of this periodical may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Permission for printing and for reprinting the material contained herein has been obtained by the publisher.

Development and Update Process of VNIR-Based Models Built to Predict Soil Organic Carbon

Pedology

Several samples are usually needed for an accurate assessment of spatial and temporal variability of SOC (Vasques et al., 2010). Soil organic C is used as an indicator of soil quality due to its direct influence on other soil properties

(Stevenson, 1994; Weil and Magdoff, 2004). Furthermore, there are incentives to improve SOC levels to mitigate increasing atmospheric concentration of green-house gases such as CO2, CH4, and N2O (Coppens et al., 2006; Fest et al., 2009; Purakayastha et al., 2008). Standard procedures for measuring SOC (e.g., dry and wet combustion) are time consuming and expensive and can limit the number of samples that researchers and farmers can invest. Additionally, the wet combustion procedure proposed by Walkley and Black (1934) create residues that can become an environmental concern. These constraints have led to an increasing interest in proxi-mal SOC sensing with VNIR diffuse reflectance spectroscopy. The VNIR technique

Cleiton H. Sequeira* School of Natural Resources Univ. of Nebraska-Lincoln Hardin Hall 3310 Holdrege St. Lincoln, NE 68583-0961 andUSDA-NRCS National Soil Survey Center,100 Centennial Mall North Rm. 152Lincoln, NE 68508-3866

Skye A. WillsUSDA-NRCS National Soil Survey Center 100 Centennial Mall North Rm. 152 Lincoln, NE 68508-3866

Sabine GrunwaldUniv. of Florida 2181 McCarty Hall Gainesville, FL 32611-0290

Richard R. Ferguson Ellis C. Benham Larry T. West

USDA-NRCS National Soil Survey Center 100 Centennial Mall North Rm. 152 Lincoln, NE 68508-3866

The large number of samples, time, and cost to assess soil organic C (SOC) with standard procedures has led to the interest in proximal sensing with vis-ible and near-infrared (VNIR) diffuse reflectance spectroscopy. The objectives of the present study were to (i) evaluate the effect of multivariate techniques and spectra preprocessing methods on the performance of VNIR-based models, (ii) evaluate the effect of subsetting datasets to improve the prediction accu-racy of models, and (iii) present a systematic iterative model development and update process. There were three datasets: Dataset-1 was used to the initial model development; Dataset-2 was used to revalidate models developed with Dataset-1; Dataset-3 was used to update promising models identified with Dataset-1 and -2. During initial model development with Dataset-1, the dataset was subset in clusters to try to improve model performance. Subsetting data-sets did not improve model performance. Revalidating models with Dataset-2 helped to identify the lack of robustness in the initial models. This is related to the increased sample diversity in Dataset-2 compared to Dataset-1 and highlights the importance of continuously updating models to cover more vari-ability. Based on Dataset-1 and 2, promising models were updated with the larger and more diverse Dataset-3. Following this update, the best model had a coefficient of multiple determination (R2), root mean squared prediction error (RMSPE), and residual prediction deviation (RPD) of 0.95, 2.062, and 4.39%, respectively. Collecting and evaluating data in separate sets allowed models to be revalidated and updated with new independent samples. This continuous process provides robust models to end users.

Abbreviations: CS, clipped spectra; NSSC, National Soil Survey Center; PC, principal component; PLSR, partial least-squared regression; RF, random forest; RMSPE, root mean squared prediction error; RPD, residual prediction deviation; SOC, soil organic C; SSL, Soil Survey Laboratory; VNIR, visible and near-infrared.

Page 2: Pedology Development and Update Process of VNIR-Based ...ufgrunwald.com/wp-content/uploads/2016/09/Sequeira-et-al-2014... · These constraints have led to an increasing interest in

904 Soil Science Society of America Journal

is rapid, cost effective, requires minimal sample preparation, can be used in situ, is nondestructive, requires no hazardous chemi-cals, and can accurately estimate multiple soil properties, including SOC (Brown et al., 2006; Reeves et al., 2002; Sankey et al., 2008; Viscarra Rossel and Behrens, 2010). The increasing use of VNIR in soil science over the last 20 yr is also attributed to recent ad-vances in computation instrument manufacturing, developments in multivariate statistics (chemometrics), and the greater number of potential applications in agriculture and soil science (Brown et al., 2006; Guerrero et al., 2010).

There is a good deal of research literature focused on devel-oping VNIR-based modes to predict SOC (Brown et al., 2006; Dunn et al., 2002; Minasny and McBratney, 2008; Vasques et al., 2010; Reeves et al., 2002; Sankey et al., 2008; Viscarra Rossel and Behrens, 2010). Many recent papers investigated the effect of us-ing different multivariate techniques and different spectra pre-processing methods on the prediction accuracy of models. For instance, Viscarra Rossel and Behrens (2010) compared the per-formance of several multivariate techniques to calibrate VNIR models for SOC, clay content, and pH prediction. The authors found the support vector machines multivariate technique to produce the best prediction for all three soil properties evalu-ated. Vasques et al. (2008) focused on combining multivariate techniques and spectra preprocessing methods to predict SOC for soils from the north-central region of Florida. They found the best SOC predictions by combining partial least-squared regres-sion (PLSR) multivariate technique with Norris gap derivative preprocessing method. There is still no consensus regarding the most successful multivariate technique and spectra preprocess-ing due to the great variety of options and the conflicting find-ings across studies.

The VNIR-based SOC prediction models are typically de-veloped using a relatively simple, straight-forward, and linear or one-time process that includes the following steps: soil samples are collected; samples are submitted to standard lab methods (e.g., dry combustion) and VNIR scan reading; database is com-piled; database is split in calibration and validation sets; differ-ent models are developed from the calibration set; models are validated with validation set; models are recommended based on prediction metric parameters. All of the research previously mentioned follows these steps. However, to our knowledge, the literature is lacking an example of iterative model development where models are submitted to additional validation rounds with new independent samples or where models are updated with new samples so it could cover a broader range of samples. These additional steps in VNIR-based model development will result in more robust analysis and recommendation of developed models that can be used in ongoing studies and production anal-ysis environments.

The objectives of the present study were to (i) evaluate the effect of different multivariate techniques and spectra prepro-cessing methods on the performance of VNIR-based models developed to predict SOC, (ii) evaluate the effect of subsetting complete datasets to improve the prediction accuracy of models,

and (iii) present a systematic iterative prediction model develop-ment and update process.

MATERIALS AND METHODSSoil Samples

Soil samples were gathered from the USDA-NRCS National Soil Survey Center (NSSC) Soil Survey Laboratory (SSL). The data are stored in the National Cooperative Soil Survey-SSL characterization database [http://soils.usda.gov/survey/nscd/ (accessed 27 Sept. 2011)]. All queried samples had SOC and VNIR data and different degrees of completeness re-garding horizon designation and texture data. Soil organic C was determined by dry combustion (total C) for samples without carbonates and by the difference between total C and inorganic C (pressure calcimeter method) for samples with carbonates present (Soil Survey Staff, 2004). All pedons were described ac-cording to Schoeneberger et al. (2002). Horizon designation was limited to the master horizons A, E, B, and C. Transitions hori-zons were generally pooled with the first described master hori-zon due to the usual predominance of characteristics of the first described master horizon (e.g., A–E and AE horizon were con-sidered A). Soil texture [determined by the pipet method accord-ing to SSL protocols (Soil Survey Staff, 2004)] was described by the 12 classes in the USDA textural triangle (Schoeneberger et al., 2002).

Dataset-1Beginning in 2009, all soil samples arriving in the NSSC

were scanned with VINR spectrometer (details below) as an initiative to build the U.S. soil spectra library. The first set of samples used for modeling was compiled on 18 May 2011 and contained 428 pedons (3053 mineral horizons) representing 30 states of the United States and the territories of Northern Mariana Islands and Puerto Rico. Samples without horizon des-ignation and texture data were removed from the dataset since these two soil properties were used to help subsetting (grouping) samples. Thus, a total of 2043 samples with SOC content range of 0.0 to 36.2% were used in the initial development of VNIR-based models to predict SOC. In this paper, Dataset-1 refers to these 2043 selected samples.

Dataset-2The second set of samples, Dataset-2, was compiled on 2

Feb. 2012 to independently revalidate models developed with Dataset-1. Therefore, it consisted of all samples scanned from the initial scanning date in the year of 2009 until the accessed date, except the original 2043 samples of Dataset-1. This included new samples that were scanned as they were processed in the center and some randomly selected archive pedons that dated back to the year 2006. Archive samples were scanned to boost the sample size of the spectra library. Dataset-2 had a total of 6581 samples (mineral horizons) with SOC content ranging from 0.0 to 63.5%. Out of the 6581 samples in Dataset-2, 2983 samples had

Page 3: Pedology Development and Update Process of VNIR-Based ...ufgrunwald.com/wp-content/uploads/2016/09/Sequeira-et-al-2014... · These constraints have led to an increasing interest in

www.soils.org/publications/sssaj 905

horizon designation, texture, and spectra data that allowed it to be assigned to the clusters created with Dataset-1 (details below).

Dataset-3The third set of sample data consisted of all samples with

SOC and VNIR data collected by the NSSC-SSL until 2 Feb. 2012. It is Dataset-1 and 2 combined together. Dataset-3 had 2039 pedons (8624 mineral horizons) representing the 50 states of the United States and the territories of Northern Mariana Islands and Puerto Rico. Soil organic C content ranged from 0.0 to 63.5%. Dataset-3 was used to update promising models identi-fied with Dataset-1 and 2 separately.

Subsetting DatasetsTwo clustering methods were implemented for subsetting

complete datasets: agglomerative hierarchical clustering and K-medoids. The basic tool of hierarchical clustering is a measure of the dissimilarity or proximity (i.e., distance) of one item (e.g., soil sample) relative to another item so that items within a cluster are more similar than items between clusters (Izenman, 2008). This assumes that clusters contain subclusters (i.e., structure). In this case, the Ward’s minimum variance method was used to cal-culate the distance. K-medoids is a nonhierarchical or partition-ing clustering method that searches for K representative objects (medoids) among the items in the data set, and a dissimilarity-based distance is used instead of Euclidean distance commonly employed in hierarchical clustering (Izenman, 2008).

For both clustering methods, soil samples were grouped as function of horizon designation, texture, and VNIR spectra (350–2500 nm). Before any cluster analysis, Savitzky–Golay first derivative was applied on the VNIR spectra to correct for any baseline drift effect. The variables were then standardized so that all properties exert a similar effect of the clustering procedure re-gardless of variability (Izenman, 2008). The number of clusters was initially set to three. This number was then increased and decreased in the attempt to find the most appropriate number based on derived-model performances. Hierarchical cluster-ing and K-medoids were implemented with “hclust” and “pam” functions, respectively, in R (R Development Core Team, 2011). A classification tree was developed with “randomForest” in R (R Development Core Team, 2011) to assign new samples to the existing clusters using horizon designation, texture, and spectra.

Visible and Near-infrared Spectroscopy and Spectra Preprocessing

The VNIR diffuse reflectance spectra were measured on air-dry samples passed through a 2-mm sieve using the LabSpec 2500 spectrometer (Analytical Spectral Devices, Boulder, CO) with a spectral range of 350 to 2500 nm, acquired at 1 nm increments. Each

sample was scanned once with an internal average of 100 read-ings. Soils were scanned from below using a Muglight (Analytical Spectral Devices, Boulder, CO). Spectralon (LabSphere, North Sutton, NH) was used as white reference once every 15 min.

The VNIR spectrum is subject to nonconstituent-related interferences (e.g., light scattering, light path-length, spectrum baseline drift) that decrease the signal-to-noise ratio. These inter-ferences can be removed or at least minimized by applying pre-processing treatments (methods) on the spectrum (Duckworth, 2004; Heise and Winzen, 2002). Derivatives are known to correct baseline effects and enhance visual resolution, with the Savitzky–Golay method as one of the better algorithms to ap-ply it (Duckworth, 2004). Thus, in addition to raw reflectance spectra, VNIR-based models were developed using spectra pre-processed to increase the signal-to-noise ratio with first and sec-ond derivatives through the Savitzky–Golay method. Savitzky–Golay derivatives were applied with a third-order polynomial across smoothing segments of 11 and 21 nm. Additionally, noise and artifact bands, which are intrinsic to the spectrometer, were removed from preprocessed spectra. This was referred to as clipped spectra (CS). Noise bands were those located at 350 to 419 and 2491 to 2500 nm while artifact bands were those at 997 to 1004 and 1827 to 1834 nm. Artifact bands are around the spectra-range edge of the three detectors built-in the spec-trometer. Thus, clipped spectra consisted of the bands at 420 to 996, 1005 to 1826, and 1835 to 2490 nm. Therefore, reflectance spectra were subject to the preprocessing treatments presented in Table 1. Savitzky–Golay derivatives were implemented with “sgolayfilt” function in R (R Development Core Team, 2011).

Multivariate Techniques for Modeling Soil Organic Carbon

Two multivariate techniques were used to model SOC as a function of the VNIR spectra: PLSR and random forest (RF). The PLSR and RF models were developed with “plsr” and “randomFor-est” functions, respectively, in R (R Development Core Team, 2011).

Partial least-squared regression was developed by Wold et al. (1983) and has become a popular chemometrics technique for quantitative analysis of diffusive reflectance spectra (Viscarra Rossel and Behrens, 2010). It is similar to principal component regression (PCR), as both employ statistical rotations to over-come high-dimensionality and multicollinearity. However, in PLSR the X and Y variables are rotated relative to the response

Table 1. Preprocessing treatments applied on reflectance spectra.

Abbreviation Description

Raw Raw reflectance spectra without any preprocessing treatmentFirst/11 Savitzky–Golay first derivative across smoothing segment of 11 nm

First/11/CS Savitzky–Golay first derivative across smoothing segment of 11 nm on clipped spectra

First/21 Savitzky–Golay first derivative across smoothing segment of 21 nm

First/21/CS Savitzky–Golay first derivative across smoothing segment of 21 nm on clipped spectra

Second/11 Savitzky–Golay second derivative across smoothing segment of 11 nm

Second/11/CS Savitzky–Golay second derivative across smoothing segment of 11 nm on clipped spectra

Second/21 Savitzky–Golay second derivative across smoothing segment of 21 nmSecond/21/CS Savitzky–Golay second derivative across smoothing segment of 21 nm on clipped spectra

Page 4: Pedology Development and Update Process of VNIR-Based ...ufgrunwald.com/wp-content/uploads/2016/09/Sequeira-et-al-2014... · These constraints have led to an increasing interest in

906 Soil Science Society of America Journal

variables to maximize predictive power (Wold et al., 2001). Partial least-squared regression models were developed using 10-fold cross-validation on the calibration set, and the optimum number of principal components (PCs) was chosen based on the minimum root mean squared error of cross-validation, up to a maximum of 25 PCs.

Random forest, a tree-based algorithm developed by Breiman (2001), is a distribution-free method that has gained popularity in a variety of fields (Izenman, 2008). The goal of RF is to obtain stable predictors (regressors) and, hence, robust models by applying bootstrap aggregating (bagging) and ran-dom input selection. The bagging process draws, with replace-ment, random and independent samples from the calibration set to grow every tree in the forest (Breiman, 1996). Random input selection randomly selects a subset of variables to determine the best split at each node in the tree (Ho, 1998). Some advantages of RF are suitability for databases with more variables than ob-

servations, robustness to noise and irrelevant features, does not overfit, ability to handle a mix of categorical and continuous variables, and little need to fine-tune parameters to achieve good performance (Izenman, 2008; Díaz-Uriarte and Andrés, 2006). The number of trees in the forest was set to 1500 in develop-ing models with Dataset-1. However, it was set to 500 and 1000 trees in developing models with Dataset-3 due to the fact that the large sample size increased computational requirements. To reduce bias, the trees were always grown to the maximum pos-sible with no pruning.

Thus, 18 VNIR-based models were developed to predict SOC using the combination of multivariate techniques and spectra preprocessing treatments. As mentioned above, these models were initially developed with Dataset-1, and revalidated and updated with Dataset-2 and 3, respectively. When develop-ing models, datasets were each randomly divided in 70% calibra-tion set and 30% validation set. The prediction quality of models was evaluated by the following validation parameters: prediction R2, Eq. [1]; RMSPE, Eq. [2]; and RPD (Williams, 1987), Eq. [3]. These are defined as

22 1

21

ˆ( )

( )=

=

−=

−∑∑

nii

nii

y yR

y y [1]

2

1

1RMSPE ( ) ˆn

ii

y yn =

= −∑ [2]

VALSDRPD RMSPE / ( 1)

=−n n

[3]

where ŷ is the predicted value; y is the mean of the observed values; y is the observed value; n is the number of observations with i = 1, 2, …, n; and SDVAL is the standard deviation of the validation set.

RESULTS AND DISCUSSIONInitial Model Development with Dataset-1

Descriptive statistics of SOC content for the 2043 samples of Dataset-1 are shown in Table 2. All datasets (Dataset-1 and Cluster 1, 2, and 3) were positively skewed and with kurtosis values greater than zero indicating a non-normal distribution of SOC data. Soil organic C content ranged from 0.0 to 36.2% in Dataset-1 and from 0.0 to 36.2, 0.0 to 11.5, and 0.0 to 8.8% in Cluster 1, Cluster 2, and Cluster 3, respectively.

For the complete Dataset-1, validation results indicate a consistently better performance of RF models compared to PLSR models for each of the nine tested preprocessing treat-ments (Table 3). For instance, the RMSPE for PLSR and RF models ranged from 2.466 to 2.521 and 1.316 to 2.247%, respec-tively. Partial least-squared regression is a parametric algorithm developed to model linear relationships between Y and strongly-correlated X variables and relies on the assumption that data (e.g., SOC content) has normal (Gaussian) distribution (Wold et al.,

Table 2. Descriptive statistics of measured soil organic C con-tent (%) in Dataset-1.

Whole set Calibration Validation

Dataset 1 (n = 2043)Min. 0.0 0.0 0.0

First quart. 0.2 0.2 0.2

Median 0.5 0.5 0.5

Mean 1.3 1.3 1.4

Third quart. 1.2 1.2 1.3

Max. 36.2 36.2 35.6

Skew 6.6 6.9 6.0

Kurtosis 54.2 62.4 44.1

Cluster 1 (n = 930)

Min. 0.0 0.0 0.0

First quart. 0.2 0.2 0.2

Median 0.6 0.5 0.7

Mean 1.8 1.8 2.0

Third quart. 1.7 1.5 1.9

Max. 36.2 36.2 35.6

Skew 5.4 5.6 4.8

Kurtosis 34.0 37.7 26.6

Cluster 2 (n = 764)

Min. 0.0 0.0 0.0

First quart. 0.1 0.1 0.1

Median 0.3 0.3 0.3

Mean 0.7 0.7 0.6

Third quart. 0.7 0.6 0.7

Max. 11.5 11.5 7.2

Skew 4.6 4.7 3.8

Kurtosis 25.7 25.7 17.6

Cluster 3 (n = 349)

Min. 0.0 0.0 0.2

First quart. 0.5 0.5 0.5

Median 0.9 0.9 0.9

Mean 1.2 1.2 1.3

Third quart. 1.6 1.6 1.8

Max. 8.8 8.8 6.3

Skew 2.7 3.0 1.9Kurtosis 11.6 13.7 4.6

Page 5: Pedology Development and Update Process of VNIR-Based ...ufgrunwald.com/wp-content/uploads/2016/09/Sequeira-et-al-2014... · These constraints have led to an increasing interest in

www.soils.org/publications/sssaj 907

2001). Thus, the poor performance of PLSR models can be ex-plained by the nonlinear relationship between VNIR reflectance and soil properties (Clark, 1999), and the non-normal distribu-tion of SOC in the dataset (Table 2). It is commonly advised to apply a proper transformation (e.g., logarithmic) to the data so statistical assumptions can be met (Sakia, 1992); however, when back-transformed to original units, the relationship between predicted and observed values is distorted. Vasques et al. (2010) found that the back-transformation of logarithmic transformed SOC value to the original unit (%) reduced the quality of PLSR models to the extent that extreme values were greatly under or overestimated. The same was observed in the present study (data not shown). On the other hand, RF is a nonparametric (distri-bution-free) algorithm that can be used even if typical statistical analysis assumptions are violated.

Chang et al. (2001) created categories based on RPD val-ues to distinguish the ability of VNIR-based models to predict soil properties. Models with RPD values >2.0 were considered stable and accurate; models with RPD values between 1.4 and 2.0 were considered suitable for further work; models with RPD < 1.4 were considered not reliable for prediction. For the com-plete Dataset-1, all PLSR models had RPD < 1.4 and would be considered of poor predictive capacity. On the other hand, for RF models, only the model derived from raw data (RF/Raw) had RPD < 2.0; which was still greater than the RPD for the compar-ative PLSR model (PLSR/Raw). Among the 18 models, the best model was obtained by using the RF algorithm with Second/21/CS preprocessing treatment (RF/Second/21/CS) (Table 3). It is commonly found that applying preprocessing treatments on

the spectra improves the accuracy of the VNIR models. Dunn et al. (2002) reported improvement in their models by using first and second derivatives while Kooistra et al. (2001) found better results with untransformed (raw) reflectance data. Vasques et al. (2009) suggested that the best preprocessing treatment associat-ed with a multivariate technique varies with SOC fraction (e.g., total organic C, recalcitrant organic C).

Subsetting Dataset-1 was attempted to improve the predic-tion performance of VNIR-based models. It was expected that the greater similarity of samples within clusters compared to samples between clusters would result in more accurate cluster models than complete Dataset-1 models. Initially, both hier-archical clustering and K-medoid methods were used to create three clusters. This number was then decreased to two and in-creased up to five clusters. Increasing or decreasing the number of clusters from the initial three clusters did not improve the model’s performance (data not shown). Thus, the final number of clus-ters was maintained at three. Clusters obtained with hierarchical clustering were chosen over those obtained with K-medoids due to their consistently better performance (data not shown). Using hierarchical clustering method, the sample size of Cluster 1, 2, and 3 was of 930, 764, and 349 samples, respectively (Fig. 1).

As with the complete Dataset-1, validation results of Cluster 1 and 2 indicate that RF models were more accurate than PLSR models (Table 3). For both Cluster 1 and 2 the best model was obtained by using RF algorithm with First/21 preprocessing treatment (RF/First/21). Conversely, PLSR models tended to be more accurate than RF models for Cluster 3 with the best model being PLSR algorithm with First/11 preprocessing treatment

Table 3. Validation results of 18 visible and near-infrared-based models developed to predict soil organic C. Each model was devel-oped and validated with the complete Dataset-1 (2043 samples) and its derived cluster datasets.

Model Dataset 1 Cluster 1 Cluster 2 Cluster 3

R2† RMSPE‡ (%) RPD§ R2 RMSPE (%) RPD R2 RMSPE (%) RPD R2 RMSPE (%) RPD

PLSR/Raw¶ 0.46 2.475 1.35 0.42 3.262 1.31 0.53 0.756 1.43 0.78 0.508 2.10RF/Raw 0.55 2.247 1.48 0.62 2.734 1.57 0.36 0.798 1.21 0.46 0.824 1.29

PLSR/First/11 0.48 2.469 1.38 0.38 3.374 1.27 0.55 0.716 1.45 0.87 0.400 2.67

RF/First/11 0.85 1.437 2.32 0.86 1.456 2.62 0.74 0.491 1.95 0.78 0.494 2.15

PLSR/First/11/CS 0.46 2.521 1.35 0.36 3.431 1.25 0.53 0.751 1.44 0.76 0.527 2.02

RF/First/11/CS 0.85 1.436 2.32 0.86 1.459 2.62 0.75 0.481 1.99 0.79 0.488 2.17

PLSR/First/21 0.48 2.483 1.37 0.39 3.353 1.28 0.53 0.754 1.43 0.75 0.535 1.99

RF/First/21 0.85 1.424 2.34 0.87 1.386 2.74 0.76 0.469 2.04 0.80 0.476 2.23

PLSR/First/21/CS 0.46 2.514 1.35 0.36 3.435 1.25 0.54 0.748 1.44 0.75 0.532 2.00

RF/First/21/CS 0.85 1.410 2.36 0.86 1.421 2.67 0.76 0.473 2.03 0.80 0.475 2.23

PLSR/Second/11 0.46 2.521 1.35 0.37 3.443 1.24 0.43 0.838 1.29 0.78 0.515 2.10

RF/Second/11 0.85 1.481 2.25 0.79 1.754 2.18 0.63 0.581 1.65 0.67 0.613 1.73

PLSR/Second/11/CS 0.47 2.495 1.36 0.35 3.485 1.23 0.49 0.836 1.30 0.81 0.479 2.29

RF/Second/11/CS 0.81 1.488 2.24 0.79 1.763 2.17 0.64 0.577 1.66 0.68 0.598 1.77

PLSR/Second/21 0.48 2.466 1.38 0.42 3.290 1.30 0.51 0.776 1.39 0.79 0.513 2.07

RF/Second/21 0.90 1.321 2.52 0.79 1.801 2.13 0.74 0.490 1.96 0.75 0.534 1.98

PLSR/Second/21/CS 0.48 2.470 1.38 0.41 3.321 1.29 0.50 0.784 1.38 0.81 0.498 2.17RF/Second/21/CS 0.89 1.316 2.53 0.80 1.782 2.15 0.75 0.480 2.00 0.75 0.528 2.01† R2, coefficient of multiple determination.‡ RMSPE, root mean squared prediction error.§ RPD, residual prediction deviation.¶ PLSR, partial least squares regression; RF, random forest; Raw, raw reflectance spectra; First, first derivative; Second, second derivative; 11,

smoothing segment of 11 nm; 21, smoothing segment of 21 nm; CS, clipped spectra.

Page 6: Pedology Development and Update Process of VNIR-Based ...ufgrunwald.com/wp-content/uploads/2016/09/Sequeira-et-al-2014... · These constraints have led to an increasing interest in

908 Soil Science Society of America Journal

(PLSR/First/11) (Table 3). The better performance of RF mod-els over PLSR models for Cluster 1 and 2 datasets is likely related to the non-normal distribution of SOC data in these two clusters (Table 2). Conversely, the improved performance of PLSR mod-els for Cluster 3 dataset is related to the closer approximation to the normal distribution of SOC values in this dataset compared to Cluster 1 and 2 (Table 2). The RMSPE of the best model for Cluster 1, 2, and 3 were 1.386, 0.469, and 0.400%, respectively. It is worthwhile to notice that the SOC content range of 0.0 to 36.2% for both complete Dataset-1 and Cluster 1 (Table 2) resulted in models with validation RMSPE ranges of 1.316 to 2.521 and 1.386 to 3.485%, respectively (Table 3). In compari-son, the SOC content range of 0.0 to 11.5 and 0.0 to 8.8% for Cluster 2 and 3, respectively, resulted in reduced RMSPE for models derived from these datasets (0.469–0.838 and 0.400–0.824%, respectively), indicating that RMSPE is database-range dependent. This observation brings into question the appro-priateness of using RMSPE alone to compare the accuracies of models derived from datasets with markedly different ranges of the property (variable) being modeled (e.g., SOC). In such cases, the use of R2 and RPD becomes important for evaluating predic-tion performance. The best model in Cluster 1 (RF/First/21), 2 (RF/First/21), and 3 (PLSR/First/11) had RPD of 2.74, 2.04,

and 2.67, respectively, placing them in the category considered stable and accurate to predict SOC (Chang et al., 2001).

Additional Model Validation with Database 2The results presented for Dataset-1 (Initial Model

Development with Dataset-1) are the result of a basic three step process that is common for chemometrics modeling in soil science: (i) a dataset is acquired and split into calibration and validation sets, (ii) models are developed using the calibration set, and (iii) models are validated with the validation set. The literature is lacking examples of models being validated in addi-tional steps with new and independent datasets that could create a more robust recommendation of models to be used.

Descriptive statistics of Dataset-2 and its variants are pre-sented in Table 4. Dataset-2 was prepared with the same pre-processing treatments as the initial Dataset-1. When SOC was predicted for the 6581 samples in Dataset-2, none of previously developed models were successful. For instance, the models RF/Second/21 and RF/Second/21/CS (top two models based on Dataset-1 results) predicted SOC for Dataset-2 samples with R2, RMSPE, and RPD ranging from 0.78 to 0.79, 6.408 to 6.481%, and 1.60 to 1.62, respectively. Both models presented extremely high RMSPE and RPD < 2.0, indicating poor pre-diction performance. The remaining models presented similar or worse results (data not shown). The performance of models RF/Second/21 and RF/Second/21/CS in predicting SOC con-tent for Dataset-2 samples can be visualized in Fig. 2. Prediction values showed a clear pattern where SOC predictions level-off at about 20% SOC while observed values increase beyond 36% SOC. This upper limit of prediction is close to the upper range limit of SOC content in the calibration set (Table 2). Therefore, it is clear that predicting SOC for samples with SOC content above the upper range limit of the calibration set (36.2%) result-ed in extrapolation and, hence, poor performance of the models.

To limit the range of predictions and avoid extrapolation, Dataset-2 was queried for samples with measured SOC content £36.2%. This resulted in a dataset with 6345 samples. This da-taset was named as Dataset-2.1. Prediction results of Dataset-2.1 with complete dataset models are shown in Table 5. The SOC prediction of Dataset-2.1 samples resulted in inferior valida-

tion results compared to those obtained with Dataset-1 (Tables 3 and 4). For instance, the best model based on Dataset-1 results (RF/Second/21/CS, Table 3) had R2, RMSPE, and RPD of 0.89, 1.316%, and 2.53, respectively; while the best model based on Dataset-2.1 results (RF/First/21, Table 5) had R2, RMSPE, and RPD of 0.72, 2.894%, and 1.83, respectively. The fact that valida-tion results indicated different best models based on Dataset-1

Table 4. Descriptive statistics of measured soil organic C (%) in Database-2 and its variants. Cluster 1, 2, and 3 refer to subsets of samples with horizon designation and texture that defines Dataset-2.2.

Whole set of samples

(Dataset 2)

Samples with SOC £36.2%

(Dataset 2.1)

Samples with horizon designation and texture

(Dataset 2.2)Cluster 1 Cluster 2 Cluster 3

N 6581 6345 2983 1395 874 714Min. 0.0 0.0 0.0 0.0 0.0 0.0

First quart. 0.3 0.3 0.2 0.2 0.1 0.7

Median 0.7 0.8 0.6 0.5 0.3 1.3

Mean 3.5 2.5 1.1 1.1 0.7 1.6

Third quart. 2.0 2.0 1.4 1.3 0.6 2.0

Max. 63.5 36.2 23.6 23.6 21.0 11.1

Skew 4.2 4.0 5.0 5.4 6.3 2.7Kurtosis 18.1 17.3 40.3 44.6 53.0 10.4

Fig. 1. Cluster dendrogram used for subsetting the 2043 samples of Dataset-1.

Page 7: Pedology Development and Update Process of VNIR-Based ...ufgrunwald.com/wp-content/uploads/2016/09/Sequeira-et-al-2014... · These constraints have led to an increasing interest in

www.soils.org/publications/sssaj 909

and Dataset-2.1 highlights the importance to revalidate VNIR-based prediction mod-els with new independent samples to create robust model development and recommen-dation. The best model based on Dataset-1 results predicted SOC for Dataset-2.1 samples with R2, RMSPE, and RPD of 0.69, 3.055%, and 1.73, respectively (Table 5). The lack of robustness of these models is related to the fact that Dataset-2.1 had a more diverse set of samples than the calibra-tion set used to develop the modes. As with Dataset-1, prediction results of Dataset-2.1 were better with RF models than with PLSR models. Figure 3 illustrates the per-formance of the best RF (First/21) and PLSR (First/11/CS) model in predicting SOC for the 6345 samples in Dataset-2.1. Both models resulted in dispersed predic-tion values, but the tendency of the PLSR models to give nega-tive predictions for samples with low SOC content and leveling-off of predictions for samples with observed SOC content >10% defined the worse performance of PLSR models. Again, the bet-ter performance of RF models compared to PLSR model in pre-dicting SOC of Dataset-2.1 samples is related to the non-normal distribution of SOC values in Dataset-2.1 (Table 4).

Dataset-2 included 2983 samples with both horizon desig-nation and texture data that could be analyzed using the cluster models. We will refer to this dataset as Dataset-2.2. The descrip-tive statistics of the 2983 samples in Dataset-2.2 and in its as-signed clusters are presented in Table 4. Once again, complete dataset RF models outperformed PLSR models, and all mod-els were of poor prediction quality (RPD < 1.4) (Table 6). To avoid extrapolation, the Cluster 2 dataset had four samples with SOC content >11.5% removed from the database and Cluster Dataset-3 had three samples with SOC content >8.8% removed. Cluster models also had unsatisfactory results with all Cluster 1 and 2 models having RPD < 1.4 (Table 6). Cluster 3 had five models with RPD between 1.4 and 2.0 and the remaining mod-els with RPD < 1.4 (Table 6). The low prediction quality of these models is related to the lack of representation in the calibration set. This observation stresses the need to continuously update models with diverse samples so variability coverage is increased and prediction accuracy can be improved. The performance of the best complete dataset and cluster models in predicting the 2983 samples of Dataset-2.2 is shown in Fig. 4. It is noticeable that samples assigned to Cluster 1 are responsible for the greatest degree of dispersion followed by Cluster 2.

Evaluating cluster model predictions individually does not allow us to assess the accuracy of these models compared to the complete dataset models. Thus, we combined the predictions of the best cluster models to better compare them to the best com-plete dataset model prediction. To facilitate this comparison, the samples that were removed from Cluster 2 and 3 to avoid ex-

trapolation were also removed from Dataset-2.2. This approach gave a slightly better prediction for the combined clusters since it had R2, RMSPE, and RPD of 0.51, 1.242%, and 1.33, respec-tively, while the RF/First/21 complete dataset model had R2, RMSPE, and RPD of 0.46, 1.358%, and 1.21, respectively (Fig. 5). However, the combined cluster prediction still had RPD < 1.4 and did not present a noticeable superior prediction perfor-mance compared to the complete dataset model. These findings

Table 5. Validation results of 18 visible and near-infrared-based models developed with Dataset-1 (2043 samples) and used to predict soil organic C (SOC) for Dataset-2.1 that con-sisted of 6345 independent samples with observed SOC rang-ing from 0.0 to 36.2%.

Model R2† RMSPE‡, % RPD§

PLSR/Raw¶ 0.46 4.247 1.26RF/Raw 0.40 4.256 1.25

PLSR/First/11 0.41 4.109 1.30

RF/First/11 0.72 2.950 1.80

PLSR/First/11/CS 0.54 3.684 1.44

RF/First/11/CS 0.72 2.948 1.80

PLSR/First/21 0.45 3.955 1.35

RF/First/21 0.72 2.894 1.83

PLSR/First/21/CS 0.53 3.729 1.42

RF/First/21/CS 0.72 2.933 1.81

PLSR/Second/11 0.42 4.064 1.32

RF/Second/11 0.65 3.195 1.66

PLSR/Second/11/CS 0.41 4.088 1.30

RF/Second/11/CS 0.65 3.186 1.67

PLSR/Second/21 0.32 4.610 1.18

RF/Second/21 0.68 3.072 1.72

PLSR/Second/21/CS 0.44 3.987 1.33RF/Second/21/CS 0.69 3.055 1.73† R2, coefficient of multiple determination.‡ RMSPE, root mean squared prediction error.§ RPD, residual prediction deviation.¶ PLSR, partial least squares regression; RF, random forest; Raw, raw

reflectance spectra; First, first derivative; Second, second derivative; 11, smoothing segment of 11 nm; 21, smoothing segment of 21 nm; CS, clipped spectra.

Fig. 2. Performance of two visible and near-infrared-based models developed with Dataset-1 (2043 samples) and used to predict soil organic C (SOC) for Dataset-2 that consisted of 6581 independent samples with SOC content range of 0.0 to 63.5%. Dashed lines represent the 1:1 relationship. Dark lines represent the regression analysis of observed vs. predicted SOC. RF, random forest; Second, second derivative; 21, smoothing segment of 21 nm; CS, clipped spectra.

Page 8: Pedology Development and Update Process of VNIR-Based ...ufgrunwald.com/wp-content/uploads/2016/09/Sequeira-et-al-2014... · These constraints have led to an increasing interest in

910 Soil Science Society of America Journal

support previous conclusions that subsetting datasets does not necessarily improve model performance. Vasques et al. (2010) grouped soil samples by soil orders and observed that just two out of seven soil orders promoted improved predictions based on RPD values when compared to the complete dataset model pre-diction. Based on relative error, Minasny and McBratney (2008) found that deriving models from defined ranges of total C con-tent (e.g., 0–1%) decreased model accuracy.

Models Update with Dataset-3

We used Dataset-3 to update the models and expand their range of ac-curate predictions. This consisted of recalibrating the best models found in the previous steps with the larger and more diverse Dataset-3. Based on our initial findings, we used only RF complete dataset models with pre-processed (not raw) spectra. Models were developed with 500 and 1000 trees in the forest. In this case, models developed with 500 trees in the for-est had better validation results com-pared to 1000 trees, and these are reported in Table 7. These findings agree with Brown et al. (2006) who reported that, in general, increasing the number of trees in a model al-

most always improves cross-validation statistics but not necessarily validation statistics and increases computational requirements.

Cross-validation and validation results of the eight updat-ed models are in Table 7. Among the top four updated models, two had reflectance spectra preprocessed with Savitzky–Golay first derivative and two with Savitzky–Golay second derivative, but all had a smoothing segment of 21 nm. The greater spectra smoothing brought by the 21 nm smoothing segment was more effective in increasing the signal-to-noise ratio compared to the

Fig. 3. Performance of two visible and near-infrared-based models developed with Dataset-1 (2043 samples) and used to predict soil organic C (SOC) for Dataset-2.1 that consisted of 6345 independent samples with SOC content range of 0.0 to 36.2%. Dashed lines represent the 1:1 relationship. Dark lines represent the regression analysis of observed vs. predicted SOC. PLSR, partial least squares regression; RF, random forest; First, first derivative; 11, smoothing segment of 11 nm; 21, smoothing segment of 21; CS, clipped spectra.

Table 6. Validation results of 18 visible and near-infrared-based models developed with Dataset-1 (2043 samples) and used to predict soil organic C (SOC) for Dataset-2.2 that consisted of 2983 independent new samples with observed SOC ranging from 0.0 to 23.6%.

Model Complete Dataset Cluster 1 Cluster 2 Cluster 3R2† RMSPE‡ (%) RPD§ R2 RMSPE (%) RPD R2 RMSPE (%) RPD R2 RMSPE (%) RPD

PLSR/Raw¶ 0.34 1.546 1.12 0.49 1.882 0.99 0.38 1.104 1.17 0.47 1.002 1.31RF/Raw 0.25 1.995 0.87 0.30 2.334 0.81 0.26 1.145 1.12 0.28 1.172 1.12

PLSR/First/11 0.18 3.033 0.63 0.32 2.680 0.69 0.17 2.082 0.61 0.45 1.310 1.15

RF/First/11 0.47 1.416 1.25 0.53 1.467 1.31 0.27 1.375 0.99 0.54 0.932 1.47

PLSR/First/11/CS 0.34 1.892 0.94 0.40 2.107 0.88 0.13 2.186 0.60 0.52 0.937 1.39

RF/First/11/CS 0.47 1.414 1.25 0.53 1.440 1.32 0.28 1.359 0.99 0.54 0.934 1.46

PLSR/First/21 0.22 2.773 0.70 0.35 2.453 0.75 0.13 2.331 0.56 0.49 0.983 1.33

RF/First/21 0.49 1.389 1.28 0.52 1.495 1.28 0.27 1.356 1.00 0.53 0.942 1.45

PLSR/First/21/CS 0.32 1.958 0.91 0.41 2.085 0.89 0.12 2.339 0.57 0.51 0.946 1.38

RF/First/21/CS 0.48 1.383 1.28 0.52 1.484 1.27 0.28 1.337 1.01 0.53 0.940 1.45

PLSR/Second/11 0.20 2.721 0.71 0.24 3.238 0.57 0.20 1.674 0.77 0.38 1.197 1.10

RF/Second/11 0.30 1.995 0.92 0.39 1.944 1.03 0.29 1.315 1.05 0.39 1.111 1.26

PLSR/Second/11/CS 0.16 3.039 0.59 0.36 2.462 0.76 0.26 2.036 0.71 0.50 1.032 1.27

RF/Second/11/CS 0.31 1.981 0.92 0.39 1.946 1.03 0.30 1.294 1.06 0.39 1.097 1.27

PLSR/Second/21 0.09 4.029 0.50 0.25 3.171 0.61 0.27 1.772 0.77 0.45 1.241 1.19

RF/Second/21 0.34 1.783 1.01 0.39 1.897 1.02 0.36 1.170 1.13 0.53 0.933 1.45

PLSR/Second/21/CS 0.13 3.013 0.60 0.31 2.555 0.73 0.29 1.557 0.85 0.45 1.135 1.19RF/Second/21/CS 0.35 1.758 1.02 0.40 1.854 1.04 0.36 1.168 1.12 0.53 0.932 1.46† R2, coefficient of multiple determination.‡ RMSPE, root mean squared prediction error.§ RPD, residual prediction deviation.¶ PLSR, partial least squares regression; RF, random forest; Raw, raw reflectance spectra; First, first derivative; Second, second derivative; 11,

smoothing segment of 11 nm; 21, smoothing segment of 21 nm; CS, clipped spectra.

Page 9: Pedology Development and Update Process of VNIR-Based ...ufgrunwald.com/wp-content/uploads/2016/09/Sequeira-et-al-2014... · These constraints have led to an increasing interest in

www.soils.org/publications/sssaj 911

11 nm smoothing segment. The four top models had similar performances with a slightly better performance for first derivative models compared to second derivative models. Between the models developed with spec-tra treated with first derivate across smoothing segment of 21 nm, the one with clipped spectra (RF/First/21/CS) had better performance. In fact, the majority of updated models in Table 7 that were developed with clipped spectra performed better than their counter partners without clipped spectra, proving the efficiency of clipped spectra in removing noise.

On average, RMSPE of updated models increased 36% compared to Dataset-1 results (Table 3 and 6) likely due to the increased validation set size of Dataset-3 over Dataset-1. Nevertheless, the standardized pa-rameter RPD increased 43% for the updated models representing a real improvement in prediction perfor-mance. The performance of the top four updated models can be visualized in Fig. 6. Notice that all models had the regression lines following close to the 1:1 relationship and that the dispersion of predicted SOC values increased for observed SOC values >10%. Increasing the number of pe-dons in the dataset with SOC content >10% would likely further improve the algorithm’s learning process and, hence, predictions. Thus, further ef-forts will focus on including pedons with high SOC content in the data-set for future update rounds of pre-diction models.

CONCLUSIONSThe RF multivariate technique

consistently resulted in more accu-rate VNIR-based models for pre-dicting SOC than PLSR. The non-normal distribution of SOC data caused the nonparametric RF mod-els to outperform the parametric PLSR models. Preprocessing spectra with Savitzky–Golay first derivative across smoothing segment of 21 nm tended to produce better predic-

Fig. 4. Performance of four visible and near-infrared-based models developed with Dataset-1 (2043 samples) and used to predict soil organic C (SOC) for Dataset-2.2 samples. The complete Dataset-2.2 had 2983 samples and its derived Cluster 1, 2, and 3 had 1395, 874, and 714 samples, respectively. Dashed lines represent the 1:1 relationship. Dark lines represent the regression analysis of observed vs. predicted SOC. PLSR, partial least squares regression; RF, random forest; Raw, raw reflectance spectra; First, first derivative; 11, smoothing segment of 11 nm; 21, smoothing segment of 21; CS, clipped spectra.

Fig. 5. Performance of complete dataset and combined clusters models developed with Dataset-1 (2043 samples) and used to predict soil organic C (SOC) for Dataset-2.2 samples that consisted in 2983 independent new samples. Dashed lines represent the 1:1 relationship. Dark lines represent the regression analysis of observed vs. predicted SOC. RF, random forest; First, first derivative; 21, smoothing segment of 21; CS, clipped spectra.

Page 10: Pedology Development and Update Process of VNIR-Based ...ufgrunwald.com/wp-content/uploads/2016/09/Sequeira-et-al-2014... · These constraints have led to an increasing interest in

912 Soil Science Society of America Journal

tions than raw data. This was further improved by clipping out noise bands from the spectra.

Collecting and evaluating data in three separate sets allowed models to be revalidated and updated with new independent samples. This was quite valuable since it tested the robustness of the models for a wider range of SOC values. The revalidation process also indicated that subsetting data through cluster analy-sis was not worthwhile since the small improvement found did not compensate for the time and computing resources needed for this more laborious procedure.

The updated RF models presented satisfactory validation results with R2, RMSPE, and RPD ranging from 0.93 to 0.95, 2.062 to 2.501%, and 3.62 to 4.39, respectively. Developed with a dataset of 2039 pedons (8624 mineral horizons) from the 50 states of the United Stated, these models cover significant vari-ability of U.S. soils and environments, and should perform better than nonupdated RF models in predicting SOC for new inde-pendent samples. Additional revalidation and update of these models should be done at regular intervals as new samples are added to the database. This continuous process will provide the most robust models possible to final users.

ACKNOWLEDGMENTSThis research was sponsored by the USDA-NRCS. The authors would like to thank all staff of the National Soil Survey Center for sampling, analyzing, and managing the data used in this research.

REFERENCESBreiman, L. 1996. Bagging predictors. Mach. Learn. 24:123–140.Breiman, L. 2001. Random forests. Mach. Learn. 45:5–32. doi:10.1023/A:1010933404324Brown, D.J., K.D. Shepherd, M.G. Walsh, M.D. Mays, and T.G. Reinsch. 2006. Global soil characterization with VNIR diffuse reflectance spectroscopy. Geoderma 132:273–290. doi:10.1016/j.geoderma.2005.04.025Chang, C., D.A. Laird, M.J. Mausbach, and C.R. Hurburgh, Jr. 2001. Near-infrared reflectance spectroscopy-principal components regression analyses of soil properties. Soil Sci. Soc. Am. J. 65:480–490. doi:10.2136/sssaj2001.652480xClark, R.N. 1999. Spectroscopy of rocks and minerals, and principles of spectroscopy. In: N. Rencz, editor, Remote sensing for the earth sciences: Manual of remote sensing. John Wiley & Sons, New York. p. 3–52.Coppens, F., P. Garnier, S. De Gryze, R. Merckx, and S. Recous. 2006. Soil moisture, carbon and nitrogen dynamics following incorporation and surface application of

labelled crop residues in soil columns. Eur. J. Soil Sci. 57:894–905. doi:10.1111/j.1365-2389.2006.00783.xDíaz-Uriarte, R., and S.A. Andrés. 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7:3. doi:10.1186/1471-

Table 7. Cross-validation and validation results of eight vis-ible and near-infrared-based models developed to predict soil organic C (SOC). Calibration and validation sets had 6033 and 2580 samples, respectively, with observed SOC ranging from 0.0 to 63.5 and 0.0 to 63.3%, respectively.

Cross-validation Validation

Model R2† RMSPE‡ (%) R2 RMSPE (%) RPD§

RF/First/11¶ 0.93 2.500 0.95 2.107 4.29RF/First/11/CS 0.93 2.464 0.95 2.099 4.31

RF/First/21 0.93 2.460 0.95 2.080 4.35

RF/First/21/CS 0.93 2.432 0.95 2.062 4.39

RF/Second/11 0.92 2.595 0.93 2.501 3.62

RF/Second/11/CS 0.92 2.585 0.93 2.498 3.62

RF/Second/21 0.93 2.403 0.95 2.082 4.34RF/Second/21/CS 0.93 2.386 0.95 2.096 4.31† R2, coefficient of multiple determination.‡ RMSPE, root mean squared prediction error.§ RPD, residual prediction deviation.¶ RF, random forest; First, first derivative; Second, second derivative;

11, smoothing segment of 11 nm; 21, smoothing segment of 21 nm; CS, clipped spectra.

Fig. 6. Validation performance of four visible and near-infrared-based models used to predict soil organic C (SOC). Dataset-3 was used to calibrate and validate the models. Calibration and validation sets had 6033 and 2580 samples, respectively, with observed SOC ranging from 0.0 to 63.5%. Dashed lines represent the 1:1 relationship. Dark lines represent the regression analysis of observed vs. predicted SOC. RF, random forest; First, first derivative; Second, second derivative; 21, smoothing segment of 21; CS, clipped spectra.

Page 11: Pedology Development and Update Process of VNIR-Based ...ufgrunwald.com/wp-content/uploads/2016/09/Sequeira-et-al-2014... · These constraints have led to an increasing interest in

www.soils.org/publications/sssaj 913

2105-7-3Duckworth, J. 2004. Mathematical data preprocessing. In: C.A. Roberts, J.

Workman Jr., and J.B. Reeves III, editors, Near-infrared spectroscopy in agriculture. ASA, CSSA, SSSA, Madison, WI. p. 115–132.

Dunn, B.W., H.G. Beecher, G.D. Batten, and S. Ciavarella. 2002. The potential of near-infrared reflectance spectroscopy for soil analysis: A case study from the Riverine Plain of south-eastern Australia. Aust. J. Exp. Agric. 42:607–614. doi:10.1071/EA01172

Fest, B.J., S.J. Livesley, M. Drösler, E. van Gorsel, and S.K. Arndt. 2009. Soil-atmosphere greenhouse gas exchange in a cool, temperate Eucalyptus delegatensis forest in south-eastern Australia. Agric. For. Meteorol. 149:393–406. doi:10.1016/j.agrformet.2008.09.007

Guerrero, C., R.A. Viscarra Rossel, and A.M. Mouazen. 2010. Diffuse reflectance spectroscopy in soil science and land resource assessment. Geoderma 158:1–2. doi:10.1016/j.geoderma.2010.05.008

Heise, H.M., and R. Winzen. 2002. Chemometrics in near-infrared spectroscopy. In: H.W. Siesler, Y. Ozaki, S. Kawata, and H.M. Heise, editors, Near-infrared spectroscopy. Wiley-VCH, Weinheim, Germany. p. 125–162.

Ho, T.K. 1998. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 2:832–844.

Izenman, A.J. 2008. Committee machines. In: A.J. Izenman, editor, Modem multivariate statistical techniques. Springer, New York. p. 505–549.

Kooistra, L., R. Wehrens, R.S.E.W. Leuven, and L.M.C. Buydens. 2001. Possibilities of visible-near-infrared spectroscopy for the assessment of soil contamination in river floodplains. Anal. Chim. Acta 446:97–105. doi:10.1016/S0003-2670(01)01265-X

Minasny, B., and A.B. McBratney. 2008. Regression rules as a tool for predicting soil properties from infrared reflectance spectroscopy. Chemom. Intell. Lab. Syst. 94:72–79. doi:10.1016/j.chemolab.2008.06.003

Purakayastha, T.J., D.R. Huggins, and J.L. Smith. 2008. Carbon sequestration in native prairie, perennial grass, no-till, and cultivated Palouse silt loam. Soil Sci. Soc. Am. J. 72:534–540. doi:10.2136/sssaj2005.0369

R Development Core Team. 2011. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Reeves, J., III, G. McCarty, and T. Mimmo. 2002. The potential of diffuse reflectance spectroscopy for the determination of carbon inventories in soils. 2002. Environ. Pollut. 116:S277–S284. doi:10.1016/S0269-7491(01)00259-7

Sakia, R.M. 1992. The Box-Cox transformation technique: A review. Statistician 41:169–178. doi:10.2307/2348250

Sankey, J.B., D.J. Brown, M.L. Bernard, and R.L. Lawrence. 2008. Comparing

local vs. global visible and near-infrared (VisNIR) diffuse reflectance spectroscopy (DRS) calibration for the prediction of soil clay, organic C and inorganic C. Geoderma 148:149–158. doi:10.1016/j.geoderma.2008.09.019

Schoeneberger, P.J., D.A. Wysocki, E.C. Benham, and W.D. Broderson. 2002. Field book for describing and sampling soils. Version 2.0. NRCS-National Soil Survey Center, Lincoln, NE.

Soil Survey Staff. 2004. Soil survey laboratory methods manual. Soil Survey Investigations Rep. 42. Version 4.0. NRCS, Washington, DC.

Stevenson, F.J. 1994. Humus chemistry: Genesis, composition, reactions. 2nd ed. John Wiley & Sons, New York.

Vasques, G.M., S. Grunwald, and W.G. Harris. 2010. Spectroscopic models of soil organic carbon in Florida, USA. J. Environ. Qual. 39:923–934. doi:10.2134/jeq2009.0314

Vasques, G.M., S. Grunwald, and J.O. Sickman. 2008. Comparison of multivariate methods for inferential modeling of soil carbon using visible/near-infrared spectra. Geoderma 146:14–25. doi:10.1016/j.geoderma.2008.04.007

Vasques, G.M., S. Grunwald, and J.O. Sickman. 2009. Modeling of soil organic carbon fractions using visible-near-infrared spectroscopy. Soil Sci. Soc. Am. J. 73:176–184. doi:10.2136/sssaj2008.0015

Viscarra Rossel, R.A., and T. Behrens. 2010. Using data mining to model and interpret soil diffusive reflectance spectra. Geoderma 158:46–54. doi:10.1016/j.geoderma.2009.12.025

Walkley, A., and I.A. Black. 1934. An examination of Degtjareff method for determining soil organic matter and a proposed modification of the chromic acid titration method. Soil Sci. 37:29–38. doi:10.1097/00010694-193401000-00003

Weil, R.R., and F. Magdoff. 2004. Significance of soil organic matter to soil quality and health. In: F. Magdoff and R.R. Weil, editors, Soil organic matter in sustainable agriculture. CRC Press, Boca Raton, FL. p. 1–44.

Williams, P.C. 1987. Variables affecting near-infrared reflectance spectroscopic analysis. In: P. Williams and K. Norris, editors, Near-infrared technology in the agricultural and food industries. Am. Assoc. Cereal Chemists, St. Paul, MN. p. 143–167.

Wold, S., H. Martens, and H. Wold. 1983. The multivariate calibration method in chemistry solved by the PLS method. In: A. Ruhe and B. Kagstrom, editors, Matrix pencils. Lecture Notes in mathematics. Springer-Verlag, Heidelberg, Germany. p. 286–293.

Wold, S., M. Sjöström, and L. Eriksson. 2001. PLS-regression: A basic tool of chemometrics. Chemom. Intell. Lab. Syst. 58:109–130. doi:10.1016/S0169-7439(01)00155-1