ars.els-cdn.com · Web viewFriedman, Jerome, Trevor Hastie and Rob %J R package version Tibshirani....
Transcript of ars.els-cdn.com · Web viewFriedman, Jerome, Trevor Hastie and Rob %J R package version Tibshirani....
Supplementary Information
Quantifying the influence of surface physico-
chemical properties of biosorbents on heavy metal
adsorption
Chaamila Pathiranaa,b,c, Abdul M. Ziyathd, K.B.S.N. Jinadasac Prasanna Egodawattab,
Sarina Sarinab, Ashantha Goonetillekeb*
a Department of Forestry and Environmental Science University of Sri Jayewardenepura ,
Nugegoda, Sri Lanka
bScience and Engineering Faculty, Queensland University of Technology (QUT), GPO Box
2434, Brisbane, 4001, Queensland, Australia
cDepartment of Civil Engineering, University of Peradeniya, Sri Lanka
dZedz Consultants Pty Ltd, Hillcrest, QLD 4118, Australia
[email protected]; [email protected]; [email protected];
[email protected]; [email protected]; [email protected]
*Corresponding author
Ashantha Goonetilleke
1
1. Selection of Biosorbents
For the selection of biosorbents, PROMETHEE (Preference Ranking Organisation METHod
for Enrichment Evaluations) which is a Multi Criteria Decision Making (MCDM) Technique
was employed. MCDM techniques are employed to help with the decision making process
when multi variable problems are involved. From the various MCDM methods available,
PROMETHEE is considered as a relatively sophisticated method compared to the others
(Brans, Vincke and Mareschal 1986; Keller, Massart and Brans 1991).
PROMETHEE is a non-parametric data analysis method used to rank the actions/objects on
the basis of a set of pre-determined criteria. For each of the variables in the data matrix, the
degree of preference of one object to another is assessed. The ranking order is developed
aided by the calculation of net ranking flow (φ value), for the available objects/actions on the
basis of a range of criteria (Ayoko et al. 2007; Podvezko and Podviezko 2010). To calculate
the φ values, each criterion must be provided with three conditions: a preference function, a
preference order (maximise/minimise) and a weighting. The PROMETHEE algorithm then
employs a number of steps to calculate the φ values between objects as explained elsewhere
(Keller, Massart and Brans 1991; Kokot and Phuong 1999). Visual PROMETHEE software
was used for the analysis.
2
Table S1 PROMTHEE II complete ranking results for the five biosorbents
Material Ø Rank
Coconut shell biochar 0.0589 1
Coir pith 0.0342 2
Rice straw 0.0304 3
Rice husk 0.0228 4
Tea waste 0. 0175 5
Tea waste (TW) and coconut shell biochar (CSB) ranking 1 and 5 respectively, were selected
for the preparation of material mixtures as the variability of physico-chemical parameters was
the highest between them.
3
Figure S1 Photos of two selected biosorbents (a) Tea factory waste (b) Coconut
shell biochar
(a) (b)
Table S2 Weight percentage of TW and CSB used to generate biosorbent mixtures
Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Weight % of CSB
10
0 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0
Weight % of TW 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
10
0
4
2. ANOVA results of the material mixtures for different variables
Table S3 Significance of p values obtained by ANOVA for each variable in the material
mixtures
Variable
Significance of P value obtained by
ANOVA
SSA < 0.001
PS < 0.005
PV < 0.001
ZP > 0.5
TAG < 0.001
TBG < 0.005
3. Statistical approach adopted
5
Figure S2 Statistical approach adopted to investigate the influence of biosorbent physico-
chemical properties on heavy metal adsorption (SSA surface area, PV pore volume, PS pore
size, ZP zeta potential, TAG total acidic group and TBG total basic group)
4. Pearson product-moment correlation coefficient (PPMCC)
The Pearson product-moment correlation coefficient (PPMCC) is a measure of the strength
and direction of association that exists between two variables measured at an interval scale.
PPMCC is defined as the covariance of the two variables divided by the product of their
standard deviations. It has a value between +1 and −1, where 1 implies total positive linear
correlation, 0 implies no linear correlation whereas −1 is for total negative linear correlation
(Bruce and Bruce 2017).
This test involves four assumptions: 1. Variables are measured at the interval or ratio level
(i.e., they are continuous); 2. There is a linear relationship between the two variables; 3.
There should be no significant outliers; and 4. The variables should be approximately
normally distributed. Statistical inference based on Pearson's correlation coefficient often
involves running a permutation test (resample test).
5. Principal components analysis (PCA)
PCA is a multivariate data analysis technique employed to assess and visualize the
interdependencies among variables and to trim down the significantly correlated (redundant)
variables which serve to measure the same construct. Matrices of data containing significant
proportions of interrelated variables are converted to a set of new hypothetical variables
known as principal components (PCs) which are orthogonal (uncorrelated) to one another.
They are represented in an order so that the first few PCs represent most of the variation in
6
the original data matrix. PCs reflect both, common and unique variance of the original
variables (conversely, common factor analysis aims to exclude unique variance) and serves to
reduce the number of variables under assessment allowing identification and assessment of
groups of interrelated variables (Salkind 2010).
Where d i , g2 is the squared distance of a given observation to the origin. The squared distance,
d i , g2 is computed (via Pythagorean Theorem) as the sum of the squared values of all the factor
scores of this observation. Components with a large value of cosi ,l2 contribute a relatively
large portion to the total distance and therefore, these components are important for that
observation.
The distance to the center of gravity is defined for supplementary observations and the
squared cosine can be computed and is meaningful. Therefore, the value of cos2 can help to
find the components that are important to interpret both, active and supplementary
observations.
Table S4 cos2 values for PC1 and PC2.
Variabl
e cos2 for PC1 Variable cos2 for PC2
TAG 0.9068 SSA 0.8150
ZP 0.7091 PS 0.7765
SSA 0.4609 PV 0.7116
PS 0.2076 ZP 0.1140
PV 0.0884 TBG 0.0882
7
TBG 0.0017 TAG 0.0515
6. Penalized Regression; Ridge, Lasso and Elastic Net regressions
In contrast to the standard linear model (the ordinary least squares method), penalized
regression allows the creation of a linear regression model that is penalized, for having too
many variables in the model, by adding a constraint to the equation. Penalized regression
methods are also known as shrinkage or regularization methods. The consequence of
imposing this penalty is to reduce (i.e. shrink) the coefficient values towards zero. This
allows the less contributive variables to have a coefficient close to zero or equal to zero. The
most common penalized regression methods are ridge regression, lasso regression and elastic
net regression.
Lasso regression employs L1 Regularization (Lasso penalization) which adds a penalty equal
to the sum of the absolute value of the coefficients. It will shrink some parameters to absolute
zero. Hence some variables will not play any role in the model, adding variable selection as
an intrinsic feature of the method. L2 Regularization used in Ridge regression (Ridge
penalization) adds a penalty equal to the sum of the squared value of the coefficients and
forces the parameters to be relatively small but, never equal to zero. It will include all the
variables. Lambda is a shared penalization parameter for both these methods.
Elastic-net is a mix of both, L1 and L2 regularizations (James et al. 2013; Bruce and Bruce
2017). A penalty is applied to the sum of the absolute values and to the sum of the squared
values. The parameter alpha sets the ratio between L1 and L2 regularization and a hybrid
behavior between L1 and L2 regularization is seen with variable selection as an intrinsic
8
feature. Optimal values for lambda and alpha are calculated by the algorithm using RMSE
(Root mean squared error). RMSE value represents the square root of the variance of
residuals and it is a measurement of the absolute fit of the model to the data. In other words,
it indicates how close the observed measurements are to the model’s predicted values (Bruce
and Bruce 2017; James et al. 2013; Friedman, Hastie and Tibshirani 2009).
7. Packages and Libraries used for statistical analysis in R studio.
The following packages and Libraries were loaded.
library(devtools)
library(caret)
library(factoextra)
library(elasticnet)
library(glmnet)
library(VIF)
library(fmsb)
library(tidyverse)
library(plyr)
library(scales)
library(grid)
library(RVAideMemoire)
8. Codes used for statistical analysis in R studio.
# Loading Libraries for analysis
9
library(devtools)
library(caret)
library(factoextra)
library(elasticnet)
library(glmnet)
library(VIF)
library(fmsb)
library(tidyverse)
# datamatrix is the original data matrix.
# DataPbAll is a data matrix created by removing Cu and Cd adsorption data from the
original data matrix.
# DataCuAll is a data matrix created by removing Pb and Cd adsorption data from the
original data matrix.
# DataCdAll is a data matrix created by removing Pb and Cu adsorption data from the
original data matrix.
# Prepare the correlation matrix
correlationmatrix <- cor(datamatrix)
# summarize the correlation matrix
print(correlationMatrix)
10
# Permutation test for PPMCC. Example:- PPMCC between PV and SSA is tested with
10000
# Resamples.
x<-datamatrix$PV
y<-datamatrix$SSA
perm.cor.test(x, y, nperm = 10000, progress = TRUE)
# Prepare PCA analysis
pcatest1 <- prcomp(datamatrix,
center = TRUE,
scale. = TRUE)
# Print and summarize output
print(pcatest1)
summary(pcatest1)
# Attributes of PCA test
res.var <-get_pca_var(pcatest1)
res.var$coord
res.var$contrib
# Obtain cos2 values for the variables
res.var$cos2
11
# Print the biplot for PCA
fviz_pca_biplot(pc1,
pointsize = 2,
col.var="dark blue",
repel = TRUE)
# Assessing VIF of variables was done using the following codes. model1 is a linear model
where Pb is the dependent variable. Cu or Cd can also be used as the dependent variable and
the VIF values will be the same.
model1 <- lm(Pb~., data=DataPbAll)
car::vif(model1)
#Checking for aliased coefficients
alias(model1)
# Building models with Enet using glmnet package.
# DataCuSel = A data matrix created by removing following from the original data matrix;
Pb, Cd, Carboxylic, Phenolic, Lactonic, PV.
# DataPbSel = A data matrix created by removing following from the original data matrix;
Cu, Cd, Carboxylic, Phenolic, Lactonic, PV.
# DataCdSel = A data matrix created by removing following from the original data matrix;
Pb, Cu, Carboxylic, Phenolic, Lactonic, PV.
# Defining repeated k-fold cross validation (k=10, repeats=10).
12
fitControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats=10,
search = "grid")
# Building model for Cu.
model.Cu <- caret::train(Cu~ .,
data = DataCuSel,
method="glmnet",
trControl = fitControl,
tuneLength = 20)
# Printing the model.
summary(model.Cu)
print(model.Cu)
# Attributes for final model
model.Cu$finalModel
model.Cu$bestTune
model.Cu$coefnames
# Coefficients when lambda is set to the optimal value
FinModCu <- model.Cu$finalModel
coef(FinModCu, s=model.Cu$finalModel$lambdaOpt)
model.Cu$finalModel$lambdaOpt
13
# Defining and summarize importance of variables
impCu <- varImp(model.Cu, scale=FALSE)
print(impCu)
# Building model for Cd.
model.Cd <- caret::train(Cd~ .,
data = DataCdSel,
method="glmnet",
trControl = fitControl,
tuneLength = 20)
# Printing the model.
summary(model.Cd)
print(model.Cd)
# Attributes for final model
model.Cd$finalModel
model.Cd$bestTune
model.Cd$coefnames
# Coefficients when lambda is set to the optimal value
FinModCd <- model.Cd$finalModel
coef(FinModCd, s=model.Cd$finalModel$lambdaOpt)
model.Cd$finalModel$lambdaOpt
14
# Defining and summarize importance of variables
impCd <- varImp(model.Cd, scale=FALSE)
print(impCd)
# Building model for Pb.
model.Pb <- caret::train(Pb~ .,
data = DataPbSel,
method="glmnet",
trControl = fitControl,
tuneLength = 20)
# Printing the model.
summary(model.Pb)
print(model.Pb)
# Attributes for final model
model.Pb$finalModel
model.Pb$bestTune
model.Pb$coefnames
# Coefficients when lambda is set to the optimal value
FinModPb <- model.Pb$finalModel
coef(FinModPb, s=model.Pb$finalModel$lambdaOpt)
model.Pb$finalModel$lambdaOpt
15
# Defining and summarize importance of variables
impPb <- varImp(model.Pb, scale=FALSE)
print(impPb)
References
Ayoko, Godwin A, Kirpal Singh, Steven Balerea and Serge Kokot. 2007. "Exploratory
multivariate modeling and prediction of the physico-chemical properties of surface
water and groundwater." Journal of Hydrology 336 (1-2): 115-124.
Brans, Jean-Pierre, Ph Vincke and Bertrand Mareschal. 1986. "How to select and how to rank
projects: The PROMETHEE method." European journal of operational research 24
(2): 228-238.
Bruce, Peter and Andrew Bruce. 2017. Practical Statistics for Data Scientists: 50 Essential
Concepts: " O'Reilly Media, Inc.".
Friedman, Jerome, Trevor Hastie and Rob %J R package version Tibshirani. 2009. "glmnet:
Lasso and elastic-net regularized generalized linear models 1 (4).
James, Gareth, Daniela Witten, Trevor Hastie and Robert Tibshirani. 2013. An introduction
to statistical learning. Vol. 112: Springer.
Keller, HR, DL Massart and JP Brans. 1991. "Multicriteria decision making: a case study."
Chemometrics and Intelligent laboratory systems 11 (2): 175-189.
Kokot, S and Tran Dong Phuong. 1999. "Elemental content of Vietnamese ricePart 2.†
Multivariate data analysis." Analyst 124 (4): 561-569.
Podvezko, Valentinas and Askoldas Podviezko. 2010. "Use and choice of preference
functions for evaluation of characteristics of socio-economical processes.
16
Salkind, Neil J. 2010. Encyclopedia of research design. Vol. 1: Sage.
17