Exploring Variable Clustering and Importance in JMP
-
Upload
jmp-division-of-sas -
Category
Technology
-
view
107 -
download
1
description
Transcript of Exploring Variable Clustering and Importance in JMP
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
EXPLORING VARIABLE CLUSTERING
AND IMPORTANCE IN JMP
CHRIS GOTWALT AND RYAN PARKER
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
CLUSTERINGINTRODUCTION
• Variable clustering is a method that performs dimension reduction on the
number of input variables to be used in a predictive model.
• Reduces inputs by finding groups of similar variables so that a single variable
can represent each group.
• Helps reduce effects of collinearity on the input variables.
• Developed by SAS/STAT Development Director Warren Sarle.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
CLUSTERINGAN ITERATIVE ALGORITHM
• Iteratively splits and assigns variables to clusters.
• Sample iterations for variables in Wine Quality data set:
Iteration 1 Alcohol, Citric Acid, pH, Sugar, Sulfur Dioxide
Alcohol, Citric Acid, Sulfur Dioxide
Alcohol, SugarpH, Sulfur
Dioxide
pH, Sugar
Citric Acid
Iteration 2
Iteration 3
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
CLUSTERINGALGORITHM DETAILS
• At each iteration the cluster with the largest second eigenvalue is split.
• Variables within this cluster are assigned to two new clusters based on each
variable’s correlation with the first two orthoblique rotated principal
components.
• After the split, variables from other clusters are reassigned to one of the new
clusters if they have a higher correlation with the new cluster.
• Ends when the second eigenvalue of all clusters is less than one.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
CLUSTERINGREDUCING EACH CLUSTER TO A SINGLE VARIABLE
pH
Sugar
pH
Citric Acid
• Each cluster can be reduced to a single
variable for modeling.
• There are two ways to do this:
1. We can use the most representative
variable from each cluster.
2. Alternatively, the cluster component from
each cluster can be used.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
CLUSTERINGMOST REPRESENTATIVE VARIABLES
• These are variables that best represent each cluster.
• They have the highest correlation with the variables in its cluster.
• Most representative variables provide a clear interpretation when used.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
CLUSTERINGCLUSTER COMPONENTS
• New variables created using the first principal component of each cluster.
• Provide a way to combine variables in each cluster into a single variable.
• Similar to traditional principal components analysis (PCA) except that each
cluster component only uses variables from that cluster.
• Interpretation not as clear when compared to most representative variables.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
CLUSTERINGDEMO: IMPORTANT TERMS
• RSquare with Own Cluster
• The RSquare a variable has with variables in its cluster.
• RSquare with Next Closest
• The RSquare a variable has with variables in the next most similar cluster.
• 1-RSquare Ratio
• Relative similarity between a variable’s own cluster and the next closest cluster.
• Values should always be less than 1.
• Values greater than 1 indicate variable should be moved to the next closest cluster.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
IMPORTANCEINTRODUCTION
• Provides a general way to assess the importance of variables for predictive
models in JMP.
• Insight is in terms of practical significance of input variables.
• Based on functional decomposition ideas of I. M. Sobol.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
IMPORTANCEFUNCTIONAL DECOMPOSITION
• I. M. Sobol showed that we can decompose a function 𝑓(𝑋1, … , 𝑋𝑝) into the
sum of lower dimensional inputs:
• 𝑓 𝑋1, … , 𝑋𝑝 = 𝑓0 + 𝑓1 𝑋1 +⋯+ 𝑓𝑝 𝑋𝑝 + 𝑓12 𝑋1, 𝑋2 +⋯
• Decomposition has a function for each 𝑋𝑖, each pair (𝑋𝑖 , 𝑋𝑗), etc.
• The variability of these lower dimensional functions assess the importance of
the input variables.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
IMPORTANCEIMPORTANCE EFFECTS
• Assessment of variable importance is in terms of effect indices.
• These indices are numbers between 0 and 1 indicating relative importance.
• Main effect indices measure variability of predictions due to a single input.
• They do not account for interaction effects.
• Total effect indices measure the total variability of predictions due the input.
• Combines all main and higher order interaction effects.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
IMPORTANCEDISTRIBUTION OF INPUT VARIABLES
• Variability in predictions is due to the distribution of input variables
• JMP 11 provides three input variable distribution options:
1. Independent Uniform
2. Independent Resampled
3. Dependent Resampled
• Monte Carlo estimation procedure used for independent cases.
• 𝐾-nearest neighbors estimation used for dependent case.
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
IMPORTANCEUSE RESAMPLED INPUTS?
Uniform
Acceptable
Resampled
Needed
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
IMPORTANCEMARGINAL INFERENCE
Main Effects0.16 0.03
Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .
VARIABLE
IMPORTANCEDEMO