Exploring Variable Clustering and Importance in JMP

15
Copyright © 2012, SAS Institute Inc. All rights reserved. EXPLORING VARIABLE CLUSTERING AND IMPORTANCE IN JMP CHRIS GOTWALT AND RYAN PARKER

description

This presentation was given live at JMP Discovery Summit 2013 in San Antonio, Texas, USA. To sign up to attend this year's conference, visit http://jmp.com/summit

Transcript of Exploring Variable Clustering and Importance in JMP

Page 1: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

EXPLORING VARIABLE CLUSTERING

AND IMPORTANCE IN JMP

CHRIS GOTWALT AND RYAN PARKER

Page 2: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGINTRODUCTION

• Variable clustering is a method that performs dimension reduction on the

number of input variables to be used in a predictive model.

• Reduces inputs by finding groups of similar variables so that a single variable

can represent each group.

• Helps reduce effects of collinearity on the input variables.

• Developed by SAS/STAT Development Director Warren Sarle.

Page 3: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGAN ITERATIVE ALGORITHM

• Iteratively splits and assigns variables to clusters.

• Sample iterations for variables in Wine Quality data set:

Iteration 1 Alcohol, Citric Acid, pH, Sugar, Sulfur Dioxide

Alcohol, Citric Acid, Sulfur Dioxide

Alcohol, SugarpH, Sulfur

Dioxide

pH, Sugar

Citric Acid

Iteration 2

Iteration 3

Page 4: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGALGORITHM DETAILS

• At each iteration the cluster with the largest second eigenvalue is split.

• Variables within this cluster are assigned to two new clusters based on each

variable’s correlation with the first two orthoblique rotated principal

components.

• After the split, variables from other clusters are reassigned to one of the new

clusters if they have a higher correlation with the new cluster.

• Ends when the second eigenvalue of all clusters is less than one.

Page 5: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGREDUCING EACH CLUSTER TO A SINGLE VARIABLE

pH

Sugar

pH

Citric Acid

• Each cluster can be reduced to a single

variable for modeling.

• There are two ways to do this:

1. We can use the most representative

variable from each cluster.

2. Alternatively, the cluster component from

each cluster can be used.

Page 6: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGMOST REPRESENTATIVE VARIABLES

• These are variables that best represent each cluster.

• They have the highest correlation with the variables in its cluster.

• Most representative variables provide a clear interpretation when used.

Page 7: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGCLUSTER COMPONENTS

• New variables created using the first principal component of each cluster.

• Provide a way to combine variables in each cluster into a single variable.

• Similar to traditional principal components analysis (PCA) except that each

cluster component only uses variables from that cluster.

• Interpretation not as clear when compared to most representative variables.

Page 8: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

CLUSTERINGDEMO: IMPORTANT TERMS

• RSquare with Own Cluster

• The RSquare a variable has with variables in its cluster.

• RSquare with Next Closest

• The RSquare a variable has with variables in the next most similar cluster.

• 1-RSquare Ratio

• Relative similarity between a variable’s own cluster and the next closest cluster.

• Values should always be less than 1.

• Values greater than 1 indicate variable should be moved to the next closest cluster.

Page 9: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEINTRODUCTION

• Provides a general way to assess the importance of variables for predictive

models in JMP.

• Insight is in terms of practical significance of input variables.

• Based on functional decomposition ideas of I. M. Sobol.

Page 10: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEFUNCTIONAL DECOMPOSITION

• I. M. Sobol showed that we can decompose a function 𝑓(𝑋1, … , 𝑋𝑝) into the

sum of lower dimensional inputs:

• 𝑓 𝑋1, … , 𝑋𝑝 = 𝑓0 + 𝑓1 𝑋1 +⋯+ 𝑓𝑝 𝑋𝑝 + 𝑓12 𝑋1, 𝑋2 +⋯

• Decomposition has a function for each 𝑋𝑖, each pair (𝑋𝑖 , 𝑋𝑗), etc.

• The variability of these lower dimensional functions assess the importance of

the input variables.

Page 11: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEIMPORTANCE EFFECTS

• Assessment of variable importance is in terms of effect indices.

• These indices are numbers between 0 and 1 indicating relative importance.

• Main effect indices measure variability of predictions due to a single input.

• They do not account for interaction effects.

• Total effect indices measure the total variability of predictions due the input.

• Combines all main and higher order interaction effects.

Page 12: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEDISTRIBUTION OF INPUT VARIABLES

• Variability in predictions is due to the distribution of input variables

• JMP 11 provides three input variable distribution options:

1. Independent Uniform

2. Independent Resampled

3. Dependent Resampled

• Monte Carlo estimation procedure used for independent cases.

• 𝐾-nearest neighbors estimation used for dependent case.

Page 13: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEUSE RESAMPLED INPUTS?

Uniform

Acceptable

Resampled

Needed

Page 14: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEMARGINAL INFERENCE

Main Effects0.16 0.03

Page 15: Exploring Variable Clustering and Importance in JMP

Copyr i g ht © 2012, SAS Ins t i tu t e Inc . A l l r ights reser ve d .

VARIABLE

IMPORTANCEDEMO