Download - Svd filtered temporal usage clustering

SVD Filtered Temporal Usage Pattern Analysis & Clustering

Liang XieLiang XieLiang XieLiang Xie

SCSUG Educational Forum 2009SCSUG Educational Forum 2009SCSUG Educational Forum 2009SCSUG Educational Forum 2009San Antonio, TXSan Antonio, TXSan Antonio, TXSan Antonio, TX

Business Objective

� Provide a robust algorithm to cluster customers based on their temporal transactional data ;

� Issues :� Data

� High Dimensionality: 360 features, multi-million records� Capture amplitude at different resolution

� High volatility due to noise

� Possible Outliers� Algorithm

� Robustness

� Efficiency� Easy to implement in SAS!

� We Choose a SVD based algorithm

� Successful application on Gene-Expression Analysis by Alter et al (PNAS, 2000)

SVD as a Filter

� SVD Definition:� Singular Value Decomposition is a mathematical tool to decompose

rectangular matrix

� Left Eigenvector matrix U can be regarded as an input rotation matrix; \Sigma is the scaling matrix, and right Eigenvector matrix V is output matrix

� SVD is similar to Fourier analysis� Filter:

� Each row of X is a linear combination of right Eigenvectors

� Each column of X is a linear combination of left Eigenvectors

'VUX Σ=

Relationship Between PCA and SVD

� SAS/STAT doesn’t explicitly support SVD

� We can tweak SAS/STAT to do SVD by link one computation method of SVD to PCA� SVD and PCA are essentially the same: SVD on the covariance matrix of

original data X is equivalent to PCA of X� PCA on non-centered covariance matrix of X is equivalent to SVD of X,

with proper scaling

')'( VSVXXSVD =

SVD in SAS/STAT

� We call PROC PRINCOMP to conduct SVD in SAS/STAT

� The uncorrected covariance matrix in PROC PRINCOM is X’X/n, not X’X, therefore the singular value matrix should be scaled by

� PROC PRINCOMPPROC PRINCOMPPROC PRINCOMPPROC PRINCOMP NOINT COV SING=

� ‘COV’ computes the principal components from the covariance matrix

� ‘NOINT’ omits the intercept from the model � ‘SING=’ specifies the singularity criterion to ensure accuracy

n

Performance

� Accuracy� Test the code on Hilbert matrix

� Specify ‘SING=1e-16’, our result is comparable to those obtained from R and MATLAB

� Efficiency� Test the code on an arbitrary rectangular matrix with 1.7million rows and

400 columns

� On a Core2Duo 1.86Ghz PC, it takes SAS 7min56sec to finish all data processing and computations, user CPU time is 5min52sec

� Note that 32-bit Windows version RRRRRRRR is not able to handle data this big:> X<-matrix(runif(1.7E6*400), ncol=400)

Error in runif(1700000 * 400) : cannot allocate vector of length 680000000

� Multi-thread/Parallel SVD algorithm from SAS is highly desired!!

Temporal Usage Pattern Analysis

� Time series usage data from customers for one year at 60min interval

� Hourly usage data is normalized to:� Year total� Monthly Total

� We want to identify segments with distinct usage pattern over one year, so that marketing department is able to design customized messages to them

Traditional Approach

� Direct K-means clustering using PROC FASTCLUS on all features

� Problems:� Not Robust: Subjective to outliers� Ambiguity in choosing optimal number of clusters a prior� High dimensionality will affect the distance measure between each pair:

� In high dimensional spaces, distances between points become relatively uniform

� Combining Robustness and High Dimensionality, we could get segments that are occupied by only a few observations which is usually not desired

� K-means clustering algorithm doesn’t take the time series nature into consideration. All features are considered independent

Our Approach

� Apply SVD to the original data, obtain Eigenvectors and singular values

� Remove components associated with the first singular value (Low Pass Filtering)

� Apply SVD again to the SVD Filtered matrix

� Calculate Pearson correlation of each observation to the right Eigenvectors obtained in previous step

� Apply k-means clustering algorithm to this correlation elements matrix

Some Notes

� For a data matrix containing 360 days’ profile, we only need to use a few of the correlation elements. We use correlation up to 85% variation is accounted for in the data

� To determine optimal number of clusters, we applied Bayesian Information Criteria. This measurement is very robust and simple to calculate:� BIC=Distortion + (Num of Var)*log(Num of Obs)*K� Distortion=sum of total variance of each cluster=sum of Distance from

PROC FASTCLUS output� With hourly data, we separate the analysis in two steps:

� Daily Level� Hourly Level for a ‘typical day’ in a month� Apply the SVD Filtered Clustering algorithm in each step

Simulated Data

� We simulate data using Heterogeneous Mixed Model of Verbeke� High Usage among Month B-D

and Month H

� Some outliers were deliberately generated by adding abnormal ad-hoc error terms

Clustering Result on Filtered Data

THANK YOUTHANK YOUTHANK YOUTHANK YOU

� You can reach me at:� [email protected]� www.linkedin.com/liangxie� My Blog:

� http://sas-programming.blogspot.com