Wal-mart Sales Forecasting

40
WAL-MART SALES FORECASTING 94-832: Business Intelligence & Data Mining SAS TEAM 7 MITHUN MATHEW MEAGAN MUSGRAVE AKASH PATEL RENU THOMAS IVY YANG

description

Forecasting 2012 holiday sales of Wal-mart with SAS Enterprise Miner using data obtained from kaggle.com

Transcript of Wal-mart Sales Forecasting

Page 1: Wal-mart Sales Forecasting

WAL-MART SALES FORECASTING

94-832: Business Intelligence & Data Mining SAS

TEAM 7MITHUN MATHEWMEAGAN MUSGRAVEAKASH PATELRENU THOMASIVY YANG

Page 2: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

Table of Contents

1 Introduction.........................................................................................................................................3

2 Business Questions..............................................................................................................................4

2.1 Question One...............................................................................................................................4

2.2 Question Two...............................................................................................................................4

3 Description and Preparation of Data...................................................................................................5

3.1 Data Source.................................................................................................................................5

3.2 Data Sets Utilized.........................................................................................................................5

3.3 Data Preparation: Merging, Cleaning, and Transforming the Data.............................................5

4 Exploratory Analysis............................................................................................................................7

4.1 Top 10 Stores by Sales.................................................................................................................7

4.2 Top 5 Departments across the stores..........................................................................................8

4.3 Sales vs CPI & Sales vs Fuel price.................................................................................................9

5 Unsupervised Learning: Clustering...................................................................................................10

5.1 Initial Results..............................................................................................................................10

5.2 Insight from Cluster A................................................................................................................11

5.3 Insight from Cluster B................................................................................................................12

5.4 Insight from Cluster C................................................................................................................12

5.5 Overall Insight............................................................................................................................13

6 Supervised Learning: Regression.......................................................................................................14

6.1 Linear Regression with Full Data................................................................................................14

6.2 Linear Regression with Imputed and Transformed Full Data.....................................................15

6.3 Linear Regression with Filtered Data.........................................................................................16

6.4 Linear Regression with Normalized Data...................................................................................17

7 Supervised Learning: Decision Tree...................................................................................................19

7.1 Two-way Split............................................................................................................................19

7.2 Three-way Split..........................................................................................................................20

7.3 Two-way Split without DEPT and STORE....................................................................................21

7.4 Decision Tree on Sampled Data.................................................................................................22

8 Time Series Analysis...........................................................................................................................23

8.1 Data Exploration........................................................................................................................23

8.2 Hierarchical Clustering [6]..........................................................................................................25

1 | P a g e

Page 3: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

8.3 Sales Forecasting [7]..................................................................................................................26

9 Business Implications.........................................................................................................................29

10 References.....................................................................................................................................30

Appendix A................................................................................................................................................31

Appendix B................................................................................................................................................32

2 | P a g e

Page 4: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

1 Introduction

This project has been done for the fulfilment of the project requirement of the course 94-832: Business Intelligence & Data Mining SAS. The data which formed part of our core analysis was the Walmart data set obtained from Kaggle.

The data contained weekly sales of various departments within different stores over different period of time. Most of the work put into the project evolves around staging the data for cleaning the data and modelling around different parameters and methodologies.

Using different methodologies, clustering, regression and decision tree, different models were generated and their errors were noted. Variables of importance were identified and clustering insights were drawn.

Time serie analysis was done for hierarchical clustering on sales trends, and portrayed how each cluster was different from each other. To predict the sales for the end of the year holiday season of 2012, time series forecasting was used.

3 | P a g e

Page 5: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

2 Business Questions

2.1 Question One

Retailers face many challenges when trying to forecast sales due to several reasons: the scale of the problem, the erratic sales at the each individual store, season changes, constant introduction of new items, and repeated promotional activity [1]. In an attempt to eradicate these issues, retailers have turned to large-scale demand-forecasting that is able to accommodate large amounts of transaction data. By collecting these data, retailers can then mine it and project future customer behavior. The ability to forecast at such on such a large scale allows retailers the opportunity to optimize their revenue system, thus enabling better choices on promotions and pricing. For our project we take on this challenge and attempt to correctly forecast sales at Walmart. Given the reputation Walmart has about its competitive pricing structure, the ability to accurately project sales is key in its ability to function. However, research out of the University of Michigan recently affirmed that clustering prior to forecasting sales greatly increases the accuracy of forecasts [2]. By clustering stores based on sales, and attributes such as average temperature, fuel prices, etc., stores can eliminate the need to control for seasonal indices and classes (summer shoes versus winter shirts etc.). After applying hierarchical clustering to the data we hope to determine which stores are similar, in terms of both sales and store attributes, so that we can ascern which characteristics are key drivers and sales, thus allowing us to generate more accurate forecasts.

2.2 Question Two

Recent news reports have underscored the importance of getting an accurate forecast. In January of 2014, Walmart had several chains cut their forecasts due to the holiday season and “profit-eating” discounts [3]. Moving forward to almost the end of 2014, Walmart again acknowledged that that it needs to do a “better” job at forecasting in order to ensure that it is keeping appropriate levels of inventory [4]. Given these recent developments, it is clear that forecasting plays an integral role in an retailers’ success. We will address Walmart’s challenge by leveraging sales data from 45 Walmart stores that are from different regions within the United States. By taking these data we will be able to make predictions on department-wide sales at each of the 45 stores. In addition to attempting to accurately predicting department-wide sales, we will also attempt to understand the impact of markdowns (price reductions) on holiday weeks. However, it is important to note that while we have data for each of the 45 stores regarding department-wide sales, we will be modeling the effect of markdowns without possessing complete historical data. Overall, we hope to understand which attributes significantly impact sales at the store level via regression, time series analysis, and decision tree models. These results can then foster an accurate prediction of 2012 sales data, thus allowing us to determine when is the best time to hire new employees.

4 | P a g e

Page 6: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

3 Description and Preparation of Data

3.1 Data Source

The Walmart Store Sales data is published as Walmart recruiting competition on Kaggle [5]. It covers historical sales data for 45 Walmart stores in different regions of United States from 2010-02-05 to 2012-11-01. There three files contained in the data set: “stores.csv”, “features.csv” and “train.csv”.

3.2 Data Sets Utilized

stores.csvThis file describes three important features of 45 stores. Each store (1-45) is defined with a store type (A-C) and a store size (numeric).

features.csvThis file describes additional information about each store for the given weeks. Each record contains 5 types of promotion markdowns at the given week. It also involves the average temperature, fuel price,CPI and unemployment rate for its corresponding geographic region in this week. As well, each record indicates whether the week is a special holiday week.

train.csvThis is the main historical sales data for training. Each records represents weekly sales for a certain department in the given store at given week. It also maintains the “isHoliday” field specifying whether the week is a holiday week.

Based on preliminary analysis, we decided to use all the tables provided. Although we use the official train data as our dataset, our business goals are not restricted to sales prediction in this project. Then the next step focuses on data cleaning, merging and pre-processing.

3.3 Data Preparation: Merging, Cleaning, and Transforming the Data

To put together all the three .csv files (train.csv, features.csv and stores.csv), the PK – FK relations were identified. Before denormalizing the data, all the ‘NA’ values in the table features.csv was changed to NULL. The ‘TRUE’ and ‘FALSE’ values for the ISHOLIDAY attribute were changed to binary values 1 and 0 repsectively.

The following statements generated the denormalized Walmart_Train dataset, which was used for the remainder of the project.

Combining Stores and Features table as Stores_Features:

CREATE TABLE Store_FeaturesASSELECT *FROM   Stores JOIN Features USING(Store);

Combining Stores_Features and Train table as Walmart_Train:

CREATE TABLE Walmart_TrainASSELECT *

5 | P a g e

Page 7: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

FROM   Train JOIN Store_Features USING(Store, Week, IsHoliday);

For analytical purposes and visualization, the variables TEMPERATURE, FUEL_PRICE and WEEKLY_SALES were categorized into the following classes: (Refer appendix A for SQL queries)

Condition TEMP_CLASSTEMPERATURE < 32 ‘Freezing’TEMPERATURE >= 32 AND TEMPERATURE < 64 ‘Cold’TEMPERATURE >= 64 AND TEMPERATURE < 79 ‘Comfortable’TEMPERATURE >= 79 AND TEMPERATURE < 95 ‘Hot’TEMPERATURE > 95 ‘Extremely Hot’

3-1: TEMP_CLASS

Condition FUEL_CLASSFUEL_PRICE < 2.75 ‘Low’FUEL_PRICE >= 2.75 AND FUEL_PRICE < 3.12 ‘Medium’FUEL_PRICE > 3.12 ‘High’

Condition SALES_CLASSWEEKLY_SALES <= 0 ‘Negative’WEEKLY_SALES > 0 AND WEEKLY_SALES <= 10000 ‘Low’WEEKLY_SALES > 10000 AND WEEKLY_SALES <= 25000 ‘Medium’WEEKLY_SALES > 25000 AND WEEKLY_SALES <= 100000 ‘High’WEEKLY_SALES > 100000 ‘Very High’

3-3: SALES_CLASS

To visualize the data from a better perspective, further categorical attributes were added, including the HOLIDAY (‘Super Bowl’, ‘Labor Day’, ‘Thanksgiving’, ‘Christmas’). The two weeks before each holiday was set as (‘Before Super Bowl’, ‘Before Labor Day’, ‘Before Thanksgiving’, ‘Before Christmas’).

Furthermore, unemployment and CPI were categorized into ‘Low’, ‘Medium’ and ‘High’. Store size was categorized to ‘Small’, ‘Medium’ and ‘Large’. (Refer appendix B for SQL Queries)

6 | P a g e

3-2: FUEL_CLASS

Page 8: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

4 Exploratory Analysis

4.1 Top 10 Stores by Sales

4-4: Top 10 Stores

The above chart shows the top 10 stores in terms of sales revenue and their percentage contribution to the total sales generated between them. Store 20 was the highest contributor with a total of 301 Million. The stores are mix of 7 large sized and 3 medium sized stores. Together, these 10 stores accounted for 39% of the revenue generated by the given 45 stores.

7 | P a g e

Page 9: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

4.2 Top 5 Departments across the stores

4-5: Top 5 Departments

The above figure shows the top 5 departments across the 3 store types namely A,B & C. Interestingly, Department number 72 showed a significant hike in sales across store type A and B. Store type A fetched the most sales whereas Store type C fetched the least sales.

4-6: Top 10 Stores

4-7: Top 5 Departments

8 | P a g e

Page 10: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

The above figure shows the pre-holiday sales registered by the 3 store types. The sales were the highest before christmas followed by pre thanksgiving, pre labor day and pre super bowl sales. Store type A registered the highest sales followed by store type B and store type C.

4.3 Sales vs CPI & Sales vs Fuel price

4-8: Sales vs CPI & Sales vs Fuel Price

No strong relationships were clear from visualizing the weekly sales data with respect to the CPI and the fuel price during that week.

9 | P a g e

Page 11: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

5 Unsupervised Learning: Clustering

5-9: Clustering Nodes

5.1 Initial Results

The clustering model utilizes all of the attributes within the data sans weekly sales and all of the markdown variables and uses the store ID as the segment cluster variable role. We set the cluster variable role to ‘segment’ and indicate that the model should standardize the data. Utilizing the centroid clustering method yields three unique clusters.

5-10: Clustering

Each of these clusters represents a group of stores that share similar values of each distinct attribute that has been clustered around the store ID. Based on the initial results table, we can see that each cluster has different averages across each attribute.

10 | P a g e

Page 12: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

Comparing these averages via the input means plot allows us to draw conclusions about each individual segment (see sections 5.2-5.4)

5-11: Clusters

5.2 Insight from Cluster A

Cluster A represents the largest amount of stores within this data set. Based on the means input plot (above), this cluster of stores has experienced lower than average fuel prices and unemployment rates. This is further complemented by a higher than average consumer price index rating. We can also observe that stores in this cluster are typically larger than the other stores. Overall, we might be able to infer that Cluster A is filled with stores in richer, suburban regions, thus explaining the high CPI and low unemployment rate and gas prices. However, because we do not have geographic information within this dataset we are unable to make further conclusions. In terms of what variables are important within this cluster, the chart below provides a visual of the importance of each attribute:

11 | P a g e

Page 13: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

5-12: Cluster A - Variable Importance

Per the Variable Importance graph, CPI, unemployment rate, and store size are the top three important variables when considering this cluster of stores.

5.3 Insight from Cluster B

Cluster B, per the input means plot, has higher a than average unemployment rates and temperature, but a lower than average fuel price, store size, and consumer price index. Again, because we do not have geographic data pertaining to each of the stores we are unable to make any further assumptions about the location of each of these stores within Cluster B. The variable importance graph (below) shows similar results as the graph from Cluster A.

5-13: Cluster B - Variable Importance

Again, the consumer price index, unemployment rate, and store size are all important variables within this cluster of stores. It appears that the same variables are important across Clusters A and B, but the averages of each of the attributes differs slightly relative to the overall attribute averages.

12 | P a g e

Page 14: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

5.4 Insight from Cluster C

Cluster C is completely different from Clusters A and B in that this is the only segment that addresses the importance of holidays. Overall, Cluster C has lower than average fuel prices and and temperature, but all other attributes are on par with the overall attribute average. Looking at the variable importance graph below confirms that this cluster’s important variables are in stark contrast to Cluster and A and B.

5-14: Cluster C - Variable Importance

This cluster of stores are grouped together because holidays have a large impact, with the variable ‘Holiday?’ dwarfing all other attribute values.

5.5 Overall Insight

Looking at all of the clusters relative to the overall population averages reveals that clustering prior to forecasting can help eliminate errors that are often caused by seasonal changes or population disparity. The impact of store size remains constant throughout each different cluster, but moving to attributes beyond that reveal that the correlation between an attribute and weekly sales differs across each of the three unique clusters.

5-15: Correlation with Weekly Sales

13 | P a g e

Page 15: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

To sum, an initial clustering analysis reveals that different groups of stores have different relationships with weekly sales depending on which cluster it belongs to. Holidays only appear to have an impact within Cluster C, while the other attributes of interest are more relevant to Clusters B and C. We now move onto our second method of unsupervised learning in an effort to test of the relationships seen above are statistically significant.

6 Supervised Learning: Regression

6.1 Linear Regression with Full Data

6-16: Regression

In this model, we maintains all the variables (CPI, DEPT, FUEL_PRICE, ISHOLIDAY, MARKDOWN1-5, STORE_SIZE, STORE_TYPE, TEMPERATURE, UNEMPLOYMENT), also we have WEEK as Time ID, STORE as ID and WEEKLY_SALES as target. We firstly use Data Partition node to split the data into 70% as training set and 30% as validation set. And then we set the selection model as stepwise, forward and backward separately, with validation error as the selection criterion.

Effect DF Sum of Squares

F Value Pr > F

DEPT 80 2.46E+13 1456.02 <.0001STORE_SIZE 1 2.15E+12 10162.5 <.0001

Table 6-1: Regression Error

The result of stepwise and forward models are pretty similar. But the backward model gives a worse result hence we take the stepwise result here, which usually gives the best solution. In this model, we get the average square training error of 2.0121E8 and validation error of 2.0451E8. Although the plot seems good especially at the beginging, the overall error statistic does not perform well. As we can see from the Type 3 Analysis of Effects above, this result is caused by getting only two important variables in this model at the end, which are DEPT and STORE_SIZE. This linear regression model contains all the values of DEPT, which means the norminal values of department will affect the regression result deeply. The average price of products in different departments may varies a lot. However, it does not make sense to predict the sales only by looking at their departments. Also, STORE_SIZE contains large numbers compared with other variables, it will cover the other variables’ effects and affect the accuracy of model.

14 | P a g e

Page 16: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

6-17: Linear Regression with Full Data

6.2 Linear Regression with Imputed and Transformed Full Data

6-18: Linear Regression with Inpute and Transform

To improve the results, we imputed the missing values of MARKDOWN1 – 5, and take the log of each interval variable to remove their skewed. Then we got a better model whose average square training error is 1.9549E8, and average squer validation error is 1.9831E8. Also, this model seems make more sence than the before one. More attributes are involved in this model.

6-19: Linear Regression with Inputed and Transformed Full Data

From the screenshot of the model below, we can see the DEPT still has huge influence.

15 | P a g e

Page 17: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

6-20: Linear Regression with Inputed and Transformed Full Data

6-21: Linear Regression with Inputed and Transformed Full Data

6.3 Linear Regression with Filtered Data

6-22: Linear Regression with Filtered Data

To reduce the negative affect of DEPT, we filter out the department variable. We make the similar settings for all other variables and get the new result. However, the result seems even worse. We get the average square training error of 4.842E8 and validation error of 4.881E8. It means, the department in this dataset is really important. And if we want to make our model more accuracte, we need to keep the department in our regression model.

16 | P a g e

Page 18: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

6-23: Linear Regression with Filtered Data

6.4 Linear Regression with Normalized Data

To normalize the interval data, we simply modify the data in Microsoft Excel. For each variable, we create a normalized variable using the original value devided by the largest value in this feature. Finally, we get the normalized STORE_SIZE, TEMPERATURE, FUEL_PRICE, CPI, UNEMPLOYMENT and MARKDOWN 1-5. As discussed above, we add the DEPT again to our model. Then we get the new model with these interval variables. However, the average square training error and validation error are still not good, which are 4.83E8 and 4.87E8.

6-24: Linear Regression with Normalized Data

17 | P a g e

Page 19: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

6-25: Linear Regression with Normalized Data

Hence in the linear regression models, the first model (using full data) gives the best performance.

18 | P a g e

Page 20: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

7 Supervised Learning: Decision Tree

7-26: Decision Tree

7.1 Two-way SplitVariable Importance

Number of Splitting Rules

Importance Validation Importance Ratio of Validation to Training Importance

DEPT 72.0 1.0 1.0 1.0STORE 41.0 0.4964515440958596 0.4964821958816832 1.0000617417473832MARKDOWN3 14.0 0.04084201073998688 0.04144053541125881 1.014654632826046STORE_SIZE 7.0 0.34207810450886517 0.3480075010929593 1.0173334583708806STORE_TYPE 6.0 0.09548333258131239 0.096222212274534 1.0077383106899038UNEMPLOYMENT 4.0 0.03701878109843059 0.031383248828117584 0.8477655907867824TEMPERATURE 4.0 0.023291162585169126 0.021652254529342163 0.9296339094352227MARKDOWN5 3.0 0.012643930440875074 0.01080177216385246 0.8543049342420205MARKDOWN1

2.00.003772326062876494

30.0011206103465749833 0.297060839359281

CPI 1.0 0.034734597871631134 0.03245507329474942 0.9343730828464989ISHOLIDAY 1.0 0.004704913825175917 0.00679092144395149 1.443367869484321MARKDOWN4

1.00.001779620944239500

60.0 0.0

MARKDOWN2 0.0 0.0 0.0 NaNFUEL_PRICE 0.0 0.0 0.0 NaN

Table 7-2: Two Way Split

A two split decision tree was generated on the train dataset. The weekly sales classes which were generated earlier were used as target classes. The model was heavily dependent on department (DEPT) and store (STORE). Majority of the splitting rules were based on these two attributes. The two-way split decision tree generated an average square error of 0.04222.

The WEEKLY_SALES is less dependent on the attribute ISHOLIDAY as opposed to the STORE_SIZE, STORE_TYPE, UNEMPLOYMENT and TEMPERATURE. Looking at the data from a broader perspective, the location of the store played a major factor in the weekly sales. A store located in a densely populated urban area would have more sales as opposed to one in a rural area, regardless of the week being a holiday or not. The holiday sales in a store located far off from the city might still be less compared to the average sales in a store located in the city on a day which is not a public holiday. Stores in the cities would be larger and would have larger amount of sales. To explore this scenario another approach was pursued. (Refer section 4)

19 | P a g e

Page 21: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

7-27: Two-way Split Decision Tree

7.2 Three-way SplitVariable Importance

Number of Splitting Rules

Importance Validation Importance Ratio of Validation to Training Importance

DEPT 150.0 1.0 1.0 1.0STORE 73.0 0.7197389364808112 0.7172217848271354 0.9965026879524074STORE_SIZE 39.0 0.16985483324111295 0.17565253608886602 1.034133281562398CPI 69.0 0.13354907518567805 0.11875136590034423 0.8891964675550165TEMPERATURE 49.0 0.12228743722593886 0.1227097368514723 1.003453336132584STORE_TYPE 11.0 0.10896237311981295 0.1091215150571583 1.0014605219470611UNEMPLOYMENT 43.0 0.0870216497060246 0.08140314276386681 0.9354355271229843MARKDOWN3 42.0 0.0838513980255293 0.07108265278770555 0.8477217370432371FUEL_PRICE 29.0 0.05580519447147505

60.045466506369912014 0.8147360976074067

MARKDOWN4 4.0 0.02260829885174643 0.01494100486542834 0.660863736957997MARKDOWN5 4.0 0.01614486406343703 0.013490146227052534 0.8355688951016575MARKDOWN2 4.0 0.01487867001852306

30.013747824483138145 0.9239955228540533

ISHOLIDAY 3.0 0.014085153665462117

0.008854538152353212 0.6286433476452034

MARKDOWN1 4.0 0.012841667939644022

0.016799803214149107 1.3082259479927658

Table 7-3: Three way split

In terms of variable importance, DEPT and STORE were the most important variables. However, the three-way split provided more flexibility to the model in terms or decision making and hence the errors in classifying them into the weekly sales classes, were less as expected. The average square error was found to be 0.02765.

20 | P a g e

Page 22: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

7-28: Three-way Split Decision Tree

7.3 Two-way Split without DEPT and STORE

To explore how much effect the attributes, DEPT and STORE had on the decision tree model, these attributes were rejected and the model was generated with the same parameters as before. The model generated portrayed a two fold increase in the average squared error (0.11243). Surprisingly, without information on which DEPT and which STORE, the sales belongs to, the model classified all other classes other than LOW WEEKLY_SALES incorrectly in almost all the cases. This can be seen from the graph plots shown below.

7-29: Two-way Split without Dept. and Store

21 | P a g e

Page 23: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

7.4 Decision Tree on Sampled Data

7-30: Decision Tree on Sampled Data

To observe how the weekly sales are dependent on the other features in the dataset, information on the department and store ID was rejected. The data was filtered such that the classes NEGATIVE and VERY HIGH WEEKLY_SALES were filtered out. The data was further sampled such that all the remaining classes, LOW, MEDIUM and HIGH WEEKLY_SALES had the same number of observations.

The decision trees modeled on this data returned results as expected: The STORE_SIZE was one of the major factors that determined the weekly sales and hence ended up as the most important variable for splitting nodes.

Variable Importance

Number of Splitting Rules

Importance Validation Importance Ratio of Validation to Training Importance

STORE_SIZE 16.0 1.0 1.0 1.0CPI 4.0 0.2651285102856137 0.24717493036369145 0.9322834805559709UNEMPLOYMENT 5.0 0.24441028209455445 0.20442631071106554 0.8364063449342921STORE_TYPE 3.0 0.15295892375537554 0.15072179548628872 0.9853743200189836MARKDOWN3 2.0 0.05270957625834675

50.0315604300793954 0.5987608385373367

FUEL_PRICE 1.0 0.022610228455857168

0.012600386988009991 0.5572870266485904

TEMPERATURE 1.0 0.02168392343651502 0.010275651333624921 0.4738833986252257MARKDOWN2 0.0 0.0 0.0 NaNMARKDOWN1 0.0 0.0 0.0 NaNMARKDOWN5 0.0 0.0 0.0 NaNISHOLIDAY 0.0 0.0 0.0 NaNMARKDOWN4 0.0 0.0 0.0 NaN

Table 7-4: Sampled Date Decision Tree

However, the average square error for both trees (two-way split and three-way split) turned out to be 0.211, hence producing nodes with lower levels of purities for the tree.

The following table summarizes the decision tree models that were generated for the WALMART_TRAIN dataset.

Average Squared ErrorTwo-way Split 0.042Three-way Split 0.028Two-way Split without DEPT and STORE 0.112Two-way & Three-way Split on Sampled Data

0.211

Table 7-5: Avg. Error

22 | P a g e

Page 24: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

8 Time Series Analysis

Due to the nature of the data the results generated by the standard algorithms used in the previous sections, provided little insight. To generate better results, the time series analysis tools of SAS were used.

8.1 Data Exploration

To analyze the data from a time series perspective, the time dimension was set up in conjunction with the cross sectional dimensions, store and department; using SAS TS Data Preparation Node.

8-31: Dimension Cube

The setting up of this structure allowed flexibility in visualizing data on aggregation over different dimensions.

The following plot shows the weekly sales for 100 of the store – department combination. It is quite evident from the plot that the sales was recorded high during the holiday seasons: Christmas in December and before summer in May. Other notable peaks in sales was during Thanksgiving in November, Superbowl in February and Labor Day in September.

23 | P a g e

Page 25: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

8-32: Weekly Sales for Store-Department

For further analysis, the mean weekly sales for each store as well as each department was plotted. Some of the departments had very high average weekly sales compared to the others. These departments although not mentioned by Walmart for privacy purposes, might be the departments which sell products required by people on a day to day basis – like groceries; or high grossing departments like electronics, etc.

8-33: Mean Weekly Sales by Store

24 | P a g e

Page 26: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

8-34: Mean Weekly Sales by Department

25 | P a g e

Page 27: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

8.2 Hierarchical Clustering [6]

8-35: Hierarchical Clustering

Based on the values of the different input variables such as CPI, UNEMPLOYMENT, ISHOLIDAY, TEMPERATURE and different MARKDOWN values, the time series inputs were used for clustering. The clustering mechanism used mean squared error between the total weekly sales of the stores as the similarity measure.

The following dendogram shows the distance between the different clusters that were generated.

8-36: Clustering Dendogram

Based on the minimum distance between clusters, at a value of 0.1 distance, three main clusters were generated. Stores 7, 16, 17, 38 and 44 were clustered together as cluster A. Stores 28, 30, 33, 36, 37, 42 and 43 were clustered together as cluster B. And the rest of the stores belonged to cluster C. The features of these clusters became more evident during the forecasting process.

26 | P a g e

Page 28: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

The following graph shows how the different stores were clustered in terms of their trends on weekly sales based on the trends of other attributes.

8-37: Clustering Graph

The features of these clusters became more evident when the trends in sales for the stores were analyzed. Stores from the same cluster showed similar trends in weekly sales.

8.3 Sales Forecasting [7]

Using SAS Enterprise Miner’s Time Series Exponential Smoothing Tool, the sales for the stores was forecasted for the next six weeks, until December 2012. This sales forecasting methodology is independent of any of the earlier mentioned input variables. The forecasting takes into consideration seasonal effects and trends in sales over the period of February 2010 to October 2012.

8-38: Sales Forecasting

For each store, different models were used to forecast the sales. The model with the least standard error was automatically selected as the best model for forecasting sales for that store. The additive winters model and seasonal models proved to be the best fit for most stores. The following table illustrates which model was used for each store, and the paaremeter estimate and the associated standard error.

27 | P a g e

Page 29: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

Time Series ID

Store Model Parameter

Parameter Estimate Standard Error

1.0 1.0 ADDWINTERS LEVEL 0.0034631198964860067 0.004416930901986381.0 1.0 ADDWINTERS SEASON 0.6055475095192919 0.0405979854454833061.0 1.0 ADDWINTERS TREND 0.001 0.0026259594222747182.0 2.0 WINTERS LEVEL 0.16545582491041058 0.025071850697555352.0 2.0 WINTERS SEASON 0.921247729943568 0.058539985659214242.0 2.0 WINTERS TREND 0.001 0.0096089062932110083.0 3.0 ADDWINTERS TREND 0.001 0.0045283327236628513.0 3.0 ADDWINTERS SEASON 0.6250302553028629 0.048690156513450813.0 3.0 ADDWINTERS LEVEL 0.18422291502403845 0.0259508958253459774.0 4.0 ADDWINTERS LEVEL 0.08795186376589817 0.0195265216633056174.0 4.0 ADDWINTERS TREND 0.001 0.0063500018342420064.0 4.0 ADDWINTERS SEASON 0.7130415339734698 0.041244896421509565.0 5.0 ADDWINTERS TREND 0.001 0.011457879340366445.0 5.0 ADDWINTERS SEASON 0.5774493510072217 0.045180083623537685.0 5.0 ADDWINTERS LEVEL 0.1367460310044868 0.0236292447322617736.0 6.0 SEASONAL SEASON 0.7157152817395214 0.042195866663985736.0 6.0 SEASONAL LEVEL 0.12046874034834318 0.0168735220370566837.0 7.0 ADDWINTERS LEVEL 0.1653571721391764 0.0202446204484523467.0 7.0 ADDWINTERS SEASON 0.7697969237235239 0.0501698151958132057.0 7.0 ADDWINTERS TREND 0.001 0.0043564596685090998.0 8.0 ADDWINTERS SEASON 0.6850143638004 0.040294371170337058.0 8.0 ADDWINTERS LEVEL 0.0662748336317954 0.0139638236013560688.0 8.0 ADDWINTERS TREND 0.001 0.0053334374251098819.0 9.0 ADDWINTERS LEVEL 0.17980061535722747 0.024244508376657699.0 9.0 ADDWINTERS TREND 0.001 0.0302504979417588049.0 9.0 ADDWINTERS SEASON 0.814285370219351 0.05696269299246877

Table 8-6: Models

Based on the models, the weekly sales of each store was forecasted for 12 weeks, covering the holiday season in December (the forecasted sales are shown after the vertical line on the graph). The following graph shows the forecasted sales of a store that is doing fairly well. Store 1 is a store from the cluster B. All stores in the cluster show a similar trends – very high peak of sales during Christmas.

8-39: Store 1 - Cluster B

The following graph shows the sales for Store 7. The store show a good amount of sales from May to September and from November to January. This store could have good potential growth in the future. This store was selected from cluster A. All stores in this cluster have similar trend, which brings in a steady amount of income in addition to higher sales during holidays. These can be considered as stores with steady growth rates.

28 | P a g e

Page 30: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

8-40: Store 7 - Cluster A

The following graph shows the sales for Store 36, which Walmart should focus on. The store has been losing out on sales and is likely to go out of business over the next couple of years. The total sales for the store decreased by half over a period of 2 years. Store 36 was taken off from cluster C. Stores from this cluster generally showed a declining trend.

8-41: Store 36 – Cluster C

29 | P a g e

Page 31: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

9 Business Implications

Based on the analysis made, the Walmart should hire personnel a few weeks before the holiday seasons, especially Thanksgiving and Christmas. This allows them to perform better when the sales go up gradually as the holidays get closer.

Using the cluster information from the section 8.2 can be used in conjunction with sales forecasting to come up with more accurated prediction.

Wal-mart should keep a close eye on the stores which are running out of business. Also provide an incentive to other stores to improve their sales, and hire the right sales representatives.

30 | P a g e

Page 32: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

10 References

[1] M. Gilliland, "Demand Forecasting in Retail," [Online]. Available: http://www.sas.com/news/feature/retail/aug06forecast.html.

[2] M. K. &. R. R. Nitin Patel, "Clustering Models to Improve Forecasts in Retails Merchandising," [Online]. Available: http://www.cytel.com/Papers/INFORMS_Prac_%2004.pdf.

[3] L. C.-L. &. R. Dudley, "Wal-Mart Sees Profit at Low End of Forecast," [Online]. Available: http://www.bloomberg.com/news/2014-01-31/wal-mart-sees-profit-at-low-end-of-forecast.html.

[4] R. Dudley, "Wal-Mart Cuts Annual Sales Forecast as Supercenters Struggle," [Online]. Available: http://www.businessweek.com/news/2014-10-16/wal-mart-cuts-annual-sales-forecast-as-its-supercenters-.

[5] "Kaggle - Walmart Recruiting - Stores Sales Forecasting," [Online]. Available: https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting.

[6] T. L. Sascha Schubert, "TIme Series Data Mining with SAS Enterprise Miner," [Online]. Available: http://support.sas.com/resources/papers/proceedings11/160-2011.pdf.

[7] S. J. Satyajit Dwivedi, "Time-series Data Mining," [Online]. Available: http://www.iasri.res.in/sscnars/data_mining/10-SAS%20Enterprise%20Miner%207.1%20Time%20Series%20Data%20Mining.pdf.

31 | P a g e

Page 33: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

Appendix AALTER TABLE WALMART_TRAINADD TEMP_CLASS VARCHAR2(15);

UPDATE WALMART_TRAIN SET    TEMP_CLASS = (CASE                         WHEN TEMPERATURE < 32 THEN 'Freezing'                        WHEN TEMPERATURE >= 32 AND TEMPERATURE < 64 THEN 'Cold'                        WHEN TEMPERATURE >= 64 AND TEMPERATURE < 79 THEN 'Comfortable'                        WHEN TEMPERATURE >= 79 AND TEMPERATURE < 95 THEN 'Hot'                        WHEN TEMPERATURE > 95 THEN 'Extremely Hot'                        ELSE NULL                    END);

-- http://www.gasbuddy.com/gb_gastemperaturemap.aspx

ALTER TABLE WALMART_TRAINADD FUEL_CLASS VARCHAR2(15);

UPDATE WALMART_TRAIN SET    FUEL_CLASS = (CASE                         WHEN FUEL_PRICE < 2.75 THEN 'Low'

WHEN FUEL_PRICE >= 2.75 AND FUEL_PRICE < 3.12 THEN 'Medium'                        WHEN FUEL_PRICE > 3.12 THEN 'High'                        ELSE NULL                    END);

-- http://www.statisticbrain.com/wal-mart-company-statistics/

ALTER TABLE WALMART_TRAINADD SALES_CLASS VARCHAR2(15);

UPDATE WALMART_TRAIN SET    SALES_CLASS = (CASE                          WHEN WEEKLY_SALES <= 0 THEN 'Negative'                         WHEN WEEKLY_SALES > 0 AND WEEKLY_SALES <= 10000 THEN 'Low'                         WHEN WEEKLY_SALES > 10000 AND WEEKLY_SALES <= 25000 THEN 'Medium'

 WHEN WEEKLY_SALES > 25000 AND WEEKLY_SALES <= 100000 THEN 'High'  WHEN WEEKLY_SALES > 100000 THEN 'Very High'

                         ELSE NULL                     END);

32 | P a g e

Page 34: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

Appendix BCREATE TABLE WALMART_TRAIN_HOLIDAYASSELECT *FROM   WALMART_TRAIN;

ALTER TABLE WALMART_TRAIN_HOLIDAYADD HOLIDAY VARCHAR2(25);

UPDATE WALMART_TRAIN_HOLIDAYSET HOLIDAY ='Super Bowl'WHERE WEEK IN (TO_DATE('12-Feb-10', 'DD-Mon-RR'), TO_DATE('11-Feb-11', 'DD-Mon-RR'), TO_DATE('10-Feb-12', 'DD-Mon-RR'), TO_DATE('08-Feb-13', 'DD-Mon-RR'));                                                                                                                                          UPDATE WALMART_TRAIN_HOLIDAY SET HOLIDAY ='Labor Day' WHERE WEEK IN (TO_DATE('10-Sep-10', 'DD-Mon-RR'), TO_DATE('09-Sep-11', 'DD-Mon-RR'), TO_DATE('07-Sep-12', 'DD-Mon-RR'), TO_DATE('06-Sep-13', 'DD-Mon-RR'));                                                                                                                                          UPDATE WALMART_TRAIN_HOLIDAY SET HOLIDAY ='Thanksgiving' WHERE WEEK IN (TO_DATE('26-Nov-10', 'DD-Mon-RR'), TO_DATE('25-Nov-11', 'DD-Mon-RR'), TO_DATE('23-Nov-12', 'DD-Mon-RR'), TO_DATE('29-Nov-13', 'DD-Mon-RR'));                                                                                                                                                                                                                                                                                    UPDATE WALMART_TRAIN_HOLIDAY SET HOLIDAY ='Christmas' WHERE WEEK IN (TO_DATE('31-Dec-10', 'DD-Mon-RR'), TO_DATE('30-Dec-11', 'DD-Mon-RR'), TO_DATE('28-Dec-12', 'DD-Mon-RR'), TO_DATE('27-Dec-13', 'DD-Mon-RR'));

UPDATE WALMART_TRAIN_HOLIDAYSET HOLIDAY ='Before Super Bowl'WHERE (WEEK BETWEEN (TO_DATE('12-Feb-10', 'DD-Mon-RR') - 14) AND  TO_DATE('12-Feb-10', 'DD-Mon-RR'))  OR (WEEK BETWEEN (TO_DATE('11-Feb-11', 'DD-Mon-RR') - 14) AND  TO_DATE('11-Feb-11', 'DD-Mon-RR'))  OR (WEEK BETWEEN (TO_DATE('10-Feb-12', 'DD-Mon-RR') - 14) AND  TO_DATE('10-Feb-12', 'DD-Mon-RR'))  OR (WEEK BETWEEN (TO_DATE('08-Feb-13', 'DD-Mon-RR') - 14) AND  TO_DATE('08-Feb-13', 'DD-Mon-RR'));                                                                                                                                          UPDATE WALMART_TRAIN_HOLIDAY SET HOLIDAY ='Before Labor Day'WHERE (WEEK BETWEEN (TO_DATE('10-Sep-10', 'DD-Mon-RR') - 14) AND  TO_DATE('10-Sep-10', 'DD-Mon-RR'))  OR (WEEK BETWEEN (TO_DATE('09-Sep-11', 'DD-Mon-RR') - 14) AND  TO_DATE('09-Sep-11', 'DD-Mon-RR'))  OR (WEEK BETWEEN (TO_DATE('07-Sep-12', 'DD-Mon-RR') - 14) AND  TO_DATE('07-Sep-12', 'DD-Mon-RR'))  OR (WEEK BETWEEN (TO_DATE('06-Sep-13', 'DD-Mon-RR') - 14) AND  TO_DATE('06-Sep-13', 'DD-Mon-RR'));

33 | P a g e

Page 35: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

                                                                                                                                        UPDATE WALMART_TRAIN_HOLIDAY SET HOLIDAY ='Before Thanksgiving'WHERE (WEEK BETWEEN (TO_DATE('26-Nov-10', 'DD-Mon-RR') - 14) AND  TO_DATE('26-Nov-10', 'DD-Mon-RR'))  OR (WEEK BETWEEN (TO_DATE('25-Nov-11', 'DD-Mon-RR') - 14) AND  TO_DATE('25-Nov-11', 'DD-Mon-RR'))  OR (WEEK BETWEEN (TO_DATE('23-Nov-12', 'DD-Mon-RR') - 14) AND  TO_DATE('23-Nov-12', 'DD-Mon-RR'))  OR (WEEK BETWEEN (TO_DATE('29-Nov-13', 'DD-Mon-RR') - 14) AND  TO_DATE('29-Nov-13', 'DD-Mon-RR'));                                                                                                                                                                                                                                                                                    UPDATE WALMART_TRAIN_HOLIDAY SET HOLIDAY ='Before Christmas'WHERE (WEEK BETWEEN (TO_DATE('31-Dec-10', 'DD-Mon-RR') - 14) AND  TO_DATE('31-Dec-10', 'DD-Mon-RR'))  OR (WEEK BETWEEN (TO_DATE('30-Dec-11', 'DD-Mon-RR') - 14) AND  TO_DATE('30-Dec-11', 'DD-Mon-RR'))  OR (WEEK BETWEEN (TO_DATE('28-Dec-12', 'DD-Mon-RR') - 14) AND  TO_DATE('28-Dec-12', 'DD-Mon-RR'))  OR (WEEK BETWEEN (TO_DATE('27-Dec-13', 'DD-Mon-RR') - 14) AND  TO_DATE('27-Dec-13', 'DD-Mon-RR'));       

       UPDATE WALMART_TRAIN_HOLIDAYSET HOLIDAY ='Not Holiday'WHERE HOLIDAY IS NULL;

ALTER TABLE WALMART_TRAIN_HOLIDAYADD STORE_SIZE_CLASS VARCHAR2(10);

UPDATE WALMART_TRAIN_HOLIDAYSET    STORE_SIZE_CLASS  = CASE                               WHEN STORE_SIZE < 100000 THEN 'Small'                              WHEN STORE_SIZE >= 100000 AND STORE_SIZE < 200000 THEN 'Medium'                              WHEN STORE_SIZE >= 200000 THEN 'Large'                           END;

ALTER TABLE WALMART_TRAIN_HOLIDAYADD UNEMPLOYMENT_CLASS VARCHAR2(10);

UPDATE WALMART_TRAIN_HOLIDAYSET    UNEMPLOYMENT_CLASS = CASE                                WHEN UNEMPLOYMENT < 7 THEN 'Low'                               WHEN UNEMPLOYMENT >= 7 AND UNEMPLOYMENT < 11 THEN 'Medium'                               WHEN UNEMPLOYMENT >= 11 THEN 'High'                            END;

ALTER TABLE WALMART_TRAIN_HOLIDAYADD CPI_CLASS VARCHAR2(10);

UPDATE WALMART_TRAIN_HOLIDAYSET    CPI_CLASS = CASE                       WHEN CPI < 159 THEN 'Low'                      WHEN CPI >= 159 AND UNEMPLOYMENT < 192 THEN 'Medium'                      WHEN CPI >= 192 THEN 'High'                  END;

34 | P a g e

Page 36: Wal-mart Sales Forecasting

94-832: Business Intelligence & Data Mining SAS

Report Team 7

ALTER TABLE WALMART_TRAIN_HOLIDAYADD DEPT_CLASS VARCHAR2(12);

UPDATE WALMART_TRAIN_HOLIDAY OHSET    DEPT_CLASS =  'Low Sales'WHERE  DEPT IN ( SELECT DEPT                  FROM   ( SELECT DEPT, MEDIAN(WEEKLY_SALES) MD

         FROM   WALMART_TRAIN_HOLIDAY GROUP BY DEPT)

   WHERE MD < 20000);

UPDATE WALMART_TRAIN_HOLIDAY OHSET    DEPT_CLASS =  'Medium Sales'WHERE  DEPT IN ( SELECT DEPT                  FROM   ( SELECT DEPT, MEDIAN(WEEKLY_SALES) MD

FROM   WALMART_TRAIN_HOLIDAY  GROUP BY DEPT) WHERE MD > = 20000 AND MD < 40000);

UPDATE WALMART_TRAIN_HOLIDAY OHSET    DEPT_CLASS =  'High Sales'WHERE  DEPT IN ( SELECT DEPT                  FROM   ( SELECT DEPT, MEDIAN(WEEKLY_SALES) MD

FROM   WALMART_TRAIN_HOLIDAY GROUP BY DEPT)   WHERE MD > =  40000);

35 | P a g e