DA 592 - Term Project Report - Berker Kozan Can Koklu

Sabancı University Data Analytics M.Sc. Programme

DA 592 - 2015-2016 Term Project

Grupo Bimbo Inventory Demand - Kaggle Contest

Students: Berker Kozan, Can Köklü

Abstract

This data analytics project was done for a Kaggle contest where the goal was to perform

demand prediction for Grupo Bimbo company. Python language was used with Jupyter

notebooks. XGBoost library was used to perform training and predictions.

Various feature engineering features such as NLTK for text extraction, creation of lag

columns and averaging over a large number of variables were used to enhance data.

After the train table was created, XGBoost was utilized to optimize according to the

scoring function dictated by the contest, RMSLE. Hyperparameter tuning was also

leveraged after feature selection based on feature importance and correlation analysis

to determine the best parameters for the XGBoost optimizer.

The final submission to Kaggle achieved a score of 0.48666; placing our team in the top

17% of the 2000 contestants.

The biggest challenges were related to analyzing and training on a large data set. This

was overcome by forcing the data types to smaller types (unsigned integers, low

accuracy floats, etc.), using HDF5 file format for data storage and launching a

powerful Google Cloud Compute Preemptible Instance (with 208 GB RAM).

Further improvements would include attempting hyperparameter tuning across a wider

range of training tables (with different features) and also implementing a failsafe

method for running the experiment in preemptible instances. Additionally, creating

different models and averaging them to find optimal and non-overfitted models would

have yielded better results.

Keywords: data science, kaggle, demand prediction, python, jupyter, xgboost, cloud, google cloud compute, hdf5, hyperparameter tuning, feature selection

1

Table of Contents

Abstract 1

Table of Contents 2

Introduction 3

What is Kaggle? 3

What is the contest about? 3

Why this project? 4

Tools Used 4

Python 4

Platforms 5

Data Exploration 6

Definition of the Data Sets 6

Exploratory Data Analysis 7

Data Types and Sizes 7

Distributions and Summary of Data

8

Correlations 9

Decreasing the Data Size 9

Models 10

Naive Prediction 10

Score 10

NLTK based Modelling 10

Feature Engineering 10

Modeling 11

Technical Problems 11

Garbage Collection 11

Data Size 11

Score 12

Conclusion 12

Final Models 12

Digging Deeper in Data

Exploration 12

Demanda - Dev - Venta

Relationship 12

Train - Test Difference 13

Feature Engineering 14

Agencia 14

Producto 14

Client Features 14

Demand Features 15

General Totals 15

Validation Technique 17

Xgboost 17

Training 18

Hyperparameter Tuning 18

Max Depth: 18

Subsample: 18

ColSample By Tree: 18

Learning Rate: 19

Technical Problems 19

Storing Data 19

RAM Problem 19

Code Reuse and Automatization 19

Results 20

Conclusion 21

Critical Mistakes 22

Poor data exploration 22

Not preparing for system outages

22

Performing hyperparameter tuning

too late 22

Further Exploration 22

Partial Fitting 22

Multiple Models 22

Neural Networks 22

2

Introduction

What is Kaggle?

Kaggle is a website founded in 2010 that provides data science challenges to its 1

participants. Participants compete against each other to solve data science problems.

Kaggle has unranked “practice” challenges as well as contests with monetary rewards.

Companies that have data challenges work together with Kaggle to formulate the problem

and reward the top performers.

What is the contest about?

The contest that we have taken on for our project belongs to a Mexican company named 2

Grupo Bimbo. Grupo Bimbo is a company that produces and distributes fresh bakery

products. The nature of the problem at its core is demand estimation.

Grupo Bimbo produces the products and ships them from storage facilities (agencies) to

stores (clients). The following week, a certain number of products that aren’t sold are

returned from the clients to Bimbo. To maximize their profits, Grupo Bimbo needs to

predict the demand of stores accurately to minimize these returns.

In the contest, we are provided with 9 weeks worth of data regarding these shipments

and we are asked to predict the demand for weeks 10 and 11. Participants are allowed to

submit 3 sets of predictions every day until the deadline of the project and can pick

any two of these predictions as their final submissions.

As a standard practice at Kaggle, when making initial submissions, the predictions are

ranked based on a “public” ranking which only evaluates a certain part of the

submission. This is done to prevent gaming the system by overfitting through trial &

error of submissions. For this contest, the “public” ranking is done on week 10 data;

meaning, when submitting our predictions we would only be able to see our performance

for week 10. The private performance of our predictions (i.e. weeks 11) are only shown

after the contest ends.

For this contest the evaluation metric is the Root Mean Squared Logarithmic Error of

our predictions.

1 "About | Kaggle." 2012. 10 Sep. 2016 < https://www.kaggle.com/about > 2 "Grupo Bimbo Inventory Demand | Kaggle." 2016. 10 Sep. 2016 < https://www.kaggle.com/c/grupo-bimbo-inventory-demand >

3

https://www.kaggle.com/about

https://www.kaggle.com/c/grupo-bimbo-inventory-demand

Why this project?

We decided to do a Kaggle contest for our project for various reasons:

1. It would allow us to benchmark our data science abilities in an international

field.

2. Kaggle has very active forums for each individual contest and these would provide

us with great new methods and insights in solving problems.

3. Since the data provided is clean, we could spend more time in feature and model

building rather than data cleaning.

4. We could work towards a clear goal and not be distracted.

5. From the number of contest in Kaggle, we picked the Grupo Bimbo project because:

a. It deals with text data which is considerably easier to work with for

beginners.

b. The data was very large and provided a learning opportunity in working

with large data sets.

c. The deadline of the project (August 30) was in line with the deadline of

our term project.

Tools Used

Python

We decided to use Python (version 2.7 ) as the our scripting language. This is the 3

language we worked with most in our programme and also one of the most popular data

science languages. We built our systems on the Anaconda package by Continuum as it 4

offers a large number of libraries that help us face the challenges.

We mainly ran Jupyter (IPython) notebooks on various systems to code and report 5

results.

A few of the specific tools/packages that we used were:

3 "Python 2.7.0 Release | Python.org." 2014. 10 Sep. 2016 < https://www.python.org/download/releases/2.7/ > 4 "Download Anaconda Now! | Continuum - Continuum Analytics." 2015. 10 Sep. 2016 < https://www.continuum.io/downloads > 5 "Project Jupyter | Home." 2014. 10 Sep. 2016 < http://jupyter.org/ >

4

https://www.python.org/download/releases/2.7/

https://www.continuum.io/downloads

http://jupyter.org/

● NLTK : NLTK is the most popular Natural Language Processing Toolkit for Python. 6

It offers great features like stemming, tokenizing and chunking in multiple

languages. This was critical since the product names were in Spanish.

● XGBoost: XGBoost is a library that can be used in conjunction with various

scripting languages (including R and Python) that is designed for gradient

boosting trees. It is much faster than regular scripting tools since the

computational parts are written and precompiled in C++. We picked this solution

based simply on its fame, as many of the winners have used this tool in Kaggle

contests . 7

● Pickle : The pickle module implements binary protocols for serializing and 8

de-serializing a Python object structure.

● HDF5 File Format: HDF is self-describing, allowing an application to interpret 9

the structure and contents of a file with no outside information, a

general-purpose, machine-independent standard for storing scientific data in

files, developed by the National Center for Supercomputing Applications (NCSA).

● Scikit-Learn : Scikit-Learn is a simple and efficient tool for data mining and 10

machine learning beside that it’s free and build on numpy, matplotlib and scipy.

We used it on feature extraction phase.

● NumPy : NumPy is an open source extension module for Python, which provides fast 11

precompiled functions for mathematical and numerical routines. Furthermore, NumPy

enriches the programming language Python with powerful data structures for

efficient computation of multidimensional arrays and matrices.

● SciPy : SciPy is a Python-based ecosystem of open-source software for 12

mathematics, science, and engineering. We used it for sparse matrices.

● Garbage Collector : The gc module was used in order to free up memory 13

periodically to optimize performance.

Platforms

For coding and performing our computations, we initially attempted to use our laptops

(a Macbook Pro and an Ubuntu Machine each with 16GB of RAM). However, after getting

numerous Memory Errors, we gradually came to realize that our computers would not be

able to run the computations that we need (at least not in an efficient and timely

manner). To solve our problem we turned to cloud services.

We first set up an EC2 instance on Amazon Web Services with about 100GB of RAM and 16

virtual CPU cores, using a public tutorial . However, running such a powerful instance 14

6 "Natural Language Toolkit — NLTK 3.0 documentation." 2005. 10 Sep. 2016 < http://www.nltk.org/ > 7 "xgboost/demo at master · dmlc/xgboost · GitHub." 2015. 10 Sep. 2016 < https://github.com/dmlc/xgboost/tree/master/demo > 8 "12.1. pickle — Python object serialization — Python 3.5.2 documentation." 2014. 20 Sep. 2016 < https://docs.python.org/3/library/pickle.html > 9 "Importing HDF5 Files - MATLAB & Simulink - MathWorks." 2012. 20 Sep. 2016 < http://www.mathworks.com/help/matlab/import_export/importing-hierarchical-data-format-hdf5-files.html > 10 "scikit-learn: machine learning in Python — scikit-learn 0.17.1 ..." 2011. 20 Sep. 2016 < http://scikit-learn.org/ > 11 "What is NumPy? - Numpy and Scipy Documentation." 2009. 20 Sep. 2016 < http://docs.scipy.org/doc/numpy/user/whatisnumpy.html > 12 "SciPy.org — SciPy.org." 2002. 21 Sep. 2016 < http://www.scipy.org/ > 13 "28.12. gc — Garbage Collector interface — Python 2.7.12 ..." 2014. 21 Sep. 2016 < https://docs.python.org/2/library/gc.html > 14 "Setting up AWS for Kaggle Part 1 – Creating a first Instance – grants ..." 2016. 10 Sep. 2016 < http://www.grant-mckinnon.com/?p=6 >

5

https://docs.python.org/3/library/pickle.html#module-pickle

http://www.nltk.org/

https://github.com/dmlc/xgboost/tree/master/demo

https://docs.python.org/3/library/pickle.html

http://www.mathworks.com/help/matlab/import_export/importing-hierarchical-data-format-hdf5-files.html

http://www.mathworks.com/help/matlab/import_export/importing-hierarchical-data-format-hdf5-files.html

http://scikit-learn.org/

http://docs.scipy.org/doc/numpy/user/whatisnumpy.html

http://www.scipy.org/

https://docs.python.org/2/library/gc.html

http://www.grant-mckinnon.com/?p=6

continuously proved costly; a two day attempt to build and run models cost over 150USD.

(An important side note, one should make sure that all items that relate to the

instance created are removed completely to avoid incurring charges. In the case of one

of the authors of this paper, an extra 50USD was later charged because backup copies of

the instances were not deleted.)

We then decided to switch to Google Cloud Compute service; building a system with

similar specs, again following a publicly available tutorial . Although slightly 15

cheaper, having a dedicated machine run for an entire day again proved costly,

incurring about 50USD. At this point we decided to find a cheaper solution and decided

to look at Amazon’s Spot Instances and Google’s Preemptible Instances.

Both Amazon Spot Instances and Google Preemptible Instances operate on the principle

that they offer the company's surplus computing power at a discount. The caveat being

that if there are other consumers that want to use this computing power, the instances

can be stopped by the company at any point. The biggest difference between the two is

that Amazon offers a more bidding model where the prices for the computing power

fluctuates; if the bid that the buyer is higher than the current market price, the

instance remains active; however if the market price raises above the bid, it is shut

down. Google on the other hand offers a specific price for the instance . 16

We eventually settled down on using a Google Preemptible instance with 32 virtual CPUs

and 208 GB of RAM. We had to deal with a premature shutdown only once while running the

instance over the course of three days. The total cost of the preemptible instances and

backups etc came to about 60 USD.

The key interface to the Google Cloud interface was a command prompt terminal, where

the Jupyter notebook was initiated and data files were uploaded and submission files

were downloaded via SSH.

Github 17

GitHub is a code hosting platform for version control and collaboration which lets

people work together on projects from anywhere. We used this to work on our codes in

parallel while easily merging our developments.

Data Exploration

Definition of the Data Sets

The data sets that we were provided with was as follows:

● train.csv — the training set, total Demand data from clients and products per

week for weeks 3-9; containing the following fields:

○ Semana - The week

○ Agencia_ID - ID of the storage facility from which the order is

dispatched.

○ Canal_ID - The channel through which the order is placed.

15 "Set up Anaconda + IPython + Tensorflow + Julia on a Google ..." 2016. 10 Sep. 2016 < https://haroldsoh.com/2016/04/28/set-up-anaconda-ipython-tensorflow-julia-on-a-google-compute-engine-vm/ > 16 "What are the key differences between AWS Spot Instances ... - Quora." 2015. 10 Sep. 2016 < https://www.quora.com/What-are-the-key-differences-between-AWS-Spot-Instances-and-Googles-Preemptive-Instances > 17 "Hello World · GitHub Guides." 2014. 20 Sep. 2016 < https://guides.github.com/activities/hello-world/ >

6

https://haroldsoh.com/2016/04/28/set-up-anaconda-ipython-tensorflow-julia-on-a-google-compute-engine-vm/

https://haroldsoh.com/2016/04/28/set-up-anaconda-ipython-tensorflow-julia-on-a-google-compute-engine-vm/

https://www.quora.com/What-are-the-key-differences-between-AWS-Spot-Instances-and-Googles-Preemptive-Instances

https://www.quora.com/What-are-the-key-differences-between-AWS-Spot-Instances-and-Googles-Preemptive-Instances

https://guides.github.com/activities/hello-world/

○ Ruta_SAK - The route ID of the delivery route.

○ Cliente_ID - The Client ID

○ Producto_ID - The Product ID

○ Venta_uni_hoy - The number of items that were ordered

○ Venta_hoy - The total cost of the items that were ordered

○ Dev_uni_proxima - The number of items that were returned.

○ Dev_proxima - The total cost of the items that were returned.

○ Demanda_uni_equil - Actual demand (the stock that was actually sold), this is the label that we need to predict for weeks 10 and 11.

● test.csv — the test set, data from clients and products for weeks 10 and 11

containing the fields:

○ Id

○ Semana

○ Agencia_ID

○ Canal_ID

○ Ruta_SAK

○ Cliente_ID

○ Producto_ID

● cliente_tabla.csv — client names (can be joined with train/test on Cliente_ID)

● producto_tabla.csv — product names (can be joined with train/test on

Producto_ID)

● town_state.csv — town and state (can be joined with train/test on Agencia_ID)

● sample_submission.csv — a sample submission file in the correct format

Image 1: Data Structure

None of the numeric variables existing in train data are in the test set of the data,

so the problem here is predicting the demand with only 6 categorical features.

Exploratory Data Analysis

Data Types and Sizes

The sizes of the data files were as follows:

7

● town_state.csv 0.03MB

● train.csv 3199.36MB

● cliente_tabla.csv 21.25MB

● test.csv 251.11MB

● producto_tabla.csv 0.11MB

● sample_submission.csv 68.88MB

Distributions and Summary of Data

Image 2: Summary of Train Data

Image 3: Summary of Train Data (cont.)

Image 4: Distribution of Target Variable

8

Target variable's mean is 7, median is 3, max is 5000, std is 25 and %75 of the data is

between 0-6. This is a classical right-skewed data and this explains why evaluation

metric is RMSLE. Moreover, we logged target variable (log(variable+1)) before starting

modeling and than take exponential of it before submitting (exp(variable)-1).

Correlations

Image 5: Scatter Plots of Key Variables

In these scatter plots, we see that orders are highly correlated with demand, and

secondly, where demand is high returns are low.

Decreasing the Data Size

In order to optimize RAM usage and speed up XGBoost’s performance, we made sure to

force the type of the data fields explicitly. We defined all our integers to use

unsigned integer format and decreased the accuracy of floating point integers as much

as possible. For example, Canal_ID can be uint8. After conversions, memory is reduced

to 2.1 gb from 6.1gb.

Data with Default Data Types Data with Optimized Data Types

9

RangeIndex: 74180464 entries, 0 to 74180463

Data columns (total 11 columns):

Semana int64

Agencia_ID int64

Canal_ID int64

Ruta_SAK int64

Cliente_ID int64

Producto_ID int64

Venta_uni_hoy int64

Venta_hoy float64

Dev_uni_proxima int64

Dev_proxima float64

Demanda_uni_equil int64

dtypes: float64(2), int64(9)

memory usage: 6.1 GB

RangeIndex: 74180464 entries, 0 to 74180463

Data columns (total 11 columns):

Semana uint8

Agencia_ID uint16

Canal_ID uint8

Ruta_SAK uint16

Cliente_ID uint32

Producto_ID uint16

Venta_uni_hoy uint16

Venta_hoy float32

Dev_uni_proxima uint32

Dev_proxima float32

Demanda_uni_equil uint32

dtypes: float32(2), uint16(4), uint32(3),

uint8(2)

memory usage: 2.1 GB

Models

Naive Prediction

We first decided to create a naive prediction; for this we grouped the training data

based on Product ID, Client ID, Agency ID and Route ID. We simply took the median of

this grouping, if this specific grouping did not exist, in the training data set, we

defaulted back to the product’s median demand and if this also did not exist, we simply

took the average of the overall demand.

Score

This method resulted in a score of 0.73 when submitted.

NLTK based Modelling

Feature Engineering

We utilized the NLTK library to extract the following information from the Producto

Tabla file (we used a slightly modified version of a code provided by Andrey Vykhodtsev

) 18

● Weight: In grams

● Pieces

● Brand Name: Extracted through a three letter acronym

● Short Name: Extracted from the Product Name field. We processed this information

using the NLTK library. We first removed the Spanish “stop words” and then used

the stemming in order to make sure only the cores of the names remained.

18 "Exploring products - Kaggle." 2016. 10 Sep. 2016 < https://www.kaggle.com/vykhand/grupo-bimbo-inventory-demand/exploring-products >

10

https://www.kaggle.com/vykhand/grupo-bimbo-inventory-demand/exploring-products

Image 6: Product Data Names after preprocessing

Modeling

We wanted to model text data and predict. Here are the steps that were taken:

1) Separate x and y of train data

2) Append test data to train data to align them to have same sparse product features

order (If they don’t have same column order, training gives false results).

3) Merge this data with products.

4) Use “count vectorizer” of Scikit-learn on brand and short_name columns to create

sparse count-word matrices and append them to train-test data horizontally.

5) Separate appended train and test data.

6) Train Xgboost with default parameters on train data and predict test data.

Technical Problems

1) Garbage Collection

Garbage collection was a big problem because of the size of the data. When we

stopped using a python object, we had to remove and force garbage collection

mechanism to free this memory. For this the the gc library was used. 19

2) Data Size

Before using xgboost, we had 70+ million records with 577 columns. Holding this

sparse data in memory with dataframe was impossible. We solved this issue with

sparse matrices of SciPy library.

In the example below, instead of holding all data including zeros in memory,

sparse method holds only data different than 0. There are many sparse matrices

methods. We used “CSR” and “COO” ones . 20

19 "28.12. gc — Garbage Collector interface — Python 2.7.12 ..." 2014. 21 Sep. 2016 < https://docs.python.org/2/library/gc.html > 20 "Sparse matrices (scipy.sparse) — SciPy v0.18.1 Reference Guide." 2008. 21 Sep. 2016 < http://docs.scipy.org/doc/scipy/reference/sparse.html >

11

https://docs.python.org/2/library/gc.html

http://docs.scipy.org/doc/scipy/reference/sparse.html

Image 7: Visual explanation of how the COO sparse matrix works.

Score

The RMSLE scores obtained by using this method were as follows:

Validation Test 10 Test 11 (Private)

0.764 0.775 0.781

Conclusion

These scores are worse than the naive approach, so we started to think about a new

model.

Final Models

Digging Deeper in Data Exploration

1. Demanda - Dev - Venta Relationship

On data description page of contest, it’s said that Demanda = Venta-Dev except

some return situations.

When we query this equation, there are 615000 records which are exceptions as

shown below. It can mean that returns can be done after more than 1 week. We

flagged these products.

Image 8: Exceptional cases where the number of returns is higher than the number

of orders (lagging returns).

Secondly, we query Demanda = 0 & Dev = 0 and there are 199767 records which

includes only returns. When we find demand mean of a product, these records can

falsify our results as they only include return values.

12

Image 9: Exceptional cases where the number of orders and demand are both zero.

2. Train - Test Difference

We analyzed the missing products, clients, agencies, routes and product-client

tuples which exist in train but not in test and vice versa or in specific files

Producto

Train - Producto File Test - Producto File Test - Train

0 0 34

This table means;

● There are no products existing in train but not in producto file

● There are no products existing in test but not in producto file

● There are 34 products existing in test but not in train (probably new

products)

Similarly, the other missing items are listed below:

Cliente

Train - Cliente File Test - Cliente File Test - Train

0 0 9663

Agencia

Train - Agencia File Test - Agencia File Test - Train

0 0 0

Route

Train - Test Test - Train

17 1012

Producto Client Tuple

Test - Train

1280296

The important outcome of this analysis was that: we should build a general model that can handle new products, clients and routes which don’t exist in train data but in test data.

13

Feature Engineering

In order to provide our models with more information, we had to perform some feature

engineering.

Agencia

Agencia file shows each agency’s town id and state name. We can merge this file with

train and test data on Agencia_ID column and encode state columns into integers.

Image 10: Agencia Table after processing.

Producto

We used features from NLTK model, weights and pieces. In addition to them, we included

short names of product and brand id.

In the picture below, we can see same product with different weights and different ids.

We take the short name of this products (we will add a feature like they are the same

product) and include them to the features. Later we will see why.

Product file

2025 , Pan Blanco 460g BIM 2025

2027 , Pan Blanco 567g WON 2027

Client Features

The Client Features were more difficult to engineer. Unlike the product table, the

client table had a large number of duplicates, where clients names were misspelled in

different cases. We removed the duplicates from the client table and then used a code

snippet provided by AbderRahman Sobh (the process made use of using TF-IDF scoring of 21

the client names and then manual selection of certain keywords) in order to classify

the clients based on their types, resulting in the following categorization:

● Individual 353145

● NO IDENTIFICADO 281670

● Small Franchise 160501

● General Market/Mart 66416

● Eatery 30419

● Supermarket 16019

● Oxxo Store 9313

● Hospital/Pharmacy 5798

21 "Classifying Client Type using Client Names - Kaggle." 2016. 10 Sep. 2016 < https://www.kaggle.com/abbysobh/grupo-bimbo-inventory-demand/classifying-client-type-using-client-names >

14

https://www.kaggle.com/abbysobh/grupo-bimbo-inventory-demand/classifying-client-type-using-client-names

https://www.kaggle.com/abbysobh/grupo-bimbo-inventory-demand/classifying-client-type-using-client-names

● School 5705

● Post 2667

● Hotel 1127

● Fresh Market 1069

● Govt Store 959

● Bimbo Store 320

● Walmart 220

● Consignment 14

Demand Features

This was the most critical part of our data structure. We generated 4 new columns for

our training and testing data and named them Lag0, Lag1, Lag2 and Lag3. We asked

ourselves why we hadn’t added the product’s ex demands.

Lag0 is a special case that attempts to find the average demand for a specific row.

This is done by attempting to find the average based on a large number of variables (as

specific as possible) and failing that, attempting to find the average of a fewer

number of variables (a more relaxed, less accurate and more general average).

For example,

● Average demand based on:

"Producto_ID", "Cliente_ID", "Ruta_SAK", "Agencia_ID", "Canal_ID"

● If this combination is not found, attempt to find average based on:

"Producto_ID", "Cliente_ID", "Ruta_SAK", "Agencia_ID"

● If this combination is not found, attempt to find average based on:

"Producto_ID", "Cliente_ID", "Ruta_SAK"

● And so on and so forth.

This was done in the order of finding various averages based on product id first, then

falling back on averages based on the short names of products , failing that, falling back on averages based on the brand names .

Lag 1 through 3 were constructed in a similar fashion but were more strict and

considered only a single week’s data. In these cases, we did not want to create any

information based on Brand names as it would be too general. Only combinations with

product id and product short name were used. So for a line of training data that

pertained to week 7, Lag 1 would be the averages of that product id or product name

based on week 6 data; Lag 2 would be averages of week 5 data; and so on and so forth.

General Totals

After obtaining the above averages, we also included the following:

● Total Venta per client (giro of client)

● Total Venta_uni_hoy per client (total unit product sold by a client)

● Division of sum of venta_hoy to venta_uni_hoy (giving the approximate price per

unit).

● Division of sum of demand to sum of Venta uni (giving the ratio of goods actually

sold by the client, i.e. ability to sell inventory)

This was done for product short names and also product ids; resulting in an additional

12 more columns for our training data. Other added columns are shown below:

● Client per town

● Sum of returns of product

15

● Sum of returns of short name of a product

After eliminating highly correlated features (%90), the training data table was as

follows:

Int64Index: 74180464 entries, 0 to 74180463 Data columns (total 36 columns): Semana uint8 Agencia_ID uint16 Canal_ID uint8 Ruta_SAK uint16 Cliente_ID uint32 Producto_ID uint16 Venta_uni_hoy uint16 Venta_hoy �oat32 Dev_uni_proxima uint32 Dev_proxima �oat32 Demanda_uni_equil �oat64 Town_ID uint16 State_ID uint8 weight uint16 pieces uint8 Prod_name_ID uint16 Brand_ID uint8 Demanda_uni_equil_original �oat64 DemandaNotEqualTheDifferenceOfVentaUniAndDev bool Lag0 �oat64 Lag1 �oat64 Lag2 �oat64 Lag3 �oat64 weightppieces uint16 Client_Sum_Venta_hoy �oat32 Client_Sum_Venta_uni_hoy �oat32 Client_Sum_venta_div_venta_uni �oat32 prod_name_sum_Venta_hoy �oat32 prod_name_sum_Venta_uni_hoy �oat32 prod_name_sum_venta_div_venta_uni �oat32 Producto_sum_Venta_hoy �oat32 Producto_sum_Venta_uni_hoy �oat32 Producto_sum_venta_div_venta_uni �oat32 Producto_ID_sum_demanda_divide_sum_venta_uni �oat64 Prod_name_ID_sum_demanda_divide_sum_venta_uni �oat64 Cliente_ID_sum_demanda_divide_sum_venta_uni �oat64 memory usage: 10.6 GB

16

Validation Technique

Validation is the maybe the most critical part of a data science project. Top priority

was to not overfitting the data. We used different models to predict week 10 and week

11.

Image 11: Structure of training, validation and test mechanism.

We used 6th and 7th week data as training. Our validation for week 10 was 8th, our

validation for week 11 was 9th. In the latter one, we didn’t use Lag1 variable, because

it means that in order to predict week 11, we should use week 10’s demand (Lag1 of week

11 is week 10) which doesn’t exist. Or, we should predict week 10 first and with this

predicted demands, we predict week 11 but it carries error from week 10 to week 11.

After feature extraction phase and adding features to each record, we deleted first 3

weeks. Because they don’t have Lag1, Lag2 and Lag3 features.

Xgboost

Xgboost can be given 2 different datasets (train and validation). With playing with

parameters, we can make it train until the validation score stops increasing after “N”

iterations. It automatically stops and tells you the best iteration number and its

score.

Xgboost can also give feature importances according to the counts of features on trees

of model. For example:

Image 12: Feature importance graph of fitted training data based on XGboost

17

Training

We started to make models after defining validation strategy and feature extraction.

Features Validation 1 (Week 8)

Validation 2 (Week 9)

Trial 1 0.476226 0.498475

Trial 2: Removing highly correlated features 0.477067 0.493038

Trial 3: Adding lag interactions. 0.502514 N/A

Trial 4: Adding more lag interactions 0.51825 N/A

Trial 5: Lag interactions but removing more

correlated features

0.517606 N/A

Trial 6: Replacing extreme values with NAN 0.517467 0.517375

Trial 7: Removing low importance features (all

lag interactions are removed)

0.480394 0.494104

Trial 8: Adding Client Types 0.48101 0.494804

Many other variations were tried but abandoned due to poor performance.

Interestingly, the original data set (with engineered features such as averages, lags

etc.) resulted in the best performance. There is a caveat however, these attempts were

all made with a fixed setting in XGBoost, as will be seen next, the number of trees may

have been set too low in these trials to take into account the benefits of added

features such as interactions between lags or clients types etc.

Hyperparameter Tuning

After selecting the data set, we proceeded with hyperparameter tuning of the XGBoost

model.

The XGBoost library has numerous parameters, the ones that were used for tuning were:

Max Depth:

The maximum depth of the decision trees.

● Values tried: 10, 12, 8, 6, 14, 18, 20, 22

● Optimal Value: 22

Subsample:

The subsampling rate of rows of the data.

● Values tried: 1, 0.9, 0.8, 0.6

● Optimal Value: 0.9

ColSample By Tree:

The subsampling rate of columns of the data.

● Values tried: 0.4, 0.3, 0.5, 0.6, 0.8, 1

18


Learning Rate:

The gradient descent optimization parameter (the size of each step).

● Values tried: 0.1, 0.05


Features Validation 1 (Week 8)

Validation 2 (Week 9)

Original Training 0.476226 0.498475

Training after Parameter Tuning 0.469628 0.489799

Technical Problems

Storing Data

“CSV” file type is very slow to load and save. In addition to that, it isn’t

self-describing. When we try to load data from it, we have to do all conversions as we

did before saving it. We searched for a better file format to store that much data.

Firstly, we tried “pickle” library which we used for storing xgboost models because of

self-describing feature. But after file size gets bigger, it starts to give error.

Secondly, we tried “HDF5” which is designed for storing big data on file. It was both

very fast to load and save and also self-describing. We picked this one.

RAM Problem

Due to the size of the training and test tables, it was not possible to perform the

operations using our underpowered laptops. Attempting to join large tables or use

XGBoost to create models always resulted in memory errors. We solved this issue by

migrating our environment to Google Cloud Compute. We used linux command line prompts

to install Anaconda and related libraries and then launched Jupyter notebook to create

a development environment. At its highest level, our instance (with 32 virtual CPUs and

208GB RAM) was performing at 100% CPU load and 40% RAM usage. Training and predicting

over our full train and test data took more than 2 hours.

Code Reuse and Automatization

There were lots of coding challenges for us as follows:

● Opening csv files with predefined data types and names

● Handling hdf5 files

● Adding configurable features (Lag0, Lag1, …) to data

● Automatically deleting first “N” lagged weeks from train data

● Appending test to train data

● Separating test and train data automatically

● Xgboost configurable hyperparameter tuning

● Handling memory issues

We solved this issues with Object Oriented Programming with Python. This is the

structure of our general class. class FeatureEngineering :

def __init__ ( self , ValidationStart, ValidationEnd, trainHdfPath, trainHdfFile,

19

testHdfPath1, testHdfPath2, testHdfFile, testTypes, trainTypes, trainCsvPath, testCsvPath,

maxLag =0 )

def __printDataFrameBasics__ (data)

def ReadHdf ( self , trainOrTestOrBoth)

def ReadCsv ( self , trainOrTestOrBoth)

def ConvertCsvToHdf (csvPath, HdfPath, HdfName, ColumnTypeDict)

def Preprocess ( self , trainOrTestOrBoth, columnFunctionTypeList)

def SaveDataFrameToHdf ( self ,trainOrTestOrBoth)

def AddCon�gurableFeaturesToTrain ( self , con�g)

def DeleteLaggedWeeksFromTrain ( self )

def ReadFirstNRowsOfACsv ( self , nrows, trainOrTestOrBoth)

def AppendTestToTrain ( self ,deleteTest = True )

def SplitTrainToTestUsingValidationStart ( self )

We can use this class by giving configurable parameters. parameterDict = { "ValidationStart" : 8 , "ValidationEnd" : 9 , "maxLag" : 3 ,

"trainHdfPath" : '../../input/train_wz.h5' , "trainHdfFile" : "train" ,

"testHdfPath1" : "../../input/test1_wz.h5" , "testHdfPath2" : "../../input/test2_wz.h5" ,

"testHdfFile" : "test" ,

"trainTypes" : { 'Semana' :np . uint8, 'Demanda_uni_equil' : np . uint32}, "testTypes" :

{ 'id' :np . uint32, 'Semana' :np . uint8, 'Agencia_ID' :np . uint16},

"trainCsvPath" : '../../input/train.csv' , "testCsvPath" : '../../input/test.csv' }

FE = FeatureEngineering( ** parameterDict)

To add complex lagged feature, we build an automation system which works with a config

variable. con�gLag0Target1DeleteColumnsFalse = Con�gElements( 0 ,[ ( "SPClRACh0_mean" ,

[ "Producto_ID, "Cliente_ID, "Ruta_SAK, "Agencia_ID, "Canal_ID], [ "mean" ]),

( "SPClRA0_mean" ,

[ "Producto_ID" , "Cliente_ID" , "Ruta_SAK" , "Agencia_ID" ], [ "mean" ]),

( "SB0_mean" ,[ "Brand_ID" ], [ "mean" ])], "Lag0" , True )

FE . AddCon�gurableFeaturesToTrain(con�gLag0Target1)

To do hyperparameter tuning automatically, we wrote a python function. defaultParams = { "max_depth" : 10 , "subsample" : 1. , "colsample_bytree" : 0.4 , "missing" :np . nan,

"n_estimators" : 500 , "learning_rate" : 0.1 }

testParams = [( "max_depth" ,[ 12 , 8 , 6 , 14,16,18,20,22 ]), ( "subsample" ,[ 0.9 , 0.8 , 0.6 ]),

( "colsample_bytree" ,[ 0.3 , 0.5 , 0.6 , 0.8 , 1 ]), ( "learning_rate" ,[ 0.05 ])]

�tParams = { "verbose" : 2 , "early_stopping_rounds" : 10 }

GiveBestParameterWithoutCV(defaultParams, testParams, X_train, X_test, y_train, y_test,

�tParams )

20

Results

Over the course of the contest, we can name 4 mile stone submissions. The validation

and public and private scores of these submissions are shown below.

Model Validation 1 Validation 2 Public Score Private Score

Naive

(averages)

0.736 0.734 0.754

Optimized with

Product Data

via NLTK

0.764 0.775 0.781

XGBoost with

default

parameters

0.476226 0.498475 0.46949 0.49596

XGBoost with

parameter

tuning

0.469628 0.489799 0.46257 0.48666

We can see from the results that we didn’t overfit data at any point.

For our final submission of predictions; we achieved a score of 0.48666; placing our

team in the top 17% of the 2000 contestants.

Looking over the scores of other participants, we’d like to say that for first time

participants of a Kaggle contest, our results were very promising.

Conclusion

We are extremely happy that we picked a Kaggle contest for our project. It allowed us

to work on a common real world problem while giving us a benchmark of our abilities in

the global arena.

We learned to leverage powerful cloud computing capabilities across various platforms

while also learning how to manipulate large data sets with memory and computing power

constraints; while also learning how to use XGBoost library for training and testing

purposes.

We also learned how to use important tools like command prompts to launch development

environments and Github for code sharing collaboration.

21

Critical Mistakes

Poor data exploration

We performed very little data exploration on our own. We mainly depended on the data

exploration that was done by other Kagglers. This resulted in sub-optimal solutions in

our training and testing as we did not exclude outliers etc.

Not preparing for system outages

We faced one outage while using the Google Cloud Preemptible Instance (possibly due to

high demand from other clients) which caused a key data file to become corrupted. The

re-creation of this data file cost us over 5 hours of work. In the future, it would be

preferable if the system was listening to “shut-down” signals that are sent by the

platforms and took necessary steps to prevent the corruption of this data.

Performing hyperparameter tuning too late

In our process we initially performed feature selection using a set of parameters for

XGBoost and then proceeded to hyperparameter tuning step. However, it became apparent

that some features were being given lower scores because our initial set of parameters

were not optimal for a high number of feature columns. Specifically, the depth of the

trees were set to 6 in our initial feature selection; when this was increased to 22, it

became apparent that the features that were originally dropped could have been good

predictors.

Further Exploration

If we had more time and resources, we would have liked to undertake additional actions.

Partial Fitting

When faced with the memory problem we decided to use cloud services. However, another

method would have been loading and processing the data in smaller batches. This would

be a more scalable model and could even be used to create a cluster of cloud machines

to perform operations in parallel.

Multiple Models

Although XGBoost is a very effective tool, it gives a single model (or in our case, 2

models one for each week). We would like to explore the possibility of creating a

larger number of models using different systems and seeing how they perform for

different slices of data. We would then take some sort of weighted average of these

predictions to reach our final prediction.

As an extension to this idea, we would also perform parameter tuning across these

various models to find optimal solutions for each one.

Neural Networks

We would also have like to approach this problem with a neural network solution to see

the accuracy of the predictions and also the performance of the neural network solution

vs the XGBoost tool.

22

DA 592 - Term Project Report - Berker Kozan Can Koklu

Documents

Transcript of DA 592 - Term Project Report - Berker Kozan Can Koklu