Behavioral Segmentation of Telecommunication Customerskiosk.nada.kth.se › utbildning › grukth...

Behavioral Segmentation of Telecommunication Customers

E M I L I A M A T T I L A

Master of Science Thesis Stockholm, Sweden 2008

Behavioral Segmentation of Telecommunication Customers

E M I L I A M A T T I L A

Master’s Thesis in Computer Science (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2008 Supervisor at CSC was Stefan Arnborg Examiner was Stefan Arnborg TRITA-CSC-E 2008:075 ISRN-KTH/CSC/E--08/075--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se

Abstract

In today’s competitive environment, telecommunication operators are investing in understanding their customers better, especially their most profitable customer groups and the groups that have the biggest potential to become such. By segmenting customers based on their behavior, operators can better target their actions, such as launching tailored products and target one-to-one marketing, to meet the customer expectations. However, the problem often is that the data regarding customer behavior is available in several different sources and analyzing the large data set is exhaustive and time consuming. With the help of data mining techniques, segmentation can be done automatically and based on actual customer behavior. In this thesis, K-means, self-organizing maps (SOM) and support vector clustering (SVC) were analyzed to find out which method would be the most suitable for behavioral customer segmentation. For the empirical study SOM was applied to perform behavioral segmentation of telecommunication customers. The data was created by simulating wireless application protocol (WAP) data and hiding five pre-defined segments to be found within the data by SOM. The result showed that for each customer, average revenue, number of visits to per service provider, download volume generated in conjunction of a visit and occasion of the visit are good attributes for behavioral segmentation. SOM is applicable for behavioral customer segmentation. It performs well with a large data and the segmentation produces meaningful results with the simulated data.

Acknowledgements

This study has been carried out at Aito Technologies with the purpose of discovering if customer behavior data extracted from a telecommunication operator network can be used for customer segmentation. This thesis is the last milestone of my computer science studies and the work was done for the Numerical Analysis and Computer Science department at the Royal Institute of Stockholm. I want to thank Anssi Tauriainen at Aito Technologies for this opportunity and my professor Stefan Arnborg for his help during the work. I also want to thank my co-worker Kim Green for his assistance in analyzing the findings during the empirical phase. Finally, I wish to thank Mikko for the support and encouragement throughout this process and my sister Sargit for the critical review of the written report. Helsinki, May 30th 2008 Emilia Mattila

Table of contents

1 Introduction .................................................................................................11.1 Problem ................................................................................................21.2 Objectives.............................................................................................21.3 Restrictions...........................................................................................3

2 Theory .........................................................................................................42.1 What is clustering? ...............................................................................42.2 Clustering techniques ...........................................................................52.3 Issues related to clustering ...................................................................6

2.3.1 Missing values and outliers ...........................................................62.3.2 Dimensionality ..............................................................................62.3.3 Data understanding........................................................................7

2.4 Methods for clustering analysis............................................................72.4.1 Self-organizing map ......................................................................72.4.2 K-means.........................................................................................92.4.3 Support vector clustering.............................................................10

2.5 Comparison of clustering techniques .................................................112.5.1 Scalability and performance........................................................122.5.2 High-dimensionality....................................................................122.5.3 Outliers and missing values.........................................................122.5.4 Input parameters ..........................................................................132.5.5 Proximity .....................................................................................142.5.6 Characteristics of a cluster ..........................................................142.5.7 Cluster summaries .......................................................................14

3 Empirical study..........................................................................................163.1 Choosing the method..........................................................................163.2 Research activities..............................................................................173.3 Data set ...............................................................................................18

3.3.1 Restrictions..................................................................................183.3.2 Simulation ...................................................................................18

3.4 Preparation..........................................................................................193.4.1 Aggregation .................................................................................193.4.2 Missing value replacement..........................................................203.4.3 Attributes .....................................................................................203.4.4 Categorization .............................................................................213.4.5 Transformations...........................................................................22

3.5 Applying SOM clustering ..................................................................22

3.5.1 Initial parameters.........................................................................223.5.2 Input vectors ................................................................................233.5.3 Training .......................................................................................243.5.4 Evaluation of clusters ..................................................................24

3.6 Visualization.......................................................................................243.6.1 Map..............................................................................................243.6.2 Component planes .......................................................................25

3.7 Clusters and summaries......................................................................274 Results .......................................................................................................28

4.1 Attributes ............................................................................................284.2 Segments ............................................................................................294.3 Simulation ..........................................................................................314.4 Scalability...........................................................................................324.5 Cluster quality ....................................................................................33

5 Summary ...................................................................................................355.1 Discussion ..........................................................................................35

5.1.1 Data preparations.........................................................................355.1.2 SOM segmentation ......................................................................365.1.3 Scalability....................................................................................365.1.4 End-user point of view ................................................................37

5.2 Conclusions ........................................................................................375.3 Future work ........................................................................................39

References ....................................................................................................41

1

1 Introduction

Customer segmentation is a process of dividing a large set of heterogeneous customers into groups that have similar characteristics, behavior or needs (Kotler & Armstrong, 2005). By identifying and profiling the most important segments, companies discover how to reach their customers more effectively. Understanding customer behavior better companies can provide customized services and products. As a consequence, segmentation has proven to be a valuable source of information for marketing, sales and business planning. Traditionally, companies conduct segmentation based on different market research and customer surveys. However, with the introduction of data mining methods, there are now ways to perform segmentation based on the actual behavior of the customers. Most commonly segmentation is based on demographic factors or customers’ views and beliefs. However, as the customer demographics and opinions do not correlate well with their actual behavior, there is a strong support for behavioral segmentation (Saarenvirta, 1998). The behavioral segmentation enables customers to be segmented based on occasion and time, usage rate, benefits sought, user status and loyalty status (Kotler & Armstrong, 2005). Companies benefit from the behavioral segmentation, as they are able to gain better understanding of their customers’ actual behavior. Consequently, business and marketing personnel can make better decisions on marketing strategies, new market opportunities can be discovered and companies can differentiate on the market from the competitors. Several researches have recognized the importance of data mining techniques as a tool for customer segmentation. As an implication segmentation has been applied in several different business areas such as tourism (Bloom, 2004), online-shopping (Vellido;Lisboa;& Meehan, 1999), drink industry (Huanga, Tzengb, & Ong, 2007) and banking (Hsieh, 2004). This thesis investigates which data mining techniques are applicable to conduct behavioral segmentation in the telecommunication field.

1 INTRODUCTION

2

1.1 Problem

Currently, telecommunication operators build segmentation models based on information combined from different sources such as billing data, call detail records (CDR) and customer surveys. The problem is that the gathered information is first of all too general and accessible with a delay. Moreover, it does not provide information on the actual usage because operator customers tend to answer differently to surveys than they behave. Additionally, this kind of segmentation process is exhaustive, time consuming and provides static results. With the help of data mining techniques segmentation can be done in real-time, automated and based on actual customer transactions. Typically in telecommunication, the transaction data provides information about customers, handsets and services. However, the legislation sets some strict rules on how customer specific data can be used. The requirement is that segmentation cannot link the results to any specific customer. The need for this thesis arose as Aito is investigating the possibility to develop a product that automatically identifies segments from the network data. The main research question is formulated below consisting of two sub questions: What factors need to be considered when performing behavioral segmentation based on network transaction data?

• Which data mining method is most suitable? • What are the most meaningful attributes for behavioral

segmentation?

1.2 Objectives

The objective of this thesis is first of all to perform a comparative study of different segmentation methods. Based on previous research on the area of segmentation, a decision was made to compare two unsupervised and one supervised learning algorithm. From the different researches K-means, self-organizing map and support vector clustering were identified as the most commonly used methods for customer segmentation. The first part of the thesis will provide a critical analysis of these methods. The second part of this thesis is an empirical study, which is performed to verify one of the clustering methods by conducting behavioral customer segmentation on a telecommunication network transaction data.

1 INTRODUCTION

3

1.3 Restrictions

As there is a wide range of clustering algorithms for different purposes the discussion in this thesis is restricted to three algorithms based on the different approaches on segmentation. These methods are self-organizing maps (SOM), K-means and support vector clustering (SVC). Another restriction is that the data samples analyzed in this study consists solely of wireless application protocol (WAP) and customer demographic data. However, any findings made in this thesis are also applicable for other types of transaction data such as mobile services. Finally, Aito was not able to obtain adequate amounts of real network data from a telecommunication operator and therefore, a decision was made to use artificial simulated data instead.

4

2 Theory

A telecommunication network produces daily a large amount of data, which contains information about network customers, services and handsets as well as their usage and quality. The amount of data is too large for a comprehensive manual analysis, and therefore valuable information might not be discovered. Fortunately, knowledge discovery techniques provide means for unveiling hidden patterns from the data. Data mining is an integral part of a framework called the knowledge discovery in databases (KDD) which defines a process of transforming raw data into useful information. Data mining is a process of automatically discovering useful information from large data sets in order to discover meaningful patterns and rules (Tan, Steinbach, & Kumar, 2006). The steps of KDD framework are applied in this thesis in order to find whether network operator customers can be grouped based on similarities in their behavior. In this chapter, clustering is introduced, different clustering methods will be presented and finally their compatibility for this problem domain will be discussed in the form of benefits and pitfalls.

2.1 What is clustering?

Already at very young age people start to categorize objects based on their similarities. This ability of grouping things together helps people understand what different members of a population have in common and how they differ from the others. Data miners have applied this same approach in order to perform clustering. Hence, clustering is a task of dividing data into a number of subgroups which are homogenous within the group and heterogeneous in respect to other groups (Berry & Linoff, 2004). Furthermore, clustering analysis refers to the process of automatically discovering clusters that lie hidden within the data (Tan, Steinbach, & Kumar, 2006). To clarify some of the naming conventions used in this thesis, segmentation is commonly used term in marketing whereas clustering is much more widely used term and not restricted to any specific business domain. However, in this thesis these terms are used interchangeably.

2 THEORY

5

2.2 Clustering techniques

In the field of data mining, several ways exist to cluster data. Two commonly applied clustering techniques are known as hierarchical and partition clustering. Characteristic for partition clustering is that the data set is to be divided into several clusters where each data sample can only belong to one cluster. To achieve this partition clustering typically tries to minimize the distance within clusters and maximize it between clusters (Vesanto & Alhoniemi, Clustering of the self-organizing map, 2000). On the other hand, hierarchical clustering is about dividing data repeatedly into sub-groups. During hierarchical clustering, objects are ordered to a tree form where each node represents a cluster. Hierarchical clustering can be performed either top-down or bottom-up (Vesanto, Using SOM in Data mining, 2000). The latter approach is better known as agglomerative clustering where all data objects are initially considered as cluster and at each step merged together with the closest cluster. The former approach starts from a root node that represents the whole data set as one cluster and iteratively divides clusters into smaller ones (Vesanto, Using SOM in Data mining, 2000; Tan, Steinbach, & Kumar, 2006). Other clustering techniques exist as well, such as mixture models and fuzzy clustering. Mixture models present clusters as distributions where each data sample belongs to a cluster based on a calculated probability distribution. The challenge for this type of clustering is how to mathematically locate the parameters that describe the cluster the best. A considerable weakness of partition and hierarchical clustering compared to mixture model is that they fail to find overlapping groups (Arnborg, 2008). In fuzzy clustering every data sample has a relationship to each cluster described as a membership weight ranging from 0 to 1. Hence, fuzzy clustering differs from the partition clustering as it assigns each data sample to all clusters with a weight (Tan, Steinbach, & Kumar, 2006; Vesanto & Alhoniemi, Clustering of the self-organizing map, 2000). To simplify the analysis done in this thesis each data sample is assumed to belong to one cluster at a time. When considering which clustering technique to choose the most restrictive aspect of hierarchical clustering is the quadratic time complexity O(n2), where n is the number of data samples, whereas the partition clustering have linear time complexity. Therefore, especially when the data sets are large, partition clustering is considered superior to hierarchical clustering in computational performance. Also, partition clustering has an advantage over hierarchical clustering of not depending on the previously found clusters (Vesanto & Alhoniemi, Clustering of the self-organizing map, 2000). Hierarchical clustering is known to form better quality clusters whereas partition clustering is more effective in handling noise (Jain, Murty, & Flynn, Data Clustering: A Review, 1999). The large data set imposes performance considerations and therefore linear partition clustering is considered in this problem domain.

2 THEORY

6

2.3 Issues related to clustering

Several factors related to clustering exist, which should be taken into account when following the process of data mining and performing clustering. These are data quality (measurement errors, noise, missing values, outliers, duplicate data), pre-processing (aggregation, sampling and dimensionality reduction) and data understanding (summary statistics and visualization). Following issues are considered the most relevant factors for the purposes of this thesis.

2.3.1 Missing values and outliers

Once data is gathered from the operator network logs, it is refined and aggregated to a higher level before inserted into the database. During this process the data set can end up containing missing or distorted values. Careful considerations should be made in order to determine how missing values should be taken care of as most of the clustering algorithms are affected by these missing values. Available options are removing the values, estimating or sampling a value and ignoring missing values (Tan, Steinbach, & Kumar, 2006). Removal of missing values is sometimes rather harsh as it can substantially reduce the amount of data samples. Instead, the algorithm could be tuned to ignore the missing values. This differs from removal in sense that the other attributes of a data sample can still provide valuable information for the analysis and if removed completely this information would be lost. Another good solution would be filling the missing value with an estimation of the value. Some algorithms are set to ignore missing values and perform comparisons between data samples that do not have missing values. Outliers, on the other hand, are values that are in some sense different from the majority of values. For some business fields these might be of interest and worth analyzing further. However, in many cases it makes sense to discard data samples with known outliers before clustering so that they do not cause problems for the analysis.

2.3.2 Dimensionality

In order to describe customer behavior, several attributes can be identified from the network transactions. The operator customers can be characterized by how much they bring revenue per month for the operator (ARPU), what kind of services is used, and with what kind of handset. Additionally, usage based factors such as how much customers use different services, during which occasions and how much they tend to download can be analyzed as well. This kind of data containing several factors to perform analysis with is referred to as high-dimensional. Characteristic for high-dimensional data is that it is hard to visualize and understand without reducing it to lower dimension. Additionally, widely spread data weakens the quality of resulting clusters (Tan, Steinbach, & Kumar, 2006). Therefore,

2 THEORY

7

dimensionality reduction has positive impact on clustering algorithms, as they tend to perform better, require less memory and provide better clusters.

2.3.3 Data understanding

An important part of the knowledge discovery process is to understand the data and its characteristics (Tan, Steinbach, & Kumar, 2006). For this purpose visual and statistical methods are used. Visual presentation is undoubtedly the most efficient way for a user to quickly gain an understanding of large data sets. On the other hand, statistical methods provide more in depth mathematical analysis of the results. Statistical methods contain histograms, mean and standard deviation calculations, occurrence counts for categorical attributes, i.e. frequency and for high-dimensional numeric data, covariance and correlation matrix.

2.4 Methods for clustering analysis

Several algorithms exist for clustering purposes (Jain & Dubes, Algorithms for Clustering Data, 1988). Some of them are based on supervised learning whereas others on unsupervised learning. The difference between these are that supervised learning uses one or more manually (or otherwise) clustered data training sets such as labels in order to assign new data set members to clusters. Unsupervised learning, on the other hand, finds the underlying patterns from the data set autonomously and proposes these as clusters. Self-organizing map and K-means are based on unsupervised learning whereas support vector clustering is a supervised learning process.

2.4.1 Self-organizing map

The self-organizing map (SOM) (Kohonen, Self-organizing map, 1995) is a neural network algorithm based on competitive unsupervised learning. Through the review of relevant literature (Kiang;Hu;& Fisher, 2006; Vellido;Lisboa;& Meehan, 1999; Hsieh, 2004; Merkevičius;Garšva;& Simutis, 2004), one can identify that SOM has been a widely used method in the area of segmentation. Also, several variants of SOM can be found but this thesis concentrates on the basic SOM. Neural networks, like SOM, consist of a set of nodes. The job of these nodes is to take inputs and combine them to a single value, which is then transformed to produce an output. Typically, each input has its own weight and the node forms a weighted sum of these inputs i.e. each input is multiplied by its own weight and all these values are added together. For neural networks the weights are important as they are used as the parameters for solving the clustering problem. The structure of a basic SOM is that is has an input and an output layer. What SOM does is that is provides the data samples for the input layer

2 THEORY

8

which does not do any work but only acts as a connection for the input attributes. As we can see from the figure 1 each attribute of a data sample (e.g. average revenue per month and visit count per service and occasion of visits per service) is connected with one node on the input layer. However, each node from the input layer is connected to all of the nodes on the output layer. The output layer resembles a chess board however the nodes on the output layer are not connected with each other. The output layer is also known as feature map as it describes the common attributes i.e. features of the underlying input data.

Figure 1. SOM consists of input and output layer. Each of the input nodes are connected to one attribute of a data sample whereas the output nodes are connected to each of the input nodes. Image from (Fröhlich,. 1997) with a small modification.

Having an idea about the structure and components, how are the clusters formed? Imagine a bees’ nest found by boys who start throwing small stones at the nest. With each hit a small dent is formed on the surface of the nest. If several stones hit the same spot, a hole is formed. With SOM, the bees’ nest would be the output layer and the holes would be the clusters. To describe in more detail how the clusters are found with SOM, training is a central concept. Figure 2 presents how the clusters are formed during the training. In SOM clustering, the data samples undergo several rounds of training. During each round, one could say that the output nodes are competing against each other of the right to own the input data samples. The winning node is the one which best describes the input data sample. As a prize, the winning node gets to update the weights of its input edges. The more times a node wins, it starts to form patterns and finally clusters. SOM is not just forming clusters but it forms them so that they have relationships. In practice this means that more similar clusters are closer to each other and the underlying structure of the data is preserved. How SOM does this is, that not only the weights of the winning node but also its neighboring nodes are updated. The further away the neighboring nodes are from the winning node, the less they get adjusted. Even when the output nodes are not directly connected, the map still changes the structure as the near by nodes get updated. This is known as the neighborhood concept,

2 THEORY

9

which enables similar clusters being closer to each other, and further away from the dissimilar ones.

(a) (b) (c) Figure 2. SOM finds the hidden clusters from the data by adjusting the weights of the nodes and its neighboring nodes. (a) Initial output layer presented as a rectangular map. (b) Training of the map starts moving the winning nodes closer to the data samples while preserving the global structure of the map. (c) Final stage of the training where the map does not change anymore. The clusters can be identified after training ends. The images are snapshots from an animation (Peltarion, 2007).

2.4.2 K-means

K-means is one of the most commonly used clustering techniques. The K refers to the number of clusters that the method will find. The clusters for K-means are found based on proximity of data samples to each other (Berry & Linoff, 2004). K-means is based on searching for the cluster center points by calculating the mean of a data sample group. K-means builds the clusters in three phases which are described in figure 3. During the first phase, the K initial cluster centers are generated randomly. Second phase includes several iterative steps during which the distance between each data sample and the cluster centers are computed. The data sample is always assigned to its closest center. Third and last phase recalculates the cluster centers as an average of all the data samples in the group and this is the way new centers are born. The process continues between second and third phase, and ends once the cluster centers do not change anymore. The purpose of these phases is to minimize the sum of squared error. The smaller the error rate, the better quality the clusters have. (Tan, Steinbach, & Kumar, 2006)

2 THEORY

10

(a) (b) (c) Figure 3. Three iterative steps of K-means for finding three cluster. (a) Cluster centers are initialized by random selection. (b) Each point is assigned to its nearest center.(c)Cluster centers are updated

2.4.3 Support vector clustering

A relatively new but promising method for segmentation is the support vector clustering (SVC) (Ben-Hur, Horn, Siegelmann, & Vapnik, A support vector clustering method, 2000; Ben-Hur, Horn, Siegelmann, & Vapnik, Support vector clustering, 2001; Huanga, Tzengb, & Ong, 2007; Finley & Joachims, 2005). The SVC is based on the theory of Support Vector Machines (SVM) (Cortes & Vapnik, 1995) which is actually a classification technique. Support vector machines are able to find patterns from a complex set of data samples. Typically, this is done by mapping the input data samples into a high dimensional feature space where the samples are easier to separate from each other. This method has been applied by SVC as well. How does SVC recognize patterns? Imagine that you are looking at a night sky on a clear evening. Let’s say that you would be able to select a set of stars by drawing a round circle in the air with your finger. Only rule for drawing the circle is that it needs to have as small radius as possible which encloses all the desired stars. Next, let’s say you could take this transparent sphere of stars down from the sky place it on the ground and stomp all over it. The sphere is now a flattened and since it is transparent you could see that it has formed groups of stars. With a SVC these star groups are the identifiable clusters. Figure 4 shows the sphere and the clusters. Hence, the first phase of SVC consists of mapping the input data samples into a high dimensional feature space and a minimal sphere radius is computed which encloses all the data samples. This is like drawing the circle around the desired set of stars. Just like some of the stars will reside on the surface of the transparent sphere, with SVC some of the data samples will also reside on the surface. These are known as the support vectors. The second phase transforms the sphere back to input data space. After this phase, the flattened transparent sphere shows groups of stars that actually resemble a map with valleys and mountain peaks. With SVC, this corresponds to the support vectors forming areas of different shapes. Hence, as support vectors are the points on the surface of the sphere they are

2 THEORY

11

actually responsible for forming the cluster boundaries. The data samples inside the sphere end up within these boundaries.

(a) (b)

Figure 4. Two phases of SVC are transformation to feature space and back to data space. (a) The minimal closing sphere in the high dimensional feature space. The points on the boundaries are the support vectors (b) Resulting three clusters. Images from (Ben-Hur, Horn, Siegelmann, & Vapnik, Support vector clustering, 2001).

In SVC, two parameters determine the shape of the clusters, namely the scale parameter of Gaussian kernel q as well as soft margin constant C. Increasing q will result in greater number of support vectors and therefore also increased number of clusters. A rather tedious process is to adjust parameter q, even several times, in order to locate the desired number of clusters. Parameter C is used for handling outliers and overlapping clusters. Clusters are found using the Lagrangian function: The Lagrangian function can be written as Wolf dual form: where

2.5 Comparison of clustering techniques

To make an informed decision on which clustering algorithm to choose several factors need to be considered, some of them have already been presented in previous sections. This chapter answers to questions related to determining the most suitable clustering method, such as is the method applicable for large data sets, can the dimensionality of the data be reduced, are there some restrictions on the structure or type of the clusters and finally how are the cluster summaries presented.

2 THEORY

12

2.5.1 Scalability and performance

Telecommunication operators have millions of users on a daily basis producing different network transactions. Handling a large data set like this imposes scalability and performance requirements for the clustering algorithm. Both K-means and SOM have linear complexity, O(ndk) and O(nd) respectively, where n is the number of data samples, d is the number of dimensions and k the number of clusters. Hence, both methods scale to large data sets and have good performance. On the other hand, SOM has an additional benefit over K-means, namely the ability to be applied as a two-level clustering. This means that first the winning nodes are formed from the initial data set and next the winning nodes are clustered. As a consequence, this approach reduces the initial large data set by choosing only the winning nodes for the cluster analysis. The third method, SVC has computational complexity of where n is the number of support vectors, not data samples. The main stepping-stone for SVC with large data sets is locating the Lagrange multiplier that is used for finding the boundary values of the clusters. However, there are proposed ways to overcome this issue such as the Sequential Minimal Optimization (SMO) algorithm (Platt, 1998). Consequently, all three algorithms are able to scale for large data sets and performance should not be an issue.

2.5.2 High-dimensionality

When segmentation is performed with several attributes the data is often hard to visualize and understand. For the behavioral segmentation, several attributes were identified for clustering such as the average revenue per customer, visit count per different occasion and download volume per different occasion. This high-dimensionality affects the performance of the algorithms and quality of the clusters, so the fewer the dimensions are, the faster the execution and more compact distribution of data samples are. One of the strong points of SOM is its ability to represent high-dimensional data on a low-level (often two dimensional) map. On the other hand, the K-means algorithm handles the high-dimensionality of data during the iterative steps. Finally, the SVC algorithm itself does not take care of dimensionality reduction and therefore requires the help of other methods such as singular value decomposition (SVD) or principal component analysis (PCA). More on these methods can be found from (Tan, Steinbach, & Kumar, 2006) appendix B.

2.5.3 Outliers and missing values

The real world customer transaction data can contain outliers and missing values. These are often problematic for the clustering as they distort the results and in this way weaken the quality of the clusters.

2 THEORY

13

SOM visualizes the data set into a two dimensional map which is a useful aid when detecting outliers. Outliers can be spotted from the map as they usually have the longest distance to all other clusters. When finding such a group the data miner can either remove outliers directly or further investigate if this group has some interesting characteristics. In the case of missing values, SOM can simply be tuned to leave out missing values from training, if the amount of missing values is not too large compared to the actual size of the data set. Unlike SOM, K-means does not incorporate any functionality for handling outliers - instead outliers affect the clustering as they are treated equally with all other samples. Inherently this will lead to poorer clustering results depending on the amount of outliers in the data set. Neither does K-means take care of the missing values so the only approach is often to discard such values manually before applying the clustering algorithm. On the other hand, SVC has a way to handle outliers and missing values automatically by using a soft margin constant (C), which leaves these data points outside the enclosing sphere. Consequently, if the algorithm should be trusted to handle outliers and missing values automatically SOM and SVC has ways to handle this, whereas K-means does not. With K-means a manual work would need to be done to remove data samples with missing values or outliers.

2.5.4 Input parameters

Some of the clustering methods require user defined input parameters. Attention should be given when choosing these parameters, as despite the fact that the algorithms can always produce clusters they might not be valid or meaningful if the parameters are badly chosen. When self-organizing maps is used for clustering four input parameters can be defined. These are map size, training length, learning rate and neighborhood radius. Some instructions on how to choose these values are provided by (Vesanto & Alhoniemi, Clustering of the self-organizing map, 2000; Vesanto;Himberg;Alhoniemi;& Parhankangas, 2000) but more of this in chapter 3.5. However, at least Matlab’s SOMToolbox has functions that determine these parameters based on the input data. So in this sense SOM can be said to be independent of any input parameter selections. Support vector clustering requires two input parameters and these are the scale parameter of Gaussian kernel q as well as soft margin constant C. The value for q is often determined indirectly from the number of cluster. By determining the value for C the user chooses in which degree outliers are handled. The main challenge with SVC is determining q as this often requires several iterations before the correct q value is found. For K-means the input parameter is the number of cluster K. The downside is that K-means always converges to the K amount of clusters even if there would not exist this number of clusters. However, the clustering could be made more reliable by applying some other method, which determines the number of clusters, before applying K-means.

2 THEORY

14

Whereas SOM finds the clusters without having to set any specific parameters both SVC and K-means require that the number of clusters is determined before starting the clustering.

2.5.5 Proximity

For customer segmentation purposes it is important to know what makes some data samples more similar or dissimilar to others. This degree of similarity or dissimilarity of data samples is referred to as proximity. Both K-means and SOM are similar in the sense that they calculate distances and minimize some error rate to impose good cluster quality. However, SOM also forms clusters based on the similarities of the input data samples which means that more similar clusters are located close to each other. It is the neighborhood factor of SOM that creates the relationships between different clusters. This aspect distinguishes SOM from many other clustering algorithms. K-means does not take into account similarity between samples within clusters. This means that once a data sample is assigned into a cluster, all samples within a cluster are treated equally and have no relationships with samples outside the cluster. Like SOM, SVC performs grouping based on the similarities between data samples within clusters like SOM, without having relationships to other clusters like K-means.

2.5.6 Characteristics of a cluster

Clustering algorithms vary in their ability to form groups of different sizes, densities and shapes. Behavioral segmentation can produce all kinds of groups and therefore these factors should not be limiting the results. Also, the type of the data samples can dictate the algorithm to use as some clustering algorithms only work with continuous data or with the kind of data where the mean can be computed. Basic K-means is restricted to numeric data and can only be used for data where mean or median can be calculated (Tan, Steinbach, & Kumar, 2006). Therefore, basic K-means does not work well with categorical data. Additionally, K-means can only find spherical clusters and usually of the same size, which can be restrictive in some cases. Instead, SVC and SOM have no problems in locating clusters of all shapes, sizes and densities.

2.5.7 Cluster summaries

When the data is analyzed for segmentation purposes the visual presentation of the resulting clusters is a great aid in understanding the data. For clustering, summaries are drawn to present both qualitative and quantitative information of the segments. For this purpose, there are both visual tools and mathematical summaries.

2 THEORY

15

SOM is often considered superior due to its ability to visually represent the clusters on a map. By looking at the map, it is easy to grasp the different patterns from the originally large data set, such as the structure, number of clusters and possible outliers. K-means and SVC do not directly have any build in visualization tools but both can be represented through other visualization means and with mathematical summaries. Examples of the latter are: mathematical calculations, literal descriptions of the results and histograms. Consequently, SOM’s several visualization tools can be considered as a great benefit in understanding and analyzing the data whereas K-means and SOM are more limited in their presentation power.

16

3 Empirical study

The previous chapter provided an introduction to the area of clustering. The theoretical background covered three clustering methods that were considered applicable for behavioral segmentation of telecommunication customers. This chapter will introduce the method to be studied in more detail and explain the steps to be taken in order to produce customer segmentation.

3.1 Choosing the method

Chapter 2 identified benefits and pitfalls between the different clustering methods. Based on the findings studied in the previous chapter four apparent benefits of SOM were identified compared to the other clustering methods, which contributed in the deployment of this method:

• topological ordering • visualization • robustness • performance

In customer segmentation, it is interesting to identify which properties make two segments different from each other as well as which segments are more similar to each other and why. SOM provides an efficient visual presentation of relationships between segments by preserving the topological ordering of the original data set. With a quick glance to a SOM map one can identify which segments are closely located and therefore more similar to each other and which are different. Also, SOM’s component planes reveal which attributes are behind the segments and what factors make two segments different. Often, the most common reason that speaks for SOM is that it enables a visual presentation of the data clusters. This visualization enables qualitative inspection and analysis of the clusters such as finding out the number of clusters, their shape and size and neighboring clusters. The map is used for selecting the actual clusters for the summary phase which provides

3 EMPIRICAL STUDY

17

quantitative information of the clusters. Finally, the actual clusters are used for drawing customer profiles for the resulting segments. SOM has proven to be a robust and efficient tool for clustering as it handles outliers, missing values and performs well with a large data set. Performance wise SOM algorithm is considered computationally efficient. Furthermore, the two-level approach makes the computation even faster as the winning nodes are trained and clustered instead of the original data (Vesanto & Alhoniemi, Clustering of the self-organizing map, 2000). The rest of the chapter will contain a description of the different stages of data mining using SOM as well as the research activities performed for this thesis.

3.2 Research activities

The complexity of a large data set often makes it hard or even impossible for humans to find the clusters. Therefore better ways to investigate the data are required. A data mining process known as ‘preparation-survey -cycle’ is presented in (Vesanto, Using SOM in Data mining, 2000). This is an approach that sheds light on understanding the data and was therefore considered as a good choice for organizing the research activities of this thesis (see figure 5).

Figure 5. SOM ‘preparation-survey -cycle’ provides a framework for data mining process (Vesanto, Using SOM in Data mining, 2000).

At first, the raw data samples are gathered. Next, the data is aggregated to reduce the data granularity and the amount of data. Then comes the preparation phase during which the attributes for segmentation are determined and raw data is transformed to a form that the clustering algorithm can understand. Now the data is ready for the actual clustering during which the data samples are trained and results are presented visually. These results are then thoroughly analyzed and evaluated with the help of different visualization tools provided by SOM. As an input for the summary phase the clusters are selected from the visual presentation. Quantitative summaries and customer profiles are drawn for

3 EMPIRICAL STUDY

18

each cluster and the domain experts determine whether the segmentation has produced useful models. Finally, by completing all these steps local preliminary models can be drawn for the business purposes. The research environment used in this thesis consists of Windows XP accommodating the MySQL database and Matlab used for studying the SOM clustering. The study was conducted using open-source SOMToolbox package available for Matlab (Vesanto;Himberg;Alhoniemi;& Parhankangas, 2000). This is an application that implements the basic SOM algorithm and visualizes the results. As the initial data set is stored in a database MySQL queries are used to transform the data samples into ASCII files. To perform pre-processing a small program was implemented in C. Calculations presented were drawn using the Matlab functions. As this research topic was considered rather challenging, extra help was received from researcher Kim Green who was involved during the evaluation phase of SOM.

3.3 Data set

The actual data samples used for this thesis are collected as raw transaction data from a telecommunication operator’s traffic logs. Before segmentation analysis the data is pre-processed and aggregated. This data is stored in a database and it consists of WAP visit information and customer demographic data.

3.3.1 Restrictions

For the purposes of this thesis only WAP and customer demographic data are analyzed. The absence of real data was a challenge in this thesis. Also we needed to verify whether SOM can actually find hidden clusters from the data set. Based on these reasons, a decision was made to create a data simulation of WAP and customer demographic data. Medium sized European operators have from five to ten million customers but as the performance of the Matlab degrades with large data sets the simulation was conducted with thousand customers. However, to ascertain that SOM is applicable for customer segmentation in terms of performance, estimates on how well SOM will perform with the actual data sizes was conducted. Considering these restrictions, it is recommended to conduct further exploratory segmentation once real data is available to verify the applicability of SOM for behavioral customer segmentation.

3.3.2 Simulation

The empirical study of SOM clustering is conducted based on artificial data containing five clusters. Setting up the artificial data set was based on probabilities. Each of the customers within one of the five clusters had a

3 EMPIRICAL STUDY

19

probability value which tells how likely it is that the customer visits a certain service during certain occasion of the day. Each visit will produce certain amount of download. The simulated data set contains information of 1000 customers using WAP services during a period of one month. In order to verify that SOM can discover hidden patterns and segments from the data four use cases were integrated into the simulated data which should be found during the clustering phase. The use cases are:

1. Finding out which services could be cross-marketed 2. Detecting differences in usage patterns between two segments 3. Discovering the most profitable segment 4. Finding out the most active customers

3.4 Preparation

According to the practices of data mining the data has to be pre-processed before cluster analysis. Hasty decisions made during the data preparation stage will still produce clustering despite the fact that these clusters might not present the true properties of the data (Vesanto, Using SOM in Data mining, 2000). This chapter will provide an insight on how to prepare the data to get it conform to the requirements of the analysis algorithm. First, the data is aggregated and then decision is made on what to do with the missing values. Next, the attributes for the clustering are determined. Finally, transformations are conducted to get attributes comparable with each other.

3.4.1 Aggregation

In data mining the amount of data transactions can grow to several millions of rows per day but often there is no need to analyze such granular data. In this case compacting and summarizing the data is advantageous. This approach of aggregating data is beneficial as it

• raises the level of abstraction, • reduces the initial data set and • requires less processing time from the clustering algorithm.

For analysis purposes, the WAP transactions do not provide meaningful information and therefore the data samples are grouped to WAP visits. The definition of a visit is that it consists of all transactions made to a service provider’s site based on the URL during one unique packet data protocol (PDP) context (Aito Technologies, 2007). As a consequence, the input data samples are reduced to one tenth of the original size, which contributes significantly to performance of the clustering algorithm. The visits were even further aggregated for each customer based on a pre-selected time period. After aggregation, each table row describes different factors of customer’s usage during a chosen time period.

3 EMPIRICAL STUDY

20

3.4.2 Missing value replacement

The simulated data does not contain missing values but the real data might do so. Therefore, missing values need to be handled or otherwise they will distort the clustering results. If the customers are not missing data in too many cases the missing values could be left for SOM to take care of. SOM sets ‘x’ in the place of missing values and only uses the available values in the training phase. As a result, missing values do not have a negative impact on the clustering results. But if it turned out that most of the customers are lacking some of the data values it would be advisable to drop these data values from clustering. If the data can be assumed not missing values at random (i.e., the absence of a value tells something about a customer), the absence of an attribute can itself be made into a logical (0/1) attribute.

3.4.3 Attributes

As the attributes, i.e. dimensions, determine what kinds of clusters are formed, choosing the right attributes is an important phase of data mining. From the network transactions several attributes can be identified such as which service URL the customer contacted and with what handset, on what occasion and how much the customer downloaded. The customer demographics tell about customer age, gender as well as home location. The demographic factors were considered as input attribute candidates for segmentation but as this information can be incorrect due to ownership versus actual user (e.g. company phone but user is actually an employee), it is not advisable. Instead, demographics are used when profiling the final clusters as it gives some indication of the customers in the segment. Accordingly, the handset model information would be interesting to use in the segmentation. Nevertheless, there are thousands of models which cannot be categorized in any meaningful way for behavioral segmentation purposes. Even if not used in the actual clustering phase, the handset information could be used when setting up the final profiles of each segment. However, as the handsets do not provide any value for the clustering they left out of the scope. Values chosen for behavioral segmentation are presented in table 1. In order to provide these values as input data for a clustering algorithm all attributes are categorized first (more on this in chapter 3.4.4). The behavioral attributes for customer segmentation are the eight service groups, number of visits, the amount of download, two occasion groups and the average revenue per customer (ARPU). This means that each customer is described with 33 attributes. One attribute is ARPU and other 32 attributes are comprised of customers visit count to each of the service groups during two occasions of a day and download volume from each of the service groups during two occasions of a day. This allows analyzing both service visit counts and download volumes per customer.

3 EMPIRICAL STUDY

21

Table 1. Attribute descriptions for usage based segmentation

Attribute category Attribute Type Description Customer demographics

Average revenue per user (ARPU)

Ordered What is the value of the customer? Values range from 1 to 6 where 1 = less than 8 €/month, 2 = 8-25€/month, 3 = 26-50€/month, 4 = 51-85€/month, 5 = 86-135€/month, 6 = more than 135€/month

Service Service group Categorical Which services does the customer use?

Download volume

Continuous How much the customers download?

Usage

Visit count Discrete How many times customer visits a service?

Occasion Occasion Categorical When does the user use the services? Work hours or on evening hours and weekends

3.4.4 Categorization

Data attributes can be discrete, ordered, continuous, binary or categorical. To find patterns from the data the categorical attributes are the most suitable type. For this thesis, the most of the customer based attributes were already organized into categories but the data gathered from the traffic was not i.e. visit count, download volume and the service providers. Therefore, some categorization was needed for these three attributes. In the behavioral segmentation download volume can be used as an attribute in two contexts – namely the overall download volume per customer or download volume per service group. The former could be used to describe the total usage rate of one customer without knowledge of used services. However, the latter is considered more interesting as it helps find out which services are downloaded the most. Download volume, being a continuous value, can have skewed distribution. The root cause for this can be that most customers tend to download very little data whereas some small group is responsible for downloading very large sets. To even up this distribution the download volumes can be categorized with binning, logarithms or standardization. A decision to use equal-sized bins was made, even if it loses some of the information as for example data volumes between 100-200 kilobytes would be treated as equals. The equal-size categorization helps define the download volumes per service provider as ‘no usage’, ‘low usage’, ‘medium usage’, ‘high usage’ or ‘very high usage’. Similarly, the visit counts were categorized in equal sized groups with ‘no usage’, ‘low usage’, ‘medium usage’, ‘high usage’ or ‘very high usage’. For behavioral segmentation this level of categorization is sufficient and descriptive. As the services a customer uses are presented with an URL, we needed to categorize the URLs into groups which are presented in table 2. However,

3 EMPIRICAL STUDY

22

this division should be considered exemplary as it does not take into account that services can actually belong to several classes. In the future, improvements should be made to development phases to achieve more accurate classification. But for the purposes of this thesis this approach is sufficient. Table 2. Service provider classification criteria.

Group Service Communication Chat, irq, mail, etc. Media News, magazines, tv, etc. Search service Google, altavista, etc. Mobile phone utilities Ring tones, games, background images, etc. Portal Service provider’s portals, etc. Information services Weather, time tables, etc. Entertainment Music, movies etc. Shopping,banks,commerce Online shopping services, bank services etc.

3.4.5 Transformations

A final step of data preparation phase is the attribute transformations of all values. The transformations make the values comparable with each other and ensure that large values do not influence the calculations more than small values. The common way of transforming is to set the mean value to zero and the standard deviation to 1 (Tan, Steinbach, & Kumar, 2006). Finally, all attributes were scaled between -1 and +1, which is a common way to provide the input values for the clustering algorithm (Berry & Linoff, 2004).

3.5 Applying SOM clustering

The data pre-processing phase has been conducted and the data samples are ready to be clustered. For finding the potential segments, SOM performs quantization and vector projection. Quantization means dividing input data samples into groups with approximately same number of data samples whereas vector projection refers to reducing a high dimensional input to a lower dimension. The resulting SOM map provides qualitative information for the analyzer about the possible segments and unveils the hidden relationships between different groups. The rest of this chapter discusses the different steps of SOM clustering starting from choosing the initial parameters to initializing the input vectors and finally training the data.

3.5.1 Initial parameters

Choosing initial parameters includes determining map size, training length, learning rate and neighborhood radius.

3 EMPIRICAL STUDY

23

SOM map can be organized as either a hexagonal or a rectangular grid consisting of M nodes. In this experiment, hexagonal was applied as it gives better visualization of the map structure. A common calculation rule for SOM map size is where N is the number of data samples which is presented in (Vesanto & Alhoniemi, Clustering of the self-organizing map, 2000). Also the two axis of the map need to be determined so that the number of map nodes is roughly . As there is no strict rule on how to choose the values for the axis the map was set to have y axis as 17 and x axis as 9. Initially, training length is set to the number of data samples e.g. 1000 and for the refining phase twice the amount i.e. 2000. The learning rate determines how fast the weight change is. Initially the learning should be high in order to adjust the map quickly to the input data. Over time, the learning rate decreases in order to fine tune the map. The learning rates for the two phases are set to 0.5 and 0.05. Neighborhood radius determines how to what extent neighborhood of a node is being updated and how much. Initially the neighborhood radius is large so that it affects a large area around a node. During the training this value is however decreased so that in the end, only the closest neighbors are affected. Typically, the initial neighborhood radius is taken as roughly half of the map height. The starting radius was set to 7. Table 3 summarizes all training parameters. Table 3. Summary of SOM map parameters.

SOM parameter First phase Second phase Map size Training length 1000 2000 Learning rate 0.5 0.05 Neighborhood radius 7 1 3 1

3.5.2 Input vectors

As we did not have any prior knowledge on which initial input vectors to choose the two other options are performing a random or linear initialization. The latter is considered faster than the former as input vectors are initialized according to a projection to a two-dimensional space based on the two greatest eigenvectors known as Principal Component Analysis (PCA) (Kohonen, Self-organizing map, 1995). Linear initialization comprises of a two-dimensional plane where the center is the mean of the data set and the plane captures the variance of the data set (Vehviläinen, 2004). The random initialization could be used as well but usually it is recommended to run several rounds of random initialization and choosing the one with the lowest error rate. But for this thesis we used linear initialization as it consistently gave stable results.

3 EMPIRICAL STUDY

24

3.5.3 Training

The clusters are formed during two training phases which are rough and refining. Each training phase consists of several iterations also know as epochs. During each epoch both learning rate and neighborhood radius decrease. At the beginning both values are high which means that map adjusts quickly to the data set and a large neighborhood area is adjusted around the winning node. When getting closer to the end of the training, learning rate is smaller and this way it enables the map to stabilize. The neighborhood radius is finally 1 so that it only affects the winning node, not any of the surrounding nodes. During the rough phase ordering of the nodes is determined so that similar nodes locate close to each other. As a result from the rough phase is a topologically ordered map. Most of the work is done but the refinement phase tunes the nodes to have better values. The refinement phase can therefore have lower initial values than the rough phase. After both phases are done the clusters are formed.

3.5.4 Evaluation of clusters

To determine how good the clusters are, a quantization error is measured. This error rate is calculated for each input sample while determining the winning node. The error rate for an input sample is determined based on the length to the winning node. The cluster quality is then determined by the average of the errors.

3.6 Visualization

Visual inspection is one of the most interesting steps of SOM cluster analysis as it reveals the hidden patterns of the original data set. Visualization provides instant overview of the cluster shapes, number of clusters, relations between clusters as well as existence of sub clusters. For the attributes, the component planes provide a powerful presentation. When looking at the whole map together with the component planes, one can find interesting facts about the data set. There are plenty of different presentations for the maps. Figure 6 displays U-matrix, D-matrix, similarity coloring and the hit histograms.

3.6.1 Map

Maps presented in figure 6 can be considered the most important visual presentation when determining the clusters. D-matrix calculates the distances between the map nodes and constructs the visual map based on this information. Distance matrix describes the map using the shape and size of each map node. The size of the map node tells the average distance to its neighbors (Kaski, Venna, & Kohonen, 1999). The most popular distance

3 EMPIRICAL STUDY

25

matrix technique is the U-matrix which stands for unified distance matrix. U-matrix differs from distance matrix presentation as it uses the grey scale for the map. This means that light areas represent close relation between map nodes, whereas dark areas represent a long distance between nodes. Therefore, dark areas can be considered as divisive gaps between clusters (Vesanto, SOM-Based Data Visualization methods, 1999). As the name indicates, similarity coloring displays the cluster structures with colors. In this method each map node is assigned with a color and neighboring clusters are separated with different colorings that make the analysis of the maps user friendly (Kaski, Venna, & Kohonen, 1999; Vesanto, SOM-Based Data Visualization methods, 1999). The actual clusters in this thesis are extracted from the similarity coloring based on the different color codes. Histograms are efficient tools for determining how the data is distributed on the map. The logic behind histograms is simple as each time a winning node is found the counter for the respective map node is increased (Vesanto;Himberg;Alhoniemi;& Parhankangas, 2000).

(a) (b) (c) (d)

Figure 6. Different presentations for the SOM clusters. (a) U-matrix with colors, (b) D-matrix, (c) similarity coloring and (d) hit histograms.

3.6.2 Component planes

The map provides initial understanding and starting point for the cluster analysis but can be somewhat cumbersome at first glance. Taking a closer look at the attributes provides more information on the clusters and the characteristics of the data set. For this analysis the component planes are analyzed. By doing this one can

• hunt down possible correlations, • see how the attributes are distributed and • determine where in the map each attribute contributes the most.

Correlations are found if two attributes have similar patterns in the same positions of the component planes. Nevertheless, these findings are not the most accurate ones and should only be considered preliminary. The further analysis on correlation should be done either by mathematical means, e.g. counting correlation from covariance and standard deviation, or visually with scatter plots (Himberg, 1998). When investigating a pair of attributes for correlation, values close to 0 imply that the two attributes are not

3 EMPIRICAL STUDY

26

correlated, whereas close to 1 values mean that if one of the attributes changes, the other attribute changes to the same direction. Finally, values close to -1 implicate that once one of the attributes changes, the other one grows to the opposite direction. Figure 7 presents the component planes of 33 attributes used in this thesis. By comparing the attributes, one can see that average revenue per customer (attribute 33) and visit count (attribute 5) correlate as expected. Network customers having high average revenue per month tend to be heavy users of the services as well. The coloring of the component plane also reveals how the attribute values are distributed. From figure 7 one can see that average revenue per customer (ARPU) (attribute 33) has high values in the lower right part of the component plane and that the values decrease towards the upper left corner. In other words the high ARPU customers are situated in the lower right part of the map whereas the low ARPU customers are situated in the upper half of the map. One interesting analysis is to find out, which attributes best describe the clusters. Component planes reveal attribute combinations that are typical for certain clusters. The way to perform this analysis is to choose a cluster and look at the component planes to identify what features the cluster has. Determining the characteristics of a cluster is presented in more detail in chapter 4.

Figure 7. The U-matrix and component planes presenting the behavioral features of operator customers.

3 EMPIRICAL STUDY

27

3.7 Clusters and summaries

In the previous chapter the most interesting groups were identified from the visual presentations and now the actual summarization and profile creation of the segmentation can be performed. As a first step, the clusters are extracted. Next, descriptive quantitative summaries can be drawn and finally, the clusters are profiled with customer demographics. In order to gain useful information out of the segmentation, answers to the following quantitative and qualitative questions should be provided:

• How many clusters are there? • Are some clusters more dominant than others? • Which clusters are the most similar or the most different? • What are the demographics for the segments? • What are the possible clusters? • Are there sub clusters within clusters? • Does the data contain outliers? • Which attributes make one segment different from another? • What is the quality of the clusters?

For the purposes of quantitative summaries we needed to extract the clusters manually from the map by gathering all the nodes of the clusters. In the future, this process should be automated. For SOM, some algorithms exist that identify the cluster boundaries and extract the nodes within. For example, (Murtaugh, 1995) presents a hierarchical clustering constrained by the neighborhood relationship and (Vesanto, Using SOM in Data mining, 2000) advices to use the distance matrix as the base for clustering. However, for this thesis the similarity coloring map was used and the clusters were extracted based on the color coding. Findings of this study and answers to most of the above-mentioned questions are presented in the following chapter.

28

4 Results

It is time to present and analyze the findings of this thesis. The segmentation is based on 33 attributes describing the behavior of a customer. These attributes provide a basis for a profound analysis of the customers’ overall usage. Another option would have been to make separate analysis for visit based, download volume based and occasion based segmentations which would have narrowed down the analysis too much for our purposes. This chapter starts by analyzing the attributes for SOM, followed by answers to the questions posed in chapter 3.7, then verifications for the use cases from chapter 3.3.2 are presented, where after SOMs scalability is approximated. The chapter is concluded with an analysis of the cluster quality.

4.1 Attributes

Attribute correlation investigations start by taking a look at the figure 7 and finding out which attributes are the most similar. Most of the attributes happen to have high correlation due to the simulated data. In most of the cases the visit count and download volume has been set to have the same usage pattern. Nevertheless, the artificial data hides all types of correlations i.e. positive, negative and no correlation. When similarities are found the attribute pairs are further investigated with mathematical means. There are of course many attribute pairs with no correlation but one hidden case is the media service usage during the evening hours. The simulation produces very high visit count figures (attribute 7) but very low download volumes (attribute 8). The results show a low correlation of 0.0176 between these two attributes, which indicates that even if a service is actively visited, the download volumes can vary a lot among the customers. Thus, there is no similar pattern for this specific service in the visit count and download volume. To hunt down a positive correlation, average revenue per customer (ARPU i.e. attribute 33), visit count for search services (attribute 9) and download volume for search services (attribute 10) are investigated. Correlation calculations show that ARPU has a positive correlation of 0.9 with visit count. Similarly, ARPU correlates strongly with download volume for search service during office hours. The strong positive

4 RESULTS

29

correlation indicates that the high ARPU customers tend to be heavy users of the search services. On the other hand, the largest negative correlation of -0.8 can be found by comparing the visit count of communication services during evening hours (attribute 3) with ARPU. This means that during evening hours if the customers have high ARPU they tend not to use communication services at all. In other words, the usage of the communication service during evening hours is highest among customers with small ARPU and lowest with high ARPU customers. Another important matter to be investigated from the attributes is whether some attributes form their own segments. This is analyzed by looking at skewness and kurtosis values where very large or small values could imply a segment. Calculations of mean, median, standard deviation, skewness and kurtosis are conducted on all attributes. High skewness and kurtosis values are identified for attributes 8, 21 and 29. These present download volumes of media services during evening hours, visit count of info services during evening hours and visit counts of shopping service during office hours. When comparing the component planes of these attributes with the U-matrix, one can see that these attributes form rather vague areas on the right upper corner as well as upper left corner of the U-matrix. These small segments can be considered as outliers and the variables are removed from the cluster analysis.

4.2 Segments

The clustering is conducted with the remaining 30 attributes of which 29 attributes are usage and occasion based whereas one attribute describes the average revenue of the customer (ARPU). The 29 usage attributes consist of visit count for each eight service groups and download volume for each eight service groups during two occasions during office hours, as well as evening and weekend hours. How many clusters are there? What are the possible clusters? The SOM map of the data set representing operator customers and their WAP service usage is shown in figure 8. The map indicates clearly five segments which are rounded with solid line circles. To make it easier to talk about the different clusters for the rest of this chapter the clusters were named according to the segments created for the simulation.

4 RESULTS

30

Figure 8. The simulated data hides five clusters within.

Hunting down possible clusters is easiest by looking at the D-matrix or the U-matrix from figure 6. There are two cluster pairs that could possibly be joined together based on their similarities. STUDENTS and COMMON WORKERS are very similar in behavior. Another possibility is to merge BUSINESS USERS and EARLY ADOPTERS. In both cases, the segments are very similar in their service usage during different occasions and in this sense these clusters could be treated as sub clusters. But as they still have some differences they are treated as different clusters in this study. Are some clusters more dominant than others? The size of the clusters indicates if some cluster consists of more data samples than the others. The largest and in this case also the most profitable groups from operator perspective are the BUSINESS USERS and EARLY ADOPTERS which contribute to total of 50% of the whole customer population. One of SOM’s strengths is that it does not try to find clusters of the same size which in this study would not have resulted in meaningful clusters. Which clusters are the most similar or the most different? SOM reveals neighboring clusters just by looking at the maps. We can find out that STUDENTS and COMMON WORKERS are very similar to each other whereas TEENS and COMMON WORKERS have nothing in common – as the distance between them illustrates. Neither do EARLY ADOPTERS and STUDENTS have anything in common. BUSINESS USERS and EARLY ADOPTERS, however, have a lot of similarities and therefore reside next to each other. Looking at the component planes and the original simulated data verify these similarities.

4 RESULTS

31

What are the customer demographics for the segments? In the simulation, we set the customer count for each segment TEENS, STUDENTS, EARLY ADOPTERS, COMMON WORKERS and BUSINESS USERS to 200, 150, 200, 150 and 300, respectively. The characteristics for the segments are identified as presented in table 4. The cluster sizes match the ones hidden in the data. Table 4. Segment profiles.

Segment Customer count

Characteristics Customer background information

Teens 200 Low average revenue per customer. Low usage during evening hours and on weekends

70% women and 30% men. Lives in large cities (more than 100000 inhabitants). Age less than 18 years.

Students 150 Low average revenue per customer. Low usage. High usage during working hours

80% men and 20% women. Lives in large cities (>100000 inhabitants). Age between 18 and 25 years.

Early adopters

200 High average revenue per customer i.e. (50-135€/month). High usage during working hours

70% men and 30% women. Lives in capital region. Age between 26 and 30 years.

Common workers

150 Medium average revenue per customer i.e. (8-50€/month) High usage during evenings and on weekends

50% men and 50% women. Lives in smaller cities (less than 100000 inhabitants). Age between 30 and 35 years.

Business users

300 High or medium high average revenue per customer i.e. (50-135€/month) Low or medium usage during evenings and on weekends

80% men and 20% women. Lives in capital region. Age between 36 and 40 years.

4.3 Simulation

Next, the use cases hidden in the simulated data are verified. Use case 1: Finding out which services could be cross-marketed This use case is verified by analyzing the component planes of attributes: attribute 1 (visit count for communication service usage during office hours), attribute 3 (visit count for communication service usage during evening hours), attribute 5 (visit count for media service usage during office hours) and attribute 7 (visit count for media service usage during evening hours). These attributes reveal that TEENS are using communication services heavily on the evenings and weekends. STUDENTS use both communication and media services heavily during the evenings and weekends. Unlike TEENS, COMMON WORKERS are using media services heavily during evening and weekends. This reveals that during evenings and weekends TEENS and STUDENTS are heavy users of

4 RESULTS

32

communication services whereas STUDENT and COMMON WORKERS are heavy users of media services. This implies that cross-marketing media services on communication sites or the other way around would be beneficial during the evenings and weekends. But during the office hours this conclusion would not apply as none of the segments have same service usage behavior. Use case 2: Detecting differences in usage patterns between segments During the office hours, TEENS are active visitors in mobile phone services but download relatively little (see attributes 13 and14). The STUDENTS, however, visit entertainment services seldom but download large movies or watch TV-series through WAP (attributes 25 and 26). Both segments can be said to be active users – STUDENTS in terms of download amounts and TEENS in terms of visit activity. Use case 3: Discovering the most profitable segment(s) When looking at the component planes for attributes 1, 2, 5, 6, 9, 10, 17, 18, 25, 26, 33 from the position of segments EARLY ADOPTERS and BUSINESS USERS, the following findings can be made. During the working hours, BUSINESS USERS use communication, media, search and portal services heavily and EARLY ADOPTERS use additionally entertainment services. Analyzing the planes one can see that average revenue per customer (ARPU) for these two segments is the highest and these segments use many services heavily during office hours. Use case 4: Finding out the most active users The analysis of all component planes from the evening hours (e.g. pair of attributes 11 and 12, 27 and 28, 31 and 32) reveals that TEENS are visiting almost every service group but are not downloading that much. This is the most active customer group during the evenings and weekend. The download volumes were set to be low but the component planes indicate high download volumes. There is obviously an error in the simulation caused by the fact that for this attribute the equal-sized binning has very low download volumes. These low download volumes are still divided into equal-sized bins having usage ranging from 1-4 (from ‘no downloads’ to ‘high download volume’). However, these attributes could have been left out as well, as it strongly correlates with the visit count.

4.4 Scalability

The scalability of SOM was verified by running tests with different data amounts and then checking that this matches with the complexity formula O(nd) presented by (Vesanto, Using SOM in Data mining, 2000). These formulas are used to estimate run times for one million respective ten million data rows to estimate how long a time it does take to process 10 million rows (n) with dimension (d) being 30. The results are presented in figure 9 where the estimate is that SOM takes to

4 RESULTS

33

finish. In case of one million respectively 10 million rows, the run times are around 8 minutes respectively 1.5 hours. The tests were run on Microsoft Windows XP, SP2 and with a processor of 1.73GHz and 504MB of RAM.

Figure 9. Run time of SOM algorithm in seconds up to eleven thousand data samples.

4.5 Cluster quality

In this part, the cluster quality of SOM was investigated, and to find out how good clusters SOM formed, a comparison was made with segments formed by K-means. The Davies-Boulding index was used to identify the k for the K-means. Cluster quality test was conducted where SOM and K-means clustering was run ten times. While SOM is initialized linearly, both the squared error and quantization error remained stable i.e. with very small variations throughout the test runs. Davies-Boulding index, on the other hand, varies between four and six clusters when the upper limit of the clusters is set to 10 (see figure 10). When running the tests, 44% of times K-means failed to find the expected five clusters. The fact that K-means always converges to the specified number of clusters determining the cluster number, is crucial phase for K-means. When same test runs were conducted with SOM, in 100% of the cases SOM was able to find the five clusters.

Figure 10. Davies-Boulding index (y-axis) for K-means clustering when the maximum number of clusters is set to 10 (x-axis). The error rate is lowest when the number of clusters is four.

4 RESULTS

34

In all the cases where K-means also found five clusters, the error rates were very close to the ones SOM gives. Based on the cluster quality measurements, K-means and SOM are equally good for segmentation purposes. Based on the overall test results, SOM is more stable than K-means, which gives poor results if the k is not chosen correctly.

35

5 Summary

5.1 Discussion

As this thesis was commissioned by Aito Technologies to find out on what means customers can be segmented based on their usage this chapter concludes and discusses some of the main factors that Aito can use to make decisions on how to proceed with this topic. All of these instructions should be considered preliminary as these tests have not been run with actual data. This chapter focuses on the data preparation instructions, SOM related guidelines, scalability – because of its importance in this problem domain – and finally the end-user aspect to the customer segmentation.

5.1.1 Data preparations

To be able to segment customers based on their behavior the data needs to be aggregated to describe the service usage patterns of a customer. When considering the service usage of different operator customers, a week’s period can already show some usage patterns. Also the information regarding usage occasion can be extracted when the aggregation is done over a week’s period. Consequently, this would be the minimum time period to perform customer segmentation on. Another aspect to the aggregation is to decide the analysis angle for the segmentation. When talking about customer segmentation the obvious starting point is the customers and their usage patterns. But there is no reason why handsets or service providers could not be the entry point for the segmentation analysis as well. These were left out from this thesis but SOM is applicable for clustering from these perspectives as well. When deciding which attributes are the most suitable for the clustering, skewness or kurtosis and correlation values should be checked. Attributes with skewed distributions can form segments of their own and are recommended to be removed. Also, the correlation between attributes shows if some of the attribute pairs should be combined. This thesis handled visit count and download volumes as separate attributes and due to the simulated data many of them had strong correlations. If the real data shows that the

5 SUMMARY

36

download volume and visit count highly correlate it is highly recommended that the attributes are combined into one based on some generic rule. Finally, as the results from this thesis shows, the average revenue per customer, occasion of usage, number of visits per service group per customer and download volume per service group per customer are applicable as clustering attributes for behavioral segmentation.

5.1.2 SOM segmentation

When considering what should be the initial data vectors for the SOM clustering linear initialization was discovered to give the most stable results. However, the usage of random initialization can be considered as well but would require several runs with the same data and selecting the one with the best cluster quality. An interesting analysis would be to see if the linear initialization is as stable with a real data set as with the simulated data. Based on the results from the empirical study phase the linear initialization is considered the best way to cluster data. Being able to use SOM for automatic segmentation an efficient algorithm is required for extracting the clusters from the map. If the clusters of the real data are as obvious as in the simulation, an algorithm based on similarity coloring would suffice for the cluster extraction. Other ways would be to determine the clusters based on the distances (Vesanto, Using SOM in Data mining, 2000). The patterns hidden into the data were all found by SOM. As some of the use cases might be somewhat fictive the idea was to prove that the patterns can be found with the help of SOM segmentation. For the use case 1 the conclusion would be that cross-marketing is beneficial from the operator point of view. The result showed that targeted marketing of media services on communication service sites or vice versa during evening hours might be beneficial for an operator. Next, the information of use case 2 could be used for offering handset packages with high download speeds for student segment as they tend to download heavily. The most profitable customers for the company, based on their heavy usage and high revenue, are found from the segmentation as shown by the use case 3. Finally, the results from the use case 4 showed that the teenagers are very active users of the services. These are the low average revenue customers but still very important group for the operators as these are going to be the adult users in the future.

5.1.3 Scalability

Operator networks produce millions of rows of data per day. Therefore, the system for segmentation needs to be able to scale to support the high data sets. A strain on the database side comes from the aggregation of a large data set. SOM has a linear performance and the scalability tests during this thesis verified that SOM is applicable for very large data sets. A large

5 SUMMARY

37

operator can have 100 million customers and according to the findings of this thesis SOM segmentation can process this amount in around 15 hours. The environment from which these approximations for the run times were collected is not the most optimal one. To get more precise value on the run times it is advisable to run the data set on a more efficient environment. All in all, these results prove that SOM algorithm performs linearly. Based on this, the performance should not be a bottleneck for customer segmentation even for large telecommunication operators.

5.1.4 End-user point of view

From the end-user point of view the segmentation application should allow users to select any subset of the data. Possible selections would be to choose certain time frame, narrow down the segmentation to cover only specific service groups or to restrict the segmentation analysis to some specific handsets models. Also, the users should be able to choose what type of segmentation is of interest, e.g. demographic, location based, usage or experience based. SOM is applicable for any subset of the huge data set. If the cluster selection can be automated, mapping each customer to a certain segment and presenting it for the end-users would be straight forward. The most important question is whether the processing can be done without labor-intensive analysis of the formed clusters. To overcome this, the clustering results should present profiles of the customers and based on these profiles domain experts could determine their applicability. Another important matter is to understand whether clusters are good enough, especially when they have been formed by an automated process. The automated way for determining cluster quality is based often based on some error criteria. However, even if the clustering produces good quality clusters they might not be applicable as such. Therefore, domain experts always need to verify the relevance and usefulness of the segments.

5.2 Conclusions

In today’s competitive environment, telecommunication operators are investing in understanding their customers better, especially their most profitable customer groups and the groups that have the biggest potential to become such. By segmenting customers based on their behavior the operators can better target their actions for the different groups to meet the customer expectations. The automated behavioral segmentation based on the actual customer usage is possible due to the transaction data gathered from the operator network. Data mining and clustering techniques are used to identify groups with similar behavior and how they relate to other groups. These methods are making the automation of segmentation process possible. Hence, three clustering methods were investigated in this thesis to find out which factors should be taken into consideration when segmenting customers based on network transaction data. For this problem domain, four factors were found

5 SUMMARY

38

to be the most crucial when selecting a clustering method – namely identifying relationships between clusters, performance of the algorithm, handling of outliers and missing values and finally the ease of analyzing the results. First of all, once customers are segmented, being able to say which segments are more similar to each other and how they differ from other clusters is considered an important factor. Out of the three studied methods, SOM was the only method that protects the relations of the original data set. To insure this we simulated artificial behavioral data based on customers, handsets and service providers that had hidden clusters inside. We found out that SOM was able to identify the clusters and the results confirmed that the neighboring clusters had similar usage patterns. By understanding what kind of behavior certain segments have, the customers of the segment can be reached through more precise and relevant sales and marketing activities. Secondly, as network operators have millions of transactions passing the system on a daily basis the segmentation method needs to perform efficiently and scale. SOM and K-means having linear time complexities are both equally suitable for large data sets. However, SOM’s advantage is that the performance can be improved significantly with the two-level approach that reduces the data to be clustered. As the verification of the run times showed SOM is applicable for very large data sets. Thirdly, missing values and outliers are something that the gathered network data can contain. After studying the literature SOM was identified as the only method out of the three that is able to handle missing values and is robust against outliers. Finally, from the analyzer’s perspective the ease of describing the results are of great importance as discovering clusters, sub-clusters or outliers are hard enough. SOM proved its efficiency in visualization while the other methods mostly rely on the descriptive summaries and histograms. With SOM we were able to spot the hidden patterns from the data with the help of different maps and attribute component planes. Another purpose of this thesis was to identify attributes that are meaningful for behavioral segmentation. For customer segmentation in general, the value of the customer is the key factor, in this case described by average revenue per customer (ARPU). For behavioral segmentation two attributes describing the usage of customers are extractable from the network data: visit count per service per customer and download volume per service per customer. The results showed that both attributes can be used for finding out different usage patterns of a customer. Additionally, an attribute such as the service provider group was used to find out what kind of service combinations customers use. This information provides valuable input for cross-marketing. Based on this study, SOM is applicable for customer segmentation. It performs well with large data and the usage based segmentation produces meaningful results. However, one might argue that the true nature of the

5 SUMMARY

39

segments cannot be presented without attitude and feeling based factors that are often covered with the customer surveys. Luckily the clustering approach presented in this study does not prevent incorporating any of these additional factors to the customer analysis.

5.3 Future work

As the time frame of this thesis is limited some of the interesting factors and ideas were left unexplored. This chapter presents suggestions regarding future research areas and subjects. First of all, it would be interesting to study in more detail the parameters related to SOM training. For example a longer training rate would allow more data samples to be taken into account during the training. Therefore, clusters would represent the actual data set better but the extra computational time spent on performing this need to be compared with the benefits gained from the better clustering. Larger neighborhood radius would mean that each training round would affect larger set of nodes. Instead of choosing the initial parameters through trial and error (Su, Liu, & Chang, 1999) propose an efficient initialization schema to construct an initial map. Second of all, in order to automate the whole segmentation process the cluster extraction algorithms need to be tested. Some algorithms for this already exist, identifying the cluster boundaries and extracting nodes inside the boundaries. Murtaugh (Murtaugh, 1995) presents a hierarchical clustering constrained by the neighborhood relationship and Vesanto (Vesanto, Using SOM in Data mining, 2000) advices to use the distance matrix as the base for clustering. Also, the perspective for the behavioral segmentation can be changed from customers to handsets or specific services as these are obtainable from the operator data as well. However, these investigations were left outside this thesis. The analysis of usage patterns of handsets and service groups would be an interesting one. In addition, for this thesis WAP data was analyzed but any of the findings made here are applicable for other technologies such as multimedia messaging (MMS), short message services (SMS), Internet and voice. Incorporating these technologies into the behavioral segmentation would provide an overview of the overall usage of the customers. Also, an analysis that was omitted from this thesis but provides valuable information from the operator point of view is the RFM – recency of last contact, frequency of contacts and monetary value of the customer. RFM would provide interesting addition to the behavioral clustering research. When the real data is available, similar tests as the ones performed in this thesis would be necessary. This sort of investigation would reveal if the real world data set consists of clear clusters or if they overlap a lot and whether there are significant amount of outliers. Once the customer segments have

5 SUMMARY

40

been profiled it would be interesting to compare these findings with the actual segments that the operators are currently using to see how much they correlate and what possible new opportunities would be found. All in all, to automate the whole process starting from transaction data to meaningful customer segments can be somewhat challenging. To avoid any pitfalls on the way, similar tests should be performed as in this thesis once real data is available.

41

References

Aito Technologies. (2007). Product Description: Aito Alchemist release 1.1. Aito Technologies. Arnborg, S. (2008). Statistical Methods in Applied Computer Science. Lecture notes, Stockholm. Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik, V. (2000). A support vector clustering method. International Conference of Pattern Recognition. Ben-Hur, A., Horn, D., Siegelmann, H. T., & Vapnik, V. (2001). Support vector clustering. Machine Learning Research . Berry, M., & Linoff, G. S. (2004). Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. Indianapolis, Indiana, Unites States of America: Wiley Publishing. Bloom, J. Z. (2004, December). Tourist market segmentation with linear and non-linear techniques. Tourism Management , 25 (6), pp. 723-733. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning , pp. 273-297. Finley, T., & Joachims, T. (2005). Supervised Clustering with Support Vector Machines. International Conference on Machine Learning. Bonn. Fröhlich J. (1997). Neural networks with Java. Retrieved May 20, 2008, from http://www.etimage.com/java/appletNN/NNtyper/e-1.html Himberg, J. (1998). Enhancing SOM-based data visualization by linking different data projections. Intelligent Data Engineering and Learning, (pp. 427-434). Hong Kong. Hsieh, N.-C. (November 2004). An integrated data mining and behavioral scoring model for analyzing bank customers. Expert Systems with Applications , pp. 623-633.

REFERENCES

42

Huanga, J.-J., Tzengb, G.-H., & Ong, C.-S. (2007, February). Marketing segmentation using support vector clustering. Expert Systems with Applications, 32 (2), pp. 313-317. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999, September). Data Clustering: A Review. ACM Computing Surveys , 31 (3), pp. 264-323. Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Retrieved May 1, 2008, from Michigan State University: Department of Computer Science and Engineering: http://www.cse.msu.edu/~jain/Clustering_Jain_Dubes.pdf Kaski, S., Venna, J., & Kohonen, T. (1999). Coloring that Reveals High-Dimensional Structures in Data. 6th International Conference on Neural Information Processing, II, pp. 729-734. Perth. Kiang, M. Y., Hu, M. Y., & Fisher, D. M. (October 2006). An extended self-organizing map network for market segmentation—a telecommunication example. Decision Support Systems, pp. 36-47. Kohonen, T. (1995). Self-organizing map. Berlin: Springer. Kohonen, T., Hynninen, J., Kangas, J., & Laaksonen, J. (1995). The Self-Organizing Map Program Package. Espoo: Helsinki University of Technology. Kotler, P., & Armstrong, G. (2005). Principles of Marketing. New Jersey, United States of America: Pearson Education. Merkevičius, E., Garšva, G., & Simutis, R. (2004). Forecasting of credit classes with the self-organizing maps. Information Technology And Control, pp. 45-52. Murtaugh, F. (1995). Interpreting the Kohonen self-organizing feature map using contiguity-constrained clustering. Pattern recognition letters, pp. 399-408. Peltarion. (2007, October 10). Synaptic: The Peltarion Blog. Retrieved May 20, 2008, from http://blog.peltarion.com/2007/04/10/the-self-organized-gene-part-1/ Platt, J. C. (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector. Microsoft Research. Saarenvirta, G. (1998). Mining Customer Data: A step-by-step look at a powerful clustering and segmentation methodology. Retrieved May 2008, from http://www.ibmdatabasemag.com/db_area/archives/1998/q3/98fsaar.shtml

REFERENCES

43

Su, M.-C., Liu, T.-K., & Chang, H.-T. (1999). An efficient initialization scheme for the self-organizing featuremap algorithm. Neural networks, (pp. 1906-1910). Washington. Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Boston, United States of America: Pearson Education, Inc. Vehviläinen, P. (2004). Data Mining for Managing Intrinsic Quality of Service in Digital Mobile Telecommunications Networks. Thesis for the degree of Doctor of Technology, Tampere. Vellido, A., Lisboa, P. J., & Meehan, K. (1999). Segmentation of the on-line shopping market using neural network. Expert Systems with Applications , pp. 303-314. Vesanto, J. (2002, May). Data exploration process based on the self-organizing map. Acta Polytechnica Scandinavica . Vesanto, J. (1999). SOM-Based Data Visualization methods. Intelligent Data Analysis , 3 (2), pp. 111-126. Vesanto, J. (2000). Using SOM in Data mining. Espoo: Helsinki University of Technology. Vesanto, J., & Alhoniemi, E. (May 2000). Clustering of the self-organizing map. IEEE on Neural Networks , 11 (3), pp. 586-600. Vesanto, J., Himberg, J., Alhoniemi, E., & Parhankangas, J. (2000). SOM Toolbox for Matlab 5. Espoo: Libella.

TRITA-CSC-E 2008:075 ISRN-KTH/CSC/E--08/075--SE

ISSN-1653-5715

www.kth.se

Behavioral Segmentation of Telecommunication Customerskiosk.nada.kth.se › utbildning › grukth...

Documents

Transcript of Behavioral Segmentation of Telecommunication Customerskiosk.nada.kth.se › utbildning › grukth...