Debt Analytics: Proactive prediction of debtors in the ... · A plataforma final é construída...

i

Debt Analytics: Proactive prediction of debtors in the

telecommunications industry

Ana Henriques Narciso

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisors: Prof. Francisco António Chaves Saraiva de Melo

Prof. José Alberto Rodrigues Pereira Sardinha

Examination Committee

Chairperson: Prof. Miguel Nuno Dias Alves Pupo Correia

Supervisor: Prof. Francisco António Chaves Saraiva de Melo

Member of the Committee: Prof. Paulo Jorge Fernandes Carreira

November 2015

iii

Abstract

Telecommunications businesses sometimes face new customers who subscribe services with no real

intention of paying for them. This special class of fraudulent customers – never-payers – are responsible

for significant revenue losses, despite being a tiny subset of all subscribers. Besides not paying monthly

bills, additional resources are spent unnecessarily during service activation, CRM processes and

collection management.

This thesis was developed in collaboration with a telecommunications company whose goal is to predict

the never-payer population, consisting of post-paid customers who are never going to pay for the newly

subscribed services. The main challenge is to predict the outcome, even before the customer’s account

is activated. At that point, too little customer data is available for analysis. For those cases, the first

month of behaviour can be integrated to improve predictions.

The final platform is built on Microsoft BI stack tools based on the CRISP-DM methodology. The

integration module is capable of loading, cleaning and summarising large amounts of input data that

provide information about new customers. Then, the analytical module selects a specific set of relevant

attributes to train several predictive models. Those models were tested, facing new, unknown customers

to decide the likelihood of being customers who will never pay their debts. Ad-hoc exploration of the

input data and results is also possible using tools such as Excel, Power Pivot and Power View. The

solution was evaluated using data mining performance measures.

Keywords: Telecommunications, Fraud, Never-payer, Data Mining, Predictive Model, SQL Server

v

Resumo

As empresas de telecomunicações enfrentam, por vezes, novos clientes que subscrevem serviços sem

real intenção de os pagar. Esta classe especial de clientes fraudulentos – nunca-pagadores – é

responsável por perdas significativas nas receitas, apesar de constituir um subconjunto minúsculo de

todos os clientes. Além de não pagarem as faturas mensais, são gastos recursos adicionais

desnecessariamente, durante a subscrição, processos de CRM e gestão de cobrança.

Esta tese foi desenvolvida em colaboração com uma empresa de telecomunicações, cujo objetivo é

prever a população nunca-pagadores, consistindo em clientes pós-pagos que nunca irão pagar os

serviços recém-subscritos. O principal desafio é prever o resultado, mesmo antes da conta de cliente

ser ativa. Nessa altura, muito poucos dados ao cliente estão disponível para análise. Nesses casos, o

primeiro mês de comportamento pode ser integrado, a fim de melhorar as previsões.

A plataforma final é construída utilizando ferramentas de BI da Microsoft com base na metodologia

CRISP-DM. O módulo de integração é responsável por carregar, limpar e sumarizar grandes

quantidades de dados que fornecem informações sobre novos clientes. Em seguida, o módulo analítico

seleciona um conjunto específico de atributos relevantes para treinar vários modelos preditivos. Esses

modelos foram testados com clientes novos e calculou-se a probabilidade de nunca virem a pagar as

suas dívidas. A exploração ad-hoc dos dados de entrada e resultados também é possível usando

ferramentas como o Microsoft Excel, Power Pivot e Power View. A solução foi avaliada com recursos a

métricas de desempenho utilizadas em data mining.

Palavras-chave: Telecomunicações, Fraude, Nunca-Pagadores, Data Mining, Modelo Preditivo, SQL

Server

vii

Acknowledgments

First, I have to thank my thesis supervisors, Professor Francisco Melo and Professor Alberto Sardinha.

I would like to thank you for your support and understanding over this past year. Most of my energy was

spent on my day job, but it was Rui Santos that always remembered to plan things and to make the

extra effort, and for that I am forever grateful. I would also to thank Pedro Estanislau and Luís Batista

for putting up with my data cravings and the telecom company that made everything possible.

I thank my family for the patience and encouragement. I am also grateful to my partner who supported

me through since my first year at this Institution and had taught me so much about perseverance and

hard work. Finally, my friends, for understanding my sudden absence in our social gatherings.

ix

Table of Contents

Abstract ................................................................................................................................................. iii

Resumo................................................................................................................................................... v

Acknowledgments ............................................................................................................................... vii

Table of Contents ................................................................................................................................. ix

List of figures ........................................................................................................................................ xi

List of Tables ....................................................................................................................................... xiii

List of Acronyms ................................................................................................................................. xv

1. Introduction .................................................................................................................................. 17

1.1. Problem Definition ............................................................................................................... 17

1.2. Objectives ............................................................................................................................ 17

1.3. Document Outline ................................................................................................................ 18

2. Fraud Detection: Context and Related Work ............................................................................ 19

2.1. Telecommunications CRM ................................................................................................... 19

2.2. Fraud Detection ................................................................................................................... 21

2.2.1. Telecom Fraud ........................................................................................................ 21

2.2.2. Related Work .......................................................................................................... 24

3. A Telecom Company: A Case Study .......................................................................................... 29

3.1. Business Model ................................................................................................................... 29

3.2. Data Model .......................................................................................................................... 32

4. Solution ......................................................................................................................................... 41

4.1. Solution Overview ................................................................................................................ 41

4.2. Methodology ........................................................................................................................ 42

4.2.1. Business Understanding ......................................................................................... 42

4.2.2. Data Understanding and Preparation ..................................................................... 44

4.2.3. Modelling ................................................................................................................ 55

4.2.4. Deployment ............................................................................................................. 59

5. Validation and Results ................................................................................................................ 63

5.1. Validation Plan ..................................................................................................................... 63

x

5.2. Results ................................................................................................................................. 65

6. Conclusion ................................................................................................................................... 69

6.1. Contributions ....................................................................................................................... 69

6.2. Future Work ......................................................................................................................... 69

7. References .................................................................................................................................... 71

xi

List of figures

Figure 1 – Diagram of the typical customer lifecycle. Adapted from [3]. ............................................... 19

Figure 2 – Diagram of the bad payer lifecycle. ...................................................................................... 20

Figure 3 – Global Telecom Fraud Loss in billions of US dollars. Data retrieved from CFCA surveys [9],

[11]–[15]. ................................................................................................................................................ 21

Figure 4 – Fraud Loss and Detection Risk. Adapted from [16]. ............................................................ 23

Figure 5 - Fraud management organisation – people, processes and technology [8]. ......................... 23

Figure 6 – Architecture of the patented system that collects customer data to compute the likelihood of

being a never-payer [17]........................................................................................................................ 25

Figure 7 – Flowchart of the patented system that calculates a never-pay score to determine the approval

of credit applications [17]. ...................................................................................................................... 25

Figure 8 – Diagram of the patented first-party fraud detection system [18]. ......................................... 26

Figure 9 – Business process (AS-IS) describing the customer lifecycle of a never-payer. ................... 30

Figure 10 - Diagram of the three main entities of the telecom company’s data model. ........................ 32

Figure 11 - Diagram of the entities available at subscription time. ........................................................ 34

Figure 12 – The Segment hierarchy. ..................................................................................................... 34

Figure 13 - Diagram of all the entities available before and after the customer account is activated. .. 36

Figure 14 - Entity-Relationship Diagram (Chen’s Database Notation) of the complete data model. .... 39

Figure 15 – Entity-Relationship Diagram (Crow’s Foot Notation) of the input data provided by the IT

department. ........................................................................................................................................... 40

Figure 16 – Application Architecture Overview ...................................................................................... 41

Figure 17 – CRISP-DM, the standard approach for developing data mining applications [20]. ............ 42

Figure 18 – Microsoft’s proposed data mining process methodology [22], heavily based on CRISP-DM

[20]. ........................................................................................................................................................ 42

Figure 19 – Overview of the SSIS package responsible for extracting data from flat files. .................. 45

Figure 20 – Data loading examples implemented by SSIS packages. ................................................. 46

Figure 21 - Entity-Relationship Diagram (Crow’s Foot Notation) of the input data after it was prepared.

............................................................................................................................................................... 47

Figure 22 – Entity-Relationship Diagram (adapted from Crow’s Foot Notation) of the Case Table

including the predictive attribute. ........................................................................................................... 49

Figure 23 – Never-payer distribution across the cities around Lisbon (Power Map) ............................ 51

xii

Figure 24 – Plots comparing the content of pricing plans for consumers and businesses. .................. 53

Figure 25 – Plots comparing the type of service usage, for the NP0 and NP1 population. .................. 53

Figure 26 – Plots comparing the number of risk evaluations before activation and the latest score across

datasets. ................................................................................................................................................ 54

Figure 27 – Mining model for Consumers using oversampling and different algorithms. ..................... 57

Figure 28 – Example of the mining model viewer for a Naïve Bayes algorithm.................................... 57

Figure 29 – Example of the mining model viewer for a Microsoft Decision Tree algorithm. ................. 58

Figure 30 – Example of a lift chart for a hybrid sampling data mining model. ...................................... 58

Figure 31 – Data mining process orchestrated by Integration Services (SSIS). ................................... 59

Figure 32 – Business process (TO-BE) describing the customer lifecycle of a never-payer. ............... 62

xiii

List of Tables

Table 1 - Comparison of the related work described in this document. ................................................ 28

Table 2 – Client entity attributes. ........................................................................................................... 33

Table 3 – Account entity attributes. ........................................................................................................ 33

Table 4 - Service entity attributes. ......................................................................................................... 33

Table 5 – Risk Evaluation attributes. ..................................................................................................... 35

Table 6 – Postal entity attributes. .......................................................................................................... 35

Table 7 – Pricing Plan entity attributes. ................................................................................................. 36

Table 8 – Usage entity attributes. .......................................................................................................... 37

Table 9 – Billing table entity. .................................................................................................................. 37

Table 10 – Payment entity attributes. .................................................................................................... 38

Table 11 – Campaign entity attributes ................................................................................................... 38

Table 12 – Never-payer distribution across the cities around Lisbon (table view) ................................ 51

Table 13 – Never-payer distribution across all Portuguese districts. .................................................... 52

Table 14 – Correlation analysis between usage attributes. ................................................................... 54

Table 15 –Confusion matrix for the never-payer classifier. ................................................................... 63

Table 16 – Training and testing set for Consumer and Business datasets. .......................................... 64

Table 17 – Validation results for all combinations of segments, algorithms, sampling strategies and data

types. ..................................................................................................................................................... 68

xv

List of Acronyms

BI Business Intelligence

CDR Call Detail Record

CRM Customer Relationship Management

DM Data Mining

DW Data Warehouse

ERP Enterprise Resource Planning

ETL Extract, Transform, Load

GSM Global System for Mobile Communications

IT Information Technology

KPI Key Performance Indicator

M2M Machine to Machine Communications

MBB Mobile Broad Band

MSSQL Microsoft SQL Server

MVNO Mobile Virtual Network Operator

NP Never-Payer

NP0 A Customer who is not a never-payer

NP1 A customer who is a never-payer

SME Small and Medium Enterprises

SOHO Small Office Home Office

SQL Structured Query Language

SSAS (Microsoft) SQL Server Analysis Services

SSIS (Microsoft) SQL Server Integration Services

17

1. Introduction

1.1. Problem Definition

Every business tries to minimise the set of customers who do not make their payments on time. This

debt is not always easy to recover, depending on the willingness, the financial capacity of the delinquent

customers, as well as many other factors. Organisations have to spend much money and resources to

recover bad payments. One of the strategies to avoid this kind of risk is to intervene proactively before

a customer runs into debt.

Telecommunications businesses sometimes face new customers who subscribe to services but with no

intention to pay for them - a form of subscription fraud. These so-called Never-payers are responsible

for the loss of large amounts each month not only because of the bills that will never be paid, but also

the costs and resources associated with the subscription of a given service by the fraudulent customer.

The main challenge of this work is to predict this particular class of post-paid customers who are never

going to pay for the newly subscribed services - the so-called never-payers.

This system was developed in collaboration with a telecommunications company that needed to identify

what is the typical profile of a risky customer and determine for every new potential customer, the

probability of being a never-payer subscriber. Although the relative number of these dodgy customers is

very tiny when looking at dozens of hundreds of additional subscriptions every month, they represent

substantial losses that could be avoided or at least, mitigated.

1.2. Objectives

This work will perform a preliminary analysis of the data provided by the data warehouse of a

telecommunications company, such as customer data, pricing plans, risk evaluations, usage, billing

history, well as other common reference attributes and historical data typical of the telecom industry.

This exploration analysis will help identify the set of attributes that best profiles a never-payer customer

and defines a predictive model. Then, supervised mining models will be capable of profiling past debtors

and learning about similar characteristics, enhancing the proactive detection of potential debtors.

State of the art systems rely both on customer characteristic and behaviour. Only two patented systems

solely rely on customer characteristics instead of behaviour, which is much more difficult and prone to

predictive errors. For instance, it becomes very challenging to judge whether a new customer is risky

based purely on demographic attributes and with no past behaviour. This system tries to rely only on

customer attributes upon acquisition and minimises the behavioural data to predict the outcome.

The final solution comprises a platform built entirely on Microsoft BI stack tools. Integration components

are capable of loading, cleaning and summarising large amounts of input data that provide information

18

about new customers. Then, analytical components select a specific set of relevant attributes to train

several predictive models that will be responsible for facing unknown customers and decide whether

they will be customers who will never pay their debts. Ad-hoc exploration of the data and results is also

possible using tools such as Excel, Power Pivot and Power BI.

1.3. Document Outline

The rest of this thesis is organised as follows:

Chapter 2 presents the most relevant concepts that are important for understanding this work. It

describes the generic relationship between the customer and a telecommunications company – the

customer lifecycle. Then, it explains how fraudulent customers affect telecom companies, pointing out

the most relevant fraud detection systems for this thesis.

Chapter 3 introduces the main subject of this thesis – a telecommunications company. This chapter

focuses on the business model that supports the customer lifecycle described in Chapter 2, and

highlights the challenges of this thesis. Also, the data model supporting the business is also detailed.

Chapter 4 presents the implemented solution, beginning with an overview of the data mining system

and how components interact. Then, it explains all the development phases, from business

understanding, data understanding and preparation, to modelling.

Chapter 5 describes the datasets used in the validation phase, as well as the validation methodology

employed in the different experiments. Additionally, it discusses the results obtained in each strategy.

Finally, Chapter 0 presents the main conclusions of this work and provides an overview of all

contributions of this thesis. It also presents a discussion of future directions that this thesis can point to.

19

2. Fraud Detection: Context and

Related Work

This chapter introduces the context of this thesis starting with an overview of the telecommunications

industry. Section 2.1 details the general relationship between the company and customer, identifying

several challenges and opportunities. Section 2.2 introduces fraud in the industry as well as fraud

detection systems.

2.1. Telecommunications CRM

CRM (Customer Relationship Management) is defined as the strategy for building, managing, and

strengthening loyal and long-lasting customer relationships [1]. Just like in any other industry, the CRM

process of a telecom company should be a customer-centric approach based on customer insight.

The term customer lifecycle refers to the various stages of the relationship between a customer and a

business. It is a framework for understanding customer behaviour and it becomes vital to comprehend

it because it directly relates to long-term customer value [2].

Figure 1 – Diagram of the typical customer lifecycle. Adapted from [3].

Figure 1 shows the relationship begins with a prospect, who is someone or some company in the target

market but not a customer yet. Upon agreement and providing personal information, a contract is signed,

and it becomes a new customer. This customer becomes an established customer after interacting with

the company, for instance, when a service is activated and used.

The relationship management should be the longest and most profitable phase in this lifecycle, in which

the customer pays for the subscribed services and could be the target of a campaign and even make a

complaint. However, sometimes the customer is churned, becoming a former customer who could be

won back through some campaign. Churn is a term used in the telecommunication service industry to

denote the customer movement from one provider to another [4]. Churning may be voluntary or forced,

for instance, when a customer incurs in bad debt.

ProspectNew

Customer

Established

Customer

Former

Customer

Acquisition Relationship Management

Agreement

Winback

Activation Churn

20

This thesis focuses on understanding what enables a telecommunications company to predict the

likelihood of a customer who simply does not pay his bills on time. Therefore, it is necessary to analyse

the lifecycle of a bad payer. Figure 2 depicts an instantiation of the customer lifecycle (Figure 1), but for

customers who incur in debt.

Figure 2 – Diagram of the bad payer lifecycle.

It may be called the bad payer lifecycle, featuring a hypothetical telecom customer who signs a contract,

uses subscribed services, but his bills are past due, most of the time. This cycle features three main

phases:

1. Acquisition, similar to the previous diagram. It represents the entry point for a prospect who

wants to subscribe to some service. At this moment, the prospect supplies the telecom company

with his information to draw up a contract and activate an account. The telecom company can

perform some kind of risk evaluation such as a background to avoid potential bad payers.

2. Intermediate, on which he uses the services subscribed and bills are emitted. It should be

lengthy and profitable unless he stops paying and enters the next phase.

3. Recovery, when a bad payer endures a set of strategies carried out by the company in order

to recover the bad debt. He could, ultimately, churn and leave voluntarily or even be forced to

churn because of his irrecoverable debt.

Many companies are increasingly using data mining techniques for CRM, which helps not only

addressing individual customer’s needs [5], but also predicting customer behaviour.

The key challenge of this thesis is to detect, at the earliest opportunity, the set of customers who will not

pay any bill after signing a contract (never-payers). Ideally, this would be feasible during Acquisition,

as soon as the company obtains some information about the prospect who could match with a known

never-payer pattern. However, if the results are inconclusive, one may look into a short period of

behaviour during Intermediate phase.

ProspectNew

Customer

Established

Customer

Former

Customer

Acquisition Intermediate

Agreement

Winback

Activation Churn

Recovery

Bad Payer

Pay debt

Bill past due

21

2.2. Fraud Detection

2.2.1. Telecom Fraud

Fraud, a dishonest attempt to convince an innocent party that a legitimate transaction is occurring when,

in fact, it is not [6], is as old as humanity itself [7]. In the last century, fraud matured in the area of

transactional businesses, such as the telecommunications industry [6]. Fraud in the telecom industry

means using of telecommunications products or services without the intention of paying [8], [9].

Fraud incidents increased dramatically with the expansion of modern technology and the Internet,

leading to the loss of billions of dollars worldwide each year [7]. Fraud negatively impacts everyone,

especially customers, knowing that fraud losses increase communications carriers’ operating costs [9].

Moreover, fraud can cause distress, loss of service and loss of customer confidence [10].

Current statistics, released by the Communications Fraud Control Association (CFCA) this year, point

to a global loss of about 38 billion (USD) which represent almost 2% of telecom revenues. According to

CFCA, subscription fraud has made it to the top five methods for committing fraud, adding up to more

than 6 billion (USD) of loss worldwide.

Figure 3 plots the estimated telecom fraud loss since 1999 [9], [11]–[15]. For ten years, fraud loss

increased continuously, peaking at $74 billion. The noticeable decrease in 2011 is attributed to growth

in global revenues outpacing the growth in fraud losses as well as improved anti-fraud programmes

implemented by operators and an increase in collaboration between professionals within the industry.

Figure 3 – Global Telecom Fraud Loss in billions of US dollars. Data retrieved from CFCA surveys [9], [11]–[15].

Telecom companies, customers and third-parties may commit fraud against each other. This thesis

focuses on fraud perpetrated by customers against telecom firms. Fraud, particularly telecom fraud,

appear to be becoming more socially acceptable [10].

Telecommunications fraud is attractive to offenders for many reasons that have been challenging

telecom carriers [6], [10], [16]:

12

37,5

57,2

74

40,146,3

38,1

0

10

20

30

40

50

60

70

80

1999 2003 2006 2009 2011 2013 2015

US

D B

illio

ns

Year

Global Telecom Fraud Loss (USD Billions)

22

Detection risk is low. The sheer volume of transactions increases the probability of fraud going

unnoticed because it is such a small proportion of the overall business. Moreover, the level of

punishment is relatively small.

There are more telecom carriers every day. As more carriers are created, the amount of

intentional fraud increases. Bad payers can simply move between carriers to avoid credit

checks.

No special equipment is required (usually). Many frauds don not require IT skills but exploit

business procedures, such as selling international calls or subscribing a product with no

intention of paying.

Industry experts estimate that there are more than 200 types of fraud [8]. The nature of fraud committed

against telecom carriers can be classified into three broad categories [6]:

Technical fraud. It involves attacks against loopholes in communications technology. It typically

needs initial technical knowledge and ability, though once a weakness has been exploited, it

can be quickly distributed in a form that non-technical people can use, for example, card cloning.

Contractual fraud. A fraud that generates revenue through the proper use of a service while

having no intention of paying for it, for instance, subscription fraud and premium rate fraud.

Procedural fraud. Attacks against the procedures implemented to minimise exposure to fraud to

grant access to the system. For instance, roaming fraud and social engineering.

Some of the most common varieties of fraud in the telecom industry include subscription fraud, when

someone signs up for service (e.g., a new post-paid contract) with no intent to pay, and identity fraud,

when the fraudster masquerades as another customer making or selling calls on this account [6].

Besides exploiting technological loopholes, fraudsters can use social engineering, exploiting the human

interactions, for instance, pretending to be a phone repair person and gaining access to a customer’s

account. Furthermore, when telecom companies launch new services, fraudsters realise they could

purchase them at a low price and then resell them illegally at a higher price to consumers who were

unaware of the service. Even when companies implement regulations to promote fairness, this ends up

spawning new types of fraud.

Subscription-based relationships take the form of an ongoing billing relationship where customers have

agreed to pay for a service over time [3]. Subscription fraud violates the relationship agreement and can

happen in two different ways. On the one hand, the customer may be consciously fraudulent, in fact,

fraud often will masquerade as a usage management problem. On the other hand, a legitimate customer

account may be hacked by someone fraudulent. There is no sharp line between intent to pay and the

ability to pay [6].

Figure 4 displays the risk of detection as perceived by a potential fraudster plotted against loss resulting

from the activities of those fraudsters [16].

23

Figure 4 – Fraud Loss and Detection Risk. Adapted from [16].

The exponential decay is a realistic portrayal of reality, given that considerable beneficial effect on loss

due to fraud can be achieved for a relatively small outlay, usually by improving processes of fraud

detection.

It is important to differentiate between fraud prevention and fraud detection. Fraud prevention aims to

stop fraud from occurring in the first place whereas fraud detection involves identifying fraud as quickly

as possible once it has been perpetrated [7]. Fraud detection comes into play once fraud prevention has

failed. Throughout this document, fraud detection is referred in a broad sense, meaning identifying fraud

at the first opportunity, even though it has not yet happened, that is, prevention.

Fraud detection (or prevention) presents itself as a significant challenge for telecom companies

concerning the volumes of data involved [8]. Daily, a company with 5 million customers can generate

hundreds of millions of transaction records. Telecommunications fraud is not static; new techniques

evolve as businesses put up defences against existing ones [6]. Besides, fighting fraud is complicated

by the existence of multiple telecom carriers. In the game of fraud detection, when one telecom company

is better than its competitors at detecting and stopping fraud, the fraudsters are inclined to move to the

competition.

Figure 5 - Fraud management organisation – people, processes and technology [8].

Inve

stig

atio

ns

Increasing

prevention costs

Ris

k o

f D

ete

ctio

n

Loss due to Fraud HIGHLOW

LOW

HIGH

Most organizations

start here

DETERRENCE PREVENTION DETECTION MITIGATION POLICY ANALYSIS

Enterprise Fraud Management Solution

Fraud Management Organisation

RIGHT PEOPLE

RIGHT PROCESSES

RIGHT TECHNOLOGY

24

One step towards beating fraudsters is to build a secure “golden database” [8]. That is, using the right

technology to construct a corporate data warehouse (DW), with adequate levels of security, which can

be used across multiple organisational departments, including CRM, fraud management, revenue

assurance and business intelligence. Another step involves prioritising the building of a fraud

management organisation that establishes key control points in the customer lifecycle and combines

the right processes, the right technology and the right people in the right places (Figure 5).

This thesis aims to build an enterprise fraud management solution that uses data mining techniques to

help the organisation prevent subscription fraud.

2.2.2. Related Work

This section presents a brief summary of the two most relevant implementations of fraud detection

systems in the context of this work. These systems share similarities with the system this thesis aims to

build, such as:

Detecting forms of subscription fraud perpetrated by the customers, wherein fraudsters simply

do not pay their debts;

The moment of detection should be as early as possible, even before the customer acquisition;

Behavioural data is not accessible, only customer data and credit databases are used.

In the end, a table summarises the comparison between each fraud detection strategy.

System and method for automated detection of never-pay datasets

Celka and Rojas patented a method for automatic detection of never-pay datasets for credit services

industry, known as credit rollers [17]. They define the never-pay population as those customers who

make a request for credit and obtain the credit instrument but over the life of the account, never make a

payment. It is designed to be as a tool to help financial service providers knowing whether the applicant

is likely to never pay after obtaining the credit instrument.

The architecture of this system is shown in Figure 6 and comprises data sources and a never-pay

module that runs the detection algorithm. Data sources contain customer data such as credit bureau

(i.e. collection agencies) data, tradeline data, historical balance data, demographic data and additional

data provided by the customer when applying for credit. The module is configured to obtain customer

records from the sources and compares them with already-proven never-pay profiles. Then, the

likelihood score of being a never-pay profile is calculated.

25

Figure 6 – Architecture of the patented system that collects customer data to compute the likelihood of being a never-payer [17].

Figure 7 presents the overall flow of the patented system and shows that several predictive models can

be combined with each other to compute a final score. If this score is below a given threshold, the

application is approved. On the contrary, if the score is higher than the threshold, the application is sent

for manual analysis.

Figure 7 – Flowchart of the patented system that calculates a never-pay score to determine the approval of credit applications [17].

26

The patent does not specify how the predictive model is built; that is, how the never-pay profiles are

created from actual credit rollers. It is suggested that profiles are composed of business rules that are

compared with prospects and output a likelihood score.

The authors claim this system is not limited to financial services industry but are also fit to other

industries, including the telecommunications services industry.

First-party fraud detection system

Mahdi et al. published a patent on a method for detecting first-party fraud using a supervised model that

calculates a risk score for current applications for credit or goods, using identity information provided by

the consumer [18].

In first-party fraud, the fraudster uses his true identity to fill an application for obtaining credit, goods, or

services, without the intention to fulfil payment obligations. In other words, first-party fraud does not

involve a stolen identity, the fraudster is willing to ruin his credit to defraud the victim. The

telecommunications industry is also affected, for instance, as it offers heavily subsidised smartphones

to those who pass a credit check. Fraudsters find this as an opportunity to sign up to as many plans as

possible with as many carriers as possible in a short period of time in order to get as many smartphones

as possible at a low price. Fraudsters present themselves indeed as themselves (i.e., the first-party),

but they have no intention of paying for the goods as contractually obligated.

Since organisations attempt to check the identity of applicants and first-party fraudsters obviously satisfy

these criteria, first-party fraud is tough to detect and prevent. This type of fraud depends solely on the

will of the applicants, and whether they actually intend to pay after they get the credit.

Figure 8 shows a diagram of the system modules as well as the information flow describing the method

for predicting first-party fraud.

Figure 8 – Diagram of the patented first-party fraud detection system [18].

27

First, it receives the customer application containing consumer identity data such as social security

number (SSN), name, address, phone number, date of birth, and others. The search module is

responsible for matching the current application and prior individual applications provided by a historical

module using linking keys. When identical, or similar, identity information is frequently used in a proximity

of time for the same or another commodity, this is evidence of first-party fraud.

Then, the generation module is responsible for producing markers (i.e. metrics) that are indicative of

first-party fraud based the identity linking keys. Examples of such markers include the number of

applications in the last week/month/year linked by address, SSN, or phone, or the number of unique

emails used in the last week/month/year linked by the various identity elements described.

Finally, the predictive module computes a risk score based on the markers; wherein the risk score

represents a chance that the current application represents first-party fraud. This module uses standard

supervised machine learning algorithms built learning examples of previous fraud attempts. These

algorithms can include neural networks, support vector machines, boosted trees or regressions.

Summary

After detailing the most relevant state of the art regarding fraud detection systems, it pays to compare

them all considering several dimensions that distinguish each system, regarding their target market,

data sources, predictive model and other dimensions.

Table 4.1 presents this comparison. Some conclusions can be drawn:

28

Celka and Rojas 2008 Mahdi et al. 2014

Target market Credit Services from Finance,

Telecom, Retail and other

industries

Generic (Credit Services)

Targeted fraudulent

population

Never-payers First-party fraudsters

Data Sources Customer data, Tradeline data

(balance history), Credit Bureau

scores, Demographic data,

Public Records

Customer data, ID Network,

Demographic data, Public Records

Prediction timing Before account approval Before account approval

Predictive Model

Type

Classification (Supervised) Classification (Supervised)

Predictive

Algorithms

Rule-based (?) Neural networks, support vector

machines, boosted trees or

regressions

Result Risk score of being fraudulent Risk score of being fraudulent

Mitigation strategy Manual review N/A

Table 1 - Comparison of the related work described in this document.

The system for detecting never-payers in the telecommunications industry will likely include data

sources such as demographic data, data provided by the customer and also some historical data, similar

to the systems described above. The source data will teach a classification algorithm to detect potential

never-payer customers, preferably before account approval. The final output will be the likelihood of not

paying any debts in the near future.

29

3. A Telecom Company: A Case Study

The main subject of this study is a Portuguese telecommunications company, whose challenge is to

detect fraudulent customers who never pay their debts. Firstly, it is important to understand its business

model and all the steps the customer follows through from the time he intends to sign a contract until he

runs into debt, that is, the customer lifecycle (Section 3.1). Secondly, to support the business model,

this company has implemented a DW-like data model that represents the input of this study and will also

be described (Section 3.2).

3.1. Business Model

The central entity of this business model is the Customer, who undergoes several phases during the

already mentioned customer lifecycle (Section 2.1). Considering the goal of this study is to identify, as

soon as possible, customers who will never pay their bills, it becomes pertinent to describe each lifecycle

phase. In addition to defining what makes a customer a never-payer, we need to pinpoint the precise

moment in time when we would have enough information to decide on his future.

The business process detailed in Figure 9 summarises all three customer lifecycle phases and its steps.

Each lane represents a distinct stakeholder, and the three phases Activation, Intermediate and Recovery

map exactly the bad payer lifecycle.

30

Figure 9 – Business process (AS-IS) describing the customer lifecycle of a never-payer.

The prospect customer begins the Activation phase when he shows an interest in subscribing a post-

paid service provided by the telecom company. After signing up a form that provides customer

information, the Activations Department is responsible for processing new account information and

requesting a Risk Evaluation on the potential client.

The prospective customer undergoes a Risk Evaluation, including criteria that have to be fulfilled so

that he is considered as an eligible client. The prospect’s fiscal number is supplied to evaluate each

eligibility criteria, which include questions like the examples described below:

Is the prospect a returning customer, and if so, did he leave any debts in the past?

Is the prospect featured in the database shared between the major Portuguese telecom

containing previous debtors? Debtors are automatically removed from this shared database if

certain conditions apply, for instance, if the debt amount is less than 20% of the national

minimum wage; if they are in an insolvency state; conversely, if the debt has prescribed or has

been relieved.

Is the prospect signalled for fraudulent or suspicious behaviour?

Collection MgmtBillingRisk EvaluationCustomer Activations

Recovery

Activ

ation

Inte

rmed

iate

ProspectCustomer

Process

information

Automatic Risk

EvaluationFiscal

Number

Approved?

Approve

Account

ActivationProspectRejected

Manual Credit

Evaluation

No

Approved?

No

Yes

Yes

Sign upform

Accept

Contract

Usage of

Services Issue Invoice

Pay bill before

due date

Pay beforedue date?

Yes

No

Open Debt

Collection CaseReceive

Nofications

Deactivate

Account

1 bill cycle after

Wait accordingcredit rating

Paid

Paydebt?

Yes

No

Account Deactivated

Draft Contract

31

Is the prospect contained in a white list?

The Risk Evaluation is an automated process, and if the results indicate that the client is eligible, its

results are then supplied to the activations’ assistant responsible for drafting the new contract.

Otherwise, the assistant can ask for a Manual Credit Evaluation that is performed by an activations’

specialist who will investigate historical data on the client (if available).

Activations’ specialist gathers all the historical information available on the potential client and tells the

activations’ assistant about the client eligibility. If the manual evaluation is positive, the assistant can

proceed to draft a new contract. On the other hand, a negative decision by the specialist can suggest

alternative methods for minimising the risk, such as the regularisation of the debt before entering into a

new contract. This could include methods such as the payment of a bond that is recurrently deducted in

subsequent invoices. Moreover, even if the evaluation is negative, the specialist can override the

decision, providing a meaningful comment. Sometimes the specialist does not have sufficient privileges

to override negative eligibility. If so, he can up-delegate the decision to his manager.

At this point, the prospect can either be rejected or approved. If he passes the risk evaluation or

liquidates the existing debt, the Activations Department drafts a new contract that is provided to the

prospect for acceptance. Once the contract is accepted, Activations Department approves the activation

of a customer account that is associated with the post-paid contract.

When the contract is accepted by the customer and entered into, the account associated with the

contract is assigned to one of the following billing cycles:

Cycle 1 – from the 1st day of the month to the 31st of the next month.

Cycle 9 – from the 9th day of the month to the 8th of the next month.

Cycle 16 – from the 16st day of the month to the 15th of the next month.

Cycle 23 – from the 23rd day of the month to the 22nd of the next month.

For instance, a customer who enters into a contract on the 3rd day of the month will be assigned to Cycle

9, that is, the cycle that immediately follows the activation date. If that is the situation, the billing cycle

will be closed on the 9th and, depending on the billing process speed, an invoice will be issued

afterwards, to illustrate, on the 13th day (bill statement date). The bill due date is typically calculated by

adding at least 11 working days (legally acceptable).

After contract activation, the customer enters the Intermediate phase and begins to use the subscribed

services. The Billing system is responsible for issuing an invoice one bill cycle after the usage of

subscribed services. If the first bill becomes past due and the customer does not make any payment,

he enters the Recovery phase that is controlled by the Collection Management system. This system

automatically opens a debt collection case and, depending on the credit rating of the debtor, warnings

such as SMS alerts and letters will be sent at different timings, and even subscribed services may be

suspended (hotline). If the debt is not liquidated after a certain amount of time, the client account is

deactivated, becoming a never-payer.

32

At the end of recovery, if a client account is deactivated and none of the bills were liquidated, then it is

regarded as a never-payer. A never-payer is a client account that was deactivated by any given reason

and has not paid its bills, i.e. the billed amount equals the open amount (current and past due debt) -

this is the future we want to predict.

After this point and several failed attempts to liquidate client’s debt, the Collection Management System

automatically runs a well-defined algorithm of collection actions that could include sending letters, legal

notices and even delegating the collection process to collection agencies and its lawyers. It is also

important to note that the customer can delay the payment of his debts simply by reporting a claim, for

instance, declaring that the contracted service is not working as intended. For that reason, never-payer

customers are capable of using contracted services for several months while not paying a single bill.

This thesis does not focus on the collection actions’ phase, but only up until the customer is deactivated

and still has all bills unpaid. It would be ideal if the system were capable of helping the activations’

assistant avoiding a potential never-payer, simply by looking at the provided information. Alternatively,

at least, after some usage of services, detect risky behaviour indicating the likelihood of not paying any

bills.

3.2. Data Model

The business model stated above is supported by a data model which is implemented across several

operational systems. All these systems converge to one central database, the data warehouse (DW).

The DW-like model is the main input for detecting the never-payer population.

Customer (Client/Account/Service)

As the above section established, the Customer is the central entity of the business model. The data

model supporting the business depicts the customer as the association between three main data

entities, as seen in the diagram below. The entry point is the Client, which aggregates at least an

Account, which in turn subscribe at least a Service.

Figure 10 - Diagram of the three main entities of the telecom company’s data model.

As an example, a business (e.g. a large corporation) may have one Account for each one of its

employees, which in turn subscribe to one or more Services (e.g. voice, fibre, TV). On the other hand,

a consumer can be represented by a Client entity which in turn may have one or more Accounts, for

instance, one for each one of his family members.

The following tables detail some of the attributes that are available for these three entities and may be

relevant for detecting patterns of never-paying customers.

Client Account1 1..*

Service1 1..*

33

Client A client is associated with at least an Account.

Segment A client is segmented a Business or Consumer client.

Location The only accessible demographic information is City and Postal code.

Fiscal Number Vital to the risk evaluation analysis and billing system, since it could provide

insights about past debts.

Table 2 – Client entity attributes.

Account An Account belongs to a Client and may subscribe at least a Service.

Creation date When the account was created.

Deactivation date When the account was deactivated, i.e. the status attribute is “deactivated”.

Status Account may be activated, deactivated, hotline (on the verge of deactivation).

Location This location is linked to bills. It represents the real location of the customer,

including City and Postal Code.

Fiscal Number Vital to the risk evaluation analysis and billing system, since it could provide

insights about past debts.

Table 3 – Account entity attributes.

Service A Service belongs to an Account and has Pricing Plans associated.

Pricing Plan The pricing plan(s) attached to the service.

Table 4 - Service entity attributes.

Client Segment

Figure 11 adds another four important entities to the model. Firstly, clients belong to a Segment, which

has its own classification hierarchy. Secondly, a customer undergoes a Risk Evaluation each time he

wants to activate an account. Thirdly, geographical data can be extracted using Postal information from

the customer account. Lastly, the subscribed services include one or more Pricing Plans, which will be

charged differently according to their rate.

34

Figure 11 - Diagram of the entities available at subscription time.

A customer belongs to a specific Segment, which is stored at Client level. The client is segmented as

Business or Consumer, for instance, the public sector and small and medium enterprises (SME) belong

to the business segment while the general public can belong to the consumer segment.

Figure 12 – The Segment hierarchy.

The types of services (and pricing plans) offered are specific to each segment. Additionally, each

segment can be classified as a three-level hierarchy as shown in Figure 12.

Risk Evaluation

Prior to account activation, the client undergoes a Risk Evaluation (detailed in the previous section)

based on its fiscal number to determine, in theory, if the account is to be activated. Each time a customer

needs to activate a new account, at least a new risk evaluation will be performed.

Client Account1 1..*

ServicePricing

Plan

1 1..*1 1..*

Risk

Evaluation

1..

*0

..*

Segment

0..

*1

Postal

0..

*1

Se

gm

en

t

Consumer

Business

SME & SOHO

SME

SOHO

MVNO

Self-employed

Large Corporations

Corporate

Public Sector

Group Enterprises

35

Risk Evaluation Before every account activation the risk is as assessed.

Fiscal Number The fiscal number of the customer being evaluated.

Criterion Each evaluation comprises nine different criteria as described in Section 3.1

above and each one is represented by its number (between 1 and 9), name

and result of the evaluation for that criterion. A criterion scoring zero is indicated

as passed, but if it scores 1 it fails and the reason is also registered.

Evaluation Score Non-risky customers score zero, which is computed by summing up all nine

criteria scores. Sometimes the activation specialist can pass the evaluation

even if the real score is above zero (fail), if so, the reason is registered.

Creation info The timestamp of the evaluation as well as the login name of the activations’

specialist when this evaluation was created using the Risk Evaluation system.

Update info Each time an evaluation is updated, this logs the timestamp and login name of

the activations’ specialist who performed changes.

Table 5 – Risk Evaluation attributes.

Postal

Some of the contact information provided by the Account entity includes the postal code and city, which

roughly map with the national database of postal codes provided by the Portuguese Post Office (CTT)

[19]. The Postal entity comprises postcode data associated with the municipality and district names,

providing geographical insights that might be potential predictors of debt patterns.

Postal An account is associated with postal and city contact information.

Postal Code 4-digit postal code.

Postal name The name that is placed after the postal code (town or municipality name).

Town The name of the town.

Municipality The name of the municipality.

District The district name such as Lisboa, Évora, Porto and Aveiro.

Table 6 – Postal entity attributes.

Pricing Plan

This system only analyses customers who subscribe post-paid Pricing Plans; that is, customers who

use the services provided by the telecom company and pay the bills generated at the end of each month.

Examples of Pricing Plans include, for instance, mobile voice post-paid plans, or even a triple-play

pricing plan that includes television, Internet and mobile services.

36

Pricing Plan A Pricing Plan belongs to a Service.

Name The name of the pricing plan.

Flag GSM Whether it is a GSM (mobile communications) price plan or not.

Segment Consumer or Business, similar to Client Segment.

Hierarchy Classifies the pricing plan using a simple hierarchy of three levels:

Level 1 – Similar to client segment. Usually, a business client

subscribes business pricing plans, and so on.

Level 2 – This system focuses on post-paid pricing plans.

Level 3 – Describes the content such as GSM (mobile), fixed, MBB

(mobile broadband), M2M (machine-to-machine) communications.

Table 7 – Pricing Plan entity attributes.

All seven entities in Figure 11 represent the data that is present the moment before the customer is

accepted by the telecom company, and a new account is created. In short, these entities are the ones

affected during the customer Acquisition phase (see Section 2.1), and each of them has unique

attributes that characterise a priori a telecom customer, in a way that these attributes are the only ones

on hand at subscription time.

Preferably, risky customers should be detected at the first opportunity, but we could look further into their

behaviour right after customer accounts are activated. Figure 13 adds four entities that will eventually

become available after an account is created, and services are subscribed.

Figure 13 - Diagram of all the entities available before and after the customer account is activated.

Usage

The most evident sign of customer behaviour is the usage of the subscribed services. Usage logs each

customer behaviour regarding, for example, mobile calls and data usage, storing metrics that help

understand customer behaviour and, ultimately, generate charges. In this system, and due to space and

processing power limitations, usage data was aggregated on a daily basis, so a service has a

corresponding usage record each day, as shown in Table 8.

37

Usage A service generates usage records.

Service The service this usage daily event refers to.

Event Date The day of the event.

Event Description Metrics such as megabytes, kilobytes and seconds.

Event Units The amount of units spent for the given event, as well as rounded units.

Suppose a service is charged every 30 seconds, and 35 seconds were spent,

so the rounded units would be 60.

Total Calls The number of events for the given event type. For two calls of 45 seconds

each, this metric would be 2.

Table 8 – Usage entity attributes.

Billing

The service usage generates charges associated with its account. Customer accounts are billed every

month in by the Billing system, as it was once described by the billing cycles in the section above. Every

billing cycle, a new bill is generated, and Billing keeps track of the charged amount, the due date and

the amount that has already been paid. The Billing data entity is essential to classify existing customers

into the never-payer population.

Billing Each billing cycle an account is given a new bill, including paid services.

Account The account this bill refers to.

Bill Cycle The day of the month on which the bill was issued. Each bill is issued monthly

always on the same day. For instance, if an account belongs to Bill Cycle 3, it

means that its bills are issued at the third day of each month.

Due Date The bill must be paid before this date, otherwise the account runs into debt.

Last Payment Date The last time the bill was paid.

Amounts Each bill keeps track of several types of an amount:

Charged amount, which is the initial amount that was charged for the

given billing cycle.

Original invoice due total, the total (cumulative) amount that is in debt.

Billed payment, the amount that was already been paid.

Open amount, the amount that is left to pay.

Table 9 – Billing table entity.

38

Payment

Each time a customer pays up, his account generates a new record which is stored in the Payment

entity.

Payment Each payment amortises the current account debt.

Account The account this payment refers to.

Payment Month The month of payment. Payment date is aggregated by month.

Amount The amount that was paid.

Table 10 – Payment entity attributes.

Campaign

Lastly, salespeople often contact customers (or prospects) to accept additional services and these

contacts are logged in the Campaign entity.

Campaign Accounts and services may be contacted by salespeople.

Account / Service The account and service of the campaign contact.

Contact Date The date of the campaign contact.

Campaign Name The name of the campaign.

Table 11 – Campaign entity attributes

After the brief explanation of how the data model supports the business model of this telecom, it

becomes more palpable what will be the data attributes and metrics that will be analysed and tested to

verify if never-paying customers are, in fact, predictable.

Figure 14 shows the complete diagram of the data model supporting the business entities. For more

detail concerning the data structure of the data sources that serve as input to the system, see Figure

15.

39

Figure 14 - Entity-Relationship Diagram (Chen’s Database Notation) of the complete data model.

Client

Account

Service

Pricing

Plan

Campaign

10

..*

Usage

11

..*

1 1..*

Billing

Payment

0..* 1

0..

*1

Risk

Evaluation

1..*

0..*

Segment0..*1

40

Figure 15 – Entity-Relationship Diagram (Crow’s Foot Notation) of the input data provided by the IT department.

Cli_

Acc_S

erv

PP

/ S

erv

ice

Usag

e

Pa

ym

en

ts

Cam

pa

ign

Bill

ing

Ris

k E

va

luatio

n

Cli_

Se

gm

en

t

PP

_R

ef

DW

_C

LIE

NT

_ID

PK

DW

_C

RE

AT

ION

_D

T_

ID

DW

_S

ER

V_

IDF

KP

K

AC

TU

AL

_U

NIT

S

AC

TU

AL

_U

NIT

S_

DE

SC

PM

T_

AM

T

CLIE

NT

_F

ISC

AL

_N

UM

SIE

BL

_C

US

T_

AC

CT

_S

T_

DE

SC

_S

S

PO

ST

AL

_C

OD

E

AD

DR

_P

OS

TA

L_

CO

DE

AD

DR

_C

ITY

DW

_C

US

T_

AC

CT

_ID

PK

DW

_S

ER

V_

IDP

K

CIT

Y

0..*

1..*

DW

_C

US

T_

AC

CT

_ID

FK

PK

DW

_S

ER

V_

IDF

KP

K

DW

_C

AM

P_

CO

NT

AC

T_

CR

EA

TIO

N_

DT

_ID

PK

CA

MP

_N

AM

EP

K

DW

_C

US

T_

AC

CT

_ID

FK

PK

OR

IG_

INV

OIC

E_

DU

E_

T

OT

AL

BIL

LE

D_

PM

T

DW

_B

ILL

_D

T_

IDP

K

OP

EN

_A

MT

DW

_LA

ST

_P

MT

_D

T_

ID

DW

_S

ER

V_

IDF

KP

K

RO

UN

DE

D_

UN

ITS

RO

UN

DE

D_

UN

ITS

_D

ES

C

TO

TA

L_

NU

M_

OF

_C

ALLS

DW

_C

ALL

_S

TA

RT

_M

ON

TH

_D

T_

IDP

K

0..*

1

1

1

DW

_P

MT

_M

ON

TH

_ID

PK

DW

_C

US

T_

AC

CT

_ID

FK

PK

0..*

1..*

PR

ICIN

G_

PLA

N_

DE

SC

PK

1

1..*

DW

_D

EA

CT

_D

T_

ID

DIS

CO

NN

EC

T_

RE

AS

ON

BIL

L_

CH

G_

WIT

H_

IVA

DW

_B

ILL

_D

UE

_D

T_

ID

AC

CO

UN

T_

AC

TIV

AT

ION

X_

VD

F_

MW

_R

ES

P

NA

ME

FK

PK

X_

VD

F_

SF

A_

RE

ST

R_

MO

RE

_IN

FO

AT

TR

IB_

NA

ME

SC

OR

E_

1

SE

QU

EN

CE

_N

UM

PK

CR

EA

TE

DP

K

LO

GIN

_C

RE

AT

ED

_B

Y

LA

ST

_U

PD

SC

OR

E

LO

GIN

_LA

ST

_U

PD

_B

Y

X_

VD

F_

RIS

K_

FL

AG

AT

TR

IB_

VA

LU

E

X_

VD

F_

SF

A_

RE

ST

RIC

TIO

N_

TY

PE

CLIE

NT

_S

EG

ME

NT

_D

ES

C

0..*

1..*

SE

C_

PO

ST

AL

_C

OD

E

FIS

CA

L_

NU

M

FIS

CA

L_

NU

M_

VA

LID

_IN

D

BU

SE

GM

EN

T1

CLIE

NT

_S

EG

ME

NT

_D

ES

CF

KP

K

SE

GM

EN

T2

SU

BS

CR

IB_

TY

PE

SU

BS

CR

IB_

CLA

SS

PR

ICIN

G_

PLA

N_

DE

SC

_S

SF

K

PR

ICIN

G_

PLA

N_

TY

PE

PR

ICIN

G_

PLA

N_

CLA

SS

HR

CH

Y_

B_

AT

TR

IB_

1

CO

NT

EN

T

HR

CH

Y_

B_

AT

TR

IB_

2

BU

PA

ID_

CLA

SS

1

1..*

1

1

41

4. Solution

In this chapter, the implemented solution is generically described in terms of its architecture (Section

4.1). Then, each step of the methodology for building data mining applications [20] is detailed (Section

4.2): understanding the business problem; understanding and preparing the data; and creating models.

Chapter 5 is solely dedicated to explaining how the models were assessed and the results.

4.1. Solution Overview

In order to proactively predict the never-payer population, a data mining application is proposed. This

application has to be capable of processing input data that can help predict if a customer is likely to

never pay his bills.

The general overview of the system is illustrated in Figure 16.

Figure 16 – Application Architecture Overview

The main application is composed by three main layers:

Sources, where customer and behavioural data from across different dimensions of the

customer is stored, as described in Section 3.2. CSV flat files are the common interface between

the source systems and the application. Source data include the ERP, CRM and Billing systems

as well as CDRs and other sources.

ETL, built on SQL Server and SSIS, is responsible for extracting the data, transforming it and

load the model set, comprising a set of data mining algorithms with different sampling strategies,

complexities and data types. This component is supported by SQL Server and automated by

SSIS.

Data Mining, where predictive models are trained and tested, predictions are stored and

evaluated. The analytical component is built on SSAS and orchestrated by SSIS.

42

Additionally, reporting tools such as Microsoft Excel, Power View, Power BI and Power Pivot integrate

seamlessly with the SQL Server database. A number of views and stored procedures are available to

the user, containing results and statistics. This enables the creation of ad-hoc exploratory analyses.

4.2. Methodology

Considering that this thesis aims to build a data mining application, it makes sense to guide its

development around a widely used methodology such as CRISP-DM (Cross Industry Standard Process

for Data Mining). CRISP-DM is a data mining methodology and process model that describes a common

approach for conducting data mining projects [20], [21]. Figure 16 depicts the main phases of this

methodology. Furthermore, the recommended approach by Microsoft is also loosely based on this

methodology [22]; each Microsoft BI stack tools plays a role, as can be seen in Figure 17.

Figure 17 – CRISP-DM, the standard approach for developing data mining applications [20].

Figure 18 – Microsoft’s proposed data mining process methodology [22], heavily based on CRISP-DM [20].

The following sections will detail each step of the methodology for building the data mining application:

Business understanding: Section 4.2.1 focuses on the business goals and how they were

translated into a data mining problem definition.

Data understanding and preparation: Section 4.2.2 introduces the source data, first insights,

challenges, and how data was prepared for modelling.

Modelling: Section 4.2.3 presents the predictive models that were applied.

Evaluation: Section 5 defines a validation plan as well as describes the results obtained.

Deployment: Section 4.2.4 explains how the data mining application was operationalised and

deployed.

4.2.1. Business Understanding

The main subject of this thesis is a telecom company whose goal is to predict, as soon as possible, if a

new customer becomes a never-payer in a near-future. A never-payer is someone who subscribes

services or products but does not intend to pay for that subscription.

43

The first step consisted in attending several meetings to identify the aforementioned requirements, detail

business processes and stakeholders involved, assumptions and constraints.

This company delineated simple requirements, including:

1. Describing the typical profile of a never-pay customer.

2. Determining for every new potential customer, the probability (i.e. risk score) of being a never-

payer subscriber in the future.

3. Predicting never-payer accounts preferably during their acquisition.

4. The prediction can also be tested using a small amount of behavioural data (e.g. usage history).

5. Operationalising the learning and testing process.

The complete detail of the business model and was previously presented in Section 3.1, including all

steps the customer follows through from the time he intends to sign a contract until he becomes a never-

payer.

Several assumptions were established:

1. At least one-year of data must be provided for analysis.

2. Data sources are supplied by the company IT Department in the form of flat files1. This simplifies

the data loading process as flat files are a common interface between all the company systems

and this data mining application.

3. Data is provided by the IT Department with a degree of aggregation that is possible for

extraction, even if it is not fit for the application. Because of the sheer volume of data to be

extracted and the processing time needed to transform data, some was not filtered/aggregated

as it should be. This can introduce aggregation level and data quality problems that have to be

solved during data preparation.

4. The universe of customers is limited to those who have post-paid pricing plans. The main goal

is to detect customers who never pay their monthly bills, that is, post-paid contracts.

5. The aggregation level required for data mining is established at the account-level.

It was decided that all the solution should be developed using Microsoft BI tools. The main reason was

because that this software was already available for the company to use. Besides, Microsoft provides a

full-stack data mining development, wherein all the tools integrate together really well. Thus, the

software setup included:

A server running Microsoft SQL Server 2012 with the following components installed:

o Database server running Microsoft SQL Server 2012 (MSSQL).

o Integration server running Microsoft SQL Integration Services 2012 (SSIS).

o Analytic server running Microsoft SL Server Analysis Services 2012 (SSAS).

Microsoft Excel 2013 with the following components installed: Power View, Power Map, Power

Pivot and Power Query.

1 Data files containing text records with a fixed number of fields.

44

The hardware for developing the solution included a PC with an Intel i7 @ 2.9 GHz processor with 8GB

of RAM. Microsoft SQL Server 2012 was configured with 3GB of RAM, so the remaining memory was

used by other software components.

Concluding, the never-payer data mining problem was defined as a supervised, classification problem.

A supervised data mining algorithm learns data patterns contained in examples provided by the user,

thus the examples of never-payer accounts. It is a classification problem because the model will predict

events described by categorical labels such as “yes” or “no” [5] to answer one simple question: will she

be a never-payer customer? Also, a probability score is calculated, describing the likelihood of that event

to happen.

4.2.2. Data Understanding and Preparation

The second phase of CRISP-DM methodology begins with understanding and preparing the data. These

two steps should be followed continuously, and fall back to the previous phase when some business

input is needed.

Firstly, it is important to distinguish between customer data and behavioural data [23]. The former

refers to first-level attributes that are present during the acquisition process, for instance, customer’s

city, services and pricing plans subscribed, as well as risk evaluation scores. The second refers to

second-level attributes are available after the customer’s application has been approved, and he begins

to use the subscribed services. Examples of behavioural data include usage metrics by service, and the

bills generated every month. The overview of the business data model is shown in Figure 14; customer

data is green-marked, whereas behavioural data is blue-marked. For a detailed description of each data

entity, see Section 3.2.

Data Extraction

The IT Department provided the first batch of data containing customer data. This included a set of

customers whose accounts were activated during two-year period, from April 2013 until the beginning

of March 2015. This added up to approximately 613.000 clients with 937.000 accounts and 2.877.000

services with corresponding pricing plans. Segment and pricing plan were included, as well as publicly

available geographical data from the Portuguese Post Office (CTT) [19]. Later, one year of risk

evaluations were added to the picture, from September 2013 until September 2014. More than 17 million

records featured the almost 2 million unique risk evaluations. Additionally, behavioural data was

provided. This included more than 16 million campaign contacts, and 4.6 million payments for 6.3 million

bills. Daily usage of services added up to 20 million records. Because it was only possible to extract one

year of risk evaluations, the training and test dataset was limited by this period (September 2013 until

September 2014).

Notice that this sheer volume of data was filtered only by the customers of interest. Some data entities,

such as usage and campaigns, needed to be re-filtered for the first month after the account activation

date. This way, the 20 million of usage records were possible to store and process. It was established

45

that behavioural entities should be limited to thirty days after account activation. In addition, data such

as campaigns and usage was aggregated daily, thus making it very difficult to load and process. For

that reason, it was summarised afterwards by month.

The process of collecting initial data into a staging area was implemented using SSIS. Figure 19

shows the collapsed view of the package wherein each entity is loaded from flat files, and certain data

dependencies have to be met. The parallelization degree is maximised to speed up the extraction.

Figure 19 – Overview of the SSIS package responsible for extracting data from flat files.

Several strategies for loading large amounts of data were used. The first one was using the highly

efficient bulk insert2 T-SQL command. A number of stored procedures were developed to bulk load the

data and apply data transformations at once.

During every bulk loading process, some actions were taken to maximise storage efficiency and

processing speed:

Trimming and uniformising string attributes;

Checking and removing duplicate records considering each entity’s business key;

Summarising data with the adequate aggregation level (daily vs. monthly);

Creating database indexes to speed up lookups.

Figure 20 shows two examples of how large volumes of data were loaded into the application database.

For instance, risk evaluation data was spread across several monthly files, each one containing all risk

2 Bulk Insert (T-SQL): https://msdn.microsoft.com/en-us/library/ms188365(v=sql.110).aspx

https://msdn.microsoft.com/en-us/library/ms188365(v=sql.110).aspx

46

evaluations for accounts that were activated in that given month. Data was extracted using SSIS bulk

load, which is less efficient, but for each 200MB file, it is considerably faster. The main problem was the

dozens of millions of records, most of them being duplicated. Therefore, the loading process had to

eliminate those duplicates and, in the end, summarise the data for getting only one evaluation per day.

Usage loading is similar to risk evaluation, but since each monthly file was sized up to 5GB, and only

3GB were available for SQL Server, the special bulk insert T-SQL command was required.

Figure 20 – Data loading examples implemented by SSIS packages.

The complete data model that was loaded into the database is depicted in Figure 15. Data loading also

ensured business constraints such as primary and foreign keys. For the sake of simplicity, several views

were set up to build the data model shown in Figure 21.

47

Figure 21 - Entity-Relationship Diagram (Crow’s Foot Notation) of the input data after it was prepared.

Se

rvic

e-P

ricin

gP

lan

Usag

e

Pa

ym

ents

Ca

mp

aig

n

Billin

g

Clien

t-A

ccou

nt-

Se

rvic

e

Ris

kE

valu

ation

Pri

cin

gP

lan (

De

no

rm)

Clien

t S

egm

ent

Se

rvic

eID

FK

PK

Actu

alU

nits

Actu

alU

nitsD

esc

Am

ou

nt

0..

*

1..

*

Acco

un

tID

FK

PK

Se

rvic

eID

FK

PK

Cam

pa

ign

Na

me

PK

Acco

un

tID

FK

PK

Ori

gIn

voic

eD

ue

To

tal

Bill

ed

Pa

ym

en

t

Cycle

IDP

K

Op

en

Am

ou

nt

Se

rvic

eID

FK

PK

Rou

nd

ed

Units

Rou

nd

ed

UnitsD

esc

To

talC

alls

CallS

tart

Mo

nth

IDP

K

0..

*

1

1

1

Mo

nth

IDP

K

Acco

un

tID

FK

PK

0..

*

1..

*

Pri

cin

gP

lan

FK

PK

1..

*

1

[Clie

nt]

Se

gm

en

tF

K

[Clie

nt]

Po

sta

l

[Clie

nt]

City

[Acco

un

t] C

rea

teD

ate

ID

[Acco

un

t] S

tatu

s

[Acco

un

t] P

osta

l

Acco

un

tID

PK

Se

rvic

eID

PK

[Acco

un

t] C

ity

Con

tactD

ate

IDP

K

Due

Da

teID

La

stP

aym

en

tDa

teID 1

..*

1

[Clie

nt]

Fis

ca

lNu

mbe

r

[Acco

un

t] A

fterD

art

eID

[Acco

un

t]

Dis

co

nn

ectR

ea

son

Cha

rged

Am

ou

nt

CritN

am

e

Fis

ca

lNu

mbe

rP

K

CritV

alu

e

CritS

core

CritF

lag

Cre

ate

Date

ID (

Acco

un

t)

Eva

lRe

spo

nse

SF

AR

estr

ictio

nT

yp

e

SF

AR

estr

ictio

nIn

fo

Cre

ate

Lo

gin

Mo

difyD

ate

Mo

difyL

og

in

Cre

ate

Date

PK

Eva

lSco

re

CritN

um

ber

PK

1..

*

0..

*

Clie

ntI

DP

K

[Acco

un

t] D

ea

ctD

ate

ID

[Acco

un

t] P

osta

l2

[Acco

un

t] F

isca

lNu

mbe

rF

K

[Acco

un

t] F

isca

lNu

mV

alid

Pri

cin

gP

lan

PK

Su

bty

pe_P

ayin

g

Su

bC

lass_B

usin

ess

Pa

idC

lass_P

rep

aid

Con

ten

t_G

SM

Su

bty

pe_P

ayin

gN

on

GS

M

Su

bC

lass_C

on

su

mer

Pa

idC

lass_P

ostp

aid

Con

ten

t_A

I

Con

ten

t_F

IXE

D

Con

ten

t_M

2M

Con

ten

t_M

BB

Con

ten

t_IS

P

Con

ten

t_A

D

Se

gm

en

tP

K

BU

Se

gm

en

t1

Se

gm

en

t2

1

1..*

48

Classification

Before any further data exploration, the dataset had to be classified for each account, concerning the

attribute we aim to predict. The classification step created an extra derived binary attribute using the

never-payer condition defined in Section 3.1; accounts that are deactivated and had all of their bills

unpaid (1).

𝐴𝑐𝑜𝑢𝑛𝑡𝑆𝑡𝑎𝑡𝑢𝑠 = ′𝐷𝑒𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑒𝑑′ ∩ 𝑠𝑢𝑚(𝑂𝑝𝑒𝑛𝐴𝑚𝑜𝑢𝑛𝑡) > 0 ∩ 𝑠𝑢𝑚(𝐵𝑖𝑙𝑙𝑒𝑑𝑃𝑎𝑦𝑚𝑒𝑛𝑡) (1)

Then, the first step towards predictive modelling was taken by creating a case table, also known as the

model set. The model set is the data that is used to build the data mining models [2]. Each case table

record is a case; that is, a customer’s account with relevant attributes for predictive modelling. Figure

22 shows the main case table of this system, containing all possible attributes that were tested for

relevance. Notice that the Account ID is marked as the case table primary key; this means the case

table is a summary of all attributes and metrics at the account-level (assumption 5 of Section 4.2.1).

Besides picking categorical and numerical attributes from the original dataset, several derived attributes

were also calculated, such as the number of services subscribed, the number of risk evaluations, and

the average number of calls in the first month of subscription, and others. This case table was composed

of almost 400.000 accounts that were activated during a one-year period and were classified. The

FlagNP binary attribute was 1 for never-payer accounts, and 0 otherwise.

Note: From this point on, the term “NP1” will refer to the population classified as being never-payer, in

other respects, it will be referred as “NP0” (without never-payers).

49

Figure 22 – Entity-Relationship Diagram (adapted from Crow’s Foot Notation) of the Case Table including the predictive attribute.

Usag

eR

iskE

valu

ation

Se

rvic

e / P

ricin

gP

lan

Clien

tA

ccou

nt

Ne

ve

r-P

aye

rN_T

ota

lCa

lls_

Se

con

ds

N_A

ctu

alU

nits_S

econ

ds

[Ris

k] C

rea

teL

og

in

[Ris

k]

Mo

difyD

ate

[Ris

k]

Mo

difyL

og

in

N_R

iskE

va

l

[Ris

k]

Eva

lSco

re

N_S

erv

ice

ID

Su

bty

pe_

Pa

yin

g

Su

bC

lass_

Bu

sin

ess

Pa

idC

lass_P

rep

aid

Con

ten

t_G

SM

Su

bty

pe_

Pa

yin

gN

on

GS

M

Su

bC

lass_

Con

su

mer

Pa

idC

lass_P

ostp

aid

Con

ten

t_A

I

Con

ten

t_F

IXE

D

Con

ten

t_M

2M

Con

ten

t_M

BB

Con

ten

t_IS

P

Con

ten

t_A

D

[Clie

nt] S

egm

en

t1

[Clie

nt] S

egm

en

t2

[Clie

nt] S

egm

en

t

[Clie

nt] P

osta

l

[Clie

nt] C

ity

[Acco

un

t] C

rea

teD

ate

ID

[Acco

un

t] S

tatu

s

[Acco

un

t] P

osta

l

Acco

un

tID

[Acco

un

t] C

ity

[Clie

nt] F

isca

lNu

mbe

r

[Acco

un

t] A

fterD

art

eID

[Acco

un

t]

Dis

co

nn

ectR

ea

son

Clie

ntI

D

[Acco

un

t] D

ea

ctD

ate

ID

[Acco

un

t] P

osta

l2

[Acco

un

t] F

isca

lNu

mbe

r

[Acco

un

t] F

isca

lNu

mV

alid

[Clie

nt] B

U

[Acco

un

t] C

rea

teD

ate

ID

[Acco

un

t] S

tatu

s

[Acco

un

t] P

osta

l

Acco

un

tID

PK

N_S

erv

ice

ID

[Acco

un

t] C

ity

[Acco

un

t]

Dis

co

nn

ectR

ea

son

[Acco

un

t] D

ea

ctD

ate

ID

[Acco

un

t] P

osta

l2

[Acco

un

t] F

isca

lNu

mbe

r

[Acco

un

t] F

isca

lNu

mV

alid

N_P

ricin

gP

lan

N_T

ota

lCa

lls_

KB

yte

s

N_T

ota

lCa

lls_

Eve

nt

N_T

ota

lCa

lls_

MB

yte

s

AV

G_T

ota

lCa

lls_S

econ

ds

AV

G_T

ota

lCa

lls_K

Byte

s

AV

G_T

ota

lCa

lls_E

ve

nt

AV

G_T

ota

lCa

lls_K

Byte

s

N_A

ctu

alU

nits_E

ve

nt

N_A

ctu

alU

nits_K

Byte

s

N_A

ctu

alU

nits_M

Byte

s

[Ris

k] C

rea

teD

ate

[Ris

k] R

ea

lScore

Fla

gN

P

50

Data Quality

It was detected that almost 28% of accounts had invalid or missing attribute values. One example of

poor data quality was the account’s city and postal code attributes. They were marked as “deleted”, “not

available” or were missing in about 267.000 accounts. The account’s address should always be present,

since it is mandatory for associating with bills, very much like the fiscal number. For that reason, further

analysis was made to understand why more than one-quarter of the customers were invalid.

Because the data had come from the company’s DW, this kind of data quality was discussed with the

BI Department, and it was concluded that those customers were, in fact, invalid; accounts with invalid

statuses that should not be considered. For that reason, a new attribute flag was provided, indicating

whether the account was valid. Thus, the dataset was narrowed down to about 670.000 accounts and

avoided future problems evaluating mining models.

Input data was provided in batches, and its contents are based on the extraction date. When some

extraction mistakes occur, such as missing or miscalculated fields, only the affected entities were re-

extracted and without historical data, due to their large volume of data and the time needed to extract

them again. Because there is no historical data, further extractions always reflect the present time of the

extraction, and they need to be synchronised with older data.

For instance, suppose the batch of Clients, Accounts and Services need to be re-extracted. Even if

account creation dates are limited between date A and B to simulate the same time-span as the first

extraction, many Accounts will have a single set of Services, or they could even belong to different

Clients. Then, there will be (new) Services without Price Plans and Price Plans that used to belong to a

Service and are now orphan records.

There are two possible methods of avoiding this kind of time incoherence. The first one, and the easiest

concerning implementation is to re-extract everything ensuring every record refers to the same present

time, but this was completely impractical considering the time and processing power required for this

task. The alternative is to set up a sophisticated mechanism that would be capable of re-extracting only

the entities/records strictly needed for maintaining the model coherence. Considering the resources

required by the first approach and the complexity required by the alternative, the automatic resolution

of this problem was considered out of scope. However, when this situation occurred, special care was

taken, but in a more manual, case-by-case way.

Finally, although campaign data was provided, it was not possible to incorporate campaign contacts in

the data mining model, due to aggregation level incompatibilities.

Exploration Analysis

Exploration analysis was now possible. Excel and its add-ins Power View, Power Pivot and Power

Map, helped to visualise how the data behaved.

Globally, there are ~2% of never-payers, which is a reasonable percentage when it comes to fraud.

Firstly, 86% of the global dataset is composed of consumers, and from those consumers, ~2% are never-

51

payers. From the remaining 14% of businesses, ~1% never pay their bills. Moreover, business and

consumer segments have very distinct topologies, translating in different attribute metrics. For that

reason, they should be analysed separately.

Power Map helped to get to know better the location of each customer (account). Attributes such as the

account’s country, district, city, town and postal code, proved to be very helpful when mapping the

distribution of the never-payer population.

Figure 23 shows an example the proportional distribution of never-payers across account’s cities. The

cities with the biggest yellow bar are the cities with the greatest relative amount of never-payers (NP).

For instance, a customer from Sintra is ~80% more likely to be a never-payer than a customer from

Cascais. Table 12 shows the detail of this distribution, wherein Sintra holds 3.14% of NPs whereas

Cascais has only 1.74%.

Figure 23 – Never-payer distribution across the cities around Lisbon (Power Map)

Table 12 – Never-payer distribution across the cities around Lisbon (table view)

Looking at the big picture, the likelihood of being a never-payer varies across different levels of

geography. Looking at the distribution across districts presented in Table 13, the top districts are those

which have more probability of holding never-payers. It is important to consider the size of each the

population; even though some locations have a high propensity towards debt, they can lose importance

if their population size is not relevant. Besides, the customer’s business segment (Consumer or

Business) appeared to be very meaningful across locations.

52

Table 13 – Never-payer distribution across all Portuguese districts.

A more careful analysis of the location data shows some interesting conclusions. Globally, there are

~2% of NP consumers, whereas businesses are ~1% of the total population. When analysing by district

this gap is even clearer, including:

Ilha de São Miguel, where ~3% of consumers do not pay any bills, and nearly every business

pays their bills (only 0.83% are NP);

Ilha da Madeira is the top never-payer population (~6%), essentially because of its consumers

(~7%).

In Guarda, almost everyone pays their debts (~99%), especially in the business segment.

In Viana do Castelo, 2.59% of the businesses do not pay their bills, while consumers are below

the average of NP likelihood.

When looking at districts with at least 4.000 accounts, the worst consumers are located in Ilha da

Madeira, Ilha de São Miguel, Setúbal and Lisboa. On the contrary, the best-behaved are from Santarém

and Aveiro. Most of the never-payer businesses are from Ilha da Madeira, Faro, Setúbal and Lisboa,

while the best are located in Viseu and Leiria.

The final case table also had attributes regarding the subscribed services (number of services) and

pricing plans, containing the number of pricing plans, and other hierarchy dimensions. These attributes

are related to the Services entity, and since the case table shows the account perspective, their values

had to be pivoted and summarised. For that reason, all pricing plan attributes are numeric.

Many plots were created to understand if certain changes of mean values affected the likelihood of

belonging to the NP population. Figure 24 is an example of how numeric attributes were plotted against

the average values of the consumer and business population.

53

Figure 24 – Plots comparing the content of pricing plans for consumers and businesses.

Steepest slopes show that the never-payer population and the general population subscribe different

pricing plans. For instance, the NP1 population subscribes more mobile broadband (MBB) pricing plans

than the NP1 population, while that is the opposite for fixed, GSM (mobile voice) and M2M (machine-

to-machine) pricing plans. Flat slopes indicate that, probably, that attribute will not be useful for

predicting the never-payer population. Another interesting insight is that businesses can subscribe

consumer pricing plans, but the contrary is not true for consumers. The class of the pricing plan will be

relevant only for business predictive models.

Figure 25 compares the service usage type for both NP0 and NP1 populations. The total amount of

seconds used in calls seems to be a good indicator. The never-payer population spends almost three

times fewer seconds in calls than the NP0 population. The second plot details the lines that were

overlapped in the first one. Looks like the average number of call events per day do not vary within the

population. Nonetheless, the amount of data transferred is also correlated; Never-payers spend ~80%

less mobile data than NP0 population.

Figure 25 – Plots comparing the type of service usage, for the NP0 and NP1 population.

2,21

1,39

0,140,28

0,00

0,50

1,00

1,50

2,00

2,50

3,00

NP_0 Cons NP_1 Cons

Pricing Plan Content - Consumer

FIXED

GSM

MBB

0,67

0,02

-0,40

-0,20

0,00

0,20

0,40

0,60

0,80

1,00

1,20

NP_0 Bus NP_1 Bus

Pricing Plan Content - Business

AD

AI

ISP

M2M

28,86

9,03

(5,00)

-

5,00

10,00

15,00

20,00

25,00

30,00

35,00

40,00

45,00

NP_0 NP_1

Average Usage by day (first month)

AVG_TotalCalls_Event

AVG_TotalCalls_MBytes

AVG_TotalCalls_Seconds

0,11 0,09

0,70

0,38

-

0,10

0,20

0,30

0,40

0,50

0,60

0,70

0,80

0,90

NP_0 NP_1

Average Usage by day (detail)

AVG_TotalCalls_Event

AVG_TotalCalls_MBytes

54

Figure 26 shows that the risk evaluations executed before account activation are relevant. Businesses

that have gone through the process of risk evaluation, on average, ~10.5 times are most likely to be

never-payers, when comparing to the ~9.5 assessments of the NP0 population. In addition, the last

evaluation score immediately before activation is also relevant, wherein worst scores (higher values)

belong to never-payer businesses. However, for the business counterpart, these conclusions are not so

similar. In fact, there is a weak correlation between these two attributes between the NP0 and NP1

population. Nevertheless, consumer never-payers seem to score worst but with fewer evaluations.

Figure 26 – Plots comparing the number of risk evaluations before activation and the latest score across datasets.

Feature Selection

During data preparation, a total of 58 attributes were loaded. After initial data exploration and cleansing,

the next step is feature selection [24] – using correlation to identify the top attributes having a strong

relationship with the target variable (FlagNP). It has been proved effective in reducing dimensionality,

improving mining efficiency and accuracy, as well as enhancing result comprehensibility [25].

Some of the previously displayed tables and charts show the correlation between potential predictive

attributes and the target attribute. Nonetheless, highly correlated predictive attributes should also be

minimised, since they do not add value to the model. Table 14 shows the correlation analysis between

numeric usage attributes. For instance, the average of actual events is the same as the average of total

call events; therefore, the model set should only have one of them.

Table 14 – Correlation analysis between usage attributes.

1,98 1,86

9,46

10,53

0,01 0,02

0,15

0,22

-

0,05

0,10

0,15

0,20

0,25

0

2

4

6

8

10

12

NP_0 Cons NP_1 Cons NP_0 Bus NP_1 Bus

Risk Evaluation (first month)

Count of Risk Evaluations Last Real Score

55

This technique successfully eliminates numeric attributes of the pricing plans, usage and risk evaluation.

The final model set for the Consumer segment included the attributes

ACC_AccountID: The account being evaluated

FlagNP: Flag indicating whether the account is a never-payer

ACC_Postal2: Postal Code (3 digits)

ACC_PostalCode: Postal code (4 digits)

ACC_PostalName: Postal Name

ACC_Town: Town

ACC_Municipality: Municipality

ACC_District: District

SVC_N_ServiceID: Number of services subscribed

PPL_Content_GSM: Number of GSM pricing plans

PPL_Content_FIXED: Number of FIXED pricing plans

PPL_Content_MBB: Number of MBB pricing plans

RSK_N_RiskEval: Total number of risk evaluations

RSK_ModifyLogin: Last Modify Login (Risk Evaluation)

RSK_RealScore: Last Score (Risk Evaluation)

USG_AVG_TotalCalls_Event: Average Total of Calls (Event)

USG_AVG_TotalCalls_Seconds: Average Total of Calls (Seconds)

USG_AVG_TotalCalls_MBytes: Average Total of Calls (Mbytes)

USG_AVG_ActualUnits_MBytes: Average Actual Units (Event)

For the Business segment, additional attributes were considered:

CLI_Segment: Segment (Level 3)

CLI_Segment1: Segment (Level 1)

CLI_Segment2: Segment (Level 2)

PPL_Content_M2M: Number of M2M pricing plans

PPL_Content_ISP: Number of ISP pricing plans

PPL_Content_AI: Number of AI pricing plans

PPL_Content_AD: Number of AD pricing plans

4.2.3. Modelling

Although the relative amount of risky customers is very tiny (~2%) when looking at the dozens of

hundreds of new subscriptions every month, they represent substantial losses that could be avoided or

at least, mitigated. Because the dataset is highly imbalanced, particular strategies need to be followed.

Sampling is the most widespread means of overcoming the class imbalance problem [26].

A direct method to solve the imbalance problem is to balance artificially the distribution of the minority

class (NP1, never-payers) so that it is not under-represented when training the classifier [27]–[29].

56

There are three basic approaches to overcome the class imbalance problem and several works in the

literature that confirm the efficiency of these methods in practice [28]. These include:

Random Oversampling (ROS), which consists of sampling of the minority class (NP1) with

replacement, until there are as many minority class examples as the majority class (NP0). This

could lead to overfitting, since it produces exact copies of the never-payer class examples.

Random Undersampling (RUS), balancing class distribution through random elimination of

majority class (NP0) examples. The major drawback is that it can discard potentially useful data

that could be critical for the induction process.

Hybrid Sampling (ROS/RUS), combining ROS and RUS, wherein the majority class (NP0) is

undersampled and the minority class (NP1) is over-sampled.

In these experiments, four strategies were implemented, namely:

S1: The original training dataset was not altered.

S2: Undersampling the NP1 class.

S3: Oversampling the NP0 class.

S4: Undersampling of the NP1 class and oversampling the NP0 class.

Two moments of prediction were also configured. The first one including only “customer data” (CD) with

attributes available during acquisition and before account approval. The second, “behavioural data”

(BD), adds the usage attributes that become accessible after the subscription.

Three classification algorithms were tested: Decision Trees, Naïve-Bayes and Logistic Regressions.

These are classification algorithms [5], [30] which were implemented by Microsoft and included in SSAS.

The Decision Tree algorithm is very popular among data miners since they can predict both discrete and

continuous variables and the generated rules are easy to understand.

The Naïve Bayes algorithm calculates probabilities for each possible state of the input attribute. It is very

simple to process and provides baseline results. Nonetheless, it does not support continuous variables,

but for the purpose of this work, they were discretised.

The Logistic Regression algorithm is a powerful and well-established statistical technique that estimates

the probabilities of the target categories [1]. It is analogous to simple linear regression but for discrete

outcomes.

Several techniques and datasets were combined to find the best approach, including:

Different segments: consumer or business.

Different balancing strategies: none (S1), undersampling (S2), oversampling (S1) or both (S4).

Different data available: customer data (CD) or behavioural data (BD).

Different algorithms: decision trees, Naïve Bayes or Logistic Regressions.

Figure 27 shows an example of how this combination was achieved using the user interface of SSAS.

57

Figure 27 – Mining model for Consumers using oversampling and different algorithms.

The process of modelling is interactive, allowing the developer to understand how the algorithms pick

the best attributes for prediction. Figure 28 shows an example of a Bayesian network produced by Naïve

Bayes algorithm that was applied on a hybrid sampling dataset, featuring consumer accounts with

behavioural data. The Bayesian network shows the strongest links with the target attribute (FlagNP) are

the number of subscribed services, the customer district, the average call duration and the type of pricing

plans subscribed.

Figure 28 – Example of the mining model viewer for a Naïve Bayes algorithm.

When applying the Microsoft Decision Tree algorithm on the same dataset, the best attributes are similar.

Figure 29 shows that the algorithm picks first the attributes that count the number of voice and mobile

broadband pricing plans of an account, as well as the account’s district and the login name that assessed

the risk before account activation. Accounts that subscribe more than five mobile (GSM) pricing plans,

more than four mobile broadband pricing plans and are from Beja District are likely to belong to the

never-payer population.

58

Figure 29 – Example of the mining model viewer for a Microsoft Decision Tree algorithm.

It is also possible to visually compare different algorithms using lift charts. The lift denotes how much

better a classification data mining model performs in comparison to a random selection [1]. The X-axis

shows the percentage of customers analysed, whereas the Y-axis shows the percentage of never-

payers correctly identified; this is the percentage of the total possible never-payers. The lift chart tells

that if we analyse X% of customers then we will correctly identify Y% of the never-payer population.

Figure 30 shows the lift chart of a hybrid consumer dataset, wherein a randomly selected sample

contains about 22% percent of never-payers. All the mining models perform better than the random

guess model. The best is the yellow one, the decision tree algorithm using behavioural data since it is

closer to the line of the ideal model. The ideal model would be finding all of the never-payers just by

looking at 22% of the total population.

Figure 30 – Example of a lift chart for a hybrid sampling data mining model.

59

After defining the data mining models for training and testing, the next step comprises the design of a

validation plan and results evaluation. This phase is fully described in Section 5 and aims to pick the

best models for predicting the never-payer population. After validating the results, the next section

details how the system was operationalised and deployed for the end-user.

4.2.4. Deployment

This is the stage of the methodology that ensures the data mining process is repeatable across the

enterprise [20]. The information and knowledge that were extracted from data need to be organised and

presented so the end-user can use it. This included the operationalization of all data mining steps using

the available Microsoft BI stack technologies.

Integration Services (SSIS) orchestrates the complete data mining process; wherein each step is

implemented by a package (.dtsx file), deployed to the SSIS server. Figure 31 lays out the sequence of

each SSIS package extracting input files to testing new accounts. The detail of each step is described

below.

Figure 31 – Data mining process orchestrated by Integration Services (SSIS).

1. Extract is the data collection step that collects initial data from CSV flat files containing business

entities with attributes that may be useful for prediction. Data is stored in staging tables, and it

is prepared, cleaned and summarised with the intended aggregation level. Besides, the loading

process is as efficient as possible, both in terms of space and time used, making use of bulk

load SQL commands, table indexes, and data compression.

2. Classify is responsible for adding an extra target attribute named FlagNP. This Boolean

attribute indicates whether an account is a never-payer and it is calculated by looking at the

billing status of each account. The classification process outputs the examples the supervised

data mining algorithms learn to predict future never-payers.

3. Load integrates all business entities in a single case table. Each data record is the commonly

called a case, whose columns are the attributes with predictive potential as well as the target

attribute, FlagNP. Furthermore, it describes the attributes of a specific account that will be used

to train mining models. The final case table was the main input for exploration analysis and

feature selection.

4. Sample automates the process of randomly sampling the case table for training, putting aside

examples for subsequent testing, in order validate the models. Additionally, because of the high

imbalance ratio between the class distributions available for training, sampling strategies had to

be implemented, producing different datasets that artificially increase the proportion of never-

Extract

• Collect and prepare initial data from flat files

Classify

• Identify never-payer population

Load

• Join all data in a single case table (account-level)

Sample

• Load sampling strategies

Train

• Train each mining model

Test

• Test mining models and log results

60

payers among the account population. Those techniques included the aforesaid undersampling,

oversampling and hybrid sampling strategies, as well as, no sampling, to serve as a baseline.

5. Train systematises the process of training each model with the corresponding dataset. It

combines four sampling strategies, three algorithms, two types of attributes and two degrees of

complexity, resulting in 48 models that were trained using training examples that were put aside

and samples during the previous step.

6. Test mechanises the classification of testing account examples that were set aside for testing.

The already trained models are put into practice so that prediction results can be evaluated and

validated. Additionally, it is possible to test new accounts fed by the user, which is, in fact, the

main purpose of this application.

There are two main folders mapped into the user’s file system. The “Input” folder is where train and test

files are dropped and processed, whereas the “Output” folder will contain the results of account

predictions. From the user’s point of view the interaction with the application begins with the input of the

data needed to train the models, as well as the new accounts that he needs to evaluate.

The “Input” folder contains a “Data” folder with folders for each business entity, where CSV flat files must

be placed, and format files3 (.fmt extension) describe the input data structures. The whole data mining

process described in Figure 31 can be started on demand by clicking a batch file that starts an SQL

Server Job that is responsible for running each one of the process packages. If the input files are faulty

or some error occurs, errors are logged, and the data is not loaded for training. Otherwise, if everything

goes as planned, the models will be updated with the most recent input data and will be ready for testing.

For making predictions using the already trained models, the user simply drops a CSV containing

account data inside the “Predict” folder the root of the “Input” folder. This CSV must comply with the

structure of the model set used for training, as described in Section 4.2.3. After clicking a batch file to

start the SQL Job that tests new accounts, the file is moved to “Processed” folder. Accounts featuring

only customer data will be classified using data mining models that are specific to customer data only.

On the other hand, accounts supplied with behavioural data will both be tested using customer data and

behaviour data mining models.

The “Output” folder contains now the account prediction results, presented as CSV files that can be

viewed using Excel. The structure of prediction output files is similar to the input files, but also contain

additional columns that describe the prediction result:

Model Name – Name of the data mining model used for prediction.

Prediction – Similar to the FlagNP attribute, but it represents the predicted result assigned by

the algorithm; ‘1’ if the algorithm classifies the account as never-payer, ‘0’ otherwise.

Probability – Also known as “risk score”, this is the probability assigned to the prediction made

by the algorithm.

Support – Represents the count of cases that match with the itemset or rule used for prediction.

3 Format files https://technet.microsoft.com/en-us/library/aa173859(v=sql.80).aspx

https://technet.microsoft.com/en-us/library/aa173859(v=sql.80).aspx

61

Description – Additional details that help the user understand how the algorithm works. For

instance, the decision tree rule or association rule applied.

In addition to prediction output files, CSV reports are also produced, featuring performance measures

for each algorithm. These metrics are described in detail in the next section. Every prediction result and

statistics is accessible using a set of database views available to the end-user. Reporting tools such as

Excel, Power View and others can connect to these views and perform statistical reports.

Every step of the data mining application is logged in logging tables, including information, warnings and

possible errors. Once again, database views are available for querying and reporting the state of the

process.

The business process describing the bad payer lifecycle presented in Section 3.1 (Figure 9) can be

modified to accommodate the new data mining system that will enrich the risk evaluation step. Figure

32 introduces two different never-payer detection stages; the first one during the Acquisition phase using

customer data and the second during the Intermediate phase, after 30 days of service usage.

62

Figure 32 – Business process (TO-BE) describing the customer lifecycle of a never-payer.

The first “NP Detection” occurs right after the automatic credit evaluation performed by the Risk

Evaluation system. This detection will only use customer data, that is, subscription data that was

supplied by the prospect when he signed up a form and risk data from the risk assessment. After the

never-payer detection algorithm runs, risk results are returned to the activations’ specialist who will

decide if the prospect is approved. If so, a mitigation strategy can be suggested to the prospect.

After accepting the contract, the customer’s account is activated, and service usage begins. After 30

days of usage, the second “NP Detection” takes place. This time, both customer and usage data will be

used as input for the detection algorithms. Risk results are, once again, returned to the activations’

specialist who will perform an intermediate risk evaluation. If something indicates that this customer will

be soon a never-payer, a mitigation strategy is put in motion and business proceeds as usual.

Never-Payer DetectionRisk EvaluationCustomer Activations

Activ

ation

Inte

rmed

iate

ProspectCustomer

Process

information

Automatic

Credit

Evaluation

FiscalNumber

Approved?

Prospect Rejected

Manual Credit

Evaluation

No

Approved?Yes

+ M

itigat

ion

Str

ate

gy

Sign upform

Accept

Contract

30 Days of

Usage

Contract

NP Detection

customer data

Risk data

Subscription and customer data

Risk resultsRisk

Evaluation

NP Detection

customer data +

behaviour data

Customer + Behaviour Data

Draft Contract

Accepted? No

Approve

Account

Activation

Yes

Activation

Intermediate

Risk

Evaluation

Riskresults

Business as

Usual

Approved?

No

Yes

No

Yes

Risk Mitigation

Strategy

63

5. Validation and Results

This chapter describes the validations performed on this system as well as their results.

5.1. Validation Plan

For evaluating the system, several items must be defined, particularly the testing set (i.e., dataset), and

the evaluation metrics to assess its effectiveness.

A classifier is typically evaluated by a confusion matrix [31]. Each confusion matrix entry provides the

number of customers with the given outcome, in terms of being a never-payer or not. For instance, a

true positive (TP) is a never-payer who was correctly identified. The effectiveness measures most widely

used in data mining are set out in terms of the contingency table [5], [32].

PREDICTED AS NP1 PREDICTED AS NP0

ACTUAL NP1 TP FN

ACTUAL NP0 FP TN Table 15 –Confusion matrix for the never-payer classifier.

Accuracy considers the population correctly identified by the system [5]. It reflects how well the classifier

recognises customers of the two possible classes. Accuracy is the number of correct predictions and

correct non-predictions divided by the number of all slots.

Accuracy =

|{𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠}|

|{𝑎𝑙𝑙 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛}|=

𝑇𝑃 + 𝑇𝑁

𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

(2)

Traditionally, accuracy is the most typically used measure for these purposes [28]. However, because

we have a class imbalance problem (only 2% are never-payers), accuracy is no longer a proper

measure. Accuracy places more weight on the common classes than on rare classes, making it difficult

for a classifier to perform well on the uncommon classes [29]. If the system classifies every customer

as never-payer, it can achieve an accuracy of 98%, which is meaningless. Because of this, additional

metrics are required.

If only the performance of the positive class is considered, precision and recall become relevant [28].

Precision (also known as positive predictive value) is the proportion of positive predictions that are

correct [26], [28]. In other words, precision defines the fraction of customers reported as never-payers

by the system that is correct.

Positive Predictive Value = Precision =

|{𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑁𝑃1 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠}|

|{𝑁𝑃1 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠}|=

𝑇𝑃

𝑇𝑃 + 𝐹𝑃

(3)

64

Recall (also known as TP rate or sensitivity) is the proportion of positive items retrieved by the system

[26], [28]. In short, recall is the fraction of correctly identified never-payers that is properly reported by

the system.

TP Rate = Sensitivity = Recall =

|{𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑁𝑃1 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠}|

|{𝑁𝑃1 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛}|=

𝑇𝑃

𝑇𝑃 + 𝐹𝑁

(4)

In an ideal system, precision and recall are close to one, but enhancing one metric can hurt the other.

F-measure is a combined score for the entire system that corresponds to the harmonic mean between

precision and recall [32]. It combines recall and precision, which are effective metrics when the

imbalance problem exists [28].

F-measure (F1) =

2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

(5)

Each consumer and business dataset was split according to their number of attributes [33]. Table 16

summarises the splits for each segment. The whole set is divided into a testing and a training set. 20%

of the examples were saved for later testing. From the remaining 80%, 75-80% were used to train the

algorithms and the remaining 20-25% to validate them.

Consumer

# Cases # NP1 (estimated) % of Total

Total 100% 340,390 7,158

Testing 20% 68,078 1,432 20%

Training 80% 272,312 5,726

Validation 25% 68,078 1,432 20%

Training 75% 204,234 4,295 60%

Business

Cases # NP1 (estimated) % of Total

Total 100% 55,797 631

Testing 20% 11,159 126 20%

Training 80% 44,638 505

Validation 20% 8,927 101 16%

Training 80% 35,710 404 64% Table 16 – Training and testing set for Consumer and Business datasets.

65

5.2. Results

The predictive validation includes performance measures that evaluate the accuracy and precision of

predicted outcomes against expected results. It compares the performance of several predictive

algorithms in typical case scenarios (detailed in Section 4.2.3), using different sets of predictive

attributes.

The complete overview of the results when assessing all combinations of models is presented in Table

17. For the sake of simplicity, combinations that performed too poorly or were similar to one another

were removed. Because we are dealing with very imbalanced datasets, the overall results are not very

good. It becomes noticeable that the number of false positives is the biggest problem.

Consumer algorithms performed consistently better than the business algorithms. The main reason is

that the never-payer population in the consumer segment (~2%) is bigger than the business segment

(2%). It may seem a little difference, but consumer algorithms also train with ten times more examples

than the businesses’ do. This difference leads to an even smaller precision for the business segment,

since the number of never-payers correctly identified is much lower.

Overfitting happened mostly using oversampling strategies and when an algorithm’s complexity was

increased. Even though the number of true positives increased, false alarms became too high.

When examples were oversampled to achieve 50/50 chance of being never-payer, more false alarms

(false positives) were introduced. This false positive increase happens because the algorithm is being

told the word is a 50/50 chance, but when it is tested in the real word, this assumption makes it biased.

For example, the algorithm classifier may end up “remembering” a never-payer simply because it sees

this same account several times.

Increasing the complexity of algorithms can also lead to overfitting. It is possible to increase or decrease

the degree of complexity of every algorithm, for instance, by adjusting the maximum number of states

or the maximum size of the tree or Bayesian network. As an example, a decision tree may be forced to

grow a little longer, thus creating a more specific mining classifier. The more complex the algorithm is,

the less generic it is and the more specific it is. The key is to find that “sweet spot” wherein the degree

of complexity does not hurt the precision and recall of the algorithm.

Undersampling also introduced many false alarms (false positives) and misses (false negatives).

Random undersampling has removed certain significant examples. One of the problems with random

undersampling is that one we do not control what critical examples from the NP1 class are thrown away.

Valuable information about the decision boundary between the minority and majority class may be

eliminated.

The hybrid approach of using both undersampling and oversampling turned out to perform better,

especially when used with the Microsoft Decision Trees algorithm. Nevertheless, it was necessary to

tweak the size of each class to obtain these results.

66

In fact, Decision Trees performed the best for the three sampling strategies, both with customer and

behavioural data. One of the goals of this thesis was to see if it is possible to predict the outcome only

with customer data and decision trees could do so, even though introducing many false alarms.

Logistic regressions worked better with undersampling and hybrid strategies, but they performed worse

generically, except for undersampling in the consumer dataset.

Naïve Bayes performed better for consumer datasets, but with many false alarms. They seem to work

better for oversampling strategies.

In general, behavioural data algorithms achieved best results, which is, in fact, understandable. The

behaviour of customers is an important indicator of how he will behave in the future. Stating that a

customer will be a never-payer, just by judging where he came from and what services he will subscribe,

seems to be very hard.

67

Model ID Segment Algorithm Balancing Obs. Data Type # Cases # NP1 TN FN FP TP Acc. Prec. Recall F-Measure

NB_Cons_S1_CD Consumer Naive Bayes None Customer 68,078 1,429 66,118 1,392 531 37 97.18% 6.51% 2.59% 3.71%

NB_Cons_S1_BD_Complex Consumer Naive Bayes None Behaviour 68,078 1,429 65,326 1,330 1,323 99 96.10% 6.96% 6.93% 6.94%

LR_Cons_S1_CD_Complex Consumer Logistic Regression

None Customer 68,078 1,429 64,212 1,356 2,437 73 94.43% 2.91% 5.11% 3.71%

NB_Bus_S1_CD Business Naive Bayes None Customer 11,160 111 10,933 107 116 4 98.00% 3.33% 3.60% 3.46%

NB_Bus_S1_BD Business Naive Bayes None Behaviour 11,160 111 10,892 105 157 6 97.65% 3.68% 5.41% 4.38%

DT_Cons_S2_CD_Complex Consumer Decision Trees

Undersampling

10% NP0 Customer 68,078 1,429 64,181 1,174 2,468 255 94.65% 9.36% 17.84% 12.28%

DT_Cons_S2_BD_Complex Consumer Decision Trees

Undersampling

10% NP0 Behaviour 68,078 1,429 64,414 1,154 2,235 275 95.02% 10.96% 19.24% 13.96%

NB_Cons_S2_BD_Complex Consumer Naive Bayes Undersampling

10% NP0 Behaviour 68,078 1,429 57,099 855 9,550 574 84.72% 5.67% 40.17% 9.94%

LR_Cons_S2_CD Consumer Logistic Regression

Undersampling

10% NP0 Customer 68,078 1,429 63,471 1,210 3,178 219 93.55% 6.45% 15.33% 9.08%

LR_Cons_S2_BD Consumer Logistic Regression

Undersampling

10% NP0 Behaviour 68,078 1,429 61,663 1,081 4,986 348 91.09% 6.52% 24.35% 10.29%

NB_Bus_S2_CD Business Naive Bayes Undersampling

10% NP0 Customer 11,160 111 10,303 96 746 15 92.46% 1.97% 13.51% 3.44%

NB_Bus_S2_BD_Complex Business Naive Bayes Undersampling

10% NP0 Behaviour 11,160 111 10,095 91 954 20 90.64% 2.05% 18.02% 3.69%

LR_Bus_S2_CD_Complex Business Logistic Regression

Undersampling

10% NP0 Customer 11,160 111 10,669 103 380 8 95.67% 2.06% 7.21% 3.21%

LR_Bus_S2_BD_Complex Business Logistic Regression

Undersampling

10% NP0 Behaviour 11,160 111 10,686 101 363 10 95.84% 2.68% 9.01% 4.13%

DT_Cons_S3_CD Consumer Decision Trees

Oversampling 50.000 NP1 Customer 68,078 1,429 65,820 1,315 829 114 96.85% 12.09% 7.98% 9.61%

DT_Cons_S3_BD Consumer Decision Trees

Oversampling 50.000 NP1 Behaviour 68,078 1,429 65,752 1,293 897 136 96.78% 13.17% 9.52% 11.05%

DT_Cons_S3_BD_Complex Consumer Decision Trees

Oversampling 50.000 NP1 Behaviour 68,078 1,429 64,488 1,176 2,161 253 95.10% 10.48% 17.70% 13.17%

NB_Cons_S3_CD Consumer Naive Bayes Oversampling 50.000 NP1 Customer 68,078 1,429 59,998 1,013 6,651 416 88.74% 5.89% 29.11% 9.79%

NB_Cons_S3_BD Consumer Naive Bayes Oversampling 50.000 NP1 Behaviour 68,078 1,429 58,922 936 7,727 493 87.27% 6.00% 34.50% 10.22%

68

Model ID Segment Algorithm Balancing Obs. Data Type # Cases # NP1 TN FN FP TP Acc. Prec. Recall F-Measure

DT_Bus_S3_CD Business Decision Trees

Oversampling 5.000 NP1 Customer 11,160 111 10,924 106 125 5 97.93% 3.85% 4.50% 4.15%

DT_Bus_S3_BD Business Decision Trees

Oversampling 5.000 NP1 Behaviour 11,160 111 10,948 105 101 6 98.15% 5.61% 5.41% 5.50%

DT_Bus_S3_BD_Complex Business Decision Trees

Oversampling 5.000 NP1 Behaviour 11,160 111 10,839 101 210 10 97.21% 4.55% 9.01% 6.04%

NB_Bus_S3_CD Business Naive Bayes Oversampling 5.000 NP1 Customer 11,160 111 10,269 94 780 17 92.17% 2.13% 15.32% 3.74%

NB_Bus_S3_BD Business Naive Bayes Oversampling 5.000 NP1 Behaviour 11,160 111 9,885 85 1,164 26 88.81% 2.18% 23.42% 4.00%

NB_Bus_S3_BD_Complex Business Naive Bayes Oversampling 5.000 NP1 Behaviour 11,160 111 10,177 90 872 21 91.38% 2.35% 18.92% 4.18%

DT_Cons_S4_CD Consumer Decision Trees

Both 40.000 NP0 50% NP1

Customer 68,078 1,429 64,024 1,115 2,625 314 94.51% 10.68% 21.97% 14.38%

DT_Cons_S4_BD Consumer Decision Trees

Both 40.000 NP0 50% NP1

Behaviour 68,078 1,429 63,973 1,107 2,676 322 94.44% 10.74% 22.53% 14.55%

NB_Cons_S4_CD Consumer Naive Bayes Both 40.000 NP0 50% NP1

Customer 68,078 1,429 55,193 811 11,456 618 81.98% 5.12% 43.25% 9.15%

NB_Cons_S4_BD Consumer Naive Bayes Both 40.000 NP0 50% NP1

Behaviour 68,078 1,429 54,124 736 12,525 693 80.52% 5.24% 48.50% 9.46%

DT_Bus_S4_CD Business Decision Trees

Both 10.000 NP1 30% NP0

Customer 11,160 111 10,334 92 715 19 92.77% 2.59% 17.12% 4.50%

DT_Bus_S4_BD Business Decision Trees

Both 10.000 NP1 30% NP0

Behaviour 11,160 111 10,128 81 921 30 91.02% 3.15% 27.03% 5.65%

NB_Bus_S4_CD Business Naive Bayes Both 10.000 NP1 30% NP0

Customer 11,160 111 8,615 68 2,434 43 77.58% 1.74% 38.74% 3.32%

NB_Bus_S4_BD Business Naive Bayes Both 10.000 NP1 30% NP0

Behaviour 11,160 111 8,243 57 2,806 54 74.35% 1.89% 48.65% 3.64%

LR_Bus_S4_CD_Complex Business Logistic Regression

Both 10.000 NP1 30% NP0

Customer 11,160 111 10,762 98 287 13 96.55% 4.33% 11.71% 6.33%

LR_Bus_S4_BD_Complex Business Logistic Regression

Both 10.000 NP1 30% NP0

Behaviour 11,160 111 10,728 99 321 12 96.24% 3.60% 10.81% 5.41%

Table 17 – Validation results for all combinations of segments, algorithms, sampling strategies and data types.

69

6. Conclusion

After implementing and assessing the performance of the system, this chapter presents the goals that

were met and new ideas for future developments.

6.1. Contributions

The main goal of this work was to predict if a customer will not pay any of his bills, even before the

customer’s account is activated. At that point, too little customer data is available for analysis.

So, the first challenge was to understand how debtors behave in the telecommunications industry. The

customer lifecycle was described, and the fraudster was introduced. Few systems can predict the

probability of becoming fraudulent upon the acquisition phase. Two patented systems were described

and compared. Even if little implementation details were available, the topology of data to be used as

predictive attributes was the most valuable insight.

Then, the main subject of this case study was presented. Both businesses and data model were detailed

and fit within the customer lifecycle. This business analysis was necessary for requesting the appropriate

data for analysis.

The solution implemented combined database, integration and analytical components. The

development was directed by the CRISP-DM methodology [20], [21], wherein the first step was to define

the data mining problem, and it was decided that classification algorithms would be tested. Several

techniques and exploration tools were used to get to know better the data provided. The data was

loaded, cleaned into the final set of attributes that showed the most potential for prediction.

Several combinations of balancing strategies, types of data, segments and algorithms were

experimented, to find the best approaches for predicting the never-payer population. These test runs

produce metrics that were evaluated and discussed.

The full system can operationalise the learning and testing process, outputting the probability of a

customer being a never-payer.

6.2. Future Work

This section points out several enhancements that could be made to boost the performance measures

presented above, as well as operational improvements to the process.

The experiments performed in this thesis required between 13 and 24 features (attributes) to estimate

the probability of being a never-payer customer. Since we are dealing with large data volumes, the

computation of these features turned out to be burdensome to run. For that reason, continuous attributes

70

were discretised. Also, every combination of attributes was not investigated. That would require 2N

experiments, where N represents the number of attributes, for every algorithm and balancing strategy.

It would be interesting to use smarter feature selection methods to help choosing the perfect set of

predictive attributes.

There is much room for improvement regarding the inclusion of extra features that were not available at

the time. For instance, more demographic data, the contact channel and additional behaviour such as

complaints.

During data preparation, several cleansing techniques were applied, but there is room for a more careful

outlier analysis and sparseness removal. Cleansing techniques are especially important for continuous

attributes of usage and risk evaluation.

In addition to the three sampling strategies experimented on this thesis – RUS, ROS and RUS/ROS

(Section 4.2.3) – the SMOTE sampling strategy could also help synthesising items belonging to the

never-payer class [26]. Another way to improve the performance of these models is to include more

never-payer examples from past years, to balance the dataset.

The set of the best predictive models could also be combined in order to output a weighted score. This

strategy is quite common in the credit industry (Section 2.2.2).

Finally, in addition to the views and tables that were provided for the user to get the prediction results, a

predefined report could also be added, helping business users to obtain KPIs regarding the results.

71

7. References

[1] K. Tsiptsis and A. Chorianopoulos, Data Mining Techniques in CRM. 2010.

[2] M. J. A. Berry and G. S. Linoff, Mastering Data Mining - The Art and Science of Customer Relationship Management, 1st ed. Wiley, 1999.

[3] G. S. Linoff and M. J. A. Berry, Data Mining Techniques For Marketing, Sales, and Customer Relationship Management, 3rd ed. Wiley, 2011.

[4] Y. Zhang, R. Liang, Y. Li, Y. Zheng, and M. Berry, “Behavior-Based Telecommunication Churn Prediction with Neural Network Approach,” 2011 Int. Symp. Comput. Sci. Soc., pp. 307–310, 2011.

[5] J. Han and M. Kamber, Data Mining. Concepts and Techniques, 2nd ed. Morgan Kaufmann, 2006.

[6] R. A. Becker, C. Volinsky, and A. R. Wilks, “Fraud Detection in Telecommunications: History and Lessons Learned,” Technometrics, vol. 52, no. 1, pp. 20–33, 2010.

[7] R. J. Bolton, D. J. Hand, F. Provost, L. Breiman, R. J. Bolton, and D. J. Hand, “Statistical Fraud Detection: A Review,” Stat. Sci., vol. 17, no. 3, pp. 235–255, 2002.

[8] M. Ghosh, “Telecoms fraud,” Comput. Fraud Secur., vol. 2010, no. 7, pp. 14–17, 2010.

[9] Communications Fraud Control Association, “CFCA 2015 Global Fraud Loss Survey,” 2015.

[10] P. Hoath, “Telecoms Fraud, The Gory Details,” Comput. Fraud Secur., vol. 1998, no. 1, pp. 10–14, 1998.





72


[16] P. Hoath, “What’s new in telecoms fraud?,” Comput. Fraud Secur., vol. 1999, no. 2, pp. 13–19, 1999.

[17] C. J. Celka and C. R. Rojas, “System and method for automated detection of never-pay data sets,” US20080294540 A1, 27-Nov-2008.

[18] R. Mahdi, D. Villagomez, and C. Jones, “First party fraud detection system,” US20140279379 A1, 18-Sep-2014.

[19] CTT Correios de Portugal, “Postcode,” 2015. [Online]. Available: https://www.ctt.pt/feapl_2/app/restricted/postalCodeSearch/postalCodeDownloadFiles.jspx?lang=01. [Accessed: 30-Jun-2015].

[20] C. Shearer, H. J. Watson, D. G. Grecich, L. Moss, S. Adelman, K. Hammer, and S. a Herdlein, “The CRISP-DM model: The New Blueprint for Data Mining,” J. Data Warehous., vol. 5, no. 4, pp. 13–22, 2000.

[21] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and W. Rudiger, “CRISP-DM 1.0,” Cris. Consort., p. 76, 2000.

[22] Microsoft, “Data Mining Concepts,” Microsoft Developer Network. [Online]. Available: https://msdn.microsoft.com/en-us/library/ms174949(v=sql.110).aspx. [Accessed: 01-Mar-2015].

[23] S. Rosset, U. Murad, E. Neumann, Y. Idan, and G. Pinkas, “Discovery of Fraud Rules for Telecommunications - Challenges and Solutions,” in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 409–413.

[24] A. Rehman and A. R. Ali, “Customer Churn Prediction , Segmentation and Fraud Detection in Telecommunication Industry,” pp. 1–9, 2014.

[25] V. Ganganwar, “An overview of classification algorithms for imbalanced datasets,” Int. J. Emerg. Technol. Adv. Eng, vol. 2, no. 4, pp. 42–47, 2012.

[26] P. Brennan, “A comprehensive survey of methods for overcoming the class imbalance problem in fraud detection,” no. June, pp. 1–107, 2012.

[27] M. R. C. De Leon and E. R. L. Jalao, “Prediction Model Framework for Imbalanced Datasets,” no. c, pp. 33–41, 2014.

[28] Q. Gu, Z. Cai, L. Zhu, and B. Huang, “Data Mining on Imbalanced Data Sets,” 2008 Int. Conf. Adv. Comput. Theory Eng., pp. 1020–1024, 2008.

73

[29] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Handling imbalanced datasets: A review,” Science (80-. )., vol. 30, no. 1, pp. 25–36, 2006.

[30] W. Habboub, “The Nine Data Mining Algorithms in SSAS,” TechNet Articles, 2012. [Online]. Available: http://social.technet.microsoft.com/wiki/contents/articles/6775.the-nine-data-mining-algorithms-in-ssas.aspx. [Accessed: 16-Aug-2015].

[31] N. V Chawla, “Data Mining for Imbalanced Datasets: An Overview,” in Data Mining and Knowledge Discovery Handbook, Springer-Verlag, 2005, pp. 853–867.

[32] N. Chinchor, “MUC-4 Evaluation Metrics,” Proc. 3rd Conf. Messag. Underst. - MUC3 ’91, pp. 22–29, 1991.

[33] I. Guyon, “A scaling law for the validation-set training-set size ratio,” AT&T Bell Lab., pp. 1–11, 1997.

Debt Analytics: Proactive prediction of debtors in the ... · A plataforma final é construída...

Documents

Transcript of Debt Analytics: Proactive prediction of debtors in the ... · A plataforma final é construída...