Download - Subscription fraud analytics using classification

SOMDEEP KUMAR SEN

Trimax Analytics and Optimization Services

2/21/2014

SUBSCRIPTION FRAUD ANALYTICS USING CLASSIFICATION

Subscription Fraud Analytics Using Naïve Bayes Classifier

2

Contents

Introduction .................................................................................................................................... 3

Overview of the study ..................................................................................................................... 3

Objective of the Study .................................................................................................................... 4

Telecommunication fraud: an Overview ........................................................................................ 4

Definition .................................................................................................................................... 4

Types ........................................................................................................................................... 4

Subscription fraud ................................................................................................................... 4

Recharge Voucher Fraud ......................................................................................................... 4

Pre-paid Balance Fraud ........................................................................................................... 4

Unauthorized Service Fraud.................................................................................................... 4

Models Used ................................................................................................................................... 5

Naïve Bayes Classification: an overview ..................................................................................... 5

Decision Tree (A Supervised Learning Method): ........................................................................ 5

Methodology ................................................................................................................................... 6

Analysis & Findings ......................................................................................................................... 6

Using R ........................................................................................................................................ 6

Using RapidMiner ...................................................................................................................... 13

Conclusion ..................................................................................................................................... 18


3

Introduction

The advancement of technological tools such as computers, the internet, and cellular phones

has made life easier and more convenient for most people in our society. However some

individuals and groups have subverted these telecommunication devices into tools to defraud

numerous unsuspecting victims. It is not uncommon for a scam to originate in a city, country,

state, or even a country different from that in which the victim resides. While, telecom fraud

may occur in different forms, the present study would focus upon the use of analytics to detect

subscription fraud. The study focuses on the application of Naïve Bayes Classification Algorithm

to detect & predict probable fraudsters.

Overview of the study

A fictitious telecom company called Bad Idea came up with a strange rate plan called Praxis

Plan where the callers are allowed to make only one call in the Morning (9AM-Noon),

Afternoon (Noon-4PM), Evening(4PM-9PM) and Night (9PM-Midnight); i.e. four calls per day.

Despite the popularity of the plan, Bad Idea was a target of Subscription Fraud by a gang of

fraudsters consisting of three people: Sally, Virginia and Vince. They finally terminated their

services. Bad Idea has their call logs spanning over one and half months.

The analytics team of the company has been provided two data sets: Black-List Subscriber Call-

Logs & Audit Log. The Black-List Subscriber Call-Logs data set includes the calling patterns of the

three fraudsters i.e. Sally, Virginia and Vince. After every 5 days the company undertakes an

audit to see whether these Fraudsters have joined their network. The company reviews the list

of subscribers who have made calls to the same people as these three fraudsters and in the

same time frame. This has been provided in the Audit Log.

Test Data: http://bit.ly/1du9cRs

Training Data: http://bit.ly/1du9AQ1

http://bit.ly/1du9cRs

http://bit.ly/1du9AQ1


4

Objective of the Study

To provide the Name of the probable callers and the confidence in terms of probability

To provide Name of the fraudster, if any

Code used to determine the subscriber

Telecommunication fraud: an Overview

Definition

Telecommunication fraud is the theft of telecommunication service (telephones, cell phones,

computers etc.) or the use of telecommunication service to commit other forms of fraud.

Victims include consumers, businesses and communication service providers.

Types

Subscription fraud

Subscriber fraud occurs when someone signs up for service with fraudulently-obtained

customer information or false identification. Lawbreakers obtain your personal information and

use it to set up a cell phone account in the name of the subscriber

Recharge Voucher Fraud

This mainly includes unusual top-up recharges and high number of recharges in a given time-

period

Pre-paid Balance Fraud

Employees with high number of manual balance change as well as Subscribers with high

balances might be an indication of Pre-paid Balance Fraud

Unauthorized Service Fraud

HLR vs. Post-paid subscriber profile reconciliation, HLR services vs. Post-paid Subscriber services

Profile mis-match or sudden change in Subscriber usages could be possible indication of

Unauthorized Service Fraud


5

Models Used

Naïve Bayes Classification: an overview

A Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from

Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for

the underlying probability model would be "independent feature model".

In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular

feature of a class is unrelated to the presence (or absence) of any other feature. For example, a

fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even if these

features depend on each other or upon the existence of the other features, a naive Bayes

classifier considers all of these properties to independently contribute to the probability that

this fruit is an apple.

Depending on the precise nature of the probability model, naive Bayes classifiers can be trained

very efficiently in a supervised learning setting. In many practical applications, parameter

estimation for naive Bayes models uses the method of maximum likelihood; in other words,

one can work with the naive Bayes model without believing in Bayesian probability or using any

Bayesian methods.

An advantage of the naive Bayes classifier is that it requires a small amount of training data to

estimate the parameters (means and variances of the variables) necessary for classification.

Because independent variables are assumed, only the variances of the variables for each class

need to be determined and not the entire covariance matrix.

Decision Tree (A Supervised Learning Method):

A decision tree is a flowchart-like structure in which internal node represents test on an

attribute, each branch represents outcome of test and each leaf node represents class label

(decision taken after computing all attributes). A path from root to leaf represents classification

rules. In decision analysis a decision tree and the closely related influence diagram is used as a

visual and analytical decision support tool, where the expected values (or expected utility) of

competing alternatives are calculated.

http://en.wikipedia.org/wiki/Flowchart

http://en.wikipedia.org/wiki/Decision_analysis

http://en.wikipedia.org/wiki/Influence_diagram

http://en.wikipedia.org/wiki/Expected_value

http://en.wikipedia.org/wiki/Expected_utility


6

Methodology

In order to make the final prediction Naïve Bayes Classification has been conducted by using

two different packages in the form of R and Rapid Miner. This has been done in order to make

comparison between the results provided by the two packages.

Analysis & Findings

Using R

Our training data (BlackListSubscriberCallLogs), in the form of an excel sheet, has 138 instances

of the names of the people called by the fraudsters Sally, Vince and Virginia in each of the time

frames. We import this dataset into R as “blacklisted”.

We also have a file (Audit Log) of 15 instances where we predict the fraudster by the end of

this report. This is our unseen data. We import this dataset into R as “audit”

The Process: Import the datasets and understand them Install packages and load the libraries “caret” and “klaR” for Naïve Bayes and “party” for

Decision Tree Train our model(Naïve Bayes) using 10-fold cross validation Tweak the parameters of the model to obtain finer results Check for Accuracy and Kappa values Compare the result of Naïve Bayes model with 10-fold cross validated Decision Tree model


7

The above method is used for 10 fold cross validation, which divides the entire dataset in 9:1

parts (using 9 parts for training and 1 part for testing). It repeats this 10 times, reshuffling the

data each time. The outcome of the model is after it has trained itself from all the trials.

Now, we shuffle (for random sampling) our dataset (blacklisted) and take 15 observations

(about 10%) to apply our model and check for the accuracy against it. This set of observation

can be identified and called using the set.seed() function.


8

Upon analyzing the confusion matrix, we find that: The accuracy of our model is (6+3+2)/15 = 73.3% Precision of predicting Sally = 6/9 = 66.66% Precision of predicting Vince = 3/3 = 100% Precision of predicting Virginia = 2/3 = 66.66%

Now, we tweak our model, using Laplace (fL) and usekernel. Laplace is a smoothing technique

that assigns non-zero probability to events that do not occur in a sample. Usekernel is another

smoothing technique which is a non-parametric way to estimate the probability density

function of a random variable.


9


10

We observe that the outcome of both the models fit and fit1 (with Laplace and usekernel) are

identical in this case.

Thus, with 73% accuracy, we apply our model to the unseen data (audit).

To obtain the posterior probabilities for each set of observation in the unseen data, we type the

following command:


11

Now, we build the Decision Tree Model,

> plot (ctreeFit$finalModel)


12

Reading the graph: In the evening, if the call is made to Frank and at night, to Clark, and in the

morning the call reaches either, Kelly, Larry or Robert, the probability of the caller fraudster

being Vince is close to 80%.

The disadvantage of the Decision Tree is that the name of Sally shows below each of the nodes,

irrespective of the correct name.

We also find that the accuracy of Naïve Bayes (0.661) is better than the Decision Tree (0.578).

Thus, we now compare the Naïve Bayes output as obtained by Rapid Miner against the output

of R.


13

Using RapidMiner

Initially both the data sets are uploaded into Rapid Miner. The Black-List Subscriber Call-Logs is

named as telecom234 and the Audit log is named as telecom56. Many different classification

algorithms are available in Rapid Miner; out of all we choose the Naïve Bayes Classifier. Both

the telecom234 is dragged into the main window. In data mining we use the concept of data

splitting. In data splitting, we divide our data set into two parts, i.e., training set and validation

set. The purpose of training set is to create model, whereas validation set is used to estimate

the accuracy of the created model. To create model and to estimate its accuracy using the data

splitting technique, we use validation operators that can be found in Evaluation -> Validation

folder in operator window. Most commonly used operators are Split Validation and X-

Validation. We first make use of Split Validation. Drag and drop Split Validation operator into

process window. Split Validation operator is a group operator, i.e., it groups multiple operators

in it. Group operators have a special sign on them; they have two overlapping blue squares on

their icon as shown in figure below.


14

Validation operator has split ratio parameter (visible in parameter window on right), which

specifies how data set will be split. 0.7 in figure above will split data set into 70% of data for

training set and remaining 30% of data for testing set. Now double click Validation operator.

The Validation sub process window has two parts, i.e., Training and Testing. The split validation

operator is a nested one & we double click on it.

Now the Naïve Bayes Classification is entered into the training window. Validation allows us to

estimate the accuracy of our model. For this purpose, Rapid Miner provides many Performance

Operators in the Performance Measurement folder. Apply Model and Performance

(Classification) operators in testing window as shown in figure below


15

Now the averagable ports of validation operator with result port of Process window are

connected. From the result perspective one would get a performance vector with details about

our created model performance. For example, the created model has the accuracy of almost

71% as shown in figure below, which is quite good.


16

Once the model is created, it is time for using the model to perform classification/prediction.

The telecom56 data set is dragged into the main window. Telecom56 data set is the unlabeled

data set. Apply Model operator is also dragged into the main window. Apply Model operator

will get model from the Validation operator and will apply this model on input of un-labeled

data i.e. telecom56 data which is shown in the figure below.


17

Now, running the whole process would provide the prediction as shown the figure below in the

form of the name of the probable callers along with the confidence in terms of probability


18

Conclusion

Comparing the accuracy and precision from the confusion matrix of Rapid Miner and R results,

we see:

For R:

The accuracy of our model is (6+3+2)/15 = 73.3%

Precision of predicting Sally = 6/9 = 66.66%

Precision of predicting Vince = 3/3 = 100%

Precision of predicting Virginia = 2/3 = 66.66%

Rapid Miner:

The accuracy of our model is = 70.73%

Precision of predicting Sally = 81.82%

Precision of predicting Vince =58.33%

Precision of predicting Virginia = 72.22%

Since, both the statistical software gives accuracy above 70%, we can be confident about our

model and come to the conclusion that Naïve Bayes may be considered the best classifier in this

case, where the training data is considerably small and categorical.