Predicting Con B

Expert Systems with Applications 40 (2013) 6561–6569

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

A combined mining-based framework for predictingtelecommunications customer payment behaviors

0957-4174/$ - see front matter � 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.eswa.2013.06.001

⇑ Corresponding author.E-mail addresses: [email protected] (C.-H. Chen), [email protected].

edu.tw, [email protected] (R.-D. Chiang), [email protected] (T.-F. Wu),[email protected] (H.-C. Chu).

Chun-Hao Chen, Rui-Dong Chiang ⇑, Terng-Fang Wu, Huan-Chen ChuDepartment of Computer Science and Information Engineering, Tamkang University, Taipei 251, Taiwan, ROC

a r t i c l e i n f o a b s t r a c t

Keywords:Late payment prediction systemAssociation rulesClusteringDecision treesDomain-driven data mining

Most existing data mining algorithms apply data-driven data mining technologies. The major disadvan-tage of this method is that expert analysis is required before the derived information can be used. In thispaper, we thus adopt a domain-driven data mining strategy and utilize association rules, clustering, anddecision trees to analyze the data from fixed-line users for establishing a late payment prediction system,namely the Combined Mining-based Customer Payment Behavior Predication System (CM-CoP). The CM-CoP could indicate potential users who may not pay the fee on time. In the implementation of the pro-posed system, first association rules were used to analyze customer payment behavior and the results ofanalysis were used to generate derivative attributes. Next, the clustering algorithm was used for cus-tomer segmentation. The cluster of customers who paid their bills was found and was then deleted toreduce data imbalances. Finally, a decision tree was utilized to predict and analyze the rest of the datausing the derivative attributes and the attributes provided by the telecom providers. In the evaluationresults, the average accuracy of the CM-CoP model was 78.53% under an average recall of 88.13% andan average gain of 11.2% after a six-month validation. Since the prediction accuracy of the existingmethod used by telecom providers was 65.60%, the prediction accuracy of the proposed model was13% greater. In other words, the results indicate that the CM-CoP model is effective, and is better thanthat of the existing approach used in the telecom providers.

� 2013 Elsevier Ltd. All rights reserved.

1. Introduction

The telecom market has developed rapidly and telecom provid-ers have spared no effort to increase their revenue by winningmore customers and improving performance. However, they stillhave to deal with late payments from customers. Most customerswill pay their bill on time, but some also not pay their bill, eitherintentionally or because they forget to make the payment. Thesetwo behaviors are collectively called late payments.

There are many types of fraud, and telephone fraud is a com-mon one (Taniguchi, Haft, Hollmen, & Tresp, 1998). The literatureindicates that the telephone fraud causes losses of two to three bil-lion US dollars each year, and losses from telephone fraud com-prises 1.5% to 5% of the total turnover. The traditional monitoringmethods used by fixed-line providers identify abnormal situationsafter the expiration of a payment period, but by then the loss hasalready occurred. Obviously, traditional monitoring methods donot satisfy the telecom providers’ needs for risk control. A number

of scholars have used data mining technologies (without the use ofadditional equipment or extra loads) to analyze communicationsand liaison records, and to conduct user profile analysis in orderto discover fraudulent behavior (Cahill, Lambert, Pinheiro, &Sun, 2002; Fawcett & Provost, 1997; Hung, Yen, & Wang, 2006;Schommer, 2010; Wheeler & Aitken, 2000; Yan, Wolniewicz, &Dodier, 2004).

Late payments may be caused by fraud, habitual delays, or otherspecial reasons (Hung et al., 2006; Schommer, 2010). Fraudulentbehavior can cause telecom providers to suffer heavy short-termlosses. Although late payments caused by non-fraudulent behaviormay not immediately cause significant losses, it may cause otherlosses such as cash flow reductions, increased labor costs for debtcollection (fixed-line providers indicate that some users paid alltheir fees within two years) and customer churn. Recent literaturehas discussed fraudulent behavior (Hung et al., 2006; Schommer,2010), but seldom paid attention to late payments caused bynon-fraudulent behavior. Thus, this study aims to discuss late pay-ments caused by specific reasons or irregular habits.

The above-mentioned studies employed data-driven data min-ing technologies. The disadvantage of that method is that themined information requires expert analysis before use (Hunget al., 2006; Schommer, 2010). Data-driven data mining technology

http://crossmark.dyndns.org/dialog/?doi=10.1016/j.eswa.2013.06.001&domain=pdf

http://dx.doi.org/10.1016/j.eswa.2013.06.001

mailto:[email protected]

mailto:[email protected]. edu.tw

mailto:[email protected]. edu.tw




http://dx.doi.org/10.1016/j.eswa.2013.06.001

http://www.sciencedirect.com/science/journal/09574174

http://www.elsevier.com/locate/eswa

6562 C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6561–6569

is useful for information technology personnel but is usually notmeaningful for business personnel. Take association rule miningas an example, some of the derived rules only expose common-sense knowledge and may not be interesting in business point ofview. For example, if a rule ‘‘If milk is bought, Then bread isbought,’’ is derived with high support and confidence. This rule isa reliable rule according to the well-known Apriori algorithm. Thisrule is then a technique interestingness information. But, it maynot be valuable for business since the derived rule is common-sense knowledge, and it may also mislead decision makers becausethe rule ‘‘If milk is bought, Then bread is not bought’’ may also existif we take ‘‘not bought’’ concept into consideration, simulta-neously. And, if the proposed approach only considers techniqueinterestingness, common-sense information may be derived, suchas which customers often delay payment but eventually pay, orwhich customers always pay on time. Recently, Longbing Cao sug-gested the domain-driven data mining concept (D3M) Cao, 2010;Cao et al., 2010, and combined it with industry knowledge to mineuseful real information. They emphasize the issues surroundingreal-world data mining, and propose the trends from data-centeredhidden pattern discovery to domain driven actionable knowledgediscovery (AKD). And, the ‘‘actionable’’ means that the derivedknowledge patterns can not only provide important grounds tobusiness decision makers for making appropriate actions, but alsodeliver expected outcomes to business. Based on that innovation,several more relevant studies were performed (Du & Ling, 2010;Jin et al., 2010; Mansingh, Osei-Bryson, & Reichgelt, 2011; Marinica& Guillet, 2010; Xu, Lin, & Xu, 2009).

As mentioned in Cao (2010) and Cao et al. (2010), ‘‘For manycomplex enterprise applications, one-scan mining seems unwork-able for many reasons. To this end, we propose the Combined Min-ing based AKD (CM-AKD) framework to progressively extractactionable knowledge.’’ And, ‘‘one-scan mining seems unworkable’’means that the problem that we attempt to solve may need to ana-lyze the data many times (not just once) by various techniquessuch that the problem could be solve effectively. We also believethat the prediction of telecommunications customer paymentbehavior is also highly domain-driven task. In this paper, we thusadopt the CM-AKD framework to analyze data on fixed-line usersfor establishing a late payment prediction system, namely theCombined Mining-based Customer Payment Behavior Predication Sys-tem (CM-CoP). The CM-CoP could indicate which users are morelikely to not pay their bill on time. In the implementation of thelate payment prediction system, this study used association rulesto analyze customer payment behavior, but did not produce anyprediction rules, which makes it unique among the related studies(Fawcett & Provost, 1997; Hung et al., 2006; Schommer, 2010).Thus, customer payment behavior is a part of data processing,and the payment behavior rules are used to produce derivativeattributes. Customer payment behavior can then be directly storedin the data. Next, a clustering algorithm is used for customer seg-mentation, and the cluster of customers who paid their fees punc-tually is eliminated to reduce data imbalances. Finally, a decisiontree is utilized to construct a prediction model from the rest ofthe data by using the derivative attributes from the associationrules and the attributes provided by the telecom providers. Thus,based on the real dataset given by the telecom providers and theproposed framework, service personnel can remind customerswho may make delayed payments to pay the fee.

Modeling requires an efficiency evaluation to maintain its pre-dictive power. Generally, when the time frame is too long or poli-cies change, user behavior may change, and the accuracy, recall,and predictive power of the system will be reduced. To avoid this,this system automatically verifies and compares the accuracy rateof each rule when the system analyzes monthly user payment

records. If the accuracy rate of a rule is lower than the set thresholdvalue, the system highlights it for review and the providers decidewhether to delete it. Next, providers may use new data and theconstructed model to produce a new rule. If the new rule is veri-fied, the system will add the rule to the database thus maintainingthe system’s predictive power. After the system is built, the pro-vider spends six months conducting verifications. A comparisonof the data from providers indicated that, even though the testingenvironment was different from the conditions at the providers,the efficacy rules produced by CM-CoP is greater than that of exist-ing rules used by telecom providers. Thus, the two main contribu-tions of this work are described as follows:

1. Firstly, we adopt a domain-driven data mining strategy and uti-lize data mining techniques to analyze the data from fixed-lineusers for establishing a late payment prediction system, namelythe Combined Mining-based Customer Payment Behavior Pred-ication System (CM-CoP).

2. Secondly, the average accuracy of the CM-CoP model was78.53% under an average recall of 88.13% and an average gainof 11.2% after a six-month validation.

Since this paper focuses on real-world application in the tele-com market, background concepts related to association rules,clustering, and the decision tree method are introduced briefly.This paper is organized as follows: Section 2 introduces D3M andrelated work; the framework of the proposed CM-CoP is describedin Section 3, which includes system flow, data preprocessing, anddata mining flow; Section 4 presents cases and verification results;and conclusions are offered in Section 5.

2. Related work

This section introduces related data mining methods anddomain-driven data mining concepts and structures. Data miningapproaches include association rule, clustering, and decision treeare describe in Section 2.1. The concepts of domain-driven datamining and relevant existing mining structures are introduced inSection 2.2.

2.1. Related data-mining approaches

Data mining aims to extract useful knowledge and patternsfrom existing data to solve a specific issue. To date, it has been usedin many different fields, such as shopping cart analysis (Agrawal,Imielinksi, & Swami, 1993), network intrusions (Tajbakhsh,Rahmati, & Mirzaei, 2009), and stock market analysis (Au & Chan,2003; Hadavandi, Shavandi, & Ghanbari, 2010). One common useis the mining of association rules from transaction data, i.e. ananalysis of correlations between products purchased by customers.The association rule is represented as A ? B, where A and B arecommon products, and the rule states that if product A is pur-chased, product B will be purchased together with it. Two gaugesare used to measure the validity of association rules, support andconfidence. The earliest association rule mining was suggested byAgrawal et al. (1993), and the three main steps include: (1) pro-duce candidate itemsets; (2) produce frequent itemsets based onminimum support; and (3) produce frequent itemsets based onminimum confidence.

Clustering data based on data similarity is known as the cluster-ing method, and it is an unsupervised mining technology. The k-means method has been widely used (McQueen, 1967). The inputitems for k-means include the given itemset and the designatedcluster number k, where k is an integer greater than 1. In k-means,

C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6561–6569 6563

the centroid of each cluster is used as a representative of thecluster center. The procedure has four steps: (1) the k of the itemsfrom the data set is selected as the representative centroid; (2) allof the items are assigned to the centroid cluster; (3) the centroid iscalculated based on the new cluster; (4) if all the items still belongto the original cluster, it ends, otherwise step (2) is repeated.

Classification aims to find rules for the classification of newdata. The mining approach is a supervised method, and some use-ful information needs to be extracted from the data clusters of a gi-ven type as the basis for the classification of new data. Mostclassification mining approaches find the rules and sort them intoan easy to operate structure called the classifier. The decision tree(Quinlan, 2003), proposed by Quinlan in 1992, is often used. Thedecision tree uses the distribution differences of different typesof data as the classification criterion, and thus, the classificationrules extracted from the decision tree can use these characteristicsto classify the new data. In addition, the decision tree algorithmcan find classification rules and convert them into a tree structure.With the tree structure, given rules can be used to classify newdata more rapidly because the decision tree uses the classifier pro-duced from the classification mining method.

2.2. Domain-driven data mining concepts

Domain-driven data mining (D3M) (Cao, 2010; Cao et al., 2010)was proposed by Longbing Cao at the University of Technology inSydney, Australia. The D3M is defined by the following five items.

(1) Based on meta knowledge, a heuristic method is used forcontinuous testing to solve problems.

(2) The main objective is to mine actionable knowledge discov-ery (AKD).

(3) Technology and business purposes are the interestedconditions.

(4) Actual complex enterprise application is a prerequisite.(5) Actual business application is a post-condition.

Based on the above definition, Longbing Cao further describedfour frameworks for the logical concept of D3M: (1) post-analy-sis-based AKD (PA-AKD) (Du & Ling, 2010; Marinica & Guillet,2010; Mansingh et al., 2011; Xu et al., 2009); (2) unified interest-ingness-based AKD (UI-AKD) (Jin et al., 2010; Sim, Indrawan,Zutshi, & Srinivasan, 2010); (3) combined mining-based AKD(CM-AKD) (He, Zhang, Shi, & Huang, 2010; Pradeep, Krishna, Illapu,Kumar, & Koyi, 2010); (4) and multisource combined-mining-based AKD (MSCM-AKD) (Xiang, Cao, Hu, & Yang, 2010). Thecommon central concept of the four frameworks is actionableknowledge discovery (AKD).

In the PA-AKD framework, there are two steps. First, generalpatterns are found in the data, and then meta knowledge and do-main-specific business interestingness are used to extract maneu-verable business rules. In the UI-AKD framework, the main conceptuses meta knowledge and domain knowledge to develop miningtechnology for deriving actionable knowledge patterns. This algo-rithm can directly mine actionable knowledge discovery (AKD).The CM-AKD framework focuses on the use of different mining ap-proaches, finding different patterns, and using these patterns forfeature construction or comparison. Finally, the pattern mergermethod is combined with domain knowledge to aggregate finalmining results. The MSCM-AKD framework is the most complexamong the four structures, and its main purpose is to extend theCM-AKD framework and increase the number of different datasources. For this reason, there is not much literature on MSCM-AKD. In this paper, the CM-AKD framework is utilized for establish-ing the prediction of the customer payment behavior system,which can be formalized as Eq. (1):

ð1Þ

where ti,j and bi,j are technical and business interestingness of modelmj, and [ii,j()] indicates the alternative checking of unified interest-ingness, [JPj is the merger function, Xm is the meta-knowledge con-sisting of meta-data about patterns, features and their relationships.In other words, the CM-AKD consists of multi-steps for patternextraction and refinement on the whole dataset. It first split into Jsteps of mining based on business understanding, data understand-ing, exploratory analysis and goal definition. Then, each step j isused for extracting a pattern sub-set Pj based on technical signifi-cance (ti()). The pattern sub-set Pj is then fed into step j + 1 for guid-ing corresponding feature construction and pattern set Pj+1. Thederived pattern sub-sets are then merged into a final pattern set(P) based on the environment (e), domain knowledge (Xm) andbusiness expectations (bi). Finally, the merged pattern P is then con-verted into business rules as final deliverables that reflect businesspreferences and needs. Based on the CM-AKD framework, in thenext section, the details of the proposed predicting customer pay-ment behavior system are described. Note that the CM-AKD frame-work is one of the domain-driven data mining framework proposedby Cao et al. for mining actionable knowledge rules (patterns).Based on CM-AKD framework, we propose the CM-CoP frameworkfor predicting telecommunications customer payment behavior.And, the domain-driven data mining strategy focus on how to takethe objective and subjective interestingness in terms of techniqueand business goals into consideration for driving actionable rules(patterns). The descriptions about the proposed domain-driven datamining strategy in terms of the five items are stated as follows.Firstly, based on the CM-AKD framework, we propose the system,namely the Combined Mining-based Customer payment behaviorPredication framework (CM-CoP), which uses the heuristic methods(Step-1 to Step-3 mining, see Fig. 1) is used for continuous testing tosolve problems. Secondly, the main objective of the CM-CoP is usingcustomer payment and communication behaviors to predict whichusers might not pay their bills. Thirdly, in the proposed approach,we take the attributes provided by the telecom providers (as busi-ness interestingness) and the rules provided by the results of theassociation rules (as technique interestingness) into considerationfor achieving more accuracy results. For last two items, telecommu-nications customer payment behavior prediction is the complexenterprise application as mentioned in previous section. After veri-fication by the provider, the accuracy of the proposed model washigher than that of the existing model.

3. The proposed predicting customer payment behavior system

This section introduces the method’s structure. First, we will de-scribe the proposed CM-CoP system framework according to theCM-AKD framework. Next, the proposed predicting customer pay-ment behavior algorithm is described, including using associationrules to produce payment behavior pattern for analysis and de-rived attributes, utilizing the clustering technique to reduce imbal-ance, and combing the derived attributes and the attributesprovided by the telecom providers to construct the decision treesfor predict user payment behavior patterns.

3.1. CM-CoP system framework

Based on the CM-AKD framework, we propose system, namelythe Combined Mining-based Customer payment behavior Predicationframework (CM-CoP), combines data mining techniques includingassociation rules, clustering, and decision trees, as well as industry

Telecom Provider

Meta Knowledge

Step-1 MiningAssociation Mining


DomainKnowledge

Step-2 MiningClustering Analysis


Desired ClusteringDesired

Clustering

Extracted DB

ÖDecision Tree1Decision Tree1 Decision Tree2

Decision Tree2 Decision TreenDecision Treen

Pattern MergerHigh Precision Rule Set

Step-3 MiningDecision TreesStep-3 MiningDecision Trees

ti,1()ti,2()bi,2()

Association Patterns


ETL Transformation

CDR/BASEDatabasePayment Records

ActionableRule Set (Model)


Domain-driven mining phase



Experts Validation

Validated Rules

R1: X1 �Y1R2: X2 �Y2Ö

Rn: Xn �Yn

updating

Model Tuning phase

Late Payment Customers Predicting


NewPayment Records

Late PaymentCustomers List

Telecom ProviderTelecom Provider

Meta Knowledge


StS et pe -1 MiMM ni ini gnAssociation Mining

SSSttSSSS ttteettttSSt eeeppeeettee ppp--eepp -11111 MMMiiMMMM iinniMMiinnniii nnniiiin iinnii nnnggnniin gggnggAAAssssssssoooooccccciiiaaaaatttiontiontion MMMiiinnnnniiinnnnngggggggg


DomainKnowledge


StS et pe -2 MiMM ni ini gnClustering Analysis

SSSttSSSS ttteettttSSt eeeppeeettee ppp--eepp -22222 MMMMMiiMMMM iinniMMiinnniii nnniiiin iinnii nnnggnniin gggnggCCCllluuuuusssssttttteeerereriiinnnnnggggggggg AAAnnnnnaaaaalllyyyyyyyysssssiiissssss


Desired ClusteringDesired

ClusteringgDDDeeeeesssssiiiiirrrerereddddd

CCCCCllllluuuuusssssttttteeerereriiiiinnnnnnggggggggggDesired

Clustering

Extracted DB

ÖDecision Tree1Decision Tree1DDDeeeeeccccciiisssssiiiooon Ton Ton Trrrerereeeeee111111Decision Tree1 Decision Tree2

Decision Tree2DDDeeeeeccccciiisssssiiiooon Ton Ton Trrrerereeeeee22222222Decision Tree2 Decision Treen

Decision TreenDDDeeeeeccccciiisssssiiiooononon TTTrrrerereeeeeennnnnnnnnnnnnnDecision Treen

Pattern MergerHigh Precision Rule Set

Step-3 MiningDecision TreesStS et pe -3 MiMM ni ini gnDecision TreesSSSttSSSS ttteettttSSt eeeppeeettee ppp--eepp -33333 MMMMMiiMMMM iinniMMiinnniii nnniiiin iinnii nnnggnniin gggnggDDDeeeeeecccccciiissssssiiiooonoonon TTTrrreerereeeeeeessssssStep-3 MiningDecision Trees

ti,1()ti,1()ti,2()bi,2()ti,2()ti,2()bi,2()bi,2()


AssociationPatternrr s

AAAssssossossoccccciiiaaiaaattatititiootiononon PPPaaaaatttttttteeererernnrr nnnssrrnnnn ssss


ETL Transformation

CDR/BASEDatabasePayment Records


ActionableRule Set (Model)))

AAAccctctctiiiooooonnnnnaaaaabbbbbllleeeeeRRRRuuuuullleeeee SSSeeetetet (M(M(Mooooodddddeeeeeeellllllll))))))))))))


Domain-driven mining phase


ActionableRule Set (Model))

d

AAAccctctctiiiooooonnnnnaaaaabbbbbllleeeeeRRRuuuuullleeeee SSSeeetetet (Mo(Mo(Modddddeeeeeellllll)))))))))))


Experts Validation

Validated Rules

R1: X1 �Y1R2: X2 �Y2Ö

Rn: Xn �Yn

Validated Rules

R1: X1 �Y1R2: X2 �Y2Ö

Rn: Xn �Yn

updating

Model Tuning phase


Latet PaP ya mentCuC stot mers Predid ctini gn

LLLaaaaattttteetttt eee PPttee PPPaaPPPP aaayyaaPPaa yyymmayy mmmeeeeennnnntttttCCCCuuCCC uuuussCCCu ssssttttttootttt oooommtttoo mmmeeeeeerrrs Prsrs Prrs Preeeeeedddididdiidddd cccccctttitititinnii nnnnggnniin ggggggggggnnggg


NewPayment Records

Late PaymentCustomers List

Fig. 1. Combined Mining-based Customer Payment Behavior Prediction Framework (CM-CoP).


knowledge provided by fixed-line providers for analyzing fixed-line user data. The purpose of the CM-CoP is using customer pay-ment and communication behaviors to predict which users mightnot pay their bills, and the proposed CM-CoP framework is shownin Fig. 1.

The overall system execution flow is shown in Fig. 1, and it in-cludes two parts: (1) the domain-driven mining phase and (2) themodel tuning phase. In the first part, ETL (extraction transforma-tion loading) is utilized to derive CDR historical data. Next, thisstudy uses association rules to analyze data about user bills basedon telecom providers’ practices to create a behavioral model of po-tential late-paying users. According to the derived rules, in combi-nation with the professional knowledge of the providers, thederived attributes from payment behavior is established. Mean-while, the clustering technique is then used to derive the desiredgroups with business interestingness. Finally, decision tree algo-rithms are utilized to analyze the data by using various attributes,and the derived rules are stored in a database for validation.

In the second part, after the model is constructed, its efficiencyhas to be evaluated to maintain its predictive power. Generally,when the time frame is too long or policies change, user behaviormay change, and the accuracy and recall of the system’s predictivepower will decrease. The system’s design must take this intoaccount. The system automatically verifies and compares the accu-racy rate of each rule when the system retrieves the monthly userpayment records. If the accuracy rate of the rule is lower than theset threshold value, the system will highlight it for the providers to

review the rule. Apart from this, the providers can use new data tocreate rules from the constructed model, and if a new rule is veri-fied, the system will add the rule to the database, thus maintainingpredictive power.

3.2. The proposed domain-driven mining approach

In this subsection, based on the proposed CM-CoP framework,the CM-CoP algorithm is proposed for predicting customer pay-ment behavior algorithm in this paper. The details of the proposedCM-CoP algorithm are stated in Table 1.

From Table 1, the proposed CM-CoP algorithm can be dividedinto four parts, including association pattern mining (lines 2–3),clustering analysis (lines 4–5), mining decision tree (lines 6–11)and rule evaluation (line 12–16). In first part, the preprocessedpayment records are first used for deriving association pattern tobe as the customer behavior (see Section 3.3 for more details).Then, since only a few customers will late to pay their bills, theproportion between customers who late payment and pay theirbills on time is imbalance. Thus, the data imbalance should be ta-ken into consideration before building the model. Here, after con-sulting the experts of the telecom providers for derivingappropriate attributes, the clustering technique is then used to di-vide customers into groups. Those groups that can be identified ascontain customers who pay their bills on time will then be re-moved. The remaining groups are the customers that we need tofocus on. However, it is not an easy task to find general actionable

Table 1The proposed CM-CoP algorithm.

Algorithm: CM-CoP algorithm

Input: A set of CDR/BASE dataset CDR, a set of payment records PR, businessproblem w, minimum support a, minimum confidence k, mata knowledgeXm, domain knowledge Xd.

Output: The operable business rule set R0 .Procedure CM-CoP(){(1) AKD is split into 3 steps of mining;(2) PR0 dataPrecessing(PR, Xm);(3) associationPattern Step-I AssociationMining(PR0 , a, k);(4) CDR0 dataPrecessing(CDR, Xm);(5) desiredCluster Step-II ClusteringAnalysis(CDR’);(6) attributeSet featureSelection(CDR’);(7) For each subset subAttributes of attributeSet(8) decisionTree Step-III MiningDecisionTree(9) (association Pattern, desiredCluster, subAttributes);(10) decisionTreeSet decisionTreeSet [ decisionTree;(11) End For(12) For each rule Ri in decisionTreeSet(13) If evaluationFunction(Ri, Xm, Xd) == true(14) R0 R0 [ Ri;(15) End If(16) End For(17) Output R’}

Table 2The operable business rule turning procedure.

Procedure: The operable business rule turning procedureInput: The operable business rule set R0 , a set of new coming payment records

newPR, a set of CDR/BASE dataset CDR, mata knowledge Xm, domainknowledge Xd, an accuracy threshold k.

Output: The operable business rule set R’.Procedure RuleTurning (){R0 R0 [ CM-CoP(CDR, newPR, Xm, Xd);(1) For each rule Rj in R0

(2) If expertValidation(Ri, k, Xm, Xd) == false(3) Remove Ri from R0;(4) End If(5) End For(6) Output the tuned operable business rule set R0;}


rules for predicting the customers’ behavior, and directly use therules provided by the results of the association rules could notsolve the problem efficiently. In order to conquer this issue, we se-lect different set of attributes from the derived association patterns(as technique interestingness) and those consulted attributes (asbusiness interestingness) for constructing decision trees. At last,the each rule in the decision trees is then evaluated by the telecomproviders for enhancing its predicting ability. Finally, those verifiedrules are then collected as the operable business rule set (seeSection 3.4 for more details).

Furthermore, modeling requires an efficiency evaluation tomaintain its predictive power. Generally, when the time frame istoo long or policies change, user behavior may change, the predic-tive power of the system will be reduced. To avoid this, this systemautomatically verifies and compares the accuracy rate of each rulewhen the system analyzes monthly user payment records by usingthe operable business rule turning procedure shows in Table 2:

As shown in Table 2, the new payment records will first use togenerate new operable business rule (line 1). Then, for each rule inthe operable business rule set R0, experts verify it predicting power.If its predicting power is lower than a threshold, then the rule willbe removed from the operable business rule set (line 2–6). Sincewe focus on analyzing the real data for predicting the customerswho may make delayed payments, the goal of this paper isattempted to design the CM-CoP framework (Fig. 1) and its

algorithms (Tables 1 and 2). So, the detail approaches of the relateddata mining algorithms are using the existing tools in IBM Intelli-gent Miner.

3.3. Payment behavior pattern analysis and derived attributes

The user communication characteristics consist of CDR informa-tion. Users’ telephone usage habits are expressed by the start time,the number of users to make the call, the sum, duration, call type,call variation information, and other statistical data. As specificfraudulent behavior may have a fixed behavior pattern, mostfraudulent behavior can be found by examining the CDR data. Incombination with the professional knowledge of the providersand user data, fraudulent behavior can then be determined. Gener-ally, normal users sometimes delay payment because of specialreasons or habitual delays. Although these delays are differentfrom fraudulent behavior, they also have fixed behavior patterns.To identify late paying users, this behavior must be compared withcustomer payment records. With the derived attributes (X ? noformat) of user payment and call behavior patterns obtained fromCDR by discussing with telecom providers, the proposed approachcan be utilized to predict whether a user will default on a debt. Asshown in Fig. 1, the system requires data on CDR, the customerbase, and customer payment status. The attributes of the impor-tant data for user payment status are shown in Table 3.

As shown in Table 3, when a user’s Payment Status is ‘‘00’’, itmeans the user paid the bill on time; when the Payment Statusis ‘‘01’’, it means the user failed to pay the bill on time, and whenthe Payment Status is ‘‘02’’, it means the user never paid the bill.Meanwhile, there is a delay of 35 days from the time spent thetelephone is used and the time needed to predict whether a usermay default on their debt for more than ten days. Data acquisitionand prediction can be completed in 20 days. Service personnel canremind customers who may make delayed payments to pay thefee. Based on the definition used by telecom providers, there aresix billing cycles. The data provided by telecom providers was ta-ken from customer data in one area and payment cycle. One billingcycle is listed in Table 4. Differences in regions and billing cycletime points are not considered.

Assuming that it is early in the seventh month, we need to pre-dict the correlation between payment behavior patterns in thesixth month and payment habits in the first to fifth months. Sincethe customer payment status for the fifth month is still unknownearly in the seventh month, the payment records for the fifthmonth are not used in the analysis of the customer’s payment pat-tern. Thus, only the relation between the user payment behaviorfrom the first to fourth month and late payment behavior in thesixth month can be described.

Next, in an analysis of user payment behavior patterns, the pay-ment records for the first to sixth month are summarized in thepayment status table, as shown in Table 4, and the data for all ofthe months is aggregated to the original payment status. Also,attention is paid to whether payment defaults exceed ten days,but attention is not paid to whether the users paid their bill. Whenuser payment status in Table 4 is ‘‘00’’, user payment status is ‘‘yes’’in the sixth month, otherwise it is ‘‘no’’. ‘‘Yes’’ indicates a timelypayment, and ‘‘no’’ indicates delinquency. For example, in Table4, the user payment status for the sixth month can be changedfrom ‘‘02’’ to the new payment status ‘‘no’’. The user payment sta-tus then changes from ‘‘01’’ to the new payment status ‘‘01C’’,where ‘‘C’’ represents the payment status of the fourth month.The limitation of repeating occurrences of the same new paymentstatus item in the analysis can be overcome using the associationrules.

During the analysis using association rules, the payment statusof each customer during the period from the first to fourth month

Table 3Data format of original CDR payment situations.

Attribute name Description

Amount Total amount of the bill, which includes communicationfees for seven days, excluding international calls

Installed date Installation date

Billing cycle Billing cycle for each month

Cycle number Billing period Payment deadline

1 From the 1st day of the previous monthto the end of the previous month

The 25th of the current month

Payment status Payment status of the current month00: paid punctually, 01: paid, 10 days overdue, 02: not paid

Table 4New payment status.

Customer ID Bill month Original payment status New payment status

001 06 02 No001 04 01 01C001 03 02 02D001 02 00 00E001 01 02 02F

C: the 4th month; D: the 3rd month; E: the 2ndmonth; F: the 1st month

Start

CDR/BASE

Clustering

DB

Decision tree 1 Decision tree n

Select high accuracy rules

Rules inrule base

End

Select data depends on billing cycle

Remove customerspaid punctually

Start

CDR/BASE

Clustering

DB

Decision tree 1 Decision tree n

Select high accuracy rules

Rules inrule base

End

Select data depends on billing cycle

Remove customerspaid punctually

Fig. 2. Data mining implementation flow.


is regarded as one transaction, and the association rule is used tofind the rule X ? no format since the goal of the proposed approachis attempted to predict late payment users, in which the confi-dence is greater than 50%. X refers to the subset of the paymentstatus for the past four months, and is also the user payment modelto be found. In the subsequent data preprocessing, all occurrencesof X are regarded as the derived attributes related to paymentbehavior. This study uses ETL to set the derived attributes. Finally,a decision tree is utilized to analyze the training data. To avoidirrelevant attributes in the final result and to increase rule compre-hensibility, payment behavior-related derived attributes are usedas substitutes for the original payment status from the first tofourth month.

From the payment behavior pattern analysis, a total of 19meaningful rules are produced, and 19 derived attributes are set.An example using one of the 19 rules is shown as follows:

01D; 01C!no ðsupport¼0:39%; confidence¼60:18%; lift¼7:19Þ

It can be seen that the probability of late payment is 60.18% inthe sixth month when the user made a late payment in the thirdmonth and fourth month. If the user has made a late payment inthe past two or three months, the probability of a late paymentin the analyzed month is 60.18%. When preprocessing data, if theuser has such a situation, the relative attribute of the rule is setto ‘‘yes’’, otherwise it is set to ‘‘no’’. For long-term users, the advan-tage of this approach is that it can combine payment behavior pat-terns from the past two to five months with call behavior toincrease prediction accuracy. This method is also applied to cus-tomers who have insufficient historical payment records. As longas the payment records of new customers satisfy any rule, the sys-tem can utilize the payment behavior patterns of new customers toincrease the accuracy rate.

3.4. Data mining implementation process

Late-paying customers comprise 7% of all customers. If a deci-sion tree is used directly for analysis, prediction analysis will bedifficult, due to the small percentage of late-paying customers.And, in fact, if a decision tree is used directly, the result has onesingle node, the root node, and the prediction accuracy for normal

customers is 93%. This accuracy is high, but it fails to meet themodel prediction requirements since the model prediction goal isto find customers who may make delayed payments.. As shownin Fig. 2, clustering and decision trees are used to perform datamining in the second phase to analyze user behavior and find tar-get customers.

In the first phase, clustering is used to group customers and tofind the cluster of normal customers. Then, the decision tree is uti-lized to create rules from those clusters. The behavior of normalcustomers does not change with time. After the behavior rulesfor normal customers are found, they are then used on all the win-dow data arrays. In addition, some fraudulent behavior or defaultbehavior can be found by using the clustering algorithm in thisphase for things such as in analysis of pay-per-call usage cluster,


where it is found that some users may indulge in numerous pay-per-calls over a short time. Since this type of customers often can’tpay their bills, if these customers had similar behavior in the past,their payment behavior will be used to identify whether the usersusually paid the fee normally. If customers don’t have similarbehavior, they will be included in the forecast list.

In the second phase, the behavior rules for normal customers inthe first phase are used to eliminate the customers who satisfiedthe rules, and the percentage of default customers in the datatherefore increases. In this phase, a decision tree is formed fordrawing predicts from and analyzing the rest of the data. However,there are many different kinds of default behavior. The datessigned by customers are different, and some new customers haveno historical data. In order to find the customers who met theobjective, different attributes are selected for analysis to producedifferent decision trees and rules. After verification, the rules thatare more than 80% accurate are selected and stored in an SQL data-base in order to find the target customers.

Table 5Important related fields.

4. Experimental results

This study used data from fixed-line providers in Taiwan forcase discussions. The fixed-line providers provided customers’CDR and payment data of customers from one base station, area,and payment cycle for a period of twelve months. The data wasthen used for model construction. According to statistics, 7% ofthe users still hadn’t paid their bills more than ten days after theywere overdue. Due to the confidentiality of the service agreements,this study only discusses the data mining process in the secondphase as well as some verification results.

Field name Definition Function

MAXAMOUNT Maximum amount for making one callin the current week

Function field

TOTALAMOUNT Total amount for making calls in thecurrent week

Function field

STDAMOUNT Standard deviation of the call amountfor the current week

Function field

AVGAMOUNT Average amount of each call in thecurrent week

Function field

NUMBERCOUNT The number used to make calls in thecurrent week

Function field

PAYTYPE Payment type Function fieldPAYLASTMONTH Pay-per-call in last month Function fieldNUMBERCLRS3 Number of different calls for the

current weekFunction field

TOTALDURITIONS Total call time for the current week Function fieldFIRSTCALL Pay-per-call in the first usage Function fieldPAYSTATUS Timely payment Supplemental

field

Fig. 3. The clustering results of cluster ‘‘[6]6’’ and cluster ‘‘[4]3’’.

4.1. Training data and testing data

Since late payment behavior may change over time, the predic-tive power of the model may decrease. Thus, in order to overcomethis problem and prevent the model from depending on historicaldata, a future system operation is conducted to enable the datasetsof the model to cover different time intervals. The time windowconcept is used to set the datasets for the model construction(including the training data and testing data). The fixed-line pro-vider offered twelve months of CDR data and payment data formodel construction. The data was divided into six sections accord-ing to the sequence months, and the model was in turn constructedto determine the useful rules, so that different datasets for themodel would be able to cover different time intervals. In otherwords, the sliding window size was set at six, and since the accu-racy of the last month could not be evaluated, six datasets werethen generated for model construction. During each time interval,the customer data in the sixth month was used as the training data,and the data from the first month to the fifth month were used ashistorical data to construct the prediction model. The data from thefollowing months were used as testing data to verify the derivedrules. Since the derived rules would become ineffective over time,less accurate rules are deleted and new rules are generated duringeach system test in order to prevent the model from depending onhistorical data.

There are many reasons that customers will pay their bills late.Some fraudulent behaviors may cause heavy short-term losses toproviders. We thus expect that the system can analyze and predictuser behavior through a small amount of CDR. Based on the studiedexperiences of the past three months and users who made habituallate payments, we also found that fraudulent behavior could beidentified based on the behavioral difference between the currentweek and past weeks, and between the current week and the pastseveral months. To prevent special fraudulent behavior from being

diluted by other behavior, the training data from each month wassubdivided into different datasets for one week, two weeks, threeweeks and one month. Ten datasets in total were then generated,including four, three, two, and one datasets, which were generatedfor one week, two weeks, three weeks, and one month, respec-tively. Thus, a stable model that is easily converted with timewas established, and the effectiveness of the data mining effectwas increased.

4.2. Case discussion

In the proposed system, this study used the function fields andsupplemental field shown in Table 5 to cluster and describe userbehavior in the first phase, in order to find the cluster for normalcustomers. Next, a decision tree was used to analyze the derivedclusters.

After clustering, about thirty clusters were derived. However,most of them contain a small parts of users (less than 2% of allusers). The two representative clusters, namely cluster ‘‘[6]6’’ andcluster ‘‘[4]3’’, that account for 55.92% and 31.49% of all users,respectively, were used for further analysis. The clustering resultsare shown in Fig. 3.

From Fig. 3, this study checked the supplemental fields of thetwo clusters, i.e. the PAYSTATUS distribution situation. Thecluster ‘‘[4]3’’ has 5644 instance. According to the PAYTATUS, its


percentage of late payment users in cluster ‘‘[4]3’’ (approximately22.57% (=1274/5644)) was higher than the percentage of total latepayment users, which fails to meet the goal of this study. The num-ber of late payment users in cluster set ‘‘[6]6’’ was 82, comprising0.818% of the users in the cluster, which was lower than the per-centage of total late-paying users.

With reference to the distribution of the three function fields foruser call behavior in the representative cluster, TOTALAMOUNT,MAXAMOUNT and AVGAMOUNT, it was found that users whosetelecom fee was lower than NT$200 each month account for 97%of total users, and the users whose maximum amount and averageamount for each call was lower than NT$10 comprises 80%, aftercomparing the call amount with call amount of all users usingtwo function fields in cluster ‘‘[6]6’’. This percentage was higherthan that for all users. When the call amount was higher, the per-centage of users in the cluster was lower than the percentage of thetotal users. Thus, it can be deduced that the users in cluster ‘‘[6]6’’were normal users, and this cluster can therefore be excluded fromlate payment behavior. After customers with the characteristics ofcluster ‘‘[6]6’’ are eliminated, the percentage of late-paying cus-tomers increases. In fact, it then comprised 15% of all users.

For users in cluster ‘‘[4]3’’, the percentage of users whose billwas lower than NT$200 each month, and the percentage of userswhose maximum and average amounts per call were lower thanNT$10, were both lower than the percentage for all users. Whenthe telephone fee was high, the percentage of users in the clusterwas higher than the percentage for all users. The users in cluster‘‘[4]3’’ displayed behavior opposite to that of users in cluster‘‘[6]6’’, so the users in cluster ‘‘[4]3’’ were retained for furtheranalysis.

In the second phase, a decision tree was directly used to predictand analyze the rest of the data. In this study, ten useful rules werederived. Two rules are described as follows:

ð1Þ IF PAY PATTERN 04 ¼ ‘Y’ andTOTALAMOUNT >¼ 300 THEN PAY ¼ ‘N’

In rule (1), PAY_PATTERN_04 = ‘Y’ indicates that the user did notpay their bill in the fourth month. Since the threshold value of therule is set to 80%, we can say that, according to rule (1), if the userhas not paid their bill for the fourth month and the total callamount of the current week exceeds NT$ 300, the probability oflate payment is greater than 80%.

Table 6Verification of statistical results .

Month Prediction results S

1 Late payment 1576 (A) 1Normal payment 436 1Total number of users 2012 (B) 2

2 Late payment 1592 1Normal payment 455 9Total number of users 2047 2





ð2Þ IF USEDMONTH <¼ 6:5 andMAXAMOUNT >¼ 697:5 andTOTALAMOUNT >¼ 5397:5 THEN PAY ¼ ‘N’

According to rule (2), if a new customer has used the service forless than 6.5 months and the price per call for the current week isgreater than or equal to NT$657.5, and the total call amount for theweek is greater than NT$5,397.5, the probability of default for thenew customer exceeds 80%.

4.3. Verification results

The last item in this study is a verification conducted by thetelecom provider for six months. The provider investigated userswhose monthly telephone fees exceed NT$2,000 and comparedtheir data with the prediction results of the system. The statisticalresults are shown in Table 6. During the comparison, the subjectsinvestigated by the provider were considered to be the population;therefore, the recall of the providers was not calculated. The meth-od used to calculate the evaluation criteria in Table 6 is describedas following Eqs. (2)–(4):

Predictionaccuracy¼ the number of correctly predicted late payment user total

predicted number of late payment users ð¼ A=BÞ ð2Þ

Recall ¼ the number of correctly predicted late payment users=number of the actual late payment users ð¼ A=CÞ ð3Þ

Accuracy of provider¼ the number of the actual late payment users=total actual number of users ð¼C=DÞ ð4Þ

According to the evaluation criteria, the results are shown in Table 6.The results are shown in Table 6, the average recall and the

average accuracy of the system were 88.13% and 78.53%, respec-tively, according to the proposed CM-CoP model. The accuracywas 13% greater than that of the predictions resulting from theexisting method used by the telecom provider, which was65.60%. Meanwhile, in this situation, the average gain of the sys-tem was thus 11.2 (=78.53%/7%). Thus, the proposed CM-CoP mod-el is efficient for predicting late-paying customers.

ubjects investigated

837 (C) Predictive accuracy 78.33%011 Recall 85.79%848 (D) Accuracy of provider 64.50%

778 Predictive accuracy 77.77%87 Recall 89.54%765 Accuracy of provider 64.30%






5. Conclusions and future work

In this paper, we first propose a late payment prediction frame-work, namely the Combined Mining-based Customer PaymentBehavior Predication (CM-CoP) framework, which incorporatesdata mining technology with the domain-driven data mining strat-egy to predict which customers are most likely to not pay their billmore than ten days after payment is due. Then, the CM-CoP algo-rithm is proposed for achieving this goal. After verification by theprovider, the accuracy of the proposed model, which was 78.53%,was higher than that of the existing model, which was 65.60%. Dur-ing the implementation of the plan, the user payment records forthe previous month (the fifth month of each time interval) wereadded to the user payment behavior model, and the accuracyand recall of the existing model improved.

In addition, there are many reasons for late payment, and somebehavior can cause providers to suffer heavy short-term losses.Since we hope that the system can use a small amount of recentCDR to predict and analyze users’ behavior, each month-basedwindow was subdivided into several week-based windows. Theweek-based windows increased the real-time capabilities of thesystem. Because of this providers do not have to wait until the mid-dle ten days of the next month for CDR prediction analysis, but canperform an analysis after the system collects the CDR for one week.In fact, if the providers are willing to use a week-based operationmodel to develop a real-time function, the system can make fur-ther analysis for different late payment behaviors in a short time.In the future, the authors will communicate with telecom provid-ers regarding this issue and will conduct further analysis of eachtype of behavior in the hope that they will offer user payment re-cords for the previous month on time. Further, the existing rulesused by providers can be combined with the model. Different mod-els can be provided for users with different call amounts in order tofurther improve the model’s efficiency.

References

Agrawal, R., Imielinksi, T., & Swami, A. (1993). Mining association rules betweensets of items in large database. In The 1993 ACM SIGMOD conference, WashingtonDC, USA.

Au, W. H., & Chan, K. C. C. (2003). Mining fuzzy association rules in a bank-accountdatabase. IEEE Transactions on Fuzzy Systems, 11(2), 238–248.

Cahill, M. H., Lambert, D., Pinheiro, J. C., & Sun, D. X. (2002). Detecting fraud in therealworld. Handbook of massive data sets (massive computing 4). KluwerAcadamic Publishers (pp. 911–929). Kluwer Acadamic Publishers.

Cao, L. (2010). Domain-driven data mining: challenges and prospects. IEEETransactions on Knowledge and Data Engineering, 22(6), 755–769.

Cao, L., Zhao, Y., Zhang, H., Luo, D., Zhang, C., & Park, E. K. (2010). Flexibleframeworks for actionable knowledge discovery. IEEE Transactions on Knowledgeand Data Engineering, 22(9), 1299–1312.

Du, J., & Ling, C. X. (2010). Asking generalized queries to domain experts to improvelearning. IEEE Transactions on Knowledge and Data Engineering, 22(6), 812–825.

Fawcett, T., & Provost, F. (1997). Adaptive fraud detection. Data Mining andKnowledge Discovery, 1(3).

Hadavandi, E., Shavandi, H., & Ghanbari, A. (2010). Integration of genetic fuzzysystems and artificial neural networks for stock price forecasting. Knowledge-Based System, 23(8), 800–808.

He, J., Zhang, Y., Shi, Y., & Huang, G. (2010). Domain-driven classification based onmultiple criteria and multiple constraint-level programming for intelligentcredit scoring. IEEE Transactions on Knowledge and Data Engineering, 22(6),826–838.

Hung, S. Y., Yen, D. C., & Wang, H. Y. (2006). Applying data mining to telecom churnmanagement. Expert Systems with Applications, 31, 515–524.

Jin, H., Chen, J., He, H., Kelman, C., McAullay, D., & O’Keefe, C. M. (2010). Signalingpotential adverse drug reactions from administrative health databases. IEEETransactions on Knowledge and Data Engineering, 22(6), 839–853.

Mansingh, G., Osei-Bryson, K., & Reichgelt, H. (2011). Using ontologies to facilitatepost-processing of association rules by domain experts. Information Sciences,181(3), 419–434.

Marinica, C., & Guillet, F. (2010). Knowledge-based interactive postmining ofassociation rules using ontologies. IEEE Transactions on Knowledge and DataEngineering, 22(6), 784–797.

McQueen, J. B. (1967). Some methods of classification and analysis of mutivariateobservations. In The symposium on mathematical satistics and probability (pp.281–297).

Pradeep, I. K., Krishna, S. M., Illapu, S. S. R., Kumar, A., & Koyi, L. P. (2010). CRMsystem using CM-AKD approach of D3M. International Journal of EngineeringScience and Technology, 2(3), 237–242.

Quinlan, J. R. (2003). Induction of decision trees. Machine Learning, 1(1), 81–106.Schommer, C. (2010). Discovering fraud behaviour in call detailed records. Grande

region security and reliability day.Sim, A. T. H., Indrawan, M., Zutshi, S., & Srinivasan, B. (2010). Logic-based pattern

discovery. IEEE Transactions on Knowledge and Data Engineering, 22(6), 798–811.Tajbakhsh, A., Rahmati, M., & Mirzaei, A. (2009). Intrusion detection using fuzzy

association rules. Applied Soft Computing, 9(2), 462–469.Taniguchi, M., Haft, M., Hollmen, J., & Tresp, V. (1998). Fraud detection in

communication networks using neural and probilistic methods. IEEEInternational Conference on Acoustics, Speech and Signal Processing, 2, 12–15.

Wheeler, R., & Aitken, S. (2000). Multiple algorithms for fraud detection. Knowledge-Based System, 13, 93–99.

Xiang, E. W., Cao, B., Hu, D. H., & Yang, Q. (2010). Bridging domains using worldwide knowledge for transfer learning. IEEE Transactions on Knowledge and DataEngineering, 22(6), 70–783.

Xu, X., Lin, J., & Xu, D. (2009). Mining pattern of supplier with the methodology ofdomain-driven data mining. IEEE International Conference on Fuzzy Systems,1925–1930.

Yan, L., Wolniewicz, R. H., & Dodier, R. (2004). Predicting customer behavior intelecommunications. IEEE Intelligent Systems, 50–58.

http://refhub.elsevier.com/S0957-4174(13)00381-3/h0005






















































Predicting Con B

Documents

Transcript of Predicting Con B