using analytic services data mining framework for ... · Services Data Mining Framework is the...

14
hyperion.com introduction This paper focuses on using Naïve Bayes, one of the Data Mining algorithms (shipped in-the-box with Analytic Services) to develop a model to solve a typical business problem in the admissions department at an academic university – referred to as “ABC University” in the paper. The paper details out the approach that is taken by the user to solve the problem and explains the various steps that are performed by using Analytic Services in general and the Analytic Services Data Mining Framework in particular, towards arriving at the solution. problem statement One of the problems related to managing admissions that typical universities face is to be able to predict with reasonable accuracy the likelihood that an applicant would eventually enroll in an academic program. Universities typically incur a considerable expense in promoting their programs and in following up with prospective candidates. Identifying applicants with a higher likelihood of enrollment into the program will help the university channel the promotional expenditure in a more gainful way. The candidates typically apply to more than one university to widen their chances of getting enrolled within that academic year. Universities that can quickly arrive at a decision on the applicant stand a higher chance of getting acceptance from candidates. ABC University collects from applicants a variety of data as part of the admissions process: demographic, geographic, test scores, financial information, etc. In addition to that, the admissions department at the ABC University also has using analytic services data mining framework for classification predicting the enrollment of students at a university – a case study D ata Mining is the process of knowledge discovery involving finding hidden patterns and associations, constructing analytical models, performing classification and prediction, and presenting mining results. Data Mining is one of the functional groups that is offered with Hyperion System 9 BI+ Analytic Services – a highly scalable enterprise class architecture analytic server (OLAP). The Data Mining Framework within Analytic Services integrates data mining functions with OLAP and provides the users with highly flexible and extensible on-line analytical mining capabilities. On-line analytical mining greatly enhances the power of exploratory data analysis by providing users with the facilities for data mining on different subsets of data at different levels of abstraction in combination with the core analytic services like drill up, drill down, pivoting, filtering, slicing and dicing – all performed on the same OLAP data source. white paper

Transcript of using analytic services data mining framework for ... · Services Data Mining Framework is the...

Page 1: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com

introductionThis paper focuses on using Naïve Bayes, one of the DataMining algorithms (shipped in-the-box with AnalyticServices) to develop a model to solve a typical businessproblem in the admissions department at an academicuniversity – referred to as “ABC University” in the paper. Thepaper details out the approach that is taken by the user to solvethe problem and explains the various steps that are performedby using Analytic Services in general and the Analytic ServicesData Mining Framework in particular, towards arriving at thesolution.

problem statementOne of the problems related to managing admissions thattypical universities face is to be able to predict with reasonable

accuracy the likelihood that an applicant would eventuallyenroll in an academic program. Universities typically incur aconsiderable expense in promoting their programs and infollowing up with prospective candidates. Identifyingapplicants with a higher likelihood of enrollment into theprogram will help the university channel the promotionalexpenditure in a more gainful way. The candidates typicallyapply to more than one university to widen their chances ofgetting enrolled within that academic year. Universities thatcan quickly arrive at a decision on the applicant stand a higherchance of getting acceptance from candidates.

ABC University collects from applicants a variety of data aspart of the admissions process: demographic, geographic, testscores, financial information, etc. In addition to that, theadmissions department at the ABC University also has

using analytic services data miningframework for classification

predicting the enrollment of students at a university – a case study

Data Mining is the process of knowledge discovery involving finding

hidden patterns and associations, constructing analytical models,

performing classification and prediction, and presenting mining results. Data

Mining is one of the functional groups that is offered with Hyperion System

9 BI+ Analytic Services – a highly scalable enterprise class architecture

analytic server (OLAP). The Data Mining Framework within Analytic Services

integrates data mining functions with OLAP and provides the users with

highly flexible and extensible on-line analytical mining capabilities. On-line

analytical mining greatly enhances the power of exploratory data analysis by

providing users with the facilities for data mining on different subsets of data

at different levels of abstraction in combination with the core analytic services

like drill up, drill down, pivoting, filtering, slicing and dicing – all performed

on the same OLAP data source.

white paper

Page 2: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com

acceptance information from the previous year’s admissionsprocess. The problem at hand is to use all this available dataand predict whether an applicant will choose to enroll or not.The ABC University is also interested in analyzing thecomposite factors influencing the enrollment decision. Thisadditional analysis is useful in adjusting the admissions policyat the university and also in ensuring effective costmanagement in the admissions department.

available dataThe admissions department is currently gatheringdemographic, geographic, test scores, financial information,etc., from applicants as part of the admissions process. Thereis also historical data available indicating the actualenrollment status of applicants along with all the otherattributes that were collected as part of the admission process.

The dataset made available has 33 different attributes foreach applicant inclusive of the decision result attribute. Thereare in all about 11000 records available.

white paper

2

Table 1: List of potential mining attributes available in database

Page 3: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com 3

preparing for data mining

cube is the data sourceThe algorithms in the Data Mining Framework are designed towork on data present within an Analytic Services cube. Thedesign of the cube should take into consideration the dataneeds for all kinds of analyses (OLAP and Data Mining) thatthe user is interested in performing. Once the data is broughtinto the cube environment it can then be accessed through theData Mining Framework for predictive analytics.

The Data Mining Framework uses MDX expressions toidentify sections within the cube to obtain input data for thealgorithm as well as to write back the results. The Data MiningFramework can only take regular dimension members asmining attributes. What this implies is that only data that isreferenced through regular dimension members (not throughattribute dimensions or user defined attributes) can bepresented as input data to the Data Mining Framework.Accordingly, the data that is required for predictive analyticsshould be modeled within the standard dimensions andmeasures within a cube.

In the case study being discussed in this paper, the primarybusiness requirement was to build a classification model forprediction. Since there were no other accompanying businessrequirements, the design of the Analytic Services cube wasprimarily driven by the Data Mining analytics need. Forexample, we have not used any attribute dimension modelingin the case study. However, in the generic case it is more likelythat the cube caters to both regular OLAP analytics andpredictive analytics within the same dimensional model.

preparing mining attributesThe available input data can broadly be of two data types –‘number’ or ‘string’. However, since measures in AnalyticServices are essentially stored in the database in a numericalformat, the ‘string’ type input data will have to be encoded intoa ‘number’ type data before being stored in Analytic Services.For example, if the gender information is available as a stringstating ‘Male’ or ‘Female’ it needs to be first encoded into anumeric – like ‘1’ or ‘0’, before being stored as a measure in theAnalytic Services OLAP database.

Mining attributes can be of two types – ‘categorical’ or‘numerical’. Mining attributes that describe discreteinformation content like gender (‘Male’ or ‘Female’), zip code(95054, 94304, 90210, etc.), customer category (‘Gold’, ‘Silver’,‘Blue’), status information (‘Applied’, ‘Approved’, ‘Declined’,‘On Hold’), etc. are termed ‘categorical’ attribute types.Mining attributes that describe continuous informationcontent like sales, revenue, income, etc. are termed ‘numerical’attribute types. The Analytic Services Data Mining Frameworkhas the capability of working with algorithms that can handleboth categorical and numerical attribute types. Among the

algorithms that are shipped in the box with the AnalyticServices Data Mining Framework, the Naïve Bayes and theDecision Tree algorithms have the capability to handle bothcategorical as well as numerical mining attribute types andtreat them accordingly.

One of the key steps in Data Mining is the data auditing orthe data conditioning phase. This involves putting together,cleansing, categorizing, normalizing, and proper encoding ofdata. This step is usually performed outside the Data Miningtool. The effectiveness of the Data Mining algorithm is largelydependent on the quality and completeness of the source data.In some cases, for various mathematical reasons, the availableinput data may also need to be transformed before it isbrought into a Data Mining environment. Transformationsmay sometimes also include splitting or combining of inputdata columns. Some of these transformations may be done onthe input dataset outside the Data Mining Framework byusing standard data manipulation techniques available in ETLtools or RDBMS environments. For the current case the inputdata does not need any mathematical transformation, butsome encoding is needed to convert data into a format that canbe processed within the Analytic Services OLAP environment.

In the current problem at the ABC University, the availableset of input data consisted of both ‘string’ and ‘number’data types. The list below gives some of the input data,which needed encoding of ‘string’ type input into ‘number’type input:

• Identity related data – like Gender, City, State, Ethnicity• Data related to the application process – like Application

Status, Primary Source of contact, Applicant Type, etc.• Date related data – like Application Date, Source Date, etc.

(Dates were available in the original dataset as strings,specifically they had two different formats – “yymmdd”and “mm/dd/yy”, and they had to be encoded into a number.)

In the current case study, these encodings were doneoutside the Analytic Services environment by the constructionof look-up master tables where the ‘string’ type input werelisted in a tabular format and the records were sequentiallynumbered. Subsequently, the ‘string’ type input was referred toby its corresponding numeric identifier during data load intoAnalytic Services. Table 2 shows a few samples of how suchmapping files will look like.

white paper

StateID123456

StateName

VTCAMAMINHNJ

AppliedStatusID3456

Application Status

AppliedOffered Admission

Paid FeesEnrolled

Table 2: Typical mapping of numeric identifiers

Page 4: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com

preparing the cubeAfter all the input data has been identified and made ready, thenext step is to design an outline and load the data into anAnalytic Services cube.

In the context of the current case the Analytic Servicesoutline created was as follows:

• All the input data (measures in the OLAP context) wereorganized together into five groups (a two level hierarchycreated in the measures dimension) based on a logicalgrouping of measures. The details of each of the measureare explained in the table below -Table 3: Analytic Servicesoutline expanded.

• Data load is performed just as it is normally done for anyAnalytic Services cube.

At this stage we have:• Designed an Analytic Services cube• Loaded it with relevant data

It should be noted that the steps described so far aregeneric to Analytic Services cube building and did not needany specific support from the Analytic Services Data MiningFramework.

white paper

4

Measures related to information about the applicants’ identity were organized into thisgroup. Some of these measures were transformed from ‘string’ type to ‘number’ typeto facilitate modeling it within the Analytic Services database context.

Measures related to various test scores and high school examination results wereorganized into this group.

Measures related to the context of the applicants application processing have beenorganized together into this group.

Measures related to the academic background.

Measures providing information about the financial support and funding associatedwith the applicant.

Measure Group Explanation

Table 3: Analytic Services outline expanded

Page 5: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com 5

identifying the optimal set ofmining attributesIt is necessary to reduce the number of attributes / variablespresented to an algorithm so that the information content isenhanced and the noise minimized. This is usually performedusing supporting mathematical techniques to ensure that themost significant attributes are retained within the dataset thatis presented to the algorithm. It should be noted here that thechoice of significant attributes are more driven by theparticular data rather than by the problem itself. Attributeanalysis or attribute conditioning is one of the initial steps inthe Data Mining process and is currently performed outsidethe Data Mining Framework. The main objective during thisexercise is to identify a subset of mining attributes that arehighly correlated with the predicted attribute; while ensuringthat the correlation within the identified subset of attributes isas low as possible.

The Analytic Services platform provides for a wide varietyof tools and techniques that can be used in the attributeselection process. One method to identify an optimal set ofattributes is to use certain special data reduction techniquesimplemented within Analytic Services through CustomDefined Functions (CDFs). Additionally, users can use otherdata visualization tools like Hyperion Visual Explorer to arriveat a decision on the effectiveness of specific attributes incontributing to the overall predictive strength of the DataMining algorithm. Depending on the nature of the problemthe users may choose to utilize an appropriate tool andtechnique in deciding the optimal set of attributes.

One of the advantages of working with the AnalyticServices Data Mining Framework is the inherent capability inAnalytic Services to support customized methods for attributeselection by the use of Custom Defined Functions (CDFs).This is essential since the process of mining attribute selectioncan vary significantly across various problems and having anextensible toolkit comes in very handy to be able to customizea method to suit a specific problem.

In the current case at ABC University, a CDF was used toidentify the correlation effects amongst the available set ofmining attributes. A thorough analysis of various subsets ofthe available mining attributes was performed to identify asubset that is highly correlated with the predicted miningattribute and at the same time has low correlation scoreswithin the subset in itself. Since some Data Mining algorithms(like Naïve Bayes, Neural Net) are quite sensitive to inter-attribute dependencies, an attempt was made to outline theclusters of mutually dependent attributes, with a certaindegree of success. From each cluster a single, most convenient,attribute was selected. For this case study, an expert made thedecision, but this process can be generalized to a large degree.An optimal set of five mining attributes was identified afterthis exercise. Table 4 shows the list of identified mining

attributes, grouped by the input attribute type – categorical ornumerical.

At this stage we have:• Designed an Analytic Services cube• Loaded it with relevant data • Identified the optimal subset of measures (mining attributes)

modeling the problemWe will now use the Data Mining Framework to define anappropriate model (for the business problem) based on theAnalytic Services cube and the identified subset of miningattributes (measures). Setting up the model includes selectingthe algorithm, defining algorithm parameters and identifyingthe input data location and output data location for thealgorithm.

choosing the algorithmThe next step in the Data Mining process is to pick theappropriate algorithm. There are a set of six basic algorithmsprovided in the Data Mining Framework – Naïve Bayes,Regression, Decision Tree, Neural Network, Clustering andAssociation Rules. The Analytic Services Data MiningFramework also allows for the inclusion of new algorithmsthrough a well defined process described in the vendor guidethat is part of the Data Mining SDK. The six basic algorithmsare a sample set that is shipped with the product to provide astarting point for using the Data Mining Framework.Choosing an algorithm for a specific problem needs basicknowledge of the problem domain and the applicability ofspecific mathematical techniques to efficiently solve problemsin that domain.

The specific problem that is being discussed in this paperfalls into a class of problems termed as classification problems.The need here is to classify each applicant into a discrete set ofclasses on the basis of certain numerical and categoricalinformation available about the applicant. The ‘class’ referredto in this context is the status of the applicants applicationlooked at from an enrollment perspective: “will enroll” or “willnot enroll”. There is historical data available indicating whichkind (with a specific combination of categorical andnumerical factors associated with them) of applicants thathave gone ahead and accepted offers from the ABC Universityand subsequently enrolled into the programs. There is dataavailable for the negative case as well – i.e. applicants that didnot eventually enroll into the program.

white paper

Categorical TypeFARecievedAppStatus

Applicant Type

Numerical TypeStudBudgetTotalAward

Table 4: Optimal set of mining attributes identified

Page 6: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com

Given the fact that this problem can be looked at as aclassification problem and the fact that there is historicalinformation available, one of the algorithms that is suitable forthe analysis is the Naïve Bayes classification algorithm. Wechose Naïve Bayes for modeling this particular businessproblem.

deciding on the algorithm parametersEvery algorithm has a set of parameters that control thebehavior of the algorithm. Algorithm users need to choose theparameters based on their knowledge of the problem domainand the characteristics of the input data. Analytic Servicesprovides adequate support for such preliminary analysis ofdata using Hyperion Visual Explorer or the Analytic ServicesSpreadsheet Client. Users are free to analyze the data using anytool convenient and determine their choices for the variousalgorithm parameters.

Each of the algorithms has a set of parameters thatdetermine the way the algorithm will process the input data.For the current case, the algorithm chosen is Naïve Bayes andit has four parameters that need to be specified – “Categorical,Numerical, RangeCount, Threshold”. The details of each of theparameters and the implications of setting them are describedin the online help documentation.

Out of the selected list of attributes we have a few that areof categorical type and hence our choice for the ‘Categorical’parameter is a ‘yes’. Similarly, there are attributes that are ofnumerical type and hence the choice for ‘Numerical’parameter also is a ‘yes’. The data was analyzed using ahistogram plot to understand the distribution before decidingon the value to be provided for the ‘RangeCount’ parameter.This parameter needs to be large enough to allow for thealgorithm to use all the variety available in the data and at thesame time should be small enough to prevent over fitting.From the analysis of the input data for this particular case,setting this parameter ‘12’ seemed reasonable. The‘RangeCount’ controls the binning1 process in the algorithm.It should be emphasized that the binning schemes (includingbin count) really depend on the specific circumstances andmay vary to a great degree between different problems.

At this stage we have:• Designed an Analytic Services cube• Loaded it with relevant data • Identified the optimal subset of measures (mining attributes)• Chosen the algorithm suitable for the problem• Identified the parameter values for the chosen algorithm

applying the data mining frameworkNow that we have completed all the preparatory steps for DataMining, the next step is to use the Data Mining Wizard in theAdministration Services Console to build a Data Miningmodel for the business problem. There are three steps involved

in effectively using the Data Mining functionality to providepredictive solutions to business problems.

1. Building the Data Mining model2. Testing the Data Mining model3. Applying the Data Mining model

Each of these steps, performed using the Data MiningWizard in the Administration Services Console, uses MDXexpressions to define the context within the cube to performthe data mining operation. Various accessors, specified asMDX expressions, identify data locations within the cube. Theframework uses the data in the locations as input to thealgorithm or writes output to the specified location.

Accessors need to be defined for each of the algorithms soas to let the algorithm know specific contexts for each of thefollowing:

• (the attribute domain) the expression to identify the fac-tors of our analysis that will be used for prediction [In thecurrent context this expression pertains to the miningattributes that we identified]

• (the sequence domain) the expression to identify thecases/records that need to be analyzed [In the currentcontext this expression will identify the list of applicants]

• (the external domain) the expression to identify if multiplemodels need to be built [Not relevant in the current context]

• (the anchor) the expression to specify the additionalrestrictions from dimensions that are not really partici-pating in this data mining operation [In the current con-text all the dimensions of the cube that we used have relevance to the problem. Accordingly, the anchor in thecurrent context only helps restrict the algorithm scope tothe right measure in the ‘Measures’ dimension]

Additional details for each of these expressions can beobtained from the online help documentation.

building the data mining modelTo access the Data Mining Framework, you will need to bringup the Data Mining Wizard in the Administration ServicesConsole, and choose the appropriate application and databaseas shown in Figure 1 on the next page.

white paper

6

Page 7: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com 7

In the next screen (Figure 2 below), depending on whether youare building a new model or revising an existing model, you

choose the appropriate task option.

white paper

Figure 1: Choosing the application and database

Figure 2: Creating a Build Task

Page 8: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com

This will bring up the wizard screen for setting thealgorithm parameters and the accessor information associatedwith the chosen algorithm, in this case Naïve Bayes. The userwill select a node in the left pane to see and provide values forthe appropriate options and fields displayed in the right pane.As shown in Figure 3, select “Choose mining task settings” toset how to handle missing data in the cube. The choice in thiscase is to replace with ‘As NaN’ (Not-A-Number).

The Naïve Bayes algorithm requires that we declare upfrontif we plan to use either or both of ‘Categorical’ and ‘Numerical’predictors. In the context of the current case, we have bothcategorical and numerical attribute types and hence the choiceis ‘True’ for both these parameters. ‘RangeCount’ was decidedat 12. ‘Threshold’ was fixed at 1e-4, a very small value. Figure4 shows the completed screen for the parameters setting.

white paper

8

Figure 3: Settings to handle missing data

Figure 4: Setting parameters

Page 9: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com 9

The Naïve Bayes algorithm has two predictor accessors –‘Numerical Predictor’ and ‘Categorical Predictor’ and onetarget accessor. Figure 5 shows the various domains that needto be defined for the accessors. Table 5 shows the values that

were used for the case being discussed. All the informationprovided during this stage of model building is preserved in atemplate file so as to facilitate reuse of the information ifnecessary.

white paper

Figure 5: Accessors associated with Naive Bayes algorithm

Table 5: Setting up accessors for the “build” mode while using Naive Bayes algorithm

Page 10: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com

Once the accessors are defined, the Data Mining Wizardwill prompt the user to provide names for the template andmodel that will be generated at this stage. Figure 6 shows thescreen in which the model and template names need to bedefined.

At this stage we have:• Built a Data Mining model built using the Naïve Bayes

algorithm

testing the data mining modelThe next step will be to test the newly built model to verify thatit satisfies the level of statistical significance that is needed forthe model to be put to use. Ideally, a part of the input data(with valid known outcomes – historical data) will be set asideas a test dataset to verify the goodness of the Data Mining

model that is developed by the use of the algorithm. Testingthe model on this test dataset and comparing the outcomespredicted by the model against the known outcomes(historical data) is also one among the multiple processessupported by the Data Mining Wizard. A ‘test’ mode templatecan be created by a process similar to creating a ‘build’ modetemplate as described in the previous section. While buildingthe ‘test’ mode template the user needs to provide a‘Confidence’ parameter to let the Data Mining Frameworkknow the minimum confidence level necessary to declare themodel as a valid one. We specified a value of 0.95 for the‘Confidence’ parameter. The exact steps in the wizard anddescriptions of the various parameters can be obtained fromthe online help documentation.

white paper

10

Figure 6: Generating the template and model

Page 11: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com 11

Once the process is completed the results of the test appear(the name of which was specified in the last step of the DataMining Wizard) against the ‘Model Results’ node. Figure 7shows the node in the Administration Services Console‘Enterprise View’ pane where the ‘Mining Results’ node isvisible.

The model can be queried within the AdministrationServices Console interface to obtain a list of the modelaccessors by using the “Query Result” functionality. Invoking“Show Result” for the ‘Test’ accessor will indicate the result ofthe test. Figure 8 below shows the list of model accessors in theresult set of a model based on the Naïve Bayes algorithm usedin the test mode.

If the ‘Test’ accessor has a value 1.0 then the test is deemedsuccessful and the model is declared ‘good’ or ‘valid’ forprediction. Figure 9 shows the result of test for the case beingdiscussed in this paper.

At this stage we have:• Built a Data Mining model built using the Naïve Bayes

algorithm• The model has been verified as valid with 95% confidence

white paper

Figure 7: Model Results node in the AdministrationServices Console interface

Figure 8: Model accessors for result set associated with amodel based on Naive Bayes algorithm

Figure 9: Test results

Page 12: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com

applying the data mining modelThe intent at this stage is to use the recently constructed DataMining model to predict whether new applicants are likely toenroll into the program. Using the Data Mining model in theapply mode is similar to the earlier two steps. The Data MiningWizard guides the user to provide the parameters appropriateto the ‘apply’ mode. The ‘Target’ domain is usually different inthe ‘apply’ mode since data is written back to the cube. Thedetails of the various accessors and the associated domains canbe obtained from the online help documentation. Table 6shows the values that were provided to the Data MiningWizard to use the model in the ‘apply’ mode.

Just as in the ‘build’ mode the names of the results modeland template are specified in the wizard and the template issaved before the model is executed. The results of theprediction are written into the location specified by the‘Target’ accessor – The mining attribute that is referred to bythe MDX expression: {[ActualStatus]}. The results can bevisualized either by querying the model results in theAdministration Services Console using the “Query Result”functionality as described in the previous section, or byaccessing the cube and reviewing the data written back to thecube. One of the options to view the results will be to use theAnalytical Services Spread Sheet Client to connect to thedatabase and view the cube data for the ‘ActualStatus’ measure.

interpreting the resultsThe results of the Data Mining model need to be interpretedin the context of the business problem that it is attempting tosolve. Any transformation done to the input measures need tobe appropriately adjusted for while attempting to interpret theresults. In the context of the case being discussed in this paper,the intent was to predict whether applicants were likely toenroll at the ABC University. The possible outcomes in thiscase are either the applicant will enroll or the applicant willnot enroll. The model was verified against the entire set ofavailable data (over 11300 records).

the confusion matrixYou can construct a confusion matrix by listing the ‘falsepositives’ and ‘false negatives’ in a tabular format. A ‘falsepositive’ happens when the model predicts that an applicantwill enroll and in reality the applicant does not enroll. A ‘falsenegative’ happens when the model predicts that an applicantwill not enroll and in reality the applicant does enroll. Theresults predicted by the model can be compared with theactual outcome as available in the historical data to build theconfusion matrix. In general for such classification problems,it is most likely that one of these (‘false positives’ or ‘falsenegatives’) will be slightly more important than the other in abusiness context. In the case being discussed in this paper, a‘false negative’ means lost revenue, whereas a ‘false positive’

white paper

12

Table 6: Setting up accessors for the “apply” mode while using Naive Bayes algorithm

Page 13: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com 13

means additional promotional expenditure in trying to followup on an applicant who will eventually not enroll. Theimportance of each should be analyzed in the context of thebusiness and the model needs to be rebuilt if necessary with adifferent training set (historical data) or with a different set ofattributes.

Figure 10 below shows the confusion matrix constructedusing the data set that was analyzed as part of this case study.It is evident from the confusion matrix that the modelpredicted that 1550 (1478 + 72) students will enroll. Of that,only 1478 actually enrolled and 72 did not enroll. This impliesthat there were 72 false positives. Similarly, the modelpredicted that 9805 (9356 + 449) students will not enroll. Ofthat, only 9356 actually did not enroll, whereas 449 actuallydid enroll. This implies that there were 449 false negatives.

analyzing the resultsOn further analysis of the results the following observationscan be made:

Success rate of the model: 95.41% (only 521 incorrectpredictions in 11355 cases)

additional functionalityThe Analytic Services Data Mining Framework offers morefunctionality that can be used when deploying models in realbusiness scenarios. Some of the further steps that can beconsidered include:

transformationsThe Data Mining Framework also offers the ability to apply atransform to the input data just before it is presented to thealgorithm. Similarly, the output data can be transformedbefore being written into the Analytic Services cube. The DataMining Framework offers a basic list of transformations – exp,log, pow, scale, shift, linear that can be used through the Data

Mining Wizard. The details of each of these transformations,what they do and how to use them can be obtained from theAnalytic Services online help documentation. This list oftransformations is further extensible through the import ofcustom Java routines written specifically for the purpose. Thedetails of how to write Java routines to be imported asadditional transforms can be obtained from the vendor guidethat is shipped as part of the Data Mining SDK

mappingIn some cases when the model has been developed for adifferent context and needs to be used elsewhere, the‘Mapping’ functionality is useful. Through this functionalitythe user can provide information to the Data MiningFramework on how to interpret the existing model accessorsin the new context in which it is being deployed. Moreinformation on using this functionality can be obtained fromthe online help documentation.

import/export of pmml modelsThe Data Mining Framework allows for portability throughimport and export of mining models using the PMML format.

setting up models for scoringThe Data Mining models built using the Analytic ServicesData Mining Framework can also be set up for ‘scoring’. In the‘scoring’ mode the user interacts with the model at real timeand the results are not written to the database. The input datacan either be sourced from the cube or through data templateswhich the user fills up during execution. The ‘scoring’ mode ofdeployment can be combined with custom applications builtusing developer tools provided by Hyperion ApplicationBuilder to make applications that cater to a specific businessprocess while leveraging powerful predictive analyticcapability from the Analytic Services Data Mining Framework.The online help documentation provides additional details onhow to ‘score’ a Data Mining model.

using the data mining framework in batch modeThere is also a batch mode interface to access thefunctionalities provided in the Data Mining Framework.Scripts written using the MaxL command interface can beused to do almost all the functionality that is exposed throughthe Data Mining Wizard. Details of the MaxL commands andtheir usage can be obtained from the online helpdocumentation.

building custom applicationsCustom applications can be developed using Analytic Servicesas the backend database and developer tools provided alongwith Hyperion Application Builder. The functionalityprovided by the Data Mining Framework can be invokedthrough APIs.

white paper

Incorrect PredictionsFalse positivesFalse negatives

Total

# of Cases72449521

Percentage of Cases0.634%3.954%4.59%

Figure 10: Confusion matrix to analyze the model’seffectiveness in prediction

Page 14: using analytic services data mining framework for ... · Services Data Mining Framework is the inherent capability in Analytic Services to support customized methods for attribute

hyperion.com

white paper

,

.. . .

© Copyright 2005 Hyperion Solutions Corporation. All rights reserved. “Hyperion,” the Hyperion “H” logo, and Hyperion’s product names are trademarks of Hyperion. References toother companies and their products use trademarks owned by the respective companies and are for reference purpose only. 5164_0805

summaryData Mining is one of the functional groups among thecomprehensive enterprise class analytic functionalities offeredwithin Analytic Services. This case study focused on using the‘Naïve Bayes’ algorithm to solve a classification problem,modeled using a real life data set. It was possible to get a95.41% success rate in the classification exercise using theAnalytic Services Data Mining Framework.

Some of the business benefits of Data Mining in the OLAPcontext that can be illustrated from the current case include:

• It can serve as a discovery tool in a critical decision-support process. It includes evaluation of the criticalparameters affecting the outcome of a customer (appli-cant) behavior. The ABC University had initially assumedthat some time-related factors played a stronger role ininfluencing the judgment to enroll. The Data Miningexercise proved it not to be true. In fact, some other, finan-cial attributes appeared as number one.

• The successful prediction mechanism can become a basefor a full-blown risk-management application. In case ofABC University, again, they can devise a policy to investmore promotional expenditure in tracking applicantswith distinctly higher academic credentials but with mod-erate probability of enrollment. Similarly, the predictionmechanism can help the admissions department in making decisions on admission offers even before theyhave seen the entire applicant pool.

• Operational control and reporting tool. Traditional OLAPreporting can provide visibility into the state of theadmissions operations, extent of funds utilization andreporting on various other financial/operational indica-tors; in all providing better control on the conformancebetween planned and actual business positions.

suggested reading1. Data Mining: Concepts and TechniquesJiawei Han, Micheline Kamber

2. Data Mining Techniques: For Marketing, Sales, and CustomerRelationship ManagementMichael J. A. Berry, Gordon S. Linoff.

3. Data Mining ExplainedRhonda Delmater, Jr., Monte Hancock

4. Data Mining: A Hands-On Approach for BusinessProfessionals (Data Warehousing Institute Series)Robert Groth

footnote1 Breaking up a continuous range of data into discretesegments / bins.