b28131.pdf

8/13/2019 b28131.pdf

1/118

Oracle Data Mining

Application Developer's Guide

11gRelease 1 (1.1)

B28131-01

January 2007

Beta Draft

8/13/2019 b28131.pdf

2/118

Oracle Data Mining Application Developers Guide, 11gRelease 1 (1.1)

B28131-01

Copyright 2005, 2007, Oracle. All rights reserved.

The Programs (which include both the software and documentation) contain proprietary information; theyare provided under a license agreement containing restrictions on use and disclosure and are also protected

by copyright, patent, and other intellectual and industrial property laws. Reverse engineering, disassembly,or decompilation of the Programs, except to the extent required to obtain interoperability with other

independently created software or as specified by law, is prohibited.

The information contained in this document is subject to change without notice. If you find any problems inthe documentation, please report them to us in writing. This document is not warranted to be error-free.Except as may be expressly permitted in your license agreement for these Programs, no part of thesePrograms may be reproduced or transmitted in any form or by any means, electronic or mechanical, for anypurpose.

If the Programs are delivered to the United States Government or anyone licensing or using the Programs onbehalf of the United States Government, the following notice is applicable:

U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical datadelivered to U.S. Government customers are "commercial computer software" or "commercial technical data"pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. Assuch, use, duplication, disclosure, modification, and adaptation of the Programs, including documentationand technical data, shall be subject to the licensing restrictions set forth in the applicable Oracle licenseagreement, and, to the extent applicable, the additional rights set forth in FAR 52.227-19, CommercialComputer Software--Restricted Rights (June 1987). Oracle USA, Inc., 500 Oracle Parkway, Redwood City, CA94065.

The Programs are not intended for use in any nuclear, aviation, mass transit, medical, or other inherentlydangerous applications. It shall be the licensee's responsibility to take all appropriate fail-safe, backup,redundancy and other measures to ensure the safe use of such applications if the Programs are used for suchpurposes, and we disclaim liability for any damages caused by such use of the Programs.

Oracle, JD Edwards, PeopleSoft, and Siebel are registered trademarks of Oracle Corporation and/or itsaffiliates. Other names may be trademarks of their respective owners.

The Programs may provide links to Web sites and access to content, products, and services from thirdparties. Oracle is not responsible for the availability of, or any content provided on, third-party Web sites.You bear all risks associated with the use of such content. If you choose to purchase any products or servicesfrom a third party, the relationship is directly between you and the third party. Oracle is not responsible for:(a) the quality of third-party products or services; or (b) fulfilling any of the terms of the agreement with thethird party, including delivery of products or services and warranty obligations related to purchasedproducts or services. Oracle is not responsible for any loss or damage of any sort that you may incur from

dealing with any third party.Alpha and Beta Draft documentation are considered to be in prerelease status. This documentation isintended for demonstration and preliminary use only. We expect that you may encounter some errors,ranging from typographical errors to data inaccuracies. This documentation is subject to change withoutnotice, and it may not be specific to the hardware on which you are using the software. Please be advisedthat prerelease documentation is not warranted in any manner, for any purpose, and we will not beresponsible for any loss, costs, or damages incurred due to the use of this documentation.

8/13/2019 b28131.pdf

3/118

Beta Draft iii

Contents

Preface ................................................................................................................................................................. v

Audience....................................................................................................................................................... v

Documentation Accessibility..................................................................................................................... v

Related Documentation.............................................................................................................................. vi

Conventions .............. ............... .............. ............... .............. ................ .............. ................ ............... ............ vi

1 Use Cases

Analyze Customer Demographics and Buying Patterns .................................................................. 1-1

Evaluate the Success of a Marketing Campaign ................................................................................ 1-4

Use Predictive Analytics to Create a Customer Profile..................................................................... 1-5

2 A Tour of the APIs

Data Mining PL/SQL Packages ............................................................................................................. 2-1

Data Mining Data Dictionary Views .................................................................................................... 2-3

Data Mining SQL Functions .................................................................................................................. 2-4

Data Mining Java API..............................................................................................................................

2-5

3 Creating a Mining Model

Mining Model Objects ............................................................................................................................ 3-1

Settings and Defaults .............................................................................................................................. 3-2

Model Build............................................................................................................................................... 3-3

4 Scoring and Deployment

Applying a Mining Model...................................................................................................................... 4-1

Real Time Scoring .................................................................................................................................... 4-1

Deployment and Model Management ................................................................................................. 4-2

5 The Case Table

Requirements ............................................................................................................................................ 5-1

About Attributes....................................................................................................................................... 5-2

Nested Data ............................................................................................................................................... 5-7

Missing Values and Sparse Data ........................................................................................................ 5-11

8/13/2019 b28131.pdf

4/118

iv Beta Draft

6 Data Transformation

Getting Started.......................................................................................................................................... 6-1

How Will the Model Use the Data? ...................................................................................................... 6-1

Algorithm-Specific Transformations.................................................................................................... 6-3

Automatic Embedded Transformations ............................................................................................... 6-5

Custom Embedded Transformations.................................................................................................... 6-7

7 Text Transformation

Oracle Text for Oracle Data Mining...................................................................................................... 7-1

Term Extraction in the Sample Programs ............................................................................................ 7-2

From Unstructured Data to Structured Data ....................................................................................... 7-3

Steps in the Term Extraction Process .................................................................................................... 7-4

Example: Transforming a Text Column ................................................................................................ 7-8

8 The Java API

The Java Environment ............................................................................................................................. 8-1

Connecting to the Data Mining Engine ............................................................................................... 8-1

API Design Overview.............................................................................................................................. 8-7

9 Sequence Similarity Search and Alignment (BLAST)

Bioinformatics Sequence Search and Alignment .............................................................................. 9-1

BLAST in the Oracle Database .............................................................................................................. 9-2

Oracle Data Mining Sequence Search and Alignment Capabilities.............................................. 9-2

NCBI BLAST ............................................................................................................................................. 9-3

Using Oracle BLAST................................................................................................................................ 9-4

Index

8/13/2019 b28131.pdf

5/118

Beta Draft v

Preface

This manual describes the programmatic interfaces to Oracle Data Mining. You canuse the PL/SQL and Java interfaces to create data mining applications or add datamining features to existing applications. You can use the SQL operators in applicationsor in ad hoc queries.

This manual should be used along with the demo applications and the related

reference documentation. (See "Related Documentation"on page -vi.)The preface contains these topics:

Audience

Documentation Accessibility

Related Documentation

Conventions

AudienceThis manual is intended for database programmers who are familiar with Oracle Data

Mining.

Documentation AccessibilityOur goal is to make Oracle products, services, and supporting documentationaccessible, with good usability, to the disabled community. To that end, ourdocumentation includes features that make information available to users of assistivetechnology. This documentation is available in HTML format, and contains markup tofacilitate access by the disabled community. Accessibility standards will continue toevolve over time, and Oracle is actively engaged with other market-leadingtechnology vendors to address technical obstacles so that our documentation can beaccessible to all of our customers. For more information, visit the Oracle Accessibility

Program Web site athttp://www.oracle.com/accessibility/

Accessibility of Code Examples in Documentation

Screen readers may not always correctly read the code examples in this document. Theconventions for writing code require that closing braces should appear on anotherwise empty line; however, some screen readers may not always read a line of textthat consists solely of a bracket or brace.

8/13/2019 b28131.pdf

6/118

vi Beta Draft

Accessibility of Links to External Web Sites in Documentation

This documentation may contain links to Web sites of other companies ororganizations that Oracle does not own or control. Oracle neither evaluates nor makesany representations regarding the accessibility of these Web sites.

TTY Access to Oracle Support Services

Oracle provides dedicated Text Telephone (TTY) access to Oracle Support Serviceswithin the United States of America 24 hours a day, seven days a week. For TTYsupport, call 800.446.2398.

Related DocumentationThe documentation set for Oracle Data Mining is part of the Oracle Database 11gRelease 1 (1.1) Online Documentation Library. The Oracle Data Mining documentationset consists of the following documents:

Oracle Data Mining Concepts

Oracle Data Mining Java API Reference(javadoc)

Oracle Data Mining Administrator's Guide

For information about Oracle Data Miner, the graphical user interface for Data Mining,see the online help. Oracle Data Miner is distributed on Oracle Technology Network at

http://www.oracle.com/technology/index.html

For detailed information about the Oracle Data Mining PL/SQL interface, see Oracle

Database PL/SQL Packages and Types Reference.For detailed information about the SQL operators for data mining, see Oracle DatabaseSQL Reference.

For information about the data mining process in general, independent of bothindustry and tool, a good source is the CRISP-DM project (Cross-Industry StandardProcess for Data Mining) at

http://www.crisp-dm.org

ConventionsThe following text conventions are used in this document:

Note: Information to assist you in installing and using the DataMining demo programs is provided in Oracle Data Mining

Administrator's Guide.

Convention Meaning

boldface Boldface type indicates graphical user interface elements associatedwith an action, or terms defined in text or the glossary.

italic Italic type indicates book titles, emphasis, or placeholder variables forwhich you supply particular values.

monospace Monospace type indicates commands within a paragraph, URLs, codein examples, text that appears on the screen, or text that you enter.

8/13/2019 b28131.pdf

7/118

Beta Draft Use Cases 1-1

1Use Cases

This chapter provides examples that illustrate some of the capabilities of the SQLinterface to Oracle Data Mining. The examples show simple data mining techniquesfor obtaining valuable information about your customers. This kind of information isneeded to answer business questions, such as: How should we position a product?When should we introduce a new product? What pricing and sales promotionstrategies should we use?

This chapter contains the following topics:

Analyze Customer Demographics and Buying Patterns

Evaluate the Success of a Marketing Campaign

Use Predictive Analytics to Create a Customer Profile

Analyze Customer Demographics and Buying PatternsThe code fragments in this section show how you can apply data mining models tolearn more about your customers.

These examples use models that were previously created. They exist in OracleDatabase as database objects. For the sake of simplicity, we can assume that you

Note: The base interface to Oracle Data Mining is SQL, and most ofthe examples throughout this manual use SQL code. Oracle DataMining also supports a standards-based Java API, which is layered onthe SQL API. The Java API is described in "Data Mining Java API"onpage 2-5and in Chapter 8.

See Also:

Oracle Database SQL Referencefor the semantics of the SQL DataMining functions.

Oracle Database PL/SQL Packages and Types Referencefor thesemantics of the PL/SQL API.

Oracle Data Mining Java API Reference(javadoc) for the semanticsof the Java API.

The sample PL/SQL and Java programs provided with OracleData Mining. To obtain the sample programs, see the instructionsin Oracle Data Mining Administrator's Guide.

8/13/2019 b28131.pdf

8/118


1-2 Oracle Data Mining Application Developers Guide Beta Draft

created the models, that they exist within your schema, and that you are executing theSQL statements that apply them. You could have used the PL/SQL API, the Java API,or Oracle Data Miner to create the models.

The primary method for applying data mining models is by executing specialized SQLfunctions. There exists a separate family of SQL functions for prediction, clustering,and feature extraction. The SQL functions for data mining are described in Chapter 4,

"Scoring and Deployment". They are documented in detail in Oracle Database SQLReference.

Segment Customer Base and Predict Attrition

The query in Example 11divides customers into segments and predicts customerattrition.

The segments are created by a clustering model named clus_model. The model uses

heuristics to divide the customer database into groups that have similarcharacteristics.The prediction is generated by a classification model calledsvmC_model.

The model predicts the attrition probability for each customer in each segment. Thedata is returned by segment, ordered by the overall attrition propensity in thesegment.

Example 11 Predict Attrition for Customer Segments

SELECT count(*) as cnt,AVG(PREDICTION_PROBABILITY(svmC_model,

'attrite' USING *)) as avg_attrite, AVG(cust_value_score)

FROM customersGROUP BY CLUSTER_ID(clus_modelUSING *)

ORDER BY avg_attrite DESC;

The sophisticated analytics within this seemingly simple query let you see yourcustomer base within natural groupings based on similarities. The SQL data miningfunctions show you where your most loyal and least loyal customers are. Thisinformation can help you make decisions about how to sell your product. For example,if most of the customers in one segment have a high probability to attrite, you mighttake a closer look at the demographics of that segment when considering yourmarketing strategy.

The CLUSTER_IDfunction in Example 11applies the mining model clus_modelto

the customerstable. It returns the segment for each customer. The number ofsegments is determined by the clustering algorithm. Oracle Data Mining supports twoclustering algorithms: enhanced k-Means and O-Cluster.

The PREDICTION_PROBABILITYfunction in Example 11applies the mining modelsvmC_modeland returns the probability to attrite for each customer. Oracle DataMining supports a number of classification algorithms. In this case, the model nameimplies that Support Vector Machine was chosen as the classifier.

Note: The examples illustrate SQL syntax for different use cases, butthe code uses hypothetical object names. For example, a table might benamed my_table.

8/13/2019 b28131.pdf

9/118



Predict Missing Incomes

The sample query in Example 12returns the ten customers who are most likely toattrite, based on age, gender, annual_income, and zipcode.

In addition, since annual_incomeis often missing, the PREDICTIONfunction is usedto perform missing value imputation for the annual_incomeattribute. ThePREDICTIONfunction applies the regression model, svmR_model, to predict the most

likely annual income. The NVLfunction replaces missing annual income values withthe resulting prediction.

Example 12 Predict Customer Attrition and Transform Missing Income Values

SELECT * FROM ( SELECT cust_name, cust_contact_info FROM customers ORDER BY

PREDICTION_PROBABILITY(tree_model, 'attrite' USING age, gender, zipcode, NVL(annual_income,

PREDICTION(svmR_modelUSING *))as annual_income) DESC)

WHERE rownum < 11;

Find Anomalies in the Customer Data

These examples use a one-class SVM model to discover atypical customers (outliers),find common demographic characteristics of the most typical customers, and computethe probability that a new or hypothetical customer will be a typical affinity cardholder.

Find the Top 10 Outliers

Find the top 10 outliers -- customers that differ the most from the rest of thepopulation. Depending on the application, such atypical customers can be removed

from the data.

SELECT cust_id FROM (SELECT cust_idFROM svmo_sh_sample_prepared ORDER BY prediction_probability(SVMO_SH_Clas_sample, 0 using *) DESC, 1)WHERE rownum < 11;

Compute the Probability of a New Customer Being a Typical Affinity Card Member

Compute the probability of a new or hypothetical customer being a typical affinitycard holder. Normalization of the numerical attributes is performed on-the-fly.

WITH age_norm AS (

SELECT shift, scale FROM SVMO_SH_sample_norm WHERE col = 'AGE'), yrs_residence_norm AS ( SELECT shift, scale FROM svmo_sh_sample_norm WHERE col = 'YRS_RESIDENCE') SELECT prediction_probability(SVMO_SH_Clas_sample, 1 using

(44 - a.shift)/a.scale AS age, (6 - b.shift)/b.scale AS yrs_residence, 'Bach.' AS education, 'Married' AS cust_marital_status, 'Exec.' AS occupation, 'United States of America' AS country_name, 'M' AS cust_gender,

8/13/2019 b28131.pdf

10/118

Evaluate the Success of a Marketing Campaign


'L: 300,000 and above' AS cust_income_level, '3' AS houshold_size ) prob_typicalFROM age_norm a, yrs_residence_norm b;

Find the Demographics of a Typical Affinity Card Member

Find demographic characteristics of the typical affinity card members. These statistics

will not be influenced by outliers and are likely to provide a more truthful picture ofthe population of interest than statistics computed on the entire group of affinitymembers.

SELECT a.cust_gender, round(avg(a.age)) age,round(avg(a.yrs_residence)) yrs_residence,

count(*) cntFROM mining_data_one_class_v aWHERE prediction(SVMO_SH_Clas_sampleusing *) = 1GROUP BY a.cust_genderORDER BY a.cust_gender;

Save Mining Results in a Table

This example shows how to save the results of scoring a customer response model.

UPDATE CUST_RESPONSE_APPLY_UPDATESET prediction = prediction(CUST_RESPONSE19964_DT using *), probability = prediction_probability(CUST_RESPONSE19964_DT using *)

The table in question has all of the predictors, and has columns to hold the predictionand probability. The assumption is that any necessary transformations are embeddedin the model (otherwise the using clause would need to contain them).

Evaluate the Success of a Marketing CampaignThis example uses a classification model to predict who will respond to a marketing

campaign for DVDs and why.

1. First predict the responders.

This statement uses the PREDICTIONand PREDICTION_DETAILSfunctions toapply the model campaign_model.

SELECT cust_name,PREDICTION(campaign_model USING *)

AS responder,

PREDICTION_DETAILS(campaign_model USING *)AS reason

FROM customers;

2. Combine the predictions with relational data.

This statement combines the predicted responders with additional informationfrom the salestable. In addition to predicting the responders, it shows howmuch each customer has spent for a period of three months before and after thestart of the campaign.

SELECT cust_name,PREDICTION(campaign_model USING *) AS responder,

SUM(CASE WHEN purchase_date < 15-Apr-2005 THEN purchase_amt ELSE 0 END) AS pre_purch, SUM(CASE WHEN purchase_date >= 15-Apr-2005 THEN

8/13/2019 b28131.pdf

11/118



purchase_amt ELSE 0 END) AS post_purchFROM customers, salesWHERE sales.cust_id = customers.cust_idAND purchase_date BETWEEN 15-Jan-2005 AND 14-Jul-2005

GROUP BY cust_id, PREDICTION(campaign_model USING *);

3. Combine the predictions and relational data with multi-domain, multi-database

data.In addition to predicting responders, find out how much each customer has spenton DVDs for a period of three months before and after the start of the campaign.

SELECT cust_name,PREDICTION(campaign_model USING *) as responder,

SUM(CASE WHEN purchase_date < 15-Apr-2005 THEN purchase_amt ELSE 0 END) AS pre_purch, SUM(CASE WHEN purchase_date >= 15-Apr-2005 THEN purchase_amt ELSE 0 END) AS post_purchFROM customers, sales, products@PRODDBWHERE sales.cust_id = customers.cust_idAND purchase_date BETWEEN 15-Jan-2005 AND 14-Jul-2005

AND sales.prod_id = products.prod_id

AND CONTAINS(prod_description, 'DVD') > 0GROUP BY cust_id, PREDICTION(campaign_model USING *);

4. Evaluate the effectiveness and significance of the information you have obtained.

Compare the success rate of predicted responders and non-responders withindifferent regions and across the company. Is the success statistically significant?

SELECT responder, cust_region, COUNT(*) AS cnt, SUM(post_purch pre_purch) AS tot_increase, AVG(post_purch pre_purch) AS avg_increase, STATS_T_TEST_PAIRED(pre_purch, post_purch) AS significanceFROM (SELECT cust_name, cust_region PREDICTION(campaign_model USING *) AS responder, SUM(CASE WHEN purchase_date < 15-Apr-2005 THEN purchase_amt ELSE 0 END) AS pre_purch, SUM(CASE WHEN purchase_date >= 15-Apr-2005 THEN purchase_amt ELSE 0 END) AS post_purchFROM customers, sales, products@PRODDBWHERE sales.cust_id = customers.cust_idAND purchase_date BETWEEN 15-Jan-2005 AND 14-Jul-2005

AND sales.prod_id = products.prod_id AND CONTAINS(prod_description, 'DVD') > 0GROUP BY cust_id, PREDICTION(campaign_model USING *) )GROUP BY ROLLUP responder, cust_regionORDER BY 4 DESC;

Use Predictive Analytics to Create a Customer ProfilePredictive analytics, implemented in the DBMS_PREDICTIVE_ANALYTICSpackage,support routines for making predictions, assessing attribute importance, and creatingprofiles. Example 13shows how you could use predictive analytics to generate acustomer profile.

Note: With predictive analytics, you do notneed to create a model.The routine dynamically creates and applies a model, which does notpersist upon completion.

8/13/2019 b28131.pdf

12/118



The PROFILEstatement in Example 13analyzes customer data in thesupplementary_demographics table. The XML rules in the profile are written to aresults table with these columns.

Name Type------------------------- ------------PROFILE_ID NUMBERRECORD_COUNT NUMBER

DESCRIPTION XMLTYPE

The rule identifier is stored in PROFILE_ID. The number of cases described by therule is stored in RECORD_COUNT. The XML that describes the rule is stored in theDESCRIPTIONcolumn.

Example 13 Generate a Customer Profile

--describe the source dataSQL> describe sh.supplementary_demographicsName Null? Type----------------------------------------- -------- ----------------------------CUST_ID NOT NULL NUMBEREDUCATION VARCHAR2(21)OCCUPATION VARCHAR2(21)HOUSEHOLD_SIZE VARCHAR2(21)YRS_RESIDENCE NUMBERAFFINITY_CARD NUMBER(10)BULK_PACK_DISKETTES NUMBER(10)FLAT_PANEL_MONITOR NUMBER(10)HOME_THEATER_PACKAGE NUMBER(10)BOOKKEEPING_APPLICATION NUMBER(10)PRINTER_SUPPLIES NUMBER(10)Y_BOX_GAMES NUMBER(10)OS_DOC_SET_KANJI NUMBER(10)COMMENTS VARCHAR2(4000)

-- generate the profileSQL> BEGIN DBMS_PREDICTIVE_ANALYTICS.PROFILE(

DATA_TABLE_NAME => 'supplementary_demographics', TARGET_COLUMN_NAME => 'affinity_card', RESULT_TABLE_NAME => 'cust_profile', DATA_SCHEMA_NAME => 'sh'); END; /

-- View the profile rules in the output table.-- This query returns the first two rules in the profileSQL> SELECT a.profile_id, a.record_count, a.description.extract('/') description FROM cust_profile a

WHERE profile_id < 3 ;

8/13/2019 b28131.pdf

13/118

Beta Draft A Tour of the APIs 2-1

2A Tour of the APIs

This chapter provides an overview of the PL/SQL, SQL, and Java interfaces to OracleData Mining.

This chapter contains the following sections:

Data Mining PL/SQL Packages

Data Mining Data Dictionary Views

Data Mining SQL Functions

Data Mining Java API

Data Mining PL/SQL PackagesThe PL/SQL interface to Oracle Data Mining is implemented in three packages:

DBMS_DATA_MININGincludes routines for creating, testing, examining, andscoring mining models. Also includes routines for import/export.

DBMS_DATA_MINING_TRANSFORMincludes routines for transforming the data tomake it suitable for mining. Since Oracle Data Mining supports automatic data

transformations, you will not need to use this package unless you want toimplement specialized transformations.

DBMS_PREDICTIVE_ANALYTICSincludes routines for performing predictiveanalytics, an automated form of data mining.

DBMS_DATA_MINING

The DBMS_DATA_MININGpackage includes procedures for performing these activities:

Creating, dropping, and renaming a model CREATE_MODEL, DROP_MODEL,RENAME_MODEL.

Scoring a model APPLY.

RankingAPPLYresults RANK_APPLY.

Describing the attributes, transformations, and other information used internallyby a model (model transparency) GET_MODEL_DETAILSfunctions for eachalgorithm.

Note: Scoring is more typically performed using the Data MiningSQL functions described in "Data Mining SQL Functions"on page 2-4.

8/13/2019 b28131.pdf

14/118

Data Mining PL/SQL Packages


Computing test metrics for a classification model COMPUTE_CONFUSION_MATRIX, COMPUTE_LIFT, and COMPUTE_ROC.

Exporting and importing models EXPORT_MODEL, IMPORT_MODEL.

Build ResultsThe CREATE_MODELprocedure creates a mining model. The attributes and datatransformations used internally are provided to you through theGET_MODEL_DETAILSfunctions for each supported algorithm. You can obtaininformation about mining models by querying data dictionary views, as described in"Data Mining Data Dictionary Views"on page 2-3.

Apply Results

TheAPPLYprocedure creates and populates a pre-defined table. The columns of thistable vary based on the particular mining function, algorithm, and target attribute type numerical or categorical.

The RANK_APPLYprocedure takes this results table as input and generates another

table with results ranked based on a top-N input. Classification models can also beranked based on cost. The column structure of this table varies based on the particularmining function, algorithm, and the target attribute type numerical or categorical.

Test Results for Classification Models

The COMPUTEroutines provided in DBMS_DATA_MININGare the most popularly usedmetrics for classification. They are not tied to a particular model they can computethe metrics from any meaningful data input as long as the column structure of theinput tables fits the specification of the apply results table and the targets tables.

Test Results for Regression Models

The most commonly used metrics for regression models are root mean square errorand mean absolute error. See Oracle Data Mining Conceptsfor information.

DBMS_DATA_MINING_TRANSFORM

You can configure a mining model to automatically perform the transformationsrequired by the algorithm. You can supplement the system-generated transformationswith additional transformations that you specify yourself. Alternatively, you can electto transform the data yourself instead of using the automatic transformation feature.

The routines in DBMS_DATA_MINING_TRANSFORMare convenience routines to assistyou in creating your own transformations. If these routines do not entirely suit yourneeds, you can write SQL to modify their output, or you can write your own routines.

To specify transformations for a model, pass a transformation list to theDBMS_DATA_MINING.CREATE_MODEL procedure. The STACKprocedures in theDBMS_DATA_MINING_TRANSFORM package build the transformation list.

Whether you specify some or all of the transformations, Oracle Data Mining associatesthe transformation definitions with the mining model object and reuses themwhenever the model is applied. You do notneed to separately transform the test orscoring data.

See Also: Oracle Database PL/SQL Packages and Types Reference


8/13/2019 b28131.pdf

15/118

Data Mining Data Dictionary Views


DBMS_PREDICTIVE_ANALYTICS

The DBMS_PREDICTIVE_ANALYTICSpackage automates the process of data mining.The user does not need to be aware of model building or scoring. All mining activitiesare handled internally by predictive analytics.

Predictive analytics routines prepare the data, build a model, score the model, andreturn the results of model scoring. Before exiting, they delete the model and

supporting objects.

Oracle predictive analytics support these routines:

EXPLAIN Ranks attributes in order of influence in explaining a target column.

PREDICT Predicts the value of a target column based on values in the inputdata.

PROFILE Generates rules that identify the records that have the same targetvalue.

Data Mining Data Dictionary ViewsBecause mining models are first-class database objects, you can obtain informationabout them from the data dictionary. The data dictionary views for Data Mining areavailable forALL_, USER_, and DBA_access.

ALL_MINING_MODELS

You can query the data dictionary viewALL_MINING_MODELSto obtain a list ofaccessible mining models.

SQL> describe all_mining_modelsName Null? Type----------------------------------------- -------- ----------------------------

OWNER NOT NULL VARCHAR2(30)MODEL_NAME NOT NULL VARCHAR2(30)MINING_FUNCTION VARCHAR2(30)ALGORITHM VARCHAR2(30)CREATION_DATE NOT NULL DATEBUILD_DURATION NUMBERMODEL_SIZE NUMBERCOMMENTS VARCHAR2(4000)

ALL_MINING_MODEL_ATTRIBUTES

You can query the data dictionary viewALL_MINING_MODEL_ATTRIBUTESto obtain

a list of the attributes for each accessible mining model.

SQL> describe all_mining_model_attributesName Null? Type----------------------------------------- -------- ----------------------------OWNER NOT NULL VARCHAR2(30)MODEL_NAME NOT NULL VARCHAR2(30)ATTRIBUTE_NAME NOT NULL VARCHAR2(30)ATTRIBUTE_TYPE VARCHAR2(11)DATA_TYPE VARCHAR2(12)DATA_LENGTH NUMBER


See Also: Oracle Database Reference

8/13/2019 b28131.pdf

16/118

Data Mining SQL Functions


DATA_PRECISION NUMBERDATA_SCALE NUMBERUSAGE_TYPE VARCHAR2(8)TARGET VARCHAR2(3)

ALL_MINING_MODEL_SETTINGSThe viewALL_MINING_MODEL_SETTINGSreturns the settings for each accessiblemining model.

SQL> describe all_mining_model_settingsName Null? Type----------------------------------------- -------- ----------------------------OWNER NOT NULL VARCHAR2(30)MODEL_NAME NOT NULL VARCHAR2(30)SETTING_NAME NOT NULL VARCHAR2(30)SETTING_VALUE VARCHAR2(4000)SETTING_TYPE VARCHAR2(7)

Data Mining SQL FunctionsThe built-in SQL functions for Data Mining implement scoring operations for modelsthat have already been created in the database. They provide the following benefits:

Models can be easily deployed within the context of existing SQL applications.

Scoring performance is greatly improved, especially in single row scoring cases,since these functions take advantage of existing query execution functionality.

Scoring results are pipelined, enabling some of the results to be returned quicklyto the user.

When applied to a given row of scoring data, classification and regression modelsprovide the best predicted value for the target and the associated probability of thatvalue occurring. The SQL functions for prediction are described in Table 21.



Note: SQL functions are built into the Oracle Database and areavailable for use within SQL statements. SQL functions should not beconfused with functions defined in PL/SQL packages.

Table 21 SQL Functions for Prediction

Function Description

PREDICTION Returns the best prediction for the target.

PREDICTION_BOUNDS Applies to generalized linear models. Returns the upperand lower bounds of the values predicted by a multivariatelinear regression model. For a binary logistic regressionmodel, returns the probabilities of the predictions.

PREDICTION_COST Returns a measure of the cost of false negatives and falsepositives on the predicted target.

PREDICTION_DETAILS Returns an XML string containing details that help explainthe scored row.

PREDICTION_PROBABILITY Returns the probability of a given prediction

8/13/2019 b28131.pdf

17/118



Applying a cluster model to a given row of scoring data returns the cluster ID and theprobability of that row's membership in the cluster. The SQL functions for clusteringare described in Table 22.

Applying a feature extraction model involves the mapping of features (sets ofattributes) to columns in the scoring data set. The SQL functions for feature extractionare described in Table 23.

Data Mining Java APIThe Oracle Data Mining Java API is an Oracle implementation of the JDM standard(JSR-73) Java API. It is a thin API developed using the rich in-database functionality ofOracle Data Mining.

The Oracle Data Mining Java API implements Oracle specific extensions to provide allthe data mining features available in the database. All extensions are designed to becompliant with the JDM standards extension framework. All the mining functions andalgorithms available in the database are exposed through the Oracle Data Mining JavaAPI.

Oracle Database 10.2.0.1 introduced the JDM 1.0 standard compliant API that replacedthe old Oracle proprietary Java API in the previous releases. Database 10.2.0.2patch-set release extended the JDM standards support by implementing Oracle DataMining Java API compatible with JDM 1.1.

In this release, the Oracle Data Mining Java API continues to be compatible with theJDM 1.1 and provides new data mining functionality in the database server as Oracleextensions. In this release new Oracle features include auto and embedded datapreparations, generalized linear models, transformation sequencing and taskdependency specifications.

PREDICTION_SET Returns a list of objects containing all classes in a binary ormulti-class classification model along with the associatedprobability (and cost, if applicable).

Table 22 SQL Functions for Clustering


CLUSTER_ID Returns the ID of the predicted cluster.

CLUSTER_PROBABILITY Returns the probability of a case belonging to a given cluster.

CLUSTER_SET Returns a list of all possible clusters to which a given casebelongs along with the associated probability of inclusion.

Table 23 SQL Functions for Feature Extraction


FEATURE_ID Returns the ID of the feature with the highest coefficient value.

FEATURE_SET Returns a list of objects containing all possible features alongwith the associated coefficients.

FEATURE_VALUE Returns the value of a given feature.

See Also: Oracle Database SQL Reference

Table 21 (Cont.) SQL Functions for Prediction


8/13/2019 b28131.pdf

18/118



The JDM Standard

JDM is an industry standard Java API for data mining, it is developed under the JavaCommunity Process (JCP). It defines Java interfaces that any vendor could implementfor their Data Mining Engine. It includes interfaces supporting mining functions suchas classification, regression, clustering, attribute importance and association; alongwith specific mining algorithms such as nave bayes, support vector machines,

decision tree, feed forward neural networks, and k-means.An overview of the Java packages defined by the standards is listed in Table 24. Formore details, refer to the Java documentation published with the standard athttp://www.jcp.org. In the Go to JSRbox, type in 73.

Oracle Extensions to JDM

Oracle extensions are defined to support the functionality that is not part of the JDMstandards. This section gives an overview of these extensions.

Oracle extensions have the following major additional features:

Feature Extraction function with Non-negative Matrix Factorization (NMF)algorithm

Table 24 JDM Standard Java Packages

Package Description

javax.datamining Defines objects supporting all JDM subpackages

javax.datamining.base Defines objects supporting many top-level mining objects. Introduced toavoid cyclic package dependencies.

javax.datamining.resource Defines objects that support connecting to the Data Mining ENgine and

executing tasks.javax.datamining.data Defines objects supporting logical and physical data, model signature,

taxonomy, category set and the generic super class category matrix.

javax.datamining.statistics Defines objects supporting attribute statistics.

javax.datamining.rules Defines objects supporting rules and their predicate components.

javax.datamining.task Defines objects supporting tasks for build, compute statistics, import,and export. Task has an optional subpackage for apply since apply isused mainly for supervised and clustering functions.

javax.datamining.association Defines objects supporting the build settings and model for association.

javax.datamining.clustering Defines objects supporting the build settings and model for clustering.

javax.datamining.attributeimportance Defines objects supporting the build settings and model for attribute

importance.

javax.datamining.supervised Defines objects supporting the build settings and models for supervisedlearning functions, specifically classification and regression, withcorresponding optional packages. It also includes a common test taskfor the classification and regression functions.

javax.datamining.algorithm Defines objects supporting the settings that are specific to algorithms.The algorithm package has optional sub packages for differentalgorithms.

javax.datamining.modeldetail Defines objects supporting details of various model representation.Model Details has optional sub packages for different model details.

See Also: Oracle Data Mining Java API Reference (javadoc).

8/13/2019 b28131.pdf

19/118



Generalized linear models algorithm for regression and classification functions

Oracle proprietary algorithm called Orthogonal Clustering for clustering function

Adaptive Bayes Network (ABN) algorithm for classification function

Automated and embedded transformations

Predictive analytics tasks

An overview of the Oracle extensions higher-level Java packages is provided inTable 25.

Principal Objects in the Oracle Data Mining Java API

In JDM, named objectsare objects that can be saved using the save method in theConnection. All the named objects are inherited from thejavax.datamining.MiningObjectinterface. A vendor can choose to persist thenamed objects either permanently (persistent objects) or only for the lifetime of the

connection object (transient objects).Table 26lists the JDM named objects supported by Oracle.

Physical Data Set

Physical data sets refer to the data to be used as input to data mining operations.Physical data set objects reference specific data via a URI, e.g., specifying a table or file.Oracle supports a table or view in the same database as a valid physical dataset URI.Syntax of the oracle physical dataset URI is as follows:

Data URI Syntax:

Table 25 Oracle Extensions Higher-Level Packages

Package Description

oracle.dmt.jdm.featureextraction Defines the objects related to the feature extraction function. Featureextraction supports the scoring operation.

oracle.dmt.jdm.algorithm.nmf Defines the objects related to the Non-negative Matrix Factorization (NMF)algorithm.

oracle.dmt.jdm.algorithm.glm

oracle.dmt.jdm.modeldetail.glm

Defines the objects related to the Generalized Linear Model (GLM)algorithm.

oracle.dmt.jdm.algorithm.ocluster Defines the objects related to the Orthogonal Clustering (O-Cluster)algorithm.

oracle.dmt.jdm.algorithm.abn Defines the objects related to the Adaptive Bayes Network (ABN) algorithm.

oracle.dmt.jdm.transform Defines the objects related to the transformations.

Table 26 JDM Named Objects Supported by Oracle

Persistent Objects Transient Objects Unsupported Objects

Model Apply Settings Logical Data

Build Settings Physical Dataset Taxonomy

Task

Cost Matrix

Test Metrics

Transformation sequence

8/13/2019 b28131.pdf

20/118



[schemaName.] tableName/viewName

The physical data set object can support multiple data representations. Oracle DataMining supports two types of data representation, i.e., single-record case and widedata. Refer to data mining concepts guide for more details. The Oracle implementationrequires users to specify the case-id column in the physical dataset.

A physical data set object is a transient object in the Oracle Data Mining Java API. It is

stored in the Connection object as an in-memory object.

Build Settings

A build settings object captures the high-level specification input for building a model.The API specifies mining functions: classification, regression, attribute importance,association, clustering, and feature extraction.

Build settings allows a user to specify the type of result desired without having tospecify a particular algorithm. Although a build settings object allows for thespecification of an algorithm and its settings, if the algorithm settings are omitted, theDME selects an algorithm based on the build settings and possibly characteristics ofthe data.

Build settings may also be validated for correct parameters using the verify method.Build settings is a persistent object in the API, it is stored as a table with the userspecified name in the user schema. This settings table is interoperable with thePL/SQL API. It is recommended not to modify the build settings table manually in theschema.

Task

The execute method in the Connection object is used to start an execution of a miningtask. Typically, mining operations are done using tables with millions of records, so theexecution of operations like building a model can take more time. JDM supportsasynchronous execution of mining tasks using DBMS_SCHEDULERin the database.Each mining task is stored as a DBMS_SCHEDULERJob object in the user schema. When

the user saves the task object, it creates a Job object and sets the object to be in theDISABLEDstate. When the user executes a task, it enables the job to start execution.

Applications that would like to use the DBMS_SCHEDULERfunctionality likescheduling mining tasks could use DBMS_SCHEDULERPL/SQL package provided inthe database.

To monitor tasks that are executed asynchronously, the execute method returns ajavax.datamining.ExecutionHandle object. It provides all the necessaryfeatures like waitForCompletion, getStatusmethods to retrieve the status detailsof the task.

Model

A model object is the result of applying an algorithm to data as specified in a buildsettings object.

Models can be used in several operations. They can be:

Inspected, for example to examine the rules produced from a decision tree orassociation

Tested for accuracy

Applied to data for scoring

Exported to an external representation such as native format or PMML

8/13/2019 b28131.pdf

21/118



Imported for use in the DMS

When a model is applied to data, it is submitted to the DMS for interpretation. AModel references its BuildSettingsobject as well as the Taskthat created it. as.

Test Metrics

A Test Metrics object is the result of testing a supervised model with test data. Based

on the type of mining function, different test metrics are computed. For classificationmodels, accuracy, confusion-matrix, lift, and receiver-operating characteristic can becomputed to access the model. Similarly for regression models, R-squared and RMSerrors can be computed.

Apply Settings

An apply settings object allows users to tailor the results of an apply task. It contains aset of ordered items. Output can consist of:

Data to be passed through to the output from the input dataset, e.g., key attributes

Values computed from the apply itself, e.g., score, probability and in the case ofdecision trees, rule identifiers

Multi-class categories for its associated probabilities, e.g., in a classification modelwith target favoriteColor, users could select the specific colors to receive theprobability that a given color is favorite.

Each mining function class defines a method to construct a default apply settingsobject. This simplifies the programmer's effort if only standard output is desired. Forexample, typical output for classifications apply would include the top prediction andits probability.

Transformation Sequence

A transformation sequence object represents the sequence of transformations that areto be performed as part of a mining operation. For example, a Support Vector Machine

model build involves outlier handling and normalization transformations. In additionto this, there can be new derived attribute creation and business transformations, andso on. Typicaly these transformations are resused for other model builds and henceapplications can save the transformation sequence as a named object in the API.

Transformation sequence can be used either to perform transformations as a separatetask or embed them to the modeling process by specifying it as one of the input objectsfor model building.

8/13/2019 b28131.pdf

22/118



8/13/2019 b28131.pdf

23/118

Beta Draft Creating a Mining Model 3-1

3Creating a Mining Model

This chapter provides information about creating, configuring, and testing models fordata mining. It also explains how to retrieve model details to learn about informationstored internally in the model.

This chapter contains the following topics:

Mining Model Objects

Settings and Defaults

Model Build

Mining Model ObjectsMining models are database objects. You can view a list of the mining models in yourschema by querying the USER_MINING_MODELSdata dictionary view. The columns ofUSER_MINING_MODELSare described in Table 31.

Sample ModelsA set of sample programs in SQL and Java are provided with Oracle Data Mining. Youcan find the sample programs in $ORACLE_HOME/rdbms/demoas shown.

> cd $ORACLE_HOME/rdbms/demo> ls dm*

To set up the environment for the sample programs, follow the instructions in OracleData Mining Administrator's Guide. Assuming you have run the appropriate scripts and

Table 31 USER_MINING_MODELS View

Column Data Type Description

model_name VARCHAR2(25) Name of the mining model.

mining_function VARCHAR2(30) The mining model function. See Oracle Data Mining Conceptsforan overview of mining functions.

algorithm VARCHAR2(30) The algorithm used by the mining model. See Oracle DataMining Conceptsfor algorithms used by the mining functions.

creation_date DATE The date on which the mining model was created.

build_duration NUMBER The duration of the mining model build process in seconds.

model_size NUMBER The size of the mining model in megabytes.

comments VARCHAR2(30) Results of a SQL COMMENTapplied to the mining model.

8/13/2019 b28131.pdf

24/118

Settings and Defaults


created a user called dmuser, you can run the sample programs. Each program createsa mining model.

Example 31lists the mining models belonging to dmuserand created by the DataMining sample SQL programs.

Example 31 List the Sample Mining Models

SQL> connect dmuser/dmuserSQL> set linesize 1000SQL> set pagesize 1000SQL> SELECT model_name, mining_function, algorithm FROM user_mining_models;MODEL_NAME MINING_FUNCTION ALGORITHM----------------------- ----------------------- ----------------------------

ABN_SH_CLAS_SAMPLE CLASSIFICATION ADAPTIVE_BAYES_NETWORKAI_SH_SAMPLE ATTRIBUTE_IMPORTANCE MINIMUM_DESCRIPTION_LENGTHAR_SH_SAMPLE ASSOCIATION_RULES APRIORI_ASSOCIATION_RULESDT_SH_CLAS_SAMPLE CLASSIFICATION DECISION_TREEKM_SH_CLUS_SAMPLE CLUSTERING KMEANSNMF_SH_SAMPLE FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTORNB_SH_CLAS_SAMPLE CLASSIFICATION NAIVE_BAYES

OC_SH_CLUS_SAMPLE CLUSTERING O_CLUSTERGLMR_SH_REGR_SAMPLE REGRESSION GENERALIZED_LINEAR_MODELGLMC_SH_CLAS_SAMPLE CLASSIFICATION GENERALIZED_LINEAR_MODELSVMO_SH_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINESSVMC_SH_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINESSVMR_SH_REGR_SAMPLE REGRESSION SUPPORT_VECTOR_MACHINEST_SVM_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINEST_NMF_SAMPLE FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR

Model Privileges

You need the CREATE MINING MODELprivilege to create models in your ownschema. You can perform any operation on models that you own. This includes

applying the model, adding a cost matrix, renaming the model, and dropping themodel.

You can perform specific operations on mining models in other schemas if you havethe appropriate system privileges. For example,CREATE ANY MINING MODELallows you to create models in other schemas. SELECT ANY MINING MODELallowsyou to apply models that reside in other schemas. You can add comments to models ifyou have the COMMENT ANY MINING MODELprivilege.

Settings and DefaultsA settings table is a relational table that provides configuration information for aspecific model. Create a settings table with the columns described in Table 32if youwant a model to have any nondefault characteristics. Supply the name of the settingstable when you create the model.

See Also: Oracle Data Mining Administrator's Guidefor a completediscussion of mining model security.

Table 32 Settings Table Required Columns

Column Name Data Type

setting_name VARCHAR2(30)

8/13/2019 b28131.pdf

25/118

Model Build


The values inserted into the setting_namecolumn are one or more of severalconstants defined in the DBMS_DATA_MININGpackage. Depending on what thesetting name denotes, the value for the setting_valuecolumn can be a predefinedconstant or the actual numerical value corresponding to the setting itself. Thesetting_valuecolumn is defined to be VARCHAR2. You can explicitly castnumerical inputs to string using the TO_CHAR()function, or you can rely on theimplicit type conversion provided by the Database.

You specify the model function when you create the model. Depending on thefunction, you may have a choice of algorithms. The settings described in Table 34apply to a mining function. Use these settings to specify the algorithm that the modelwill use, the location of costs, weights, and prior probabilities tables, and otherfunction-specific characteristics.

This example creates a settings table for an SVM classification model and insertsseveral values.

CREATE TABLE drugstore_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(4000))

BEGIN-- override the default for convergence tolerance for SVM classificationINSERT INTO drugstore_model_settings (setting_name, setting_value)VALUES (dbms_data_mining.svms_conv_tolerance, TO_CHAR(0.081));COMMIT;

END;

You can create a settings table based on another model's settings usingGET_MODEL_SETTINGS.

BEGINCREATE TABLE my_new_model_settings ASSELECT setting_name, setting_valueFROM TABLE (DBMS_DATA_MINING.GET_MODEL_SETTINGS('my_other_model'));

END;

Model Build

The mining function is a required argument to the CREATE_MODELprocedure. It canhave any of the values listed in Table 33.

DBMS_DATA_MINING.CREATE_MODEL ( model_name IN VARCHAR2, mining_function IN VARCHAR2, data_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2 DEFAULT NULL, settings_table_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL, settings_schema_name IN VARCHAR2 DEFAULT NULL,

setting_value VARCHAR2(4000)

See Also: Mining model settings are fully described in Oracle

Database PL/SQL Packages and Types Reference.

Table 32 (Cont.) Settings Table Required Columns

Column Name Data Type

8/13/2019 b28131.pdf

26/118

Model Build


transform_list IN DM_TRANSFORMS DEFAULT NULL);

Table 33 Mining Functions

Value Description

association Association is a descriptive mining function. An association

model identifies relationships and the probability of theiroccurrence within a data set.

Association models use the Apriori algorithm.

attribute_importance Attribute Importance is a predictive mining function. Anattribute importance model identifies the relativeimportance of an attribute in predicting a given outcome.

Attribute Importance models use the Minimal DescriptorLength algorithm.

classification Classification is a predictive mining function. Aclassification model uses historical data to predict acategorical target.

Classification models can use: Naive Bayes, Adaptive BayesNetwork, Decision Tree, Binary Logistic Regression, or

Support Vector Machine algorithms. The default is NaiveBayes.

The classification function can also be used for anomalydetection. In this case, the SVM algorithm with a null targetis used (One-Class SVM).

clustering Clustering is a descriptive mining function. A clusteringmodel identifies natural groupings within a data set.

Clustering models can use: k-Means or O-Clusteralgorithms. The default is k-Means.

feature_extraction Feature Extraction is a descriptive mining function. Afeature extraction model creates an optimized data set onwhich to base a model.

Feature extraction models use the Non-Negative MatrixFactorization algorithm.

regression Regression is a predictive mining function. A regressionmodel uses historical data to predict a numerical target.

Regression models can use Support Vector Machine orMultivariate Linear Regression. The default is SupportVector Machine.

8/13/2019 b28131.pdf

27/118

Model Build


Table 34 Data Mining Function Settings

Setting Name Setting Value

algo_name Classification:

algo_naive_bayes(Default)

algo_support_vector_machines(Use this setting for both SVM and One-Class SVM

algo_adaptive_bayes_network

algo_decision_tree

algo_generalized_linear_model

Regression:

algo_support_vector_machines(Default)

algo_generalized_linear_model

Association Rules:

algo_apriori_association_rules

Clustering: algo_kmeans(Default)

algo_o_cluster

Feature Extraction:

algo_nonnegative_matrix_factor

Attribute Importance:

algo_ai_mdl

clas_cost_table_name The name of a relational table that specifies a cost matrix to beused by a Decision Tree model at build time. .

The cost matrix table must be present in the schema of themodel.

class_weights_table_name

Name of a relational table that specifies the weights for a GLMclassification model.

The weights table must be present in the schema of the model.

clas_priors_table_name The name of a relational table that specifies prior probabilities.Priors can be used by NB, ABN, and SVM (classification only?)

The prior probabilities table must be present in the schema ofthe model.

clus_num_clusters Number of clusters generated by a clustering algorithm.

Default is 10.

feat_num_features Number of features to be extracted by the NMF algorithm.

Default value estimated from the data by the algorithm.

asso_max_rule_length Maximum rule length for AR algorithm.

Default is 4.

asso_min_confidence Minimum confidence value for AR algorithm

Default is 0.1.

asso_min_support Minimum support value for AR algorithm

Default is 0.1.

8/13/2019 b28131.pdf

28/118

Model Build


8/13/2019 b28131.pdf

29/118

Beta Draft Scoring and Deployment 4-1

4Scoring and Deployment

This chapter will discuss how models are applied to the new data and how they can bedeployed to applications and reporting systems.

NOTE FOR BETA: This chapter is basically an outline at this stage. The sectionheadings convey the general subject matter.


Applying a Mining Model

Real Time Scoring

Deployment and Model Management

Applying a Mining ModelFor each function/algorithm, describe the scoring process and results.

Attribute importance and association models cannot be applied.

Examples of SQL Data Mining scoring functions and DBMS_DATA_MINING.APPLYfor model scoring.

Real Time ScoringMaximizing data mining's potential means pushing it from the exclusive realm oftechnical specialists into the hands of front-line workers, such as telemarketingpractitioners and order-entry personel. This is where point-of-sales fraud detectionand prevention needs to occur. Front-line workers, using operational systems withintegrated, seamless access to predictive models and extracted rule sets, can make useof the "business intelligence" offered by data mining in real-time, while they are stilldealing with individual customers.

The ability to extend the reach of business intelligence, through knowledge developedby the data mining process, increases the value of a company's information

warehouse. It also enables front-line personnel to detect potential fraud more quicklyand reliably.

Predictive models and rule sets can be integrated into live production and operationalsystems enabling real-time and near real-time interrogation of transactions. Byaugmenting human capabilities, the models and rule sets help front-line workers makebetter analytical decisions. Detection of fraudulent behavior, identification of potentialliabilities, such as bankruptcy, and recognition of better marketing and sellingopportunities are now possible, while the transaction occurs.

8/13/2019 b28131.pdf

30/118

Deployment and Model Management


A call center will place and receive thousands of calls every week, perhaps every day.Each call represents a potential sale. However, each call also represents the potentialfor fraud. Billions of dollars are lost every single year to credit card fraud in the callcenter environment.

Read-time data mining can detect and prevent fraud. For example, an operator takesan order, and within seconds, has scanned terabytes of information, to see if the credit

card presented is valid, as well as to determine if its proposed use matches one ofhundreds of fraud models.

Deployment and Model ManagementModel deployment is the process required to use a model in a target environment. Thismight be:

extracting the model details (rules of a decision tree or attribute importanceranking), to produce reports

scoring data using the model in the database where it was built, either for batch orreal-time results

moving the model from the system where it was built to the system where it willbe used for scoring (export/import)

8/13/2019 b28131.pdf

31/118

Beta Draft The Case Table 5-1

5The Case Table

This chapter explains how to create the data sets for building, testing, and applyingyour models. Algorithm-specific and application-specific data transformations arediscussed in Chapter 6.


Requirements

About Attributes

Nested Data

Missing Values and Sparse Data

RequirementsThe data that you wish to mine must be defined within a single table or view. Theinformation for each record must be stored in a separate row. The data records arecommonly called cases. Each case can be identified by a unique case ID. The table orview itself is referred to as a case table.

TheCUSTOMERS

table in theSH

schema is an example of a table that could be used formining. All the information for each customer is contained in a single row. The case IDis the CUST_IDcolumn. The rows listed in Example 51are selected fromSH.CUSTOMERS.

Example 51 Sample Case Table

SQL> select cust_id, cust_gender, cust_year_of_birth,cust_main_phone_number from sh.customers where cust_id < 11;

CUST_ID CUST_GENDER CUST_YEAR_OF_BIRTH CUST_MAIN_PHONE_NUMBER------- ----------- ---- ------------- -------------------------1 M 1946 127-379-89542 F 1957 680-327-1419

3 M 1939 115-509-33914 M 1934 577-104-27925 M 1969 563-667-77316 F 1925 682-732-72607 F 1986 648-272-61818 F 1964 234-693-87289 F 1936 697-702-261810 F 1947 601-207-4099

8/13/2019 b28131.pdf

32/118

About Attributes


Column Data Types

The columns of the case table hold the attributesthat describe each case. InExample 51, the attributes are: CUST_GENDER, CUST_YEAR_OF_BIRTH, andCUST_MAIN_PHONE_NUMBER. The attributes are the predictors in a supervised modeland the descriptors in an unsupervised model. The case ID, CUST_ID, can be viewedas a special attribute. It is not a predictor or a descriptor.

Oracle Data Mining accepts the following column data types:

VARCHAR2, CHARNUMBER, FLOATDM_NESTED_CATEGORICALS

DM_NESTED_NUMERICALS

See "Converting the Column Data Type"on page 5-5for information about convertingthe data type if necessary.

The nested types are described in "Nested Data"on page 5-7. The case ID columncannot have a nested type.

Data for Build, Test, and ApplyYou need two data sets to build a predictive model: one for building (training) themodel and one for testing the model. You only need one data set to build a descriptivemodel. Once the model has been built, you can apply it to the data of interest.

Each data set consists of a set of cases (rows). It is often convenient to derive the builddataand test datafrom the same data set. For example, you might select 60% of therows for building the model and 40% for testing the model.

The data to which you apply the model is called the apply dataor scoring data. TheAPPLYprocess matches column names in the scoring data with the names of the

columns that were used to build the model. TheAPPLYprocess does not require all thecolumns to be present in the scoring data. If the data types do not match, Oracle Data

Mining attempts to perform type coercion. For example, if a column calledPRODUCT_RATINGis VARCHAR2in the build data but NUMBERin the scoring data,Oracle Data Mining will effectively apply a TOCHAR()function to convert it.

The column in the test or scoring data must undergo the same transformations as thecorresponding column in the build data. For example, if theAGEcolumn in the builddata was transformed from numbers to the values CHILD,ADULT, and SENIOR, thentheAGEcolumn in the scoring data must undergo the same transformation so that themodel can properly evaluate it. Many transformations can be associated internallywith the attributes and automatically applied to the test or scoring data. See Chapter 6,"Data Transformation".

About AttributesAn attribute is an item of information in the data set being mined. In predictivemodels, attributes are the predictors that affect a given outcome. In descriptivemodels, attributes are the items of information being analyzed for natural groupingsor associations. A table of employee data might contain attributes such as job title, dateof hire, salary, age, sex, and so on.

8/13/2019 b28131.pdf

33/118

About Attributes


Physical Attributes and Logical Attributes

Physical attributesare the columns in the build data or scoring data. Logicalattributesare the data representations used internally by the model. The physicalattributes typically undergo various transformations before being processed by thealgorithm, thus the logical and physical attributes for a given model might be quitedifferent.

Physical Attributes

The physical attributes for a mining model can be obtained from the system catalogview USER/ALL/DBA_MINING_MODEL_ATTRIBUTES . When used with the USERprefix, this view returns all the physical attributes in the user's schema. When usedwith theALLprefix, it returns all the physical attributes accessible to the current user.The DBAprefix is only available to DBAs.

The query in Example 52returns the physical attributes for the modelT_SVM_CLAS_SAMPLE, an SVM model generated by one of the Data Mining sampleprograms.

Example 52 ALL_MINING_MODEL_ATTRIBUTES

SQL> select model_name, attribute_name, attribute_type, data_type, target from user_mining_model_attributes

where model_name = 'T_SVM_CLAS_SAMPLE';MODEL_NAME ATTRIBUTE_NAME ATTRIBUTE_TYPE DATA_TYPE TARGET------------------- --------------------- --------------- ----------- ------T_SVM_CLAS_SAMPLE COMMENTS NUMERICAL NESTED TABLE NO

T_SVM_CLAS_SAMPLE AGE NUMERICAL NUMBER NOT_SVM_CLAS_SAMPLE CUST_MARITAL_STATUS CATEGORICAL VARCHAR2 NOT_SVM_CLAS_SAMPLE COUNTRY_NAME CATEGORICAL VARCHAR2 NOT_SVM_CLAS_SAMPLE CUST_INCOME_LEVEL CATEGORICAL VARCHAR2 NOT_SVM_CLAS_SAMPLE EDUCATION CATEGORICAL VARCHAR2 NOT_SVM_CLAS_SAMPLE OCCUPATION CATEGORICAL VARCHAR2 NOT_SVM_CLAS_SAMPLE HOUSEHOLD_SIZE CATEGORICAL VARCHAR2 NOT_SVM_CLAS_SAMPLE YRS_RESIDENCE NUMERICAL NUMBER NOT_SVM_CLAS_SAMPLE BULK_PACK_DISKETTES NUMERICAL NUMBER NOT_SVM_CLAS_SAMPLE FLAT_PANEL_MONITOR NUMERICAL NUMBER NOT_SVM_CLAS_SAMPLE HOME_THEATER_PACKAGE NUMERICAL NUMBER NOT_SVM_CLAS_SAMPLE BOOKKEEPING_APPLICATION NUMERICAL NUMBER NOT_SVM_CLAS_SAMPLE PRINTER_SUPPLIES NUMERICAL NUMBER NOT_SVM_CLAS_SAMPLE Y_BOX_GAMES NUMERICAL NUMBER NO

T_SVM_CLAS_SAMPLE OS_DOC_SET_KANJI NUMERICAL NUMBER NOT_SVM_CLAS_SAMPLE CUST_GENDER CATEGORICAL CHAR NOT_SVM_CLAS_SAMPLE AFFINITY_CARD NUMERICAL NUMBER YES

The DATA_TYPEcolumn in Example 52displays the data type of thephysical attribute.TheATTRIBUTE_TYPEcolumn displays the type of the logical attributeand indicateshow it will be processed internally by the model.

The data type NESTED TABLErefers to either of the nested data types:DM_NESTED_NUMERICALSor DM_NESTED_CATEGORICALS. (See "Nested Data"onpage 5-7.)

Note: In this manual, the term "attributes" usually refers topredictors; the case ID, which plays a special role, is not generallyincluded.

8/13/2019 b28131.pdf

34/118

About Attributes


Logical AttributesA logical attribute is an object used internally by the model. It comprises the dataderived from a physical attribute, combined with embedded transformations andmetadata specific to the algorithm.

Logical attributes are either categorical(character data) or numerical(numeric data).

Embedded transformation and reverse transformation information is associated withthe logical attribute. For example, the physical attribute SALARYmight beautomatically transformed into 10 bins. The logical attribute SALARYthen has integervalues 1 10, not the original dollar values. The reverse transformation changes theinteger values into dollar value bin ranges. Reverse transformation expressions enablemodel transparencyfor model details and model scoring. Transparency is discussed inChapter 6. See also "Model Details"on page 5-6.

Categoricals

Categorical attributes have values that belong to a finite number of discrete categoriesor classes. There is no implicit order associated with the values. Some categoricals arebinary: They have only two possible values, such as yesor no, or maleor female.The term multiclassis used to describe models when the categorical target has more

than two values. For example, a target of clothing sizes could have the values small,medium, or large.

Oracle Data Mining interprets CHAR, VARCHAR2, and DM_NESTED_CATEGORICALSdata types as categorical attributes.

Numericals

Numerical attributes can theoretically have an infinite number of values. The valueshave an implicit order, and the differences between them are also ordered. Attributeswith a numeric data type, such as salary, are typically numerical attributes.

Oracle Data Mining interprets NUMBER, FLOAT, and DM_NESTED_NUMERICALSdatatypes as numerical attributes.

Target Attribute

When a model requires a targetattribute, it must be identified before the model isbuilt. The target attribute contains the historical values used to train (build) the model.When the model is applied, the target contains the predicted values generated by themodel.

For example, a model based on a table of demographic data about customers mightpredict which customers are most likely to respond to a promotion. The target would

Note: The *_MINING_MODEL_ATTRIBUTESview displays thephysical attributes that were used to construct the model. Some of thecolumns might have been eliminated through transformations oralgorithmic processing; these would not be included in the view.

Note: Oracle Data Mining internally processes the rows of nestedtables as logical attributes. Each row specifies an attribute and itsvalue. See "Scoping of Logical Attribute Name"on page 5-6and"Nested Data"on page 5-7.

8/13/2019 b28131.pdf

35/118

About Attributes


be the RESPONSEattribute, which would have the values yesor no, depending onwhether or not that customer was likely to respond.

The target attribute for a classification model is categorical. The target attribute for aregression model is numerical. The target for an attribute importance model is eithercategorical or numerical.

Clustering, feature extraction, and anomaly detection models do not use a target

attribute.

You can query the *_MINING_MODEL_ATTRIBUTESview to find the target for a givenmodel, as shown in Example 52.

Converting the Column Data Type

You must convert the data type of a column if its type is not supported by Oracle DataMining or if its type will cause Oracle Data Mining to interpret it incorrectly. Forexample, zip codes identify different postal zones; they do not typically imply order. Ifthe zip codes are stored in a numeric column and you want it to be treated as acategorical attribute, convert it to character data. You can do this using the TO_CHARfunction to convert the digits 1-9 and the LPADfunction to retain the leading 0, if there

is one.LPAD(TO_CHAR(ZIPCODE),5,'0')

Date Data

The Oracle Data Mining PL/SQL and Java APIs do not support DATEand TIMESTAMPdata. If your data includes date data, convert it to one of the numeric or character datatypes supported by Data Mining.

In most cases, DATEand TIMESTAMPshould be converted to NUMBER, but you shouldevaluate each case individually. A TIMESTAMPcolumn should generally be convertedto a number since it represents a unique point in time.

Alternatively, a column of dates in a table of annual sales data might indicate the

month when a product was sold. This DATEcolumn would be converted to VARCHAR2and treated as a categorical. You can use the TO_CHARfunction to convert a DATEdatatype to VARCHAR2.

You can convert dates to numbers by selecting a starting date and subtracting it fromeach date value. Another approach would be to parse the date and distribute itscomponents over several columns. This approach is used by Predictive Analytics,which doessupport DATEand TIMESTAMPdata types.

Model Signature

The model signature is a list of the physical attributes used to construct the model.Some or all of the physical attributes in the signature should be present for scoring. Ifsome columns are not present, they are disregarded. If columns with the same namesbut different data types are present, the model attempts to convert the data type.

The target and case ID columns are not included in the signature.

The columns in the model signature, plus the target column, are listed in the*_MINING_MODEL_ATTRIBUTEScatalog view. (See "Physical Attributes"on page 5-3.)

See Also: Oracle Database SQL Referencefor information on data typeconversion. See Oracle Database PL/SQL Packages and Types Referenceforinformation about date data types inDBMS_PREDICTIVE_ANALYTICS.

8/13/2019 b28131.pdf

36/118

About Attributes


Scoping of Logical Attribute Name

The logical attribute name consists of two parts: a column name, and a subcolumnname.

column_name[.subcolumn_name]

The column_namecomponent is the name of the physical attribute. It is present in all

logical attribute names. Nested attributes also have a subcolumn_namecomponent asshown in Example 53.

Example 53 A Nested Column and Corresponding Logical Attributes

The nested column SALESPRODhas three rows.

SALESPROD(ATTRIBUTE_NAME, VALUE)--------------------------------((PROD1, 300),(PROD2, 245),(PROD3, 679))

The name of the physical attribute is SALESPROD. Its associated logical attributes have

the following names.SALESPROD.PROD1SALESPROD.PROD2SALESPROD.PROD3

Model Details

Model details reveal information about logical attributes and their internal treatmentby the algorithm. There is a separate GET_MODEL_DETAILStable function for eachalgorithm.

Reverse transformation expressions are applied to model details, ensuring modeltransparency. See Chapter 6for a discussion of model transparency. See also "Logical

Attributes"on page 5-4Example 54shows the definition of the GET_MODEL_DETAILStable function forGeneralized Linear Models. The PIPELINEDkeyword instructs Oracle Database toreturn the rows as single values instead of returning all the rows as a single value.

Example 54 Model Details for a GLM Model

The syntax of the GET_MODEL_DETAILSfunction for Generalized Linear Models isshown as follows.

DBMS_DATA_MINING.GET_MODEL_DETAILS_GLM ( model_name VARCHAR2)RETURN DM_GLM_COEFF_SET PIPELINED;

The function returns a DM_GLM_COEFF_SET, a virtual table. The columns are themodel details. There is one row for each logical attribute in the specified model. Thecolumns are described as follows.

Column Name Column Type------------------------------------------------------class VARCHAR2(4000)attribute_name VARCHAR2(4000)attribute_subname VARCHAR2(4000)

8/13/2019 b28131.pdf

37/118

Nested Data


attribute_value VARCHAR2(4000)coefficient NUMBERstd_error NUMBERtest_statistic NUMBERp_value NUMBERvif NUMBERstd_coefficient NUMBERlower_coeff_limit NUMBERupper_coeff_limit NUMBERexp_coefficient BINARY_DOUBLEexp_lower_coeff_limit BINARY_DOUBLEexp_upper_coeff_limit BINARY_DOUBLE

Nested DataOracle Data Mining requires a case table in single-record case format, with each recordin a separate row. What if some or all of your data is in multi-record case format, witheach record in several rows? What if you want one attribute to represent a series orcollection of values, such as a student's test scores or the products purchased by acustomer?

This kind of one-to-many relationship is usually implemented as a join between tables.For example, you might define one column in your customer table as a join to a salestable and thus associate a list of products purchased with each customer.

Oracle Data Mining supports dimensioned data through nested columns. To includedimensioned data in your case table, create a view and cast the joined data to one ofthe Data Mining nested table types. Each row in the nested column consists of anattribute name/value pair. Oracle Data Mining internally processes each nested row asa separate attribute.

Nested Object TypesOracle Database supports user-defined data types that make it possible to modelreal-world entities as objects in the database. Collection typesare object data types formodeling multi-valued attributes. Nested tables are collection types. Nested tables canbe used anywhere that other data types can be used. You can learn more aboutcollection types in Oracle Database Application Developer's Guide - Object-RelationalFeatures.

Oracle Data Mining supports two nested object types: one for numerical attributes, theother for categorical attributes.

DM_NESTED_NUMERICALS

The DM_NESTED_NUMERICALSobject type is a nested table of numerical attributes.Each row is a single DM_NESTED_NUMERICAL.

The nested numerical attributes (rows) are described as follows.

SQL> describe dm_nested_numericalName Null? Type----------------------------------------- -------- ----------------------------ATTRIBUTE_NAME VARCHAR2(4000)VALUE NUMBER

The collection of numerical attributes (table) is described as follows.

See Also: Sample code for converting to a nested table in "Example:Use SQL to Transform a List to a Nested Column"on page 5-9.

8/13/2019 b28131.pdf

38/118

Nested Data


SQL> describe dm_nested_numericalsDM_NESTED_NUMERICALS TABLE OF SYS.DM_NESTED_NUMERICALName Null? Type----------------------------------------- -------- ----------------------------ATTRIBUTE_NAME VARCHAR2(4000)VALUE NUMBER

DM_NESTED_CATEGORICALSThe DM_NESTED_CATEGORICALSobject type is a nested table of categorical attributes.Each row is a single DM_NESTED_CATEGORICAL.

The nested categorical attributes (rows) are described as follows.

SQL> describe dm_nested_categoricalName Null? Type----------------------------------------- -------- ----------------------------ATTRIBUTE_NAME VARCHAR2(4000)VALUE VARCHAR2(4000)

The collection of categorical attributes (table) is described as follows.

SQL> describe dm_nested_categoricals

DM_NESTED_CATEGORICALS TABLE OF SYS.DM_NESTED_CATEGORICALName Null? Type----------------------------------------- -------- ----------------------------ATTRIBUTE_NAME VARCHAR2(4000)VALUE VARCHAR2(4000)

Example: Transform a List to a Nested Column

Example 55shows the data for six customers. Each customer purchased severalproducts. This data set is not suitable for mining, because the information for eachcustomer (case) is stored in several rows.

Example 55 Multi-Record Case

CUST_ID CUST_INCOME_LEVEL CUST_CR PRODUCT QUANTITY------- -------------------- --------- -------------------- --------100990 K: 250,000 - 299,999 7000 Extension Cable 1100990 K: 250,000 - 299,999 7000 Mouse Pad 1100990 K: 250,000 - 299,999 7000 SIMM- 16MB PCMCIAII card 1100990 K: 250,000 - 299,999 7000 Standard Mouse 1100991 J: 190,000 - 249,999 11000 SIMM- 16MB PCMCIAII card 1100991 J: 190,000 - 249,999 11000 Standard Mouse 1100992 L: 300,000 and above 9000 18" Flat Panel Graphics Monitor 1100992 L: 300,000 and above 9000 Extension Cable 1100992 L: 300,000 and above 9000 Mouse Pad 1100992 L: 300,000 and above 9000 Standard Mouse 1100993 K: 250,000 - 299,999 7000 Multimedia speakers- 3" cones 1

100993 K: 250,000 - 299,999 7000 O/S Documentation Set - Englis 1100993 K: 250,000 - 299,999 7000 SIMM- 16MB PCMCIAII card 1100994 I: 170,000 - 189,999 10000 Model SM26273 Black Ink Cartri 1100994 I: 170,000 - 189,999 10000 Mouse Pad 1100994 I: 170,000 - 189,999 10000 SIMM- 16MB PCMCIAII card 1100995 J: 190,000 - 249,999 9000 18" Flat Panel Graphics Monitor 1100995 J: 190,000 - 249,999 9000 CD-RW, High Speed Pack of 5 1100995 J: 190,000 - 249,999 9000 Extension Cable 1100995 J: 190,000 - 249,999 9000 Mouse Pad 1100995 J: 190,000 - 249,999 9000 Standard Mouse 1

8/13/2019 b28131.pdf

39/118

Nested Data


Example 56shows how the list of products would look as a nested table of typeDM_NESTED_NUMERICALS. This table is suitable for mining, because the informationfor each customer is specified in a single row.

Example 56 Single-Record Case

CUST_ID CUST_INCOME_LEVEL CUST_CR CUSTPRODS(ATTRIBUTE_NAME, VALUE)------- -------------------- ------- --------------------------------100990 K: 250,000 - 299,999 7000 ('Extension Cable', 1),

('Mouse Pad', 1),('SIMM- 16MB PCMCIAII card', 1),('Standard Mouse', 1))

100991 J: 190,000 - 249,999 11000 ('SIMM- 16MB PCMCIAII card', 1),('Standard Mouse', 1))

100992 L: 300,000 and above 9000 ('18" Flat Panel Graphics Monitor', 1), ('Extension Cable', 1),

('Mouse Pad', 1),('Standard Mouse', 1))

100993 K: 250,000 - 299,999 7000 ('Multimedia speakers- 3" cones', 1),('O/S Documentation Set - Englis', 1),

('SIMM- 16MB PCMCIAII card', 1))

100994 I: 170,000 - 189,999 10000 ('Model SM26273 Black Ink Cartri', 1),('Mouse Pad', 1),('SIMM- 16MB PCMCIAII card', 1))

100995 J: 190,000 - 249,999 9000 ('18" Flat Panel Graphics Moni to',1),('CD-RW, High Speed Pack of 5', 1),('Extension Cable', 1),('Mouse Pad',1),('Standard Mouse', 1))

Example: Use SQL to Transform a List to a Nested Column

Example 57shows how to define a nested column for data mining. This example usestransactional market basket data.

Example 57 Convert to a Nested Column

The view SALES_TRANS_CUSTprovides a list of transaction IDs to identify eachmarket basket and a list of the products in each basket.

SQL> describe sales_trans_custName Null? Type----------------------------------------------------- -------- ----------------TRANS_ID NOT NULL NUMBERPROD_NAME NOT NULL VARCHAR2(50)QUANTITY NUMBER

The following SQL statement transforms this data to a column of typeDM_NESTED_NUMERICALSin a view called SALES_TRANS_CUST_NESTED. This view

can be used as a case table for mining.

SQL> CREATE VIEW sales_trans_cust_nested AS SELECT trans_id, CAST(COLLECT(DM_NESTED_NUMERICAL( prod_name quantity)) AS DM_NESTED_NUMERICALS) custprods FROM sales_trans_cust GROUP BY trans_id;

This query returns two rows from the transformed data.

8/13/2019 b28131.pdf

40/118

Nested Data


SQL> select * from sales_trans_cust_nestedwhere trans_id < 101000

and trans_id > 100997;TRANS_ID CUSTPRODS(ATTRIBUTE_NAME, VALUE)------- ------------------------------------------------100998 DM_NESTED_NUMERICALS (DM_NESTED_NUMERICAL('O/S Documentation Set - English', 1)100999 DM_NESTED_NUMERICALS (DM_NESTED_NUMERICAL('CD-RW, High Speed Pack of 5', 2), DM_NESTED_NUMERICAL('External 8X CD-ROM', 1),

DM_NESTED_NUMERICAL('SIMM- 16MB PCMCIAII card', 1))

Market Basket Data

b28131.pdf

Documents

Transcript of b28131.pdf