ECLT5810 E-Commerce Data Mining Techniques …eclt5810/EMlab/em-overview-2016.pdfEnterprise Miner...

Enterprise Miner Overview 1

ECLT5810 E-Commerce Data Mining Techniques Overview of SAS Enterprise Miner

1. Start Enterprise Miner I. Select English Windows II. Under Windows, select Start Menu SAS SAS 9.3 III. Type “miner” in command prompt of SAS


2. Enterprise Miner (EM) 4. 3 I. Several windows:

SAS 9.3 Interface

1. Log ‐ logging message of SAS program 2. Editor ‐ SAS program editor 3. Output ‐ output of SAS program 4. Explorer 5. Results


II. SAS Enterprise Miner

- Some Terminology - Node – EM tool to perform data mining - Connection – data flow between nodes - Project Flow Diagram (PFD) – diagram specifies the data mining process

- The Interface

Project Navigator The three tabs:

- Diagrams – lists the current project and diagrams - Tools – contains a palette of EM tools for constructing the process

flow diagram ‐ Reports – contains the HTML report entries generated by the

Reporter node

Diagram Workspace - building, editing, running and saving the process flow diagram

Tools Bar - contains commonly used EM nodes - 7 categories:

1 Sample: e.g. Input Data Source, Sampling, Data Partition 2 Explore: e.g. Distribution Explorer, Multiplot, Variable Selection 3 Modify: e.g. Transform Variables, Filter Outliers, Clustering 4 Model: e.g. Regression, Tree, Neural Network 5 Assess: e.g. Assessment, Reporter 6 Scoring: e.g. Score 7 Utility: e.g. SAS Code

Progress Indicator - displays the process of EM task

Message Panel - displays message about the execution of EM task

Connection Status Indicator - display the remote host name and the connection status for a client-server project


Command Prompt - command line input

Pull-Down Menus - functions provided by different nodes and windows - change dynamically with different nodes and windows

Toolbar - Specific functions provides by different nodes and windows - change dynamically with different nodes and windows

Pop-up Menus - common operations for the selected nodes (open, run, results, … ) - invoked by clicking the right mouse button on the node

3. Five Processes in Data Mining SAS defines Data Mining as the following tasks:

1. Sample the data by creating one or more data tables. The samples should be large enough to contain the significant information, yet small enough to process.

‐ define data source

2. Explore the data by searching for anticipated relationships, unanticipated trends, and anomalies in order to gain understanding and ideas.

‐ find mining direction / decide mining tools

3. Modify the data by creating, selecting, and transforming the variables to focus the model selection process. ‐ preparation before modeling ( normalization, missing value handling, … )

4. Model the data by using the analytical tools to search for a combination of the data that reliably predicts a desired outcome. ‐ the mining process

5. Assess the data by evaluating the usefulness and reliability of the findings from the data mining process. ‐ evaluation

The SEMMA processes.

Sample Explore Modify Model Assess


4. Creating a New Project

Project: a collection of Enterprise process flow diagrams and information only one project (but many diagrams) can be edited at the same time Different data mining task different project

I. New a project: Pull‐down menus File New Project

II. Project structure

Project Location / ‐ EMDATA / ‐ EMPROJ /

‐ USERS / ‐ REPORTS / ‐ *.dmd (process flow diagram)

‐ project_name.dmp (*.dmp : project file)

EMDATA Directory contain data set and temporary large file when running processes

EMPROJ Directory contain info. For each diagram and its nodes (settings and target profile)

USERS subdirectory contains files that represent the users currently sharing the project.

REPORTS Directory HTML reports generated by the Reporter node are stored in this directory. Each report has its own

subdirectory. The name of the subdirectory defines the name of the report.

5. Sample I. Input Data Source

Specify the source of data and details about the variables in the data set

Typically the first node of a process flow diagram

View distribution of each variable

Create target profile for each target variable


- Data Tab - Specify data set from SAS destination

- e. g. SAMPSIO.DMAGECR (All: German Credit Data) - Set data set role / size

- RAW – default - TRAIN – training set, model fitting - VALIDATE – assess the adequacy of the model, fine tune model nodes (Regression, Tree and

Neural Networks) - TEST – obtain a final, unbiased estimate of the error of the model - SCORE – data set needed to be processed by the trained model

- Variables Tab - Define Attribute (Meta Information) - Set target profile (teach later)

- Profit / Loss, prior probability, … - Set model role, measurement, type, format and view distribution of different variables

e. g. change model role of variable A to target - right click the Model Role of the A to invoke the pop‐up menus - select Set Model Role target

e.g. The attribute GOOD_BAD in the German Credit Data can be set to target.

- Common model roles include:

target ‐ the variable required to be predicted.

input ‐ a variable that is used to predict the target.

rejected ‐ a variable that is excluded from the analysis in the process flow.

id ‐ an indicator variable for every transaction in the data set. The Associations node requires it for association discovery while other

modeling nodes exclude it.

- View distribution of each variable

Right click the selected variable and choose View Distribution

- Interval Variables and Class Variables Tabs

Display summary statistics and distributions for interval (continuous) and class (nominal / ordinal / binary) variables

- Import Self Created Data File

Sources: Excel, Oracle and Sybase

In CU, 4 formats:

Delimited File (*.*)

Comma Separate Value (*.csv) (can be export from Excel)

Tab Delimited File (*.txt) (can be export form Excel)

Excel file (*.xls) Procedures:

1) Prepare a data file from Excel

variable names can be specified in the first row

save as one of the above 4 formats, says Tab Delimited

remember to close the file before exporting to EM 2) Type “prog” in command prompt 3) Select Pull‐down menu File Import Data (An Import Wizard will be popped up) 4) Select file type, says Tab Delimited Next 5) Broswe the input file Next

Options – specify whether the first row is read as variable names 6) Set SAS Destination – specify Library and Member Finish 7) Create a Input Data Source node

Select input data

Set model role, measurement, type, format, … of all the variables

Set target profile for the target variable (if any) 8) View the data: Open the node and select Pull‐down meun File Details…


II. Sampling

Sample data set

recommended for extremely large databases because it can tremendously decrease model fitting time

- Data Tab

Display data set information and data records - Variables Tab

Display variable information - General Tab

Select Sampling Method, Size and Random Seed - Simple Random - Nth (systematic sampling)

1 Total: 100; Sampling Size: 5% n = 100 / 5 = 20 2 Randomly select a start position, says 10 3 select: 10th , 30th , 50th, 70th and 90th observations 4 every nth may contain sample with part of structure, especially when data set is sorted

- Stratified 1 Specify class variables to form strata (subsets) 2 Preserve the strata proportions of the original data set

- First N 1 Select first N observations

- Cluster 1 Cluster variable: class variable only 2 Cluster data set based on the selected cluster variable 3 Data within the same cluster has the same probability to be selected

- Output Tab

List the output data set

III. Data Partition Partition data set into TRAIN, VALIDATION and TEST subsets

TRAIN ‐ used for preliminary model fitting. The analyst attempts to find the best model weights using this data set.

VALIDATION ‐ used to assess the adequacy of the model in the Model Manager and in the Assessment node. The validation data set is also used for model fine‐tuning in the following nodes:

Tree node ‐ to create the best subtree.

Neural Network node ‐ to choose among network architectures or for the early‐stopping of the training algorithm.

Regression node ‐ to choose a final subset of predictors from all the subsets computed during stepwise regression.

TEST‐ is used to obtain a final, unbiased estimate of the generalization error of the model.

- Data and Variables Tab (same as the ones in Sampling) - Partition Tab - Specify Partition Method and Random Seed

Stratification: maintain the distribution for the original data set

User Define: Partition using specific variables with specific values - Specify Train, Validation and Test data set size

Enterprise Miner Overview 8 6. Explore

I. Distribution Explorer Visualize large volumes of data graphically

Uncover patterns and trends and reveal extreme values

Generate multidimensional histograms for interval and class variables

Fully interactive

- Variables Tab - Axis (x‐, y‐ and z‐ axis) assignment for different variables - Create summary report by unselect the Chart Only check box

- X Axis, Y Axis and Z Axis Tab - Show distribution of the selected variables - Set axis for different variables

- On the Toolbar, the button : probe the value of the diagram (click (hold) and move cursor on the diagram)

- On the Toolbar, the button: resize the diagram legend - for interval variable:

- specify no. of Bins

- the Toolbar button : set the max and min boundary (left click the mark of the max / min bar and move)

After “run” the Distribution Explorer node

- View results (the Toolbar button ) - A View pull‐down menu is added

- Dimension and Perspective: Drill‐up / drill‐down (right click ‐> view ‐> dimension)

- the Toolbar button : move the legend (drag and move the legend)

- the Toolbar button : spin the diagram (move any direction)

II. Multiplot Visualize large volumes of data graphically

Uncover patterns and trends and reveal extreme values

Automatically create bar charts and scatter plots for the input and target variables

Batch

- Variables Tab

Specify whether the variable is used to plot graphs - Set graph properties

Pull‐down menu Tools Settings - View results

The View and Graphs pull‐down menus are created - Bar Charts:

Histogram of each input and target.

Bar chart of each input versus each class target.

Bar chart of each input grouped by each interval target. - Scatter Plots:

Plot of each interval input versus the target.

Plot of each class input versus the target. - Model Tab

- Check information of the data source


III. Insight SAS/INSIGHT

Analyses data through linked graph

Most useful and powerful visualization tool in EM

- Data Tab

Specify whether process the entire or partial data set - After run, view results

The Analyze, View, Tables, Graphs, Curves and Vars pull‐down menus are created

A data window showing the values of all the data will be shown - How to analyze:

- Data Window - Select subset of data - Sort data select variable and click button in upper left corner

- Histogram - Show the distribution of variables (frequency of different values) - Pull‐down menu Analyze Histogram/Bar Chart (Y)

1 Select Y variables from the left window and press Y (use left mouse button with Shift / Crtl button) ‐ if more than one Y variables

graph will be show simultaneously

relationship between multiple graphs

2 Things to notify: Distribution, Distribution Center Skewing? Outliers?

3 Select any bar in any graph Value of the bar chart will be displayed Use Shift / Ctrl to select multiple region The same portion of data will be selected in other graphs and the data window Double click the point show values of data selected

4 Brushing observation Select a portion of observation in any graph (bounded by a rectangle) Drag the rectangle and move select portion will be changed accordingly Investigating multivariate data

5 You can group the graphs by values of Class variables (select variable(s) and press Group) ‐ e.g. two variables: X = 1, 2; Y = a, b

group X and Y 4 graphs:

1. X = 1, Y = a; 2. X = 1, Y = b;

3. X = 2, Y = a; 4. X = 2, Y = b.

‐ Pull‐down menus Edit Windows Align

make y‐axis of all graphs become the same scale

6 Choose Freq variable to display the sum of the values of numeric variable instead of the no. of data ‐ choose GOOD_BAD as Y and AMOUNT as Freq

we can see total amount of loan by bad applicants

7 Change bin width and starting point (numeric variable only) Pull‐down menus Edit Window Tools Click the hand icon and click on the histogram Width changes when press and move the mouse vertically Starting points changes when move horizontally


- Create Box Plots - Compare distributions of interval data (numeric variable) - Pull‐down Menu Analyze Box Plot/Mosaic Plot (Y)

1 Select Y variable (interval variable) and X variable 2 Select Label variable to label the data with extremely value on the selected variable 3 Click the graph to see the range between regions

- Create Scatter Plots - Show the relationship between two variable Y and X (2D) - Pull‐down Menu Analyze Scatter Plot (Y X)

1 Select X and Y variables 2 Multiple variables can be selected for both X and Y Click OK 3 Select any bar in any graph

Variable IDs of selected data will be displayed 4 Use Edit Windows Tools to edit the scatter point style

- Distribution (Y) - Pull‐down Tables, Graphs and Curves are enabled - Tables Menu

1 Select the tables displayed in distribution output 2 Default – Moments and Quantiles tables are displayed

- Graphs Menu 1 Select the graphs displayed in the distribution output 2 Default – Box Plots and Histogram 3 QQ plot can be selected here

compares ordered values of a variable with quantiles of a specific theoretical distribution. If the data are from the theoretical distribution, the points on the QQ plot lie approximately on a straight line. The normal, lognormal, exponential, and Weibull distributions can be used in the plot

- help - Pull‐down menu Help EM Reference : find “insight”

IV. Variable Selection

reduce the number of inputs by setting the status of the input variables that are not related to the target a target variable should be specified

results can be overrided by resetting the status of the variables

process on interval or binary targets only, if ordinal target Regression Node to select variable

Variables Tab

Selecting variables to be processed

Manual Selection Tab

Override the role of variable (automatic / rejected / input)

Target Association Tab

Disabling the Automatic Selection Process deselect Remove variables unrelated to target check box select manually

Selection Criterion 1. R – square (Linear and Logistic Regression)

Numeric target

Fast preliminary variable assessment

Settings

Squared correlation : cut‐off threshold

Stepwise R2 Improvement : cut‐off threshold of a further forward stepwise regression

Include 2‐way interactions

measure the effect of a class variable across all the levels of another class variable

Bin interval variables

bin interval variables into 16 equally‐spaced groups


Use only grouped class variables

Reduce the levels of class variables 2. Chi– square (Decision Tree)

Binary target variable

Settings

Bins : no. of bins

Chi‐square : larger fewer splits

Passes : no. of passes through the data to determine the optimum splits

General Tab

Remove variables with a certain amount of missing values

Remove variables in hierarchies

e. g. district (Least detailed) and street (Most detailed)

7. Modify I. Data Set Attribute

Modify data set attributes (name, description and role)

Modify metadata information

Similar to Input Data Source node

II. Transformation Variables Create new variables transformed from existing variables in the data set

Transform interval variables only

Improve the fit of model to the data

Stabilize variances, remove nonlinearity, …

Variables Tab

A Table Editor and 3 Toolbar buttons: transform / create / delete variable

select variables in Table Editor and click Toolbar button (transform / delete)

only created / transformed variables can be deleted

Three type of transformation 1. Simple Transformations

- log - square root - inverse - square - exponential - standardize

2. Binning Transformations

- bucket ( equal width ) - quantile ( equal no. of data ) - optimal binning for relationship to target

no. of bins and bin range can be specified 3. Best Power Transformations

- maximize normality

use when skewed distribution - maximize correlation with target

match interval inputs with interval target - equalize spread with target levels

match interval inputs with different levels in class target


III. Replacement Replace missing and nonmissing values

Default: data with missing value is not used in: Variable Selection, Regression and Neural Network nodes

No single correct answer for what value to replace

Value imputation

Imputation can also be done by: Clustering / SOM/Kohonen nodes

Data Tab

Training Subtab – specify the training set for imputation (i.e. the data set to calculate mean, median, mode, … )

Defaults Tab

General Subtab

Replace before imputation:

Replace nonmissing values before imputation of missing values

Specified in Interval Variables and Class Variables Tabs

Replace unknown levels with:

Replace unknown values of class variables in score data set

New class variables in scoring data set

Create imputed indicator variables:

create a variable to indicate whether the input data has replaced value or not

Imputation Method Subtab

Specify the imputation method for interval and class variables

Interval variable:

Mean

Median

Midrange : ( max + min ) / 2

Distribution based : random percentiles of variable’s distribution not change the original distribution

Tree imputation

analyze each input as a target using the remaining input and rejected variables may be more accurate

Mid‐minimum spacing : same as midrange but using a portion of data

Tukey's biweight / Huber's / Andrew's Wave

Roubust M‐Estimators of location

Minimize functions of deviations of the data

Default constant

None : no imputation

Class variable:

Most frequent value (count)

Distribution‐based

Tree imputation

Default constant

none

Interval Variables Tab

Customize the interval imputation method specified in Defaults Tab

Replace nonmissing value

( Replace < / With Value (<) ) and ( Replace > / With Value (>) )

Replace extreme values before perform missing value imputation

Class Variables Tab

Customize the class imputation method specified in Defaults Tab

Replace nonmissing value (Replace Value)

You can change the order of ordinal data here

Tree Imputation Tab

Specify the perdictor variable used in tree imputation

Frequency Tab

List freq variable(s) of the data set


No imputation is done for freq variables

8. Model The mining process 1. Tree

To construct decision tree models for your data (e.g., CHAID, CART, C4.5, and C5.0.) The Tree node supports both automatic and interactive training.

2. Regression To use linear and logistic regression models for your data. It supports the stepwise, forward, and backward selection methods.

3. Neural Network To construct, train, and validate multilayer feedforward neural networks.

9. Assess Assessment

To compare models and predictions from the Model nodes.

To make cross‐model comparisons and assessments, independent of all other factors (such as sample size, modeling node, and so on).

The champion model(s) can be exported to a Score node to score new data.

References:

SAS Enterprise Miner. http://support.sas.com/documentation/onlinedoc/miner/

ECLT5810 E-Commerce Data Mining Techniques …eclt5810/EMlab/em-overview-2016.pdfEnterprise Miner...

Documents

Transcript of ECLT5810 E-Commerce Data Mining Techniques …eclt5810/EMlab/em-overview-2016.pdfEnterprise Miner...