WEKA Lab Manual

Data Mining Lab

S.K.T.R.M College of Engineering 1

LABORATORY MANUAL on

DATA MINING

Prepared by

INDRANEEL K Associate Professor

CSE Department

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SRI KOTTAM TULASI REDDY MEMORIAL COLLEGE OF ENGINEERING (Affiliated to JNTU, Hyderabad, Approved by AICTE, Accredited by NBA)

KONDAIR, MAHABOOBNAGAR (Dist), AP - 509125

Data Mining Lab


INDEX

The objective of the lab exercises is to use data mining techniques to identify customer segments and understand their buying behavior and to use standard databases available to understand DM processes using WEKA (or any other DM tool) 1. Gain insight for running pre- defined decision trees and explore results using MS OLAP Analytics. 2. Using IBM OLAP Miner – Understand the use of data mining for evaluating the content of multidimensional cubes. 3. Using Teradata Warehouse Miner – Create mining models that are executed in SQL. ( BI Portal Lab: The objective of the lab exercises is to integrate pre-built reports into a portal application ) 4. Publish cognos cubes to a business intelligence portal.

Metadata & ETL Lab: The objective of the lab exercises is to implement metadata import agents to pull metadata from leading business intelligence tools and populate a metadata repository. To understand ETL processes 5. Import metadata from specific business intelligence tools and populate a meta data repository. 6. Publish metadata stored in the repository. 7. Load data from heterogeneous sources including text files into a pre-defined warehouse schema

Data Mining Lab


CONTENTS

S.no Experiment Week NO Page NOs

1 Defining Weather relation for different attributes

1 7-18

2 Defining employee relation for different attributes

2 19-28

3 Defining labor relation for different attributes

3 29-38

4 Defining student relation for different attributes

4 39-49

5 Exploring weather relation using experimenter and obtaining results in various schemes

5 49-59

6 Exploring employee relation using experimenter

6 60-65

7 Exploring labor relation using experimenter

7 66-71

8 Exploring student relation using experimenter

8 72-78

9 Setting up a flow to load an arff file (batch mode) andperform a cross validation using J48

9 86-112

10 Design a knowledge flow layout, to load attribute selection normalize the attributes and to store the result in a csv saver.

10 116-117

Data Mining Lab


Aim: Implementation of Data Mining Algorithms by Attribute Relation File

formats Introduction to Weka (Data Mining Tool) • Weka is a collection of machine learning algorithms for data mining tasks. The

algorithms can either be applied directly to a dataset (using GUI) or called from your own Java code (using Weka Java library).

• Tools (or functions) in Weka include: • Data preprocessing (e.g., Data Filters), • Classification (e.g., BayesNet, KNN, C4.5 Decision Tree, Neural

Networks, SVM), • Regression (e.g., Linear Regression, Isotonic Regression, SVM for

Regression), • Clustering (e.g., Simple K-means, Expectation Maximization (EM)), • Association rules (e.g., Apriori Algorithm, Predictive Accuracy,

Confirmation Guided), • Feature Selection (e.g., Cfs Subset Evaluation, Information Gain, Chi-

squared Statistic), and • Visualization (e.g., View different two-dimensional plots of the data).

Launching WEKA The Weka GUI Chooser (class weka.gui.GUIChooser) provides a starting point for launching Weka‘s main GUI applications and supporting tools. If one prefers a MDI (―multiple document interface‖) appearance, then this is provided by an alternative launcher called ―Main‖ (class weka.gui.Main). The GUI Chooser consists of four buttons one for each of the four major Weka applications and four menus. The buttons can be used to start the following applications: • Explorer An environment for exploring data with WEKA (the rest of this

documentation deals with this application in more detail).

• Experimenter An environment for performing experiments and conducting statistical tests between learning schemes.

• Knowledge Flow This environment supports essentially the same functions as the Explorer but with a drag-and-drop interface. One advantage is that it supports incremental learning.

• Simple CLI Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface.

Data Mining Lab


Working with Explorer

Weka Data File Format (Input) The most popular data input format of Weka

is ―arff‖ (with ―arff‖ being the extension name of your input data file).

Experiment:1

WEATHER RELATION:

% ARFF file for weather data with some numeric features @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes

Data Mining Lab


PREPROCESSING: In order to experiment with the application, the data set needs to be

presented to WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program.

Open File- allows for the user to select files residing on the local machine or recorded medium

Open URL- provides a mechanism to locate a file or data source from a different location specified by the user

Open Database- allows the user to retrieve files or data from a database source provided by the user

Data Mining Lab


CLASSIFICATION:

The user has the option of applying many different algorithms to the data set in order to produce a representation of information. The best approach is to independently apply a mixture of the available choices and see what yields something close to the desired results. The Classify tab is where the user selects the classifier choices. Figure 5 shows some of the categories.

Output: correctly Classified Instances 5 35.7143 % Kappa statistic 0 Mean absolute error 0.4762 Root mean squared error 0.4934 Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 1 0.643 1 0.783 0.178 yes 0 0 0 0 0 0.178 no Weighted Avg. 0.643 0.643 0.413 0.643 0.503 0.178 === Confusion Matrix === a b <-- classified as 9 0 | a = yes 5 0 | b = no

CLUSTERING: The Cluster tab opens the process that is used to identify commonalties or

clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those

Data Mining Lab


described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.

== Run information === Output: Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: evaluate on training data === Model and evaluation on training set = Number of clusters selected by cross validation: 1 Cluster Attribute 0 (1) ====================== outlook sunny 6 overcast 5 rainy 6 [total] 17 temperature mean 73.5714 std. dev. 6.3326 humidity mean 81.6429 std. dev. 9.9111 windy TRUE 7 FALSE 9 [total] 16 play yes 10 no 6 [total] 16 Clustered Instance0 14 (100%) Log likelihood: -9.4063

Choosing Relationship for cluster:

Data Mining Lab


ASSOCIATION:

The associate tab opens a window to select the options for associations within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.

== Run information === Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F

\"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1

Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play

SELECTING ATTRIBUTES: The next tab is used to select the specific attributes used for the calculation

process. By default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.

OUTPUT:

Data Mining Lab


=== Run information === Evaluator: weka.attributeSelection.CfsSubsetEval Search: weka.attributeSelection.BestFirst -D 1 -N 5 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windyplay Evaluation mode: evaluate on all training data === Attribute Selection on all input data === Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 11 Merit of best subset found: 0.196

Attribute Subset Evaluator (supervised, Class (nominal): 5 play): CFS Subset Evaluator Including locally predictive attributes

Selected attributes: 1,4 : 2 outlook windy

VISUALIZATION: The last tab in the window is the visualization tab. Using the other tabs in the

program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.

Data Mining Lab


Data Mining Lab


Experiment :2 Employee Relation(INPUT): % ARFF file for employee data with some numeric features @relation employee @attribute ename {john, tony, ravi} @attribute eid numeric @attribute esal numeric @attribute edept {sales, admin} @data john, 85, 8500, sales tony, 85, 9500, admin john, 85, 8500, sales

OUTPUT

PREPROCESSING:

In order to experiment with the application, the data set needs to be presented to WEKA in a format the program understands. There are rules for the

Data Mining Lab


type of data that WEKA will accept and three options for loading data into the program.




CLASSIFICATION:

The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of ins. === Run information ===

OUTPUT: Scheme: weka.classifiers.rules.ZeroR Relation: employee Instances: 3 Attributes: 4 ename eid esal edept

Data Mining Lab


Test mode: 10-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: sales Time taken to build model: 0 seconds

CLUSTERING: The Cluster tab opens the process that is used to identify commonalties or

clusters of occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.

OUTPUT: Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: employee Instances: 3 Attributes: 4 ename eid esal edept Test mode: evaluate on training data === Model and evaluation on training set === EM == Number of clusters selected by cross validation: 1 Cluster Attribute 0 (1) ====================== ename john 3 tony 2 ravi 1 [total] 6 eid mean 85 std. dev. 0 esal mean 8833.3333 std. dev. 471.4045 edept sales 3 admin 2 [total] 5 Clustered Instances 0 3 (100%)

Data Mining Lab


Log likelihood: 3.84763

ASSOCIATION: The associate tab opens a window to select the options for associations

within the data set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.

=== Run information === Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F


Relation: employee Instances: 3 Attributes: 4 ename eid esal edept

SELECTING ATTRIBUTES: The next tab is used to select the specific attributes used for the calculation

process. By default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.

OUTPUT: === Attribute Selection on all input data === Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 11 Merit of best subset found: 0.196 Attribute Subset Evaluator (supervised, Class (nominal): 5 play): CFS Subset Evaluator Including locally predictive attributes Selected attributes: 1,4 : 2 outlook windy

VISUALIZATION:

Data Mining Lab


The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.

Data Mining Lab


Experiment:3 STUDENT RELATION % % ARFF file for student data with some numeric features % @relation student @attribute sname {john, tony, ravi} @attribute sid numeric @attribute sbranch {ECE, CSE, IT} @attribute sage numeric @data john, 285, ECE, 19 tony, 385, IT, admin john, 485, ECE, 19

PREPROCESSING: In order to experiment with the application, the data set needs to be presented to

WEKA in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program.




CLASSIFICATION: The Cluster tab opens the process that is used to identify commonalties or clusters of

occurrences within the data set and produce information for the user to analyze.

Data Mining Lab


There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.

Output: Scheme: weka.classifiers.rules.ZeroR Relation: student Instances: 3 Attributes: 4 sname sid sbranch sage Test mode: 2-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: 19.333333333333332 Time taken to build model: 0 seconds === Cross-validation ====== Summary === Correlation coefficient -0.5 Mean absolute error 0.5 Root mean squared error 0.6455 Relative absolute error 100 % Root relative squared error 100 % Total Number of Instances 3

CLUSTERING: The Cluster tab opens the process that is used to identify commonalties or clusters of

occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.

heme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: evaluate on training data=== Model and evaluation on training set === EM

Data Mining Lab


==Number of clusters selected by cross validation Cluster Attribute 0 (1) ====================== outlook sunny 6 overcast 5 rainy 6 [total] 17 temperature mean 73.5714 std. dev. 6.3326 humidity mean 81.6429 std. dev. 9.9111windy TRUE 7 FALSE 9 [total] 16 play yes 10 no 6 [total] 16 Clustered Instances 0 14 (100%) Log likelihood: -9.4063 ASSOCIATION: The associate tab opens a window to select the options for associations within the data

set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.

=== Run information === Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F


Relation: student Instances: 3 Attributes: 4 sname sid sbranch sage

SELECTING ATTRIBUTES: The next tab is used to select the specific attributes used for the calculation process. By

default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of

Data Mining Lab


them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.

Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 7 Merit of best subset found: 1 Attribute Subset Evaluator (supervised, Class (numeric): 4 sage): CFS Subset Evaluator Including locally predictive attributes Selected attributes: 1,3 : 2 sname sbranch

VISUALIZATION: The last tab in the window is the visualization tab. Using the other tabs in the program, calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the attributes from one view to another.

Data Mining Lab


Data Mining Lab


Experiment:4 % LABOR RELATION: % ARFF file for labor data with some numeric features % @relation labor @attribute name {rom, tony, santu} @attribute wage-increase-first-year numeric @attribute wage-increase-second-year numeric @attribute working-hours numeric @attribute pension numeric @attribute vacation numeric @data rom, 500, 600, 8, 200, 15 tony, 400, 450, 8, 200, 15 santu, 600, 650, 8, 200, 15

PREPROCESSING: In order to experiment with the application, the data set needs to be presented to WEKA

in a format the program understands. There are rules for the type of data that WEKA will accept and three options for loading data into the program. Open File- allows for the user to select files residing on the local machine or recorded medium



CLASSIFICATION: The Cluster tab opens the process that is used to identify commonalties or clusters of

occurrences within the data set and produce information for the user to analyze.

Data Mining Lab


There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.

Output: Scheme: weka.classifiers.rules.ZeroR Relation: labor Instances: 3 Attributes: 6 name wage-increase-first-year wage-increase-second-year working-hours pension vacation Test mode: 2-fold cross-validation === Classifier model (full training set) === ZeroR predicts class value: 15.0 Time taken to build model: 0 seconds === Cross-validation ====== Summary === Correlation coefficient 0 Mean absolute error 0 Root mean squared error 0 Relative absolute error NaN % Root relative squared error NaN % Total Number of Instances 3

CLUSTERING: The Cluster tab opens the process that is used to identify commonalties or clusters of

occurrences within the data set and produce information for the user to analyze. There are a few options within the cluster window that are similar to those described in the Classify tab. These options are: use training set, supplied test set and percentage split. The fourth option is classes to cluster evaluation, which compares how well the data compares with a pre-assigned class within the data. While in cluster mode, users have the option of ignoring some of the attributes from the data set. This can be useful if there are specific attributes causing the results to be out of range, or for large data sets. Figure 6 shows the Cluster window and some of its options.

Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100 Relation: labor Instances: 3

Data Mining Lab


Attributes: 6 name wage-increase-first-year wage-increase-second-year working-hours pension vacation Test mode: evaluate on training data === Model and evaluation on training set === EM== Number of clusters selected by cross validation: 1 Cluster Attribute 0 (1) ===================================== name rom 2 tony 2 santu 2 [total] 6 wage-increase-first-year mean 500 std. dev. 81.6497 wage-increase-second-year mean 566.6667 std. dev. 84.9837 working-hours mean 8 std. dev. 0 pension mean 200 std. dev. 0 vacation mean 15 std. dev. 0 Clustered Instances 0 3 (100%) Log likelihood: 25.90833

ASSOCIATION: The associate tab opens a window to select the options for associations within the data

set. The user selects one of the choices and presses start to yield the results. There are few options for this window and one of the most popular, Apriori, is shown in Figure below.

Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1

Data Mining Lab


Relation: labor Instances: 3 Attributes: 6 name wage-increase-first-year wage-increase-second-year working-hours pension vacation

SELECTING ATTRIBUTES: The next tab is used to select the specific attributes used for the calculation process. By

default all of the available attributes are used in the evaluation of the data set. If the user wanted to exclude certain categories of the data they would deselect those specific choices from the list in the cluster window. This is useful if some of the attributes are of a different form such as alphanumeric data that could alter the results. The software searches through the selected attributes to decide which of them will best fit the desired calculation. To perform this, the user has to select two options, an attribute evaluator and a search method. Once this is done the program evaluates the data based on the subset of the attributes, then it performs the necessary search for commonality with the date. Figure 8 shows the opinions of attribute evaluation.

=== Attribute Selection on all input data === Search Method: Best first. Start set: no attributes Search direction: forward Stale search after 5 node expansions Total number of subsets evaluated: 19 Merit of best subset found: 0 Attribute Subset Evaluator (supervised, Class (numeric): 6 vacation): CFS Subset EvaluatIncluding locally predictive attributes Selected attributes: 1 : 1 name VISUALIZATION: The last tab in the window is the visualization tab. Using the other tabs in the program,

calculations and comparisons have occurred on the data set. Selections of attributes and methods of manipulation have been chosen. The final piece of the puzzle is looking at the information that has been derived throughout the process. The user can now actually see the data displayed in a two dimensional representation of the information. The first screen that the user sees when they select the visualization option is a matrix of plots representing the different attributes within the data set plotted against the other attributes. If a lot of attributes are selected, there is a scroll bar to view all of the produced plots. The user can select a specific plot from the matrix to analyze its contents in a larger, popup window. A grid pattern of the plots allows the user to select the attribute positioning to their liking for better understanding. Once a specific plot has been selected, the user can change the

Data Mining Lab


attributes from one view to

EXPERIMENTER: The Weka Experiment Environment enables the user to create, run, modify, and analyse experiments in a more convenient manner than is possible when processing the schemes individually. For example, the user can create an experiment that runs several schemes against a series of datasets and then analyse the results to determine if one of the schemes is (statistically) better than the other schemes.The Experiment Environment can be run from the command line using the Simple CLI

Data Mining Lab


Experiment:5

COMMAND LINE:java weka.experiment.Experiment -r -T data/weather.arff

Defining an Experiment

When the Experimenter is started, the Setup window (actually a pane) is displayed. Click New toinitialize an experiment. This causes default parameters to be defined for the experiment.

To define the dataset to be processed by a scheme, first select ―Use relative paths‖ in the Datasetspanel of the Setup window and then click ―‖Add New‖ to open a dialog box below

Data Mining Lab


Select iris.arff and click Open to select the iris dataset.

The dataset name is now displayed in the Datasets panel of the Setup window. Saving the Results of the Experiment

To identify a dataset to which the results are to be sent, click on the ―CSVResultListener‖ entry in the Destination panel. Note that this window (and other similar windows in Weka) is not initially expanded and some of the information in the window is not visible. Drag the bottom right-hand corner of the window to resize the window until the scroll bars disappear.

Data Mining Lab


The output file parameter is near the bottom of the window, beside the text ―outputFile‖. Click on this parameter to display a file selection window.

Data Mining Lab


Type the name of the output file, click Select, and then click close (x). The file name is displayed in the outputFile panel. Click on OK to close the window.

The dataset name is displayed in the Destination panel of the Setup window.

Saving the Experiment Definition

The experiment definition can be saved at any time. Select ―Save …‖ at the top of the Setup window. Type the dataset name with the extension ―exp‖ (or select the dataset name if the experiment definition dataset already exists).

Data Mining Lab


The experiment can be restored by selecting Open in the Setup window and then selecting Experiment1.exp in the dialog window. Running an Experiment

To run the current experiment, click the Run tab at the top of the Experiment Environment window. The current experiment performs 10 randomized train and test runs on the Iris dataset, using 66% of the patterns for training and 34% for testing, and using the ZeroR scheme.

Click Start to run the experiment.

Data Mining Lab


If the experiment was defined correctly, the 3 messages shown above will be displayed in the Log panel. The results of the experiment are saved to the dataset Experiment1.txt. Dataset,Run,Scheme,Scheme_options,Scheme_version_ID,Date_time,Number_of_instances,Number _correct,Number_incorrect,Number_unclassified,Percent_correct,Percent_incorrect,Percent_ unclassified,Mean_absolute_error,Root_mean_squared_error,Relative_absolute_error,Root_re lative_squared_error,SF_prior_entropy,SF_scheme_entropy,SF_entropy_gain,SF_mean_prior_en tropy,SF_mean_scheme_entropy,SF_mean_entropy_gain,KB_information,KB_mean_information,KB_ relative_information,True_positive_rate,Num_true_positives,False_positive_rate,Num_false _positives,True_negative_rate,Num_true_negatives,False_negative_rate,Num_false_negatives ,IR_precision,IR_recall,F_measure,Summary iris,1,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,15.0,36.0,0.0, 29.41176470588235,70.58823529411765,0.0,0.4462386261694216,0.47377732045597576,100.0,100 .0,81.5923629400546,81.5923629400546,0.0,1.5998502537265609,1.5998502537265609,0.0,0.0,0 .0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? iris,2,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,11.0,40.0,0.0, 21.568627450980394,78.43137254901961,0.0,0.4513648596693575,0.48049218646442554,100.0,10 0.0,83.58463098131035,83.58463098131035,0.0,1.6389143329668696,1.6389143329668696,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?

Data Mining Lab


Experiment:6

Aim: to setup standard experiments, that are run locally on a single machine, or remote experiments, which are distributed between several hosts for employee relation

Type this command in simple CLI java weka.experiment.Experiment -r -T data/emp.arff

Add new relation using add new button on the right panel And give database connection using jdbc and click ok

Choose the relation and click ok button

Data Mining Lab


Choose ZERO R from the menu ―choose‖ button by clicking add new button on the right panel and click ok

Click on the run tab to get the output

Data Mining Lab


The results of the experiment are saved to the dataset Experiment2.txt. Dataset,Run,Scheme,Scheme_options,Scheme_version_ID,Date_time,Number_of_instances,Number _correct,Number_incorrect,Number_unclassified,Percent_correct,Percent_incorrect,Percent_ unclassified,Mean_absolute_error,Root_mean_squared_error,Relative_absolute_error,Root_re lative_squared_error,SF_prior_entropy,SF_scheme_entropy,SF_entropy_gain,SF_mean_prior_en tropy,SF_mean_scheme_entropy,SF_mean_entropy_gain,KB_information,KB_mean_information,KB_ relative_information,True_positive_rate,Num_true_positives,False_positive_rate,Num_false _positives,True_negative_rate,Num_true_negatives,False_negative_rate,Num_false_negatives ,IR_precision,IR_recall,F_measure,Summary iris,1,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,15.0,36.0,0.0, 29.41176470588235,70.58823529411765,0.0,0.4462386261694216,0.47377732045597576,100.0,100 .0,81.5923629400546,81.5923629400546,0.0,1.5998502537265609,1.5998502537265609,0.0,0.0,0 .0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,? iris,2,weka.classifiers.ZeroR,'',6077547173920530258,2.00102021558E7,51.0,11.0,40.0,0.0, 21.568627450980394,78.43137254901961,0.0,0.4513648596693575,0.48049218646442554,100.0,10 0.0,83.58463098131035,83.58463098131035,0.0,1.6389143329668696,1.6389143329668696,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,31.0,1.0,20.0,0.0,0.0,0.0,?

Data Mining Lab


Experiment:7

Aim: to setup standard experiments, that are run locally on a single machine, or remote experiments, which are distributed between several hosts for labor relation

Type this command in simple CLI java weka.experiment.Experiment -r -T data/labor.arff


Data Mining Lab



Data Mining Lab


Experiment:8

Aim: to setup standard experiments, that are run locally on a single machine, or remote experiments, which are distributed between several hosts for student relation

Type this command in simple CLI java weka.experiment.Experiment -r -T data/student.arff

Data Mining Lab




Data Mining Lab



KNOWLEDGE FLOW: The Knowledge Flow provides an alternative to the Explorer as a graphical front end to Weka's core algorithms. The Knowledge Flow is a work in progress so some of the functionality from the Explorer is not yet available. On the other hand, there are things that can be done in the Knowledge Flow but not in the Explorer.

Data Mining Lab


The Knowledge Flow presents a "data-flow" inspired interface to Weka. The user can select Weka components from a tool bar, place them on a layout canvas and connect them together in order to form a "knowledge flow" for processing and analyzing data. At present, all of Weka's classifiers, filters, clusterers, loaders and savers are available in the KnowledgeFlow along with some extra tools. Features of the KnowledgeFlow: * intuitive data flow style layout * process data in batches or incrementally * process multiple batches or streams in parallel! (each separate flow executes in its own thread) * chain filters together * view models produced by classifiers for each fold in a cross validation * visualize performance of incremental classifiers during processing (scrolling plots of classification accuracy, RMS error, predictions etc) omponents available in the KnowledgeFlow: DataSources: All of Weka's loaders are available DataSinks: All of Weka's savers are available Filters: All of Weka's filters are available Classifiers: All of Weka's classifiers are available Clusterers: All of Weka's clusterers are available valuation: TrainingSetMaker - make a data set into a training set TestSetMaker - make a data set into a test set CrossValidationFoldMaker - split any data set, training set or test set into folds TrainTestSplitMaker - split any data set, training set or test set into a training set and a test set ClassAssigner - assign a column to be the class for any data set, training set or test set ClassValuePicker - choose a class value to be considered as the "positive" class. This is useful when generating data for ROC style curves (see below) ClassifierPerformanceEvaluator - evaluate the performance of batch trained/tested classifiers

Data Mining Lab


IncrementalClassifierEvaluator - evaluate the performance of incrementally trained classifiers ClustererPerformanceEvaluator - evaluate the performance of batch trained/tested clusterers PredictionAppender - append classifier predictions to a test set. For discrete class problems, can either append predicted class labels or probability distributions Visualization: DataVisualizer - component that can pop up a panel for visualizing data in a single large 2D scatter plot ScatterPlotMatrix - component that can pop up a panel containing a matrix of small scatter plots (clicking on a small plot pops up a large scatter plot) AttributeSummarizer - component that can pop up a panel containing a matrix of histogram plots - one for each of the attributes in the input data ModelPerformanceChart - component that can pop up a panel for visualizing threshold (i.e. ROC style) curves. TextViewer - component for showing textual data. Can show data sets, classification performance statistics etc. GraphViewer - component that can pop up a panel for visualizing tree based models StripChart - component that can pop up a panel that displays a scrolling plot of data (used for viewing the online performance of incremental classifiers)

Launching the KnowledgeFlow The Weka GUI Chooser window is used to launch Weka's graphical environments. Select the button labeled "KnowledgeFlow" to start the KnowledgeFlow. Alternatively, you can launch the KnowledgeFlow from a terminal window by typing "javaweka.gui.beans.KnowledgeFlow". At the top of the KnowledgeFlow window is are seven tabs: DataSources, DataSinks, Filters, Classifiers, Clusterers, Evaluation and Visualization. The names are pretty much self explanatory.

Components Components available in the KnowledgeFlow:

DataSources All of WEKA‘s loaders are available.

Data Mining Lab


DataSinks All of WEKA‘s savers are available.

Filters All of WEKA‘s filters are available.

Classifiers All of WEKA‘s classifiers are available.

Clusterers All of WEKA‘s clusterers are available.

Data Mining Lab


Evaluation

• TrainingSetMaker - make a data set into a training set. • TestSetMaker - make a data set into a test set. • CrossValidationFoldMaker - split any data set, training set or test set into folds. • TrainTestSplitMaker - split any data set, training set or test set into a training set and a test set. • ClassAssigner - assign a column to be the class for any data set, training set or test set. • ClassValuePicker - choose a class value to be considered as the ―positive‖ class. This is useful when generating data for ROC style curves (see ModelPerformanceChart below and example 6.4.2). • ClassifierPerformanceEvaluator - evaluate the performance of batch trained/tested classifiers. • IncrementalClassifierEvaluator - evaluate the performance of incrementally trained classifiers. • ClustererPerformanceEvaluator - evaluate the performance of batch trained/tested clusterers. • PredictionAppender - append classifier predictions to a test set. For discrete class problems, can either append predicted class labels or probability distributions.

Visualization

• DataVisualizer - component that can pop up a panel for visualizing data in a single large 2D scatter plot. • ScatterPlotMatrix - component that can pop up a panel containing a matrix of small scatter plots (clicking on a small plot pops up a large scatter plot). • AttributeSummarizer - component that can pop up a panel containing a matrix of histogram plots - one for each of the attributes in the input data. • ModelPerformanceChart - component that can pop up a panel for visualizing threshold (i.e. ROC style) curves.

Data Mining Lab


• TextViewer - component for showing textual data. Can show data sets, classification performance statistics etc. • GraphViewer - component that can pop up a panel for visualizing tree based models. • StripChart - component that can pop up a panel that displays a scrolling plot of data (used for viewing the online performance of incremental clas-iers)

Experiment:9

Aim: Setting up a flow to load an arff file (batch mode) and perform a cross validation using J48 (Weka's C4.5 implementation). First start the KnowlegeFlow. Next click on the DataSources tab and choose "ArffLoader" from the toolbar (the mouse pointer will change to a "cross hairs").

Data Mining Lab


Next place the ArffLoader component on the layout area by clicking somewhere on the layout (A copy of the ArffLoader icon will appear on the layout area). Next specify an arff file to load by first right clicking the mouse over the ArffLoader icon on the layout. A pop-up menu will appear. Select "Configure" under "Edit" in the list from this menu and browse to the location of your arff file.

Alternatively, you can

Data Mining Lab


double-click on the icon to bring up the configuration dialog

Next click the "Evaluation" tab at the top of the window and choose the "ClassAssigner" (allows you to choose which column to be the class) component from the toolbar. Place this on the layout.

Now connect the ArffLoader to the ClassAssigner: first right click

Data Mining Lab


over the ArffLoader and select the "dataSet" under "Connections" in the menu. A "rubber band" line will appear.

Move the mouse over the ClassAssigner component and left click - a red line labeled "dataSet" will connect the two components.

Data Mining Lab


Next right click over the ClassAssigner and choose "Configure" from the menu. This will pop up a window from which you can specify which column is the class in your data (last is the default).

Data Mining Lab


Next grab a "CrossValidationFoldMaker" component from the Evaluation toolbar and place it on the layout.

Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over "ClassAssigner" and selecting "dataSet" from under "Connections" in the menu.

Data Mining Lab


Next click on the "Classifiers" tab at the top of the window and scroll along the toolbar until you reach the "J48" component in the "trees" section.

Place a J48 component on the layout.

Connect the CrossValidationFoldMaker to J48 TWICE by first choosing "trainingSet" and then "testSet" from the pop-up menu for the CrossValidationFoldMaker.

Data Mining Lab


Data Mining Lab


Next go back to the "Evaluation" tab and place a "ClassifierPerformanceEvaluator" component on the layout.

Data Mining Lab


Connect J48 to this component by selecting the "batchClassifier" entry from the pop-up menu for J48.

Data Mining Lab


Next go to the "Visualization" toolbar and place a "TextViewer" component on the layout.

Connect the ClassifierPerformanceEvaluator to the TextViewer by selecting the "text" entry from the pop-up menu for ClassifierPerformanceEvaluator.

Data Mining Lab


Now start the flow executing by selecting "Start loading" from the pop-up menu for ArffLoader.

Data Mining Lab


When finished you can view the results by choosing show results from the pop-up menu for the TextViewer component.

Data Mining Lab


Data Mining Lab


Simple CSI

The Simple CLI provides full access to all Weka classes, i.e., classifiers, filters, clusterers, etc., but without the hassle of the CLASSPATH (it facilitates the one, with which Weka was started). It offers a simple Weka shell with separated commandline and output.

Commands The following commands are available in the Simple CLI: • java <classname> [<args>] invokes a java class with the given arguments (if any) • breakstops the current thread, e.g., a running classifier, in a friendly manner 31 32 CHAPTER 3. SIMPLE CLI • kill stops the current thread in an unfriendly fashion • cls clears the output area • exit exits the Simple CLI • help [<command>] provides an overview of the available commands if without a command name as argument, otherwise more help on the specified command

Data Mining Lab


Commands The following commands are available in the Simple CLI: • java <classname> [<args>] invokes a java class with the given arguments (if any) • break stops the current thread, e.g., a running classifier, in a friendly manner SIMPLE CLI • kill stops the current thread in an unfriendly fashion • cls clears the output area • exit exits the Simple CLI • help [<command>] provides an overview of the available commands if without a comman

Command redirection Starting with this version of Weka one can perform a basic redirection: java weka.classifiers.trees.J48 -t test.arff > j48.txt Note: the > must be preceded and followed by a space, otherwise it is not recognized as redirection, but part of another parameter.

Command completion Commands starting with java support completion for classnames and filenames

Data Mining Lab


via Tab (Alt+BackSpace deletes parts of the command again). In case that there are several matches, Weka lists all possible matches. • package name completion java weka.cl<Tab> results in the following output of possible matches of package names: Possible matches: weka.classifiers weka.clusterers • classname completion java weka.classifiers.meta.A<Tab> lists the following classes Possible matches: weka.classifiers.meta.AdaBoostM1 weka.classifiers.meta.AdditiveRegression weka.classifiers.meta.AttributeSelectedClassifier • filename completion In order for Weka to determine whether a the string under the cursor is a classname or a filename, filenames need to be absolute (Unix/Linx: /some/path/file;Windows: C:\Some\Path\file) or relative and starting with a dot (Unix/Linux: ./some/other/path/file;Windows: .\Some\Other\Path\file)

Data Mining Lab


EXPERMIENT-10

AIM:

To design a knowledge flow layout, to load apply attribute selection normalize the

attributes and to store the result in a csv saver. Procedure:

1) Click on ―knowledge Glow‖ from weak GUI chooser. 2) It opens a window called ―Weka knowledge flow environment‖. 3) Click on ―data sources‖ and select ―Arff‖ to read data is the arff source. 4) Now click on the ―knowledge flow layout‖ area, which laces the Arffloader in the

layout. 5) Cdlick on ―filters‖ and select on attribute selector from the ―supervised‖ filters.

Place it on the design layout. 6) Now select another filter to normalize the numeric attribute values , from the

―unsupervised‖ filters. Placae it on the design layout. 7) Click on ―Data sinks‖ and choose ―csv‖, which writes to a destination that is in csv

format. Place it on the design layout of knowledge flow. 8) Now right click on ―Arffloader‖ and click on data set to direct the flow to ―attribute

selection‖. 9) Now right click on ―Attribute selection‖ and select data set to direct the flow to

―Normalize‖ from which ;lthe flow is directed to the csv saver in the same way. 10) Right click on csv saver and click on ―configure‖, to specify the destination where

to sotre the results let at be selected as z:\weka @ ravi. 11) Now right click on ―Affloader‖ and select ―configure to specify the ―source data‖.

Let ―in‘s‖ relation has been selected as so. 12) Now again right click on the ―Affloader‖ and click on ―start loading‖ which results

in the below ―knowledge flow layout‖. 13) We can observe the results of lthe abouve process by opening the file

z:\Weka@ravi\in‘s-weka.filters.supervised.attribute…Microsoft office Excellomma… in notepad, which displays the results I a comma separated value form

Petal length, Petal width Class 0.067797 0.041667 In‘s-setosa 0.067797 0.041667 In‘s-setosa 0.050847 0.041667 In‘s-setosa 0.627119 0.541667 In‘s-versicolor 0.830508 0.833333 In,s-virginica 0.677966 0.791667 In‘s-virginica

Data Mining Lab


Description of the German credit dataset in ARFF (Attribute Relation File Format) Format:

Structure of ARFF Format:

%comment lines @relation relation name @attribute attribute name @Data Set of data items separated by commas. % 1. Title: German Credit data % % 2. Source Information % % Professor Dr. Hans Hofmann % Institut f"ur Statistik und "Okonometrie % Universit"at Hamburg % FB Wirtschaftswissenschaften % Von-Melle-Park 5 % 2000 Hamburg 13 % % 3. Number of Instances: 1000 % % Two datasets are provided. the original dataset, in the form provided % by Prof. Hofmann, contains categorical/symbolic attributes and % is in the file "german.data". % % For algorithms that need numerical attributes, Strathclyde University % produced the file "german.data-numeric". This file has been edited % and several indicator variables added to make it suitable for % algorithms which cannot cope with categorical variables. Several % attributes that are ordered categorical (such as attribute 17) have % been coded as integer. This was the form used by StatLog. % % % 6. Number of Attributes german: 20 (7 numerical, 13 categorical) % Number of Attributes german.numer: 24 (24 numerical) % % % 7. Attribute description for german % % Attribute 1: (qualitative) % Status of existing checking account % A11 : ... < 0 DM % A12 : 0 <= ... < 200 DM % A13 : ... >= 200 DM / % salary assignments for at least 1 year % A14 : no checking account

Data Mining Lab


% Attribute 2: (numerical) % Duration in month % % Attribute 3: (qualitative) % Credit history % A30 : no credits taken/ % all credits paid back duly % A31 : all credits at this bank paid back duly % A32 : existing credits paid back duly till now % A33 : delay in paying off in the past % A34 : critical account/ % other credits existing (not at this bank) % % Attribute 4: (qualitative) % Purpose % A40 : car (new) % A41 : car (used) % A42 : furniture/equipment % A43 : radio/television % A44 : domestic appliances % A45 : repairs % A46 : education % A47 : (vacation - does not exist?) % A48 : retraining % A49 : business % A410 : others % % Attribute 5: (numerical) % Credit amount % % Attibute 6: (qualitative) % Savings account/bonds % A61 : ... < 100 DM % A62 : 100 <= ... < 500 DM % A63 : 500 <= ... < 1000 DM % A64 : .. >= 1000 DM % A65 : unknown/ no savings account % % Attribute 7: (qualitative) % Present employment since % A71 : unemployed % A72 : ... < 1 year % A73 : 1 <= ... < 4 years % A74 : 4 <= ... < 7 years % A75 : .. >= 7 years % % Attribute 8: (numerical) % Installment rate in percentage of disposable income

Data Mining Lab


% % Attribute 9: (qualitative) % Personal status and sex % A91 : male : divorced/separated % A92 : female : divorced/separated/married % A93 : male : single % A94 : male : married/widowed % A95 : female : single % % Attribute 10: (qualitative) % Other debtors / guarantors % A101 : none % A102 : co-applicant % A103 : guarantor % % Attribute 11: (numerical) % Present residence since % % Attribute 12: (qualitative) % Property % A121 : real estate % A122 : if not A121 : building society savings agreement/ % life insurance % A123 : if not A121/A122 : car or other, not in attribute 6 % A124 : unknown / no property % % Attribute 13: (numerical) % Age in years % % Attribute 14: (qualitative) % Other installment plans % A141 : bank % A142 : stores % A143 : none % % Attribute 15: (qualitative) % Housing % A151 : rent % A152 : own % A153 : for free % % Attribute 16: (numerical) % Number of existing credits at this bank % % Attribute 17: (qualitative) % Job % A171 : unemployed/ unskilled - non-resident % A172 : unskilled - resident % A173 : skilled employee / official % A174 : management/ self-employed/

Data Mining Lab


% highly qualified employee/ officer % % Attribute 18: (numerical) % Number of people being liable to provide maintenance for % % Attribute 19: (qualitative) % Telephone % A191 : none % A192 : yes, registered under the customers name % % Attribute 20: (qualitative) % foreign worker % A201 : yes % A202 : no % % % % 8. Cost Matrix % % This dataset requires use of a cost matrix (see below) % % % 1 2 % ---------------------------- % 1 0 1 % ----------------------- % 2 5 0 % % (1 = Good, 2 = Bad) % % the rows represent the actual classification and the columns % the predicted classification. % % It is worse to class a customer as good when they are bad (5), % than it is to class a customer as bad when they are good (1). % % % % % % Relabeled values in attribute checking_status % From: A11 To: '<0' % From: A12 To: '0<=X<200' % From: A13 To: '>=200' % From: A14 To: 'no checking' % % % Relabeled values in attribute credit_history % From: A30 To: 'no credits/all paid' % From: A31 To: 'all paid' % From: A32 To: 'existing paid' % From: A33 To: 'delayed previously' % From: A34 To: 'critical/other existing credit'

Data Mining Lab


% % % Relabeled values in attribute purpose % From: A40 To: 'new car' % From: A41 To: 'used car' % From: A42 To: furniture/equipment % From: A43 To: radio/tv % From: A44 To: 'domestic appliance' % From: A45 To: repairs % From: A46 To: education % From: A47 To: vacation % From: A48 To: retraining % From: A49 To: business % From: A410 To: other % % % Relabeled values in attribute savings_status % From: A61 To: '<100' % From: A62 To: '100<=X<500' % From: A63 To: '500<=X<1000' % From: A64 To: '>=1000' % From: A65 To: 'no known savings' % % % Relabeled values in attribute employment % From: A71 To: unemployed % From: A72 To: '<1' % From: A73 To: '1<=X<4' % From: A74 To: '4<=X<7' % From: A75 To: '>=7' % % % Relabeled values in attribute personal_status % From: A91 To: 'male div/sep' % From: A92 To: 'female div/dep/mar' % From: A93 To: 'male single' % From: A94 To: 'male mar/wid' % From: A95 To: 'female single' % % % Relabeled values in attribute other_parties % From: A101 To: none % From: A102 To: 'co applicant'

% From: A103 To: guarantor % % % Relabeled values in attribute property_magnitude % From: A121 To: 'real estate' % From: A122 To: 'life insurance'

Data Mining Lab


% From: A123 To: car % From: A124 To: 'no known property' % % % Relabeled values in attribute other_payment_plans % From: A141 To: bank % From: A142 To: stores % From: A143 To: none % % % Relabeled values in attribute housing % From: A151 To: rent % From: A152 To: own % From: A153 To: 'for free' % % % Relabeled values in attribute job % From: A171 To: 'unemp/unskilled non res' % From: A172 To: 'unskilled resident' % From: A173 To: skilled % From: A174 To: 'high qualif/self emp/mgmt' % % % Relabeled values in attribute own_telephone % From: A191 To: none % From: A192 To: yes % % % Relabeled values in attribute foreign_worker % From: A201 To: yes % From: A202 To: no % % % Relabeled values in attribute class % From: 1 To: good % From: 2 To: bad % @relation german_credit

@attribute checking_status { '<0', '0<=X<200', '>=200', 'no checking'}

@attribute duration real

@attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed

previously', 'critical/other existing credit'}

@attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic

appliance', repairs, education, vacation, retraining, business, other}

@attribute credit_amount real

@attribute savings_status { '<100', '100<=X<500', '500<=X<1000', '>=1000', 'no known

savings'}

@attribute employment { unemployed, '<1', '1<=X<4', '4<=X<7', '>=7'}

@attribute installment_commitment real

Data Mining Lab


@attribute personal_status { 'male div/sep', 'female div/dep/mar', 'male single', 'male

mar/wid', 'female single'}

@attribute other_parties { none, 'co applicant', guarantor}

@attribute residence_since real

@attribute property_magnitude { 'real estate', 'life insurance', car, 'no known property'}

@attribute age real

@attribute other_payment_plans { bank, stores, none}

@attribute housing { rent, own, 'for free'}

@attribute existing_credits real

@attribute job { 'unemp/unskilled non res', 'unskilled resident', skilled, 'high qualif/self

emp/mgmt'}

@attribute num_dependents real

@attribute own_telephone { none, yes}

@attribute foreign_worker { yes, no}

@attribute class { good, bad}

@data

'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male

single',none,4,'real estate',67,none,own,2,skilled,1,yes,yes,good

'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female

div/dep/mar',none,2,'real estate',22,none,own,1,skilled,1,none,yes,bad

'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male

single',none,3,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good

'<0',42,'existing paid',furniture/equipment,7882,'<100','4<=X<7',2,'male

single',guarantor,4,'life insurance',45,none,'for free',1,skilled,2,none,yes,good

'<0',24,'delayed previously','new car',4870,'<100','1<=X<4',3,'male single',none,4,'no

known property',53,none,'for free',2,skilled,2,none,yes,bad

'no checking',36,'existing paid',education,9055,'no known savings','1<=X<4',2,'male

single',none,4,'no known property',35,none,'for free',1,'unskilled

resident',2,yes,yes,good

Data Mining Lab


Lab Experiments

1. List all the categorical (or nominal) attributes and the real-valued

attributes separately.

From the German Credit Assessment Case Study given to us, the following attributes are found to be applicable for Credit-Risk Assessment:

Total Valid Attributes

Categorical or Nominal

attributes(which takes

True/false, etc values)

Real valued attributes

1. checking_status

2. duration

3. credit history

4. purpose

5. credit amount

6. savings_status

7. employment duration

8. installment rate

9. personal status

10. debitors

11. residence_since

12. property

14. installment plans

15. housing

16. existing credits

17. job

18. num_dependents

19. telephone

20. foreign worker

1. checking_status

2. credit history

3. purpose

4. savings_status

5. employment

6. personal status

7. debtors

8. property

9. installment plans

10. housing

11. job

12. telephone

13. foreign worker

1. duration

2. credit amount

3. credit amount

4. residence

5. age

6. existing credits

7. num_dependents

Data Mining Lab


2. What attributes do you think might be crucial in making the

credit assessment? Come up with some simple rules in plain

English using your selected attributes.

According to me the following attributes may be crucial in making the credit risk

assessment.

1. Credit_history 2. Employment 3. Property_magnitude 4. job 5. duration 6. crdit_amount 7. installment 8. existing credit

Based on the above attributes, we can make a decision whether to give credit or not.

checking_status = no checking AND other_payment_plans = none AND

credit_history = critical/other existing credit: good

checking_status = no checking AND existing_credits <= 1 AND

other_payment_plans = none AND purpose = radio/tv: good

checking_status = no checking AND foreign_worker = yes AND

employment = 4<=X<7: good

foreign_worker = no AND personal_status = male single: good

checking_status = no checking AND purpose = used car AND

other_payment_plans = none: good

duration <= 15 AND other_parties = guarantor: good

duration <= 11 AND credit_history = critical/other existing credit: good

checking_status = >=200 AND num_dependents <= 1 AND

property_magnitude = car: good

checking_status = no checking AND property_magnitude = real estate AND

other_payment_plans = none AND age > 23: good

savings_status = >=1000 AND property_magnitude = real estate: good

savings_status = 500<=X<1000 AND employment = >=7: good

credit_history = no credits/all paid AND housing = rent: bad

savings_status = no known savings AND checking_status = 0<=X<200 AND

existing_credits > 1: good

Data Mining Lab


checking_status = >=200 AND num_dependents <= 1 AND

property_magnitude = life insurance: good

installment_commitment <= 2 AND other_parties = co applicant AND

existing_credits > 1: bad

installment_commitment <= 2 AND credit_history = delayed previously AND

existing_credits > 1 AND

residence_since > 1: good

installment_commitment <= 2 AND credit_history = delayed previously AND

existing_credits <= 1: good

duration > 30 AND savings_status = 100<=X<500: bad

credit_history = all paid AND other_parties = none AND other_payment_plans = bank: bad

duration > 30 AND savings_status = no known savings AND num_dependents > 1: good

duration > 30 AND credit_history = delayed previously: bad

duration > 42 AND savings_status = <100 AND

residence_since > 1: bad

Data Mining Lab


3. One type of model that you can create is a Decision Tree - train a

Decision Tree using the complete dataset as the training data.

Report the model obtained after training.

A decision tree is a flow chart like tree structure where each internal node(non-leaf) denotes a test on the attribute, each branch represents an outcome of the test ,and each leaf node(terminal node)holds a class label.

Decision trees can be easily converted into classification rules.

e.g. ID3,C4.5 and CART.

J48 pruned tree

1. Using WEKA Tool, we can generate a decision tree by selecting the ―classify

tab‖.

2. In classify tab select choose option where a list of different decision trees are

available. From that list select J48.

3. Now under test option ,select training data test option.

4. The resulting window in WEKA is as follows:

Data Mining Lab


5. To generate the decision tree, right click on the result list and select visualize

tree option by which the decision tree will be generated.

6. The obtained decision tree for credit risk assessment is very large to fit on the

screen.

7. The decision tree above is unclear due to a large number of attributes.

Data Mining Lab


4. Suppose you use your above model trained on the complete

dataset, and classify credit good/bad for each of the examples in

the dataset. What % of examples can you classify correctly? (This is also called testing on the training set) Why do you think you

cannot get 100 % training accuracy?

In the above model we trained complete dataset and we classified credit good/bad for each of the examples in the dataset.

For example:

IF purpose=vacation THEN

credit=bad;

ELSE purpose=business THEN

credit=good;

In this way we classified each of the examples in the dataset.

We classified 85.5% of examples correctly and the remaining 14.5% of examples are incorrectly classified. We can‘t get 100% training accuracy because out of the 20 attributes, we have some unnecessary attributes which are also been analyzed and trained. Due to this the accuracy is affected and hence we can‘t get 100% training

accuracy.

Data Mining Lab


5. Is testing on the training set as you did above a good idea? Why

Why not?

Bad idea, if take all the data into training set. Then how to test the above classification is correctly

or not ?

According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as

training set and the remaining 1/3 as test set. But here in the above model we have

taken complete dataset as training set which results only 85.5% accuracy.

This is done for the analyzing and training of the unnecessary attributes which does not

make a crucial role in credit risk assessment. And by this complexity is increasing and

finally it leads to the minimum accuracy. If some part of the dataset is used as a training

set and the remaining as test set then it leads to the accurate results and the time for

computation will be less.

This is why, we prefer not to take complete dataset as training set.

UseTraining Set Result for the table GermanCreditData:

Correctly Classified Instances 855 85.5 %

Incorrectly Classified Instances 145 14.5 %

Kappa statistic 0.6251

Mean absolute error 0.2312

Root mean squared error 0.34

Relative absolute error 55.0377 %

Root relative squared error 74.2015 %

Total Number of Instances 1000

Data Mining Lab


6. One approach for solving the problem encountered in the previous

question is using cross-validation? Describe what cross-

validation is briefly. Train a Decision Tree again using cross-validation and report your results. Does your accuracy

increase/decrease? Why?

Cross validation:-

In k-fold cross-validation, the initial data are randomly portioned into ‗k‘ mutually exclusive subsets or folds D1, D2, D3, . . . . . ., Dk. Each of approximately equal size. Training and testing is performed ‗k‘ times. In iteration I, partition Di is reserved as the test set and the remaining partitions are collectively used to train the model.

That is in the first iteration subsets D2, D3, . . . . . ., Dk collectively serve as the training set in order to obtain as first model. Which is tested on Di. The second trained on the

subsets D1, D3, . . . . . ., Dk and test on the D2 and so on….

1. Select classify tab and J48 decision tree and in the test option select cross

validation radio button and the number of folds as 10.

2. Number of folds indicates number of partition with the set of attributes.

3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the

errors will be zeroed out, but in reality there is no such training set that gives 100%

accuracy.

Data Mining Lab


Cross Validation Result at folds: 10 for the table GermanCreditData:









Here there are 1000 instances with 100 instances per partition.
















Data Mining Lab






Correctly Classified Instances 710 71 %

Incorrectly Classified Instances 290 29 %







Percentage split does not allow 100%, it allows only till 99.9%

Data Mining Lab


Percentage Split Result at 50%:









Data Mining Lab


Percentage Split Result at 99.9%:

Correctly Classified Instances 0 0 %

Incorrectly Classified Instances 1 100 %

Kappa statistic 0






Data Mining Lab


7. Check to see if the data shows a bias against "foreign workers"

(attribute 20), or "personal-status"(attribute 9). One way to do this (Perhaps rather simple minded) is to remove these attributes from

the dataset and see if the decision tree created in those cases is

significantly different from the full dataset case which you have already done. To remove an attribute you can use the reprocess

tab in WEKA's GUI Explorer. Did removing these attributes have

any significant effect? Discuss.

This increases in accuracy because the two attributes ―foreign workers‖ and ―personal status ―are not much important in training and analyzing. By removing this, the time has been reduced to some extent and then it results in increase in the accuracy. The decision tree which is created is very large compared to the decision tree which we have

trained now. This is the main difference between these two decision trees.

After forign worker is removed, the accuracy is increased to 85.9%

Data Mining Lab


If we remove 9th attribute, the accuracy is further increased to 86.6% which shows that

these two attributes are not significant to perform training.

Data Mining Lab


Cross validation after removing 9th attribute.

Percentage split after removing 9th attribute.

Data Mining Lab


After removing the 20th attribute, the cross validation is as above.

After removing 20th attribute, the percentage split is as above.

Data Mining Lab


8. Another question might be, do you really need to input so many

attributes to get good results? Maybe only a few would do. For

example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try out some

combinations. (You had removed two attributes in problem 7

Remember to reload the ARFF data file to get all the attributes

initially before you start selecting the ones you want.)

Select attribute 2,3,5,7,10,17,21 and click on invert to remove the remaining attributes.

Here accuracy is decreased.

Select random attributes and then check the accuracy.

Data Mining Lab


After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left

over attributes and visualize them.

Data Mining Lab


After we remove 14 attributes, the accuracy has been decreased to 76.4% hence we can

further try random combination of attributes to increase the accuracy.

Cross validation

Data Mining Lab


Percentage split

Data Mining Lab


9. Sometimes, the cost of rejecting an applicant who actually has a

good credit

Case 1. might be higher than accepting an applicant who has bad

credit

Case 2. Instead of counting the misclassifications equally in both

cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do this by using a

cost matrix in WEKA.

Train your Decision Tree again and report the Decision Tree and

cross-validation results. Are they significantly different from

results obtained in problem 6 (using equal cost)?

In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider two cases with different cost. Let us take cost 5 in case 1 and cost 2 in

case 2.

When we give such costs in both cases and after training the decision tree, we can observe that almost equal to that of the decision tree obtained in problem 6. Case1

(cost 5) Case2 (cost 5)

Total Cost 3820 1705

Average cost 3.82 1.705

We don‘t find this cost factor in problem 6. As there we use equal cost. This is the

major difference between the results of problem 6 and problem 9.

The cost matrices we used here:

Case 1: 5 1

1 5

Case 2: 2 1

1 2

Data Mining Lab


1.Select classify tab.

2. Select More Option from Test Option.

Data Mining Lab


3.Tick on cost sensitive Evaluation and go to set.

4.Set classes as 2.

5.Click on Resize and then we‘ll get cost matrix. 6.Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0 7.Then confusion matrix will be generated and you can find out the difference between good and bad attribute.

8.Check accuracy whether it‘s changing or not.

Data Mining Lab


10. Do you think it is a good idea to prefer simple decision trees

instead of having long complex decision trees? How does the

complexity of a Decision Tree relate to the bias of the model?

When we consider long complex decision trees, we will have many unnecessary attributes in the tree which results in increase of the bias of the model. Because of this, the accuracy of the model can also effect.

This problem can be reduced by considering simple decision tree. The attributes will be

less and it decreases the bias of the model. Due to this the result will be more accurate.

So it is a good idea to prefer simple decision trees instead of long complex trees.

1. Open any existing ARFF file e.g labour.arff.

2. In preprocess tab, select ALL to select all the attributes.

3. Go to classify tab and then use traning set with J48 algorithm.

Data Mining Lab


4. To generate the decision tree, right click on the result list and select visualize tree

option, by which the decision tree will be generated.

Data Mining Lab


5. Right click on J48 algorithm to get Generic Object Editor window

6. In this,make the unpruned option as true .

7. Then press OK and then start. we find the tree will become more complex if not

pruned.

Visualizetree

Data Mining Lab


8. The tree has become more complex.

Data Mining Lab


11. You can make your Decision Trees simpler by pruning the node

s. One approach is to use Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees

using cross-validation (you can do this in WEKA) and report the Decision Tree you obtain? Also, report your accuracy using the

pruned model. Does your accuracy increase?

Reduced-error pruning:-

The idea of using a separate pruning set for pruning—which is applicable to decision trees as well as rule sets—is called reduced-error pruning. The variant described previously prunes a rule immediately after it has been grown and is called incremental

reduced-error pruning.

Another possibility is to build a full, unpruned rule set first, pruning it afterwards by

discarding individual tests.

However, this method is much slower. Of course, there are many different ways to assess the worth of a rule based on the pruning set. A simple measure is to consider how well the rule would do at discriminating the predicted class from other classes if it were the only

rule in the theory, operating under the closed world assumption.

If it gets p instances right out of the t instances that it covers, and there are P instances of this class out of a total T of instances altogether, then it gets positive instances right. The instances that it does not cover include N - n negative ones, where n = t – p is the number of negative instances that the rule covers and N = T - P is the total number of

negative instances.

Thus the rule has an overall success ratio of [p +(N - n)] T , and this quantity, evaluated on the test set, has been used to evaluate the success of a rule when using reduced-error pruning.

1. Right click on J48 algorithm to get Generic Object Editor window 2. In this,make reduced error pruning option as true and also the unpruned option as true . 3. Then press OK and then start.

4. We find that the accuracy has been increased by selecting the reduced error pruning option.

Data Mining Lab


Data Mining Lab


12. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules".

Make up your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different classifiers

that output the model in the form of rules - one such classifier in WEKA is rules. PART, train this model and report the set of rules

obtained. Sometimes just one attribute can be good enough in

making the decision, yes, just one! Can you predict what attribute

that might be in this dataset? OneR classifier uses a single attribute to make decisions (it chooses the attribute based on

minimum error). Report the rule obtained by training a one R

classifier. Rank the performance of j48, PART and oneR.

In WEKA, rules.PART is one of the classifier which converts the decision trees into ―IF-THEN-ELSE‖ rules.

Converting Decision trees into “IF-THEN-ELSE” rules using rules.PART

classifier:-

PART decision list

outlook = overcast: yes (4.0)

windy = TRUE: no (4.0/1.0)

outlook = sunny: no (3.0/1.0)

: yes (3.0)

Number of Rules : 4

Yes, sometimes just one attribute can be good enough in making the decision.

In this dataset (Weather), Single attribute for making the decision is “outlook”

outlook:

sunny -> no

overcast -> yes

rainy -> yes

(10/14 instances correct)

With respect to the time, the oneR classifier has higher ranking and J48 is in 2nd place

and PART gets 3rd place.

Data Mining Lab


J48 PART oneR

TIME (sec) 0.12 0.14 0.04

RANK II III I

But if you consider the accuracy, The J48 classifier has higher ranking, PART gets

second place and oneR

gets lst place

J48 PART oneR

ACCURACY (%) 70.5 70.2% 66.8%

1.Open existing file as weather.nomial.arff

2.Select All.

3.Go to classify.

4.Start.

Data Mining Lab


Here the accuracy is 100%

Data Mining Lab


The tree is something like “if-then-else” rule

If outlook=overcast then

play=yes

If outlook=sunny and humidity=high then

play = no

else

play = yes

If outlook=rainy and windy=true then

play = no

else

play = yes

To click out the rules

1. Go to choose then click on Rule then select PART.

2. Click on Save and start.

3. Similarly for oneR algorithm.

Data Mining Lab


If outlook = overcast then

play=yes

If outlook = sunny and humidity= high then

play=no

If outlook = sunny and humidity= low then

play=yes

WEKA Lab Manual

Documents

Transcript of WEKA Lab Manual