MS Word.doc

WEKA

CS 595Knowledge Discovery and Datamining

Assignment # 1Evaluation Report for WEKA

(Waikato Environment for Knowledge Analysis)

Presented By:Manoj WartikarSameer Sagade

Date:14th March, 2000.

1 of 23

WEKA

Weka Machine Learning Project.

Machine Learning:An exciting and potentially far-reaching development in contemporary

computer science is the invention and application of methods of Machine Learning. These enable a computer program to automatically analyze a large body of data and decide what information is most relevant. This crystallized information can then be used to help people make decision faster and more accurately.

One of the central problems of the information age is dealing with the enormous explosion in the amount of raw information that is available. Machine learning (ML) has the potential to sift through this mass of information and convert it into knowledge that people can use. So far, however, it has been used mainly on small problems under well-controlled conditions.

The aim of the Weka Project is to bring the technology out of the laboratory and provide solutions that can make a difference to people. The overall goal of this research programme is to build a state-of-the art facility for development of techniques of ML.

Objectives:The team at Waikato has incorporated several standard ML techniques

into software “Workbench” abbreviated WEKA (Waikato Environment for Knowledge Analysis). With the use of WEKA, a specialist in a particular field is able to use ML and derive useful knowledge from databases that are far too large to be analyzed by hand. The main objectives of WEKA are to

Make Machine Learning (ML) techniques generally available; Apply them to practical problems as in agriculture; Develop new machine learning algorithms; Design a theoretical framework for the field.

Documented Features:The WEKA presents a collection of algorithms for solving real-world data

mining problems. The software is written in Java 2 and includes a uniform interface to the standard techniques in machine learning. The following techniques in Data mining are implemented in WEKA.

1. Attribute Selection.2. Clustering.3. Classifiers (both numeric and non-numeric).4. Association Rules.5. Filters.6. Estimators.

2 of 23

WEKA

Out of these options, only Classifiers, association rules and Filters are available as direct executables. All the remaining functions are available as API’s. The data required by the software is in the “.Arff” format. Sample databases are also provided with the software.

Features: The WEKA package is comprised of a number of classes and

inheritances. We have to create an instance of any class to execute it. The functionality of WEKA is classified based on the steps of Machine learning.

Classifiers:The Classifiers class prints out a decision tree classifier for the dataset

given as input. Also A ten-fold cross-validation estimation of its performance is also calculated. The Classifiers package implements the most common techniques separately for categorical and numerical values

a) Classifiers for categorical prediction:

1. Weka.classifiers.IBk K-nearest neighbor learner2. Weka.classifiers.j48.J48 C4.5 decision trees 3. Weka.classifiers.j48.PART Rule learner 4. Weka.classifiers.NaiveBayes Naive Bayes with/without kernels5. Weka.classifiers.OneR Holte's oner6. Weka.classifiers.KernelDensity Kernel density classifier7. Weka.classifiers.SMO Support vector machines8. Weka.classifiers.Logistic Logistic regression9. Weka.classifiers.AdaBoostM1 Adaboost10. Weka.classifiers.LogitBoost Logit boost11. Weka.classifiers.DecisionStump Decision stumps (for boosting)

3 of 23

WEKA

Sample Executions of the various categorical CLASSIFIER Algorithms:

K Nearest Neighbour Algorithm:

>java weka.classifiers.IBk -t data/iris.arff

IB1 instance-based classifierusing 1 nearest neighbour(s) for classification

=== Error on training data ===

Correctly Classified Instances 150 100 %Incorrectly Classified Instances 0 0 %Mean absolute error 0.0085Root mean squared error 0.0091Total Number of Instances 150

=== Confusion Matrix ===

a b c <-- classified as 50 0 0 | a = Iris-setosa 0 50 0 | b = Iris-versicolor 0 0 50 | c = Iris-virginica

=== Stratified cross-validation ===




4 of 23

WEKA

J48 Pruned Tree Algorithm:

>java weka.classifiers.j48.J48 -t data/iris.arff

J48 pruned tree------------------

petalwidth <= 0.6: Iris-setosa (50.0)petalwidth > 0.6| petalwidth <= 1.7| | petallength <= 4.9: Iris-versicolor (48.0/1.0)| | petallength > 4.9| | | petalwidth <= 1.5: Iris-virginica (3.0)| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)| petalwidth > 1.7: Iris-virginica (46.0/1.0)

Number of Leaves : 5

Size of the tree : 9


Correctly Classified Instances 147 98 %Incorrectly Classified Instances 3 2 %Mean absolute error 0.0233Root mean squared error 0.108 Total Number of Instances 150



=== Stratified cross-validation ===Correctly Classified Instances 143 95.3333 %Incorrectly Classified Instances 7 4.6667 %Mean absolute error 0.0391Root mean squared error 0.1707Total Number of Instances 150

=== Confusion Matrix === a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor

5 of 23

WEKA


Correctly Classified Instances 144 96 %Incorrectly Classified Instances 6 4 %Mean absolute error 0.0324Root mean squared error 0.1495 50 0 0 | a = Iris-setosa 0 48 2 | b = Iris-versicolor 0 4 46 | c = Iris-virginica

SMO (support vector machines) and logistic regression algorithms can handle only two class data sets so are not evaluated.

AdaBoost, Logit Boost,Decision Stump are algorithms which boost the performance of the two classifier algorithms. The boosted algorithms are run inside these booster algorithms. These booster algorithms monitor the execution and applies appropriate boosting patches to the them.

6 of 23

WEKA

b) Classifiers for numerical prediction:

1. weka.classifiers.LinearRegression Linear regression2. weka.classifiers.m5.M5Prime Model trees3. weka.classifiers.Ibk K-nearest neighbor learner4. weka.classifiers.LWR Locally weighted regression5. weka.classifiers.RegressionByDiscretization Uses categorical classifiers

Sample Executions of the various categorical CLASSIFIER Algorithms:

Linear Regression Model:

> java weka.classifiers.LinearRegression -t data/cpu.arff

Linear Regression Model

class =

-152.7641 * vendor=microdata,formation,prime,harris,dec,wang,perkin-elmer,nixdorf,bti,sratus,dg,burroughs,cambex,magnuson,honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl + 141.8644 * vendor=formation,prime,harris,dec,wang,perkin-elmer,nixdorf,bti,sratus,dg,burroughs,cambex,magnuson,honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl + -38.2268 * vendor=burroughs,cambex,magnuson,honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl + 39.4748 * vendor=cambex,magnuson,honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl + -39.5986 * vendor=honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl + 21.4119 * vendor=ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl + -41.2396 * vendor=gould,siemens,nas,adviser,sperry,amdahl + 32.0545 * vendor=siemens,nas,adviser,sperry,amdahl + -113.6927 * vendor=adviser,sperry,amdahl + 176.5204 * vendor=sperry,amdahl + -51.2583 * vendor=amdahl + 0.0616 * MYCT + 0.0171 * MMIN + 0.0054 * MMAX + 0.6654 * CACH + -1.4159 * CHMIN + 1.5538 * CHMAX +

7 of 23

WEKA

-41.4854


Correlation coefficient 0.963 Mean absolute error 28.4042Root mean squared error 41.6084Relative absolute error 32.5055 %Root relative squared error 26.9508 %Total Number of Instances 209

=== Cross-validation ===

Correlation coefficient 0.9328Mean absolute error 35.014 Root mean squared error 55.6291Relative absolute error 39.9885 %Root relative squared error 35.9513 %Total Number of Instances 209

8 of 23

WEKA

Pruned Training Model Tree:

> java weka.classifiers.m5.M5Prime -t data/cpu.arff

Pruned training model tree:

MMAX <= 14000 : LM1 (141/4.18%)MMAX > 14000 : LM2 (68/51.8%)

Models at the leaves:

Smoothed (complex):

LM1: class = 4.15 - 2.05vendor=honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl + 5.43vendor=adviser,sperry,amdahl - 5.78vendor=amdahl + 0.00638MYCT + 0.00158MMIN + 0.00345MMAX + 0.552CACH + 1.14CHMIN + 0.0945CHMAX LM2: class = -113 - 56.1vendor=honeywell,ipl,ibm,cdc,ncr,basf,gould,siemens,nas,adviser,sperry,amdahl + 10.2vendor=adviser,sperry,amdahl - 10.9vendor=amdahl + 0.012MYCT + 0.0145MMIN + 0.0089MMAX + 0.808CACH + 1.29CHMAX

Number of Leaves : 2


Correlation coefficient 0.9853Mean absolute error 13.4072Root mean squared error 26.3977Relative absolute error 15.3431 %Root relative squared error 17.0985 %Total Number of Instances 209


Correlation coefficient 0.9767Mean absolute error 13.1239Root mean squared error 33.4455Relative absolute error 14.9884 %

9 of 23

WEKA

Root relative squared error 21.6147 %Total Number of Instances 209

10 of 23

WEKA

K Nearest Neighbour classifier Algorithm:

> java weka.classifiers.IBk -t data/cpu.arff



Correlation coefficient 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 %Root relative squared error 0 %Total Number of Instances 209



11 of 23

WEKA

Locally Weighted Regression:

> java weka.classifiers.LWR -t data/cpu.arff

Locally weighted regression===========================Using linear weighting kernelsUsing all neighbours





12 of 23

WEKA

Regression by Descretization:

> java weka.classifiers.RegressionByDiscretization -t data/cpu.arff -W weka.classifiers.Ibk

// Sub classifier is selected by categorical classification

Regression by discretization

Class attribute discretized into 10 values

Subclassifier: weka.classifiers.Ibk






13 of 23

WEKA

Association rules:Association rule mining finds interesting association or correlation

relationships among a large set of data items. With massive amounts of data continuously being collected and stored in databases, many industries are becoming interested in mining association rules from their databases. For example, the discovery of interesting association relationships among huge amounts of business transaction records can help catalog design, cross marketing, loss-leader analysis, and other business decision making processes.

A typical example of association rule mining is market basket analysis. This process analyzes customer-buying habits by finding associations between the different items that customer’s place in their “shopping baskets". The discovery of such associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. For instance, if customers are buying milk, how likely are they to also buy bread (and what kind of bread) on the same trip to the supermarket? Such information can lead to increased sales.

The WEKA software efficiently produces association rules for the given data set. The Apriori algorithm is used as the foundation of the package. It gives all the itemsets and the subsequent frequent sets for the specified minimal support and confidence.

A typical output of the Association package is :

Apriori Principle:

> java weka.associations.Apriori -t data/weather.nominal.arff -I yes

Apriori=======

Minimum support: 0.2Minimum confidence: 0.9Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 12

Large Itemsets L(1):outlook=sunny 5outlook=overcast 4outlook=rainy 5temperature=hot 4temperature=mild 6temperature=cool 4

14 of 23

WEKA

humidity=high 7humidity=normal 7windy=TRUE 6windy=FALSE 8play=yes 9play=no 5


Large Itemsets L(2):outlook=sunny temperature=hot 2outlook=sunny temperature=mild 2outlook=sunny humidity=high 3outlook=sunny humidity=normal 2outlook=sunny windy=TRUE 2outlook=sunny windy=FALSE 3outlook=sunny play=yes 2outlook=sunny play=no 3outlook=overcast temperature=hot 2outlook=overcast humidity=high 2outlook=overcast humidity=normal 2outlook=overcast windy=TRUE 2outlook=overcast windy=FALSE 2outlook=overcast play=yes 4outlook=rainy temperature=mild 3outlook=rainy temperature=cool 2outlook=rainy humidity=high 2outlook=rainy humidity=normal 3outlook=rainy windy=TRUE 2outlook=rainy windy=FALSE 3outlook=rainy play=yes 3outlook=rainy play=no 2temperature=hot humidity=high 3temperature=hot windy=FALSE 3temperature=hot play=yes 2temperature=hot play=no 2temperature=mild humidity=high 4temperature=mild humidity=normal 2temperature=mild windy=TRUE 3temperature=mild windy=FALSE 3temperature=mild play=yes 4temperature=mild play=no 2temperature=cool humidity=normal 4temperature=cool windy=TRUE 2temperature=cool windy=FALSE 2temperature=cool play=yes 3

15 of 23

WEKA

humidity=high windy=TRUE 3humidity=high windy=FALSE 4humidity=high play=yes 3humidity=high play=no 4humidity=normal windy=TRUE 3humidity=normal windy=FALSE 4humidity=normal play=yes 6windy=TRUE play=yes 3windy=TRUE play=no 3windy=FALSE play=yes 6windy=FALSE play=no 2


Large Itemsets L(3):outlook=sunny temperature=hot humidity=high 2outlook=sunny temperature=hot play=no 2outlook=sunny humidity=high windy=FALSE 2outlook=sunny humidity=high play=no 3outlook=sunny humidity=normal play=yes 2outlook=sunny windy=FALSE play=no 2outlook=overcast temperature=hot windy=FALSE 2outlook=overcast temperature=hot play=yes 2outlook=overcast humidity=high play=yes 2outlook=overcast humidity=normal play=yes 2outlook=overcast windy=TRUE play=yes 2outlook=overcast windy=FALSE play=yes 2outlook=rainy temperature=mild humidity=high 2outlook=rainy temperature=mild windy=FALSE 2outlook=rainy temperature=mild play=yes 2outlook=rainy temperature=cool humidity=normal 2outlook=rainy humidity=normal windy=FALSE 2outlook=rainy humidity=normal play=yes 2outlook=rainy windy=TRUE play=no 2outlook=rainy windy=FALSE play=yes 3temperature=hot humidity=high windy=FALSE 2temperature=hot humidity=high play=no 2temperature=hot windy=FALSE play=yes 2temperature=mild humidity=high windy=TRUE 2temperature=mild humidity=high windy=FALSE 2temperature=mild humidity=high play=yes 2temperature=mild humidity=high play=no 2temperature=mild humidity=normal play=yes 2temperature=mild windy=TRUE play=yes 2temperature=mild windy=FALSE play=yes 2temperature=cool humidity=normal windy=TRUE 2

16 of 23

WEKA

temperature=cool humidity=normal windy=FALSE 2temperature=cool humidity=normal play=yes 3temperature=cool windy=FALSE play=yes 2humidity=high windy=TRUE play=no 2humidity=high windy=FALSE play=yes 2humidity=high windy=FALSE play=no 2humidity=normal windy=TRUE play=yes 2humidity=normal windy=FALSE play=yes 4


Large Itemsets L(4):outlook=sunny temperature=hot humidity=high play=no 2outlook=sunny humidity=high windy=FALSE play=no 2outlook=overcast temperature=hot windy=FALSE play=yes 2outlook=rainy temperature=mild windy=FALSE play=yes 2outlook=rainy humidity=normal windy=FALSE play=yes 2temperature=cool humidity=normal windy=FALSE play=yes 2

Best rules found:

1. humidity=normal windy=FALSE 4 ==> play=yes 4 (1) 2. temperature=cool 4 ==> humidity=normal 4 (1) 3. outlook=overcast 4 ==> play=yes 4 (1) 4. temperature=cool play=yes 3 ==> humidity=normal 3 (1) 5. outlook=rainy windy=FALSE 3 ==> play=yes 3 (1) 6. outlook=rainy play=yes 3 ==> windy=FALSE 3 (1) 7. outlook=sunny humidity=high 3 ==> play=no 3 (1) 8. outlook=sunny play=no 3 ==> humidity=high 3 (1) 9. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2 (1)10. temperature=cool humidity=normal windy=FALSE 2 ==> play=yes 2 (1)

17 of 23

WEKA

Advantages, disadvantages and Future Upgradations:

The WEKA system has covered the entire machine learning (knowledge discovery) process. Although an research project, the WEKA system has been able to implement and evaluate a number of different Algorithms for different steps in the machine learning process.

The output and the information provided by the package is sufficient for an expert in machine learning and related topics. The results as displayed by the system show a detailed description of the flow and the steps involved in the entire machine learning process. The outputs provided by different algorithms are easy to compare and hence make the analysis easier.

ARFF dataset is one of the most widely used data storage formats for research databases, making this system easier for use in research oriented projects.

This package provides and number of application program interfaces (API) which help novice Dataminers build their systems using the ”core WEKA system”.

Since the system provides a number of switches and options, we can customize the output of the system to suit our needs.

First, major disadvantage is that the system is a Java based system and requires Java Virtual Machine installed for its execution. Since the system is entirely based on Command Line parameters and switches, it is difficult for an amateur to use the system efficiently. A Textual interface and output makes it all the more difficult to interpret and visualize the results.

Important results such as the pruned trees, hierarchy based outputs cannot be displayed making it difficult to visualize the results.

Although a commonly used dataset, ARFF is the only format that the WEKA system supports.

All the current version i.e. 3.0.1 has some bugs or disadvantages, the developers are working on a better system and have come up with a new version which has a graphical user interface making the system complete.

18 of 23

WEKA

Appendix

(Sample executions for other algorithms covered)

19 of 23

WEKA

PART Decision List Algorithm

>java weka.classifiers.j48.PART -t data/iris.arff

PART decision list------------------

petalwidth <= 0.6: Iris-setosa (50.0)

petalwidth <= 1.7 ANDpetallength <= 4.9: Iris-versicolor (48.0/1.0)

: Iris-virginica (52.0/3.0)

Number of Rules : 3


Correctly Classified Instances 146 97.3333 %Incorrectly Classified Instances 4 2.6667 %Mean absolute error 0.0338Root mean squared error 0.1301Total Number of Instances 150







20 of 23

WEKA

Naïve Bayes Classifier Algorithm:

> java weka.classifiers.NaiveBayes -t data/iris.arff

Naive Bayes Classifier

Class Iris-setosa: Prior probability = 0.33

sepallength: Normal Distribution. Mean = 4.9913 StandardDev = 0.355 WeightSum = 50 Precision = 0.10588235294117648sepalwidth: Normal Distribution. Mean = 3.4015 StandardDev = 0.3925 WeightSum = 50 Precision = 0.10909090909090911petallength: Normal Distribution. Mean = 1.4694 StandardDev = 0.1782 WeightSum = 50 Precision = 0.14047619047619048petalwidth: Normal Distribution. Mean = 0.2743 StandardDev = 0.1096 WeightSum = 50 Precision = 0.11428571428571428

Class Iris-versicolor: Prior probability = 0.33


Class Iris-virginica: Prior probability = 0.33


OneR Classifier Algorithm:

21 of 23

WEKA

> java weka.classifiers.OneR -t data/iris.arff

petallength:< 2.45 -> Iris-setosa< 4.75 -> Iris-versicolor>= 4.75 -> Iris-virginica

(143/150 instances correct)









Kernel Density Algorithm:

> java weka.classifiers.KernelDensity -t data/iris.arff

Kernel Density Estimator

22 of 23

WEKA









23 of 23

MS Word.doc

Documents

Transcript of MS Word.doc