WEKA Lab Record

7/28/2019 WEKA Lab Record

1/69

Expr.No: Date :

S.K.T.R.M College off Engineering 1

INTRODUCTION WEKA (Waikato Environment for Knowledge Analysis) is a popular suite of machine

learning software written in Java, developed at the University of Waikato, New

Zealand.

WEKA is an open source application that is freely available under the GNU general

public license agreement. Originally written in C, the WEKA application has been

completely rewritten in Java and is compatible with almost every computing platform.

It is user friendly with a graphical interface that allows for quick set up and operation.

WEKA operates on the predication that the user data is available as a flat file or

relation. This means that each data object is described by a fixed number of

attributes that usually are of a specific type, normal alpha-numeric or numeric values.

The WEKA application allows novice users a tool to identify hidden information from

database and file systems with simple to use options and visual interfaces.

The WEKAworkbench contains a collection of visualization tools and algorithms for

data analysis and predictive modeling, together with graphical user interfaces for

easy access to this functionality.

This original version was primarily designed as a tool for analyzing data from

agricultural domains, but the more recent fully Java-based version (WEKA 3), for

which development started in 1997, is now used in many different application areas,

in particular for educational purposes and research.

ADVANTAGES OF WEKA

The obvious advantage of a package like WEKA is that a whole range of data

preparation, feature selection and data mining algorithms are integrated. This means

that only one data format is needed, and trying out and comparing different

approaches becomes really easy.The package also comes with a GUI, which should

make it easier to use.

Portability, since it is fully implemented in the Java programming language and thus

runs on almost any modern computing platform.

A comprehensive collection of data preprocessing and modeling techniques.

Ease of use due to its graphical user interfaces.

WEKA supports several standard data mining tasks, more specifically, data

preprocessing, clustering, classification, regression, visualization, and feature

selection.

All of WEKA's techniques are predicated on the assumption that the data is available

as a single flat file or relation, where each data point is described by a fixed number

of attributes (normally, numeric or nominal attributes, but some other attribute types

are also supported).


2/69

Expr.No: Date :


WEKA provides access to SQL databases using Java Database Connectivity and

can process the result returned by a database query.

It is not capable of multi-relational data mining, but there is separate software for

converting a collection of linked database tables into a single table that is suitable forprocessing using WEKA. Another important area is sequence modeling.

Attribute Relationship File Format (ARFF) is the text format file used by WEKA to

store data in a database.

The ARFF file contains two sections: the header and the data section. The first line of

the header tells us the relation name.

Then there is the list of the attributes (@attribute...). Each attribute is associated with

a unique name and a type.

The latter describes the kind of data contained in the variable and what values it can

have. The variables types are: numeric, nominal, string and date.

The class attribute is by default the last one of the list. In the header section there

can also be some comment lines, identified with a '%' at the beginning, which can

describe the database content or give the reader information about the author. After

that there is the data itself (@data), each line stores the attribute of a single entry

separated by a comma.

WEKA's main user interface is the Explorer, but essentially the same functionality

can be accessed through the component-based Knowledge Flow interface and from

the command line. There is also the Experimenter, which allows the systematic

comparison of the predictive performance of WEKA's machine learning algorithms on

a collection of datasets.

Launching WEKA

The WEKA GUI Chooser window is used to launch WEKAs graphical environments.At the bottom of the window are four buttons:

1. Simple CLI. Provides a simple command-line interface that allows direct execution of

WEKA commands for operating systems that do not provide their own command lineInterface.

2. Explorer. An environment for exploring data with WEKA.

3. Experimenter. An environment for performing experiments and conducting.

4.Knowledge Flow. This environment supports essentially the same functions as theExplorer but with a drag-and-drop interface. One advantageis that it supports

incremental learning.


3/69

Expr.No: Date :


If you launch WEKA from a terminal window, some text begins scrolling inthe

terminal. Ignore this text unless something goes wrong, in which case it canhelp in

tracking down the cause.This User Manual focuses on using the Explorer but does

not explain theindividual data preprocessing tools and learning algorithms in WEKA.

For moreinformation on the various filters and learning methods in WEKA, see the

bookData Mining (Witten and Frank, 2005).

The WEKA Explorer

Section TabsAt the very top of the window, just below the title bar, is a row of tabs. WhentheExplorer is first started only the first tab is active; the others are greyedout. This isbecause it is necessary to open (and potentially pre-process) a dataset beforestarting to explore the data.

The tabs are as follows:

1. Preprocess. Choose and modify the data being acted on.

2. Classify. Train and test learning schemes that classify or perform regression.

3. Cluster. Learn clusters for the data.

4. Associate. Learn association rules for the data.

5. Select attributes. Select the most relevant attributes in the data.

6. Visualize. View an interactive 2D plot of the data.

Once the tabs are active, clicking on them flicks between different screens, onwhichthe respective actions can be performed. The bottom area of the window(includingthe status box, the log button, and the WEKA bird) stays visibleregardless of whichsection you are in.

Status BoxThe status box appears at the very bottom of the window. It displays messagesthat keep youinformed about whats going on. For example, if the Explorer isbusy loading a file, the statusbox will say that.

TIPright-clicking the mouse anywhere inside the status box brings up a little menu. Themenu gives two options:

1. Available memory. Display in the log box the amount of memory available to WEKA.

2. Run garbage collector. Force the Java garbage collector to search formemory thatis no longer needed and free it up, allowing more memoryfor new tasks. Note that thegarbage collector is constantly running as abackground task anyway.

Log Button


4/69

Expr.No: Date :


Clicking on this button brings up a separate window containing a scrollable textfield. Eachline of text is stamped with the time it was entered into the log. Asyou perform actions inWEKA, the log keeps a record of what has happened.

WEKA Status Icon

To the right of the status box is the WEKA status icon. When no processesarerunning, the bird sits down and takes a nap. The number beside the symbolgives thenumber of concurrent processes running. When the system is idle it iszero, but it increasesas the number of processes increases. When any processis started, the bird gets up andstarts moving around. If its standing but stopsmoving for a long time, its sick: somethinghas gone wrong! In that case youshould restart the WEKA explorer.

1.Preprocessing

Opening files

The first three buttons at the top of the preprocess section enable you to loaddatainto WEKA:

1. Open file.... Brings up a dialog box allowing you to browse for the datafile on thelocal file system.

2. Open URL.... Asks for a Uniform Resource Locator address for wherethe data is

stored.3. Open DB.... Reads data from a database. (Note that to make this work you might

have to edit the file in WEKA/experiment/DatabaseUtils.props.)

Using the Open file... button you can read files in a variety of formats: WEKAsARFF format,

CSV format, C4.5 format, or serialized Instances format. ARFFfiles typically have a .arffextension, CSV files a .csv extension, C4.5 files adata and names extension, and serializedInstances objects a .bsi extension.

The Current Relation

Once some data has been loaded, the Preprocess panel shows a variety of information. TheCurrent relation box (the current relation is the currentlyloaded data, which can be

interpreted as a single relational table in databaseterminology) has three entries:

1. Relation. The name of the relation, as given in the file it was loadedfrom.Filters (described below) modify the name of a relation.

2. Instances. The number of instances (data points/records) in the data.

3. Attributes. The number of attributes (features) in the data.

Working with Attributes

Below the Current relation box is a box titled Attributes. There are threebuttons, andbeneath them is a list of the attributes in the current relation.The list has three columns:


5/69

Expr.No: Date :


1. No... A number that identifies the attribute in the order they are specifiedin the datafile.

2. Selection tick boxes. These allow you select which attributes are presentin the

relation.

3. Name. The name of the attribute, as it was declared in the data file.

When you click on different rows in the list of attributes, the fields changein the box to theright titled selected attribute. This box displays the characteristics of the currentlyhighlighted attribute in the list:

1. Name. The name of the attribute, the same as that given in the attributelist.

2. Type. The type of attribute, most commonly Nominal or Numeric.

3. Missing. The number (and percentage) of instances in the data for whichthis attribute

is missing (unspecified).

4. Distinct. The number of different values that the data contains for thisAttribute.

5. Unique. The number (and percentage) of instances in the data having avalue for thisattribute that no other instances have.

Below these statistics is a list showing more information about the values storedin thisattribute, which differ depending on its type. If the attribute is nominal,the list consists ofeach possible value for the attribute along with the numberof instances that have that value.If the attribute is numeric, the list givesfour statistics describing the distribution of values in

the datathe minimum,maximum, mean and standard deviation. And below these statisticsthere is acolored histogram, color-coded according to the attribute chosen as theClassusing the box above the histogram. (This box will bring up a drop-down listof availableselections when clicked.) Note that only nominal Class attributeswill result in a color-coding.Finally, after pressing the Visualize All button, histograms for all the attributes in the dataare shown in a separate witting.

Returning to the attribute list, to begin with all the tick boxes are unticked.They can be toggled on/off by clicking on them individually. The three buttonsabove can also be used to change the selection:

1. All. All boxes are ticked.

2. None. All boxes are cleared (unticked).

3. Invert. Boxes that are ticked become unticked and vice versa.

Once the desired attributes have been selected, they can be removed byclicking the Remove button below the list of attributes. Note that this can beundoneby clicking the Undo button, which is located next to the Save buttonin the top-rightcorner of the Preprocess panel.


6/69

Expr.No: Date :


Working with Filters

The preprocess section allows filters to be defined that transform the data invariousways. The Filter box is used to set up the filters that are required.At the left of theFilter box is a Choose button. By clicking this button it ispossible to select one of thefilters in WEKA. Once a filter has been selected, itsname and options are shown inthe field next to the Choose button. Clickingon this box brings up aGenericObjectEditor dialog box.

The GenericObjectEditor Dialog Box

The GenericObjectEditor dialog box lets you configure a filter. The same kindof dialog box isused to configure other objects, such as classifiers and clusters(see below). The fields in thewindow reflect the available options. Clicking onany of these gives an opportunity to alter the

filters settings. For example, thesetting may take a text string, in which case you type thestring into the textfield provided. Or it may give a drop-down box listing several states tochoosefrom. Or it may do something else, depending on the information required.Informationon the options is provided in a tool tip if you let the mouse pointerhover of the correspondingfield.

More information on the filter and its optionscan be obtained by clicking on the More buttonin the About panel at the topof the GenericObjectEditor window.Some objects display a briefdescription of what they do in an About box,along with a More button. Clicking on the Morebutton brings up a windowdescribing what the different options do.

At the bottom of the GenericObjectEditor dialog are four buttons. The firsttwo,

Open... and Save... allow object configurations to be stored for futureuse. TheCancel buttonbacks out without remembering any changes that havebeen made. Once you are happy withthe object and settings you have chosen, clickOK to return to the main Explorer window.

Applying Filters

Once you have selected and configured a filter, you can apply it to the databypressing the Apply button at the right end of the Filter panel in thePreprocesspanel.

The Preprocess panel will then show the transformed data. The changecan be undone by pressing the Undo button.

You can also use the Edit...button to modify your data manually in a dataset editor.

Finally, the Save...button at the top right of the Preprocess panel saves the current

version of therelation in the same formats available for loading data, allowing it to bekeptfor future use.

Note: Some of the filters behave differently depending on whether a class attribute has been

set or not (using the box above the histogram, which willbring up a drop-down list of possibleselections when clicked). In particular, thesupervised filters require a class attribute to beset, and some of the unsupervised attribute filters will skip the class attribute if one is set.Note that itis also possible to set Class to None, in which case no class is set .


7/69

Expr.No: Date :


2.Classification

Selecting a Classifier

At the top of the classify section is the Classifier box. This box has a text fieldthatgives the name of the currently selected classifier, and its options. Clickingon the textbox brings up a GenericObjectEditor dialog box, just the same asfor filters that youcan use to configure the options of the current classifier. TheChoose button allowsyou to choose one of the classifiers that are available inWEKA.

Test Options

The result of applying the chosen classifier will be tested according to the optionsthat are set by clicking in the Test options box. There are four test modes:

1. Use training set. The classifier is evaluated on how well it predicts theclass of theinstances it was trained on.

2. Supplied test set. The classifier is evaluated on how well it predicts theclass of aset of instances loaded from a file. Clicking the Set... buttonbrings up a dialogallowing you to choose the file to test on.

3. Cross-validation. The classifier is evaluated by cross-validation, usingthe numberof folds that are entered in the Folds text field.

4. Percentage split. The classifier is evaluated on how well it predicts acertainpercentage of the data which is held out for testing. The amountof data held outdepends on the value entered in the % field.

Note: No matter which evaluation method is used, the model that is output isalwaysthe onebuild from all the training data. Further testing options can beset by clickingon theMore options... button:

1. Output model. The classification model on the full training set is outputso that it can

be viewed, visualized, etc. This option is selected by default.

2. Output per-class stats. The precision/recall and true/false statisticsfor each class

are output. This option is also selected by default.

3. Output entropy evaluation measures. Entropy evaluation measuresare included inthe output. This option is not selected by default.

4. Output confusion matrix. The confusion matrix of the classifiers predictions isincluded in the output. This option is selected by default.

5. Store predictions for visualization.The classifiers predictions areremembered sothat they can be visualized. This option is selected bydefault.

6. Output predictions. The predictions on the evaluation data are output.Note that in

the case of a cross-validation the instance numbers do notcorrespond to the locationin the data!


8/69

Expr.No: Date :


7. Cost-sensitive evaluation. The errors is evaluated with respect to acost matrix. TheSet... button allows you to specify the cost matrixused.

8. Random seed for xval / % Split. This specifies the random seed usedwhenrandomizing the data before it is divided up for evaluation purposes.

The Class Attribute

The classifiers in WEKAare designed to be trained to predict a single classattribute,which is the target for prediction. Some classifiers can only learnnominal classes;others can only learn numeric classes (regression problems);still others can learnboth.By default, the class is taken to be the last attribute in the data. If you wantto train aclassifier to predict a different attribute, click on the box below theTest options box tobring up a drop-down list of attributes to choose from.

Training a Classifier

Once the classifier, test options and class have all been set, the learning processisstarted by clicking on the Start button. While the classifier is busy beingtrained, thelittle bird moves around. You can stop the training process at anytime by clicking onthe Stop button.When training is complete, several things happen. The Classifier outputarea to theright of the display is filled with text describing the results of trainingand testing. Anew entry appears in the Result list box. We look at the resultlist below; but first weinvestigate the text that has been output.

The Classifier Output Text

The text in the Classifier output area has scroll bars allowing you to browsetheresults. Of course, you can also resize the Explorer window to get a largerdisplayarea. The output is split into several sections:

1. Run information. A list of information giving the learning scheme options,relation name, instances, attributes and test mode that were involved in theprocess.

2. Classifier model (full training set). A textual representation of theclassification

model that was produced on the full training data.

3. The results of the chosen test mode are broken down thus:

4. Summary. A list of statistics summarizing how accurately the classifierwas ableto predict the true class of the instances under the chosen testmode.

5. Detailed Accuracy By Class. A more detailed per-class break downof the

classifiers prediction accuracy.

6. Confusion Matrix. Shows how many instances have been assigned toeachclass. Elements show the number of test examples whose actual class

is the row and whose predicted class is the column.


9/69

Expr.No: Date :


The Result List

After training several classifiers, the result list will contain several entries. Left-clicking the entries flicks back and forth between the various results that have

been generated. Right-clicking an entry invokes a menu containing these items:

1. View in main window. Shows the output in the main window (just likeleft-clickingthe entry).

2. View in separate window. Opens a new independent window for viewing theresults.

3. Save result buffer. Brings up a dialog allowing you to save a text filecontainingthe textual output.

4. Load model. Loads a pre-trained model object from a binary file.

5. Save model. Saves a model object to a binary file. Objects are saved inJava

serialized object form.

6. Re-evaluate model on current test set. Takes the model that hasbeen built andtests its performance on the data set that has been specifiedwith the Set.. buttonunder the Supplied test set option.

7. Visualize classifier errors. Brings up a visualization window that plotsthe resultsof classification. Correctly classified instances are representedby crosses,whereas incorrectly classified ones show up as squares.

7. Visualize tree or Visualize graph. Brings up a graphical representationof thestructure of the classifier model, if possible (i.e. for decision treesor Bayesiannetworks). The graph visualization option only appears if aBayesian networkclassifier has been built. In the tree visualizer, you canbring up a menu by right-clicking a blank area, pan around by draggingthe mouse, and see the traininginstances at each node by clicking on it.CTRL-clicking zooms the view out, whileSHIFT-dragging a box zoomsthe view in. The graph visualizer should be self-explanatory.

8. Visualize margin curve. Generates a plot illustrating the predictionmargin. Themargin is defined as the difference between the probabilitypredicted for theactual class and the highest probability predicted forthe other classes. For

example, boosting algorithms may achieve betterperformance on test data byincreasing the margins on the training data.

9. Visualize threshold curve. Generates a plot illustrating the tradeoffsinprediction that are obtained by varying the threshold value betweenclasses. Forexample, with the default threshold value of 0.5, the predicted probability ofpositive must be greater than 0.5 for the instanceto be predicted as positive.The plot can be used to visualize the precision/recall tradeoff, for ROC curveanalysis (true positive rate vs falsepositive rate), and for other types of curves.

10. Visualize cost curve. Generates a plot that gives an explicit representation of

the expected cost, as described by Drummond and Holte (2000).Options aregreyed out if they do not apply to the specific set of results.


10/69

Expr.No: Date :


3.Clustering

Selecting a ClusterBy now you will be familiar with the process of selecting and configuringobjects.Clicking on the clustering scheme listed in the Clusterer box at the topofthewindow brings up a GenericObjectEditor dialog with which to choose anewclustering scheme.

Cluster Modes

The Cluster mode box is used to choose what to cluster and how to evaluatetheresults. The first three options are the same as for classification: Usetraining set,Supplied test set and Percentage split (Section 4)exceptthat now the data isassigned to clusters instead of trying to predict a specificclass. The fourth mode,Classes to clusters evaluation, compares how wellthe chosen clusters match up witha pre-assigned class in the data. The dropdown box below this option selects the

class, just as in the Classify panel.

An additional option in the Cluster mode box, the Store clusters forvisualization tickbox, determines whether or not it will be possible to visualize the clusters oncetraining is complete. When dealing with datasets that are solarge that memorybecomes a problem it may be helpful to disable this option.

Ignoring Attributes

Often, some attributes in the data should be ignored when clustering. The Ignoreattributes button brings up a small window that allows you to selectwhich attributesare ignored. Clicking on an attribute in the window highlightsit, holding down the

SHIFT key selects a range of consecutive attributes, andholding down CTRL togglesindividual attributes on and off. To cancel theselection, back out with the Cancelbutton. To activate it, click the Selectbutton. The next time clustering is invoked, theselected attributes are ignored.

Learning Clusters

The Cluster section, like the Classify section, has Start/Stop buttons, aresult textarea and a result list. These all behave just like their classification counterparts.Right-clicking an entry in the result list brings up a similarmenu, except that it showsonly two visualization options: Visualize clusterassignments and Visualize tree.

The latter is grayed out when it is notapplicable.


11/69

Expr.No: Date :


4.Associating

Setting Up

This panel contains schemes for learning association rules, and the learnersarechosen and configured in the same way as the clusterers, filters, and classifiersinthe other panels.

Learning Associations

Once appropriate parameters for the association rule learner bave been set, clicktheStart button. When complete, right-clicking on an entry in the result listallows theresults to be viewed or saved.

5.Selecting Attributes

Searching and Evaluating

Attribute selection involves searching through all possible combinations of attributesin the data to find which subset of attributes works best for prediction.To do this, twoobjects must be set up: an attribute evaluator and a searchmethod. The evaluatordetermines what method is used to assign a worth toeach subset of attributes. Thesearch method determines what style of searchis performed.

OptionsThe Attribute Selection Mode box has two options:

1. Use full training set. The worth of the attribute subset is determinedusing the full

set of training data.

2. Cross-validation. The worth of the attribute subset is determined by aprocess ofcross-validation. The Fold and Seed fields set the number offolds to use and therandom seed used when shuffling the data.As with Classify (Section 4), there is adrop-down box that can be used tospecify which attribute to treat as the class.

Performing Selection

Clicking Start starts running the attribute selection process. When it is finished, the

results are output into the result area, and an entry is added tothe result list. Right-clicking on the result list gives several options. The firstthree, (View in mainwindow, View in separate window and Save resultbuffer), are the same as for theclassify panel. It is also possible to Visualizereduced data, or if you have used anattribute transformer such as PrincipalComponents, Visualize transformed data.


12/69

Expr.No: Date :


6.Visualizing

WEKAs visualization section allows you to visualize 2D plots of the currentrelation.

The scatter plot matrix

When you select the Visualize panel, it shows a scatter plot matrix for alltheattributes, color coded according to the currently selected class. It is possibletochange the size of each individual 2D plot and the point size, and to randomlyjitterthe data (to uncover obscured points). It also possible to change theattribute used tocolor the plots, to select only a subset of attributes for inclusionin the scatter plotmatrix, and to sub sample the data. Note that changes willonly come into effect oncethe Update button has been pressed.

Selecting an individual 2D scatter plot

When you click on a cell in the scatter plot matrix, this will bring up a separatewindow

with a visualization of the scatter plot you selected. (We describedabove how tovisualize particular results in a separate windowfor example,classifier errorsthesame visualization controls are used here.)Data points are plotted in the main area ofthe window. At the top are twodrop-down list buttons for selecting the axes to plot.The one on the left showswhich attribute is used for the x-axis; the one on the rightshows which is usedfor the y-axis.

Beneath the x-axis selector is a drop-down list for choosing the colour scheme.Thisallows you to colour the points based on the attribute selected. Below theplot area, alegend describes what values the colours correspond to. If the valuesare discrete,you can modify the colour used for each one by clicking on themand making anappropriate selection in the window that pops up.To the right of the plot area is aseries of horizontal strips. Each striprepresents an attribute, and the dots within itshow the distribution of valuesof the attribute. These values are randomly scatteredvertically to help you seeconcentrations of points.

You can choose what axes are used in the main graphby clicking on these strips.Left-clicking an attribute strip changes the x-axisto that attribute, whereas right-clicking changes the y-axis. The X and Ywritten beside the strips shows what thecurrent axes are (B is used for bothX and Y).Above the attribute strips is a sliderlabelled Jitter, which is a randomdisplacement given to all points in the plot. Draggingit to the right increases theamount of jitter, which is useful for spotting concentrationsof points. Withoutjitter, a million instances at the same point would look no different to

just asingle lonely instance.

Selecting Instances

There may be situations where it is helpful to select a subset of the data using thevisualization tool. (A special case of this is the UserClassifier in theClassify panel,which lets you build your own classifier by interactively selectinginstances.)Below they-axis selector button is a drop-down list button for choosing aselection method. Agroup of data points can be selected in four ways:

1. Select Instance. Clicking on an individual data point brings up a windowlisting itsattributes. If more than one point appears at the same location,more than one set

of attributes is shown.


13/69

Expr.No: Date :


2. Rectangle. You can create a rectangle, by dragging, that selects thepoints insideit.

3. Polygon. You can build a free-form polygon that selects the points insideit. Left-

click to add vertices to the polygon, right-click to complete it. Thepolygon willalways be closed off by connecting the first point to the last.

4. Polyline. You can build a polyline that distinguishes the points on oneside fromthose on the other. Left-click to add vertices to the polyline,right-click to finish.The resulting shape is open (as opposed to a polygon,which is always closed).

Once an area of the plot has been selected using Rectangle, Polygon orPolyline, itturns grey. At this point, clicking the Submit button removes allinstances from the

plot except those within the grey selection area. Clicking onthe Clear button erasesthe selected area without affecting the graph.Once any points have been removedfrom the graph, the Submit buttonchanges to a Reset button. This button undoes allprevious removals andreturns you to the original graph with all points included.

Finally, clicking theSave button allows you to save the currently visible instances to anew ARFFfile.


14/69

Expr.No: Date :


WEKA (Waikato Environment for Knowledge Analysis)

WEKA Supports following File Formats

ARFF ( Attribute Relation File Format)

CSV ( Comma Separated Values)

C4.5

Binary Files ( True/False or Yes/No, Buys TV or Not .)

Confusion Matrix

How well your classifier can recognize tuples of different classes

True Positives : Refer to the Positive tuples that were correctly labeled by

the classifier.

True Negatives : Refer to the Negative tuples that were correctly labeled by

the classifier.

False Positives : Refer to the Negative tuples that were incorrectly labeled by

the classifier.

False Negatives : Refer to the Positive tuples that were incorrectly labeled by

the classifier.

Predicted Class

Actual

Class

C1 C2

C1 True Positives False Negatives

C2 False Positives True Negatives


15/69

Expr.No: Date :


1.Convert foll owing relation into CSV f il e format.a.Write chart for resul t & Analyze the resul t

b.Perform evaluation on tr aining set

c.Write summary and confusion matr ix and justif y it.

sno NAME CD MEFA NS OOAD VLSI WT sum avg Result

1 MANASA 71 77 75 66 60 61 552 73.60 PASS

2 RAGHU VARDHAN 64 49 61 64 78 79 539 71.87 PASS

3 PRIYANKA 75 72 77 72 82 65 590 78.67 PASS

4 ZAKIA 64 47 36 18 37 39 381 50.80 Fail

5 NAVEEN 32 68 49 45 32 56 426 56.80 Fail

6 HARITHA 79 57 86 72 68 87 585 78.00 PASS

7 PURUSHOTHAM 54 71 66 58 70 59 522 69.60 PASS

8 BHANU PRAKASH 60 37 61 63 64 44 473 63.07 Fail

9 ABHILASH 12 45 11 7 26 27 259 34.53 Fail

10 SUDHEER69 68 68 52 58 57

510 68.00 PASS

11 PARUSHA RAMULU 56 55 47 48 50 58 451 60.13 PASS

12 CHENNAKESHAVULU 87 52 81 72 90 88 618 82.40 PASS


16/69

Expr.No: Date :



17/69

Expr.No: Date :


Working with ARFF files

2.Convert foll owing relation into ARFF format. Analyze it and wri teconfusion matr ix and justify i t.

Eno Ename Salary deptname

1 Pranav 45000 CSE

2 Karthika 35000 ECE

3 Ishaanth 35000 EEE

4 Mithrika 50000 CSE

5 Sriharsha 45000 ECE

6 Rishanth 50000 CSE

7 Sushanth 45000 EEE

8 Sai Rudhra 35000 ECE


18/69

Expr.No: Date :


3.Convert following relation in to ARFF format.a. Analyze it

b.Classify records using J48 and generate Decision Tr ee

c.Write confusion matri x and justif y it.

Outlook Temperature Humidity Windy Play

Sunny 45 15 False Yes

Overcast 20 30 False yes

Rainy 30 20 True No

sunny 50 10 True No

Sunny 48 15 True Yes

Overcast 28 36 False Yes

Rainy 35 25 False No


19/69

Expr.No: Date :



20/69

Expr.No: Date :


4.Convert fol lowing Transaction database into ARFF format.Transcation id Items Bought

T1 I1,I2,I5

T2 I2,I4

T3 I2,I3T4 I1,I2,I4

T5 I1,I3

T6 I2,I3

T7 I1,I3

T8 I1,I2,I3,I5

T9 I1,I2,I3


21/69

Expr.No: Date :


5.Convert fol lowing Transaction database into ARFF format.Transcation id Items Bought

1 {M, O, N, K, E, Y}

2 {D, O, N, K, E, Y}

3 {M, A, K, E }4 {M, U, C, K, Y}

5 {C, O, O, K, I, E}


22/69

Expr.No: Date :


Description of the German credit dataset in ARFF (AttributeRelation File Format) Format:

Structure of ARFF Format:%comment lines

@relation relation name@attribute attribute name@DataSet of data items separated by commas.

% 1. Title: German Credit data%% 2. Source Information%% Professor Dr. Hans Hofmann% Institutf"urStatistik und "Okonometrie% Universit"at Hamburg% FB Wirtschaftswissenschaften% Von-Melle-Park 5% 2000 Hamburg 13%% 3. Number of Instances: 1000%% Two datasets are provided. the original dataset, in the form provided% by Prof. Hofmann, contains categorical/symbolic attributes and% is in the file "german.data".%% For algorithms that need numerical attributes, Strathclyde University

% produced the file "german.data-numeric". This file has been edited% and several indicator variables added to make it suitable for% algorithms which cannot cope with categorical variables. Several% attributes that are ordered categorical (such as attribute 17) have% been coded as integer. This was the form used by StatLog.%% 6. Number of Attributes german: 20 (7 numerical, 13 categorical)% Number of Attributes german.numer: 24 (24 numerical)%% 7. Attribute description for german%% Attribute 1: (qualitative)

% Status of existing checking account% A11 : ... < 0 DM% A12 : 0 = 200 DM /% salary assignments for at least 1 year% A14 : no checking account% Attribute 2: (numerical)% Duration in month%% Attribute 3: (qualitative)% Credit history% A30 : no credits taken/

% all credits paid back duly% A31 : all credits at this bank paid back duly


23/69

Expr.No: Date :


% A32 : existing credits paid back duly till now% A33 : delay in paying off in the past% A34 : critical account/% other credits existing (not at this bank)%% Attribute 4: (qualitative)% Purpose% A40 : car (new)% A41 : car (used)% A42 : furniture/equipment% A43 : radio/television% A44 : domestic appliances% A45 : repairs% A46 : education% A47 : (vacation - does not exist?)% A48 : retraining% A49 : business

% A410 : others%% Attribute 5: (numerical)% Credit amount%% Attibute 6: (qualitative)% Savings account/bonds% A61 : ... < 100 DM% A62 : 100


24/69

Expr.No: Date :


% Attribute 11: (numerical)% Present residence since%% Attribute 12: (qualitative)% Property% A121 : real estate% A122 : if not A121 : building society savings agreement/% life insurance% A123 : if not A121/A122 : car or other, not in attribute 6% A124 : unknown / no property%% Attribute 13: (numerical)% Age in years%% Attribute 14: (qualitative)% Other installment plans% A141 : bank

% A142 : stores% A143 : none%% Attribute 15: (qualitative)% Housing% A151 : rent% A152 : own% A153 : for free%% Attribute 16: (numerical)% Number of existing credits at this bank%

% Attribute 17: (qualitative)% Job% A171 : unemployed/ unskilled - non-resident% A172 : unskilled - resident% A173 : skilled employee / official% A174 : management/ self-employed/% highly qualified employee/ officer%% Attribute 18: (numerical)% Number of people being liable to provide maintenance for%% Attribute 19: (qualitative)

% Telephone% A191 : none% A192 : yes, registered under the customers name%% Attribute 20: (qualitative)% foreign worker% A201 : yes% A202 : no%% 8. Cost Matrix%% This dataset requires use of a cost matrix (see below)%%


25/69

Expr.No: Date :


% 1 2% ----------------------------% 1 0 1% -----------------------% 2 5 0%% (1 = Good, 2 = Bad)%% the rows represent the actual classification and the columns% the predicted classification.%% It is worse to class a customer as good when they are bad (5),% than it is to class a customer as bad when they are good (1).%%% Relabeled values in attribute checking_status% From: A11 To: '


26/69

Expr.No: Date :


% From: A73 To: '1


27/69

Expr.No: Date :


% Relabeled values in attribute class% From: 1 To: good% From: 2 To: bad%@relation german_credit@attribute checking_status { '


28/69

Expr.No: Date :


Lab Experiments

1.List all the categorical (or nominal) attributes and the real-valuedattributes separately.

From the German Credit Assessment Case Study given to us,the following attributesare found to be applicable for Credit-Risk Assessment:

Total Valid Attributes Categorical or Nominalattributes(which takes True/false,etcvalues)

Real valued attributes

1. checking_status2. duration

3. credit history4. purpose5. credit amount6. savings_status7. employment duration8. installment rate9. personal status10. debitors11. residence_since12. property14. installment plans15. housing16. existing credits17. job18. num_dependents19. telephone20. foreign worker


29/69

Expr.No: Date :


2.What attributes do you think might be crucial in making the creditassessment? Come up with some simple rules in plain Englishusing your selected attributes.

According to me the following attributes may be crucial in making the credit riskassessment.


30/69

Expr.No: Date :



31/69

Expr.No: Date :


Based on the above attributes, we can make a decision whether to give credit or not.


32/69

Expr.No: Date :


3.One type of model that you can create is a Decision Tree - train aDecision Tree using the complete dataset as the training data.Report the model obtained after training.

A decision tree is a flow chart like tree structure where each internal node(non-leaf)denotes a test on the attribute, each branch represents an outcome of the test ,andeach leaf node(terminal node)holds a class label.

Decision trees can be easily converted into classification rules.e.g. ID3,C4.5 and CART.

J48 pruned tree

The resulting window in WEKA is as follows:

.


33/69

Expr.No: Date :



34/69

Expr.No: Date :


The decision tree above is unclear due to a large number of attributes.


35/69

Expr.No: Date :


4.Suppose you use your above model trained on the complete dataset,and classify credit good/bad for each of the examples in thedataset. What % of examples can you classify correctly? (This isalso called testing on the training set) Why do you think you

cannot get 100 % training accuracy?

In the above model we trained complete dataset and we classified creditgood/bad for each of the examples in the dataset.

For example:

IFpurpose=vacation THENcredit=bad;

ELSEpurpose=business THENcredit=good;

In this way we classified each of the examples in the dataset.


36/69

Expr.No: Date :



37/69

Expr.No: Date :


5.Is testing on the training set as you did above a good idea? WhyWhy not?


38/69

Expr.No: Date :


6.One approach for solving the problem encountered in the previousquestion is using cross-validation? Describe what cross-validationis briefly. Train a Decision Tree again using cross-validation and

report your results. Does your accuracy increase/decrease? Why?

Cross validation:-

In k-fold cross-validation, the initial data are randomly portioned into kmutuallyexclusive subsets or folds D1, D2, D3, . . . . . ., Dk. Each of approximatelyequal size. Training and testing isperformed k times. In iteration I, partition Di isreserved as the test set and the remaining partitions arecollectively used to train themodel.

That is in the first iteration subsets D2, D3, . . . . . ., Dk collectivelyserve as thetraining set in order to obtain as first model. Which is tested on Di. The second

trained onthe subsets D1, D3, . . . . . ., Dk and test on the D2 and so on.


39/69

Expr.No: Date :


1. Select classify tab and J48 decision tree and in the test option select

crossvalidation radio button and the number of folds as 10.

2. Number of folds indicates number of partition with the set of attributes.

3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the

errors will be zeroed out,but in reality there is no such training set that gives 100%

accuracy.

Cross Validation Result at folds:10 for the table GermanCreditData:

Correctly Classified Instances

Incorrectly Classified Instances

Kappa statistic

Mean absolute error

Root mean squared error

Relative absolute error

Root relative squared error

Total Number of Instances

Here there are 1000 instances with 100 instances per partition.




Kappa statistic

Mean absolute error






40/69

Expr.No: Date :





Kappa statistic

Mean absolute error








Kappa statistic

Mean absolute error






41/69

Expr.No: Date :


Percentage split does not allow 100%, it allows only till 99.9%

PercentageSplit Result at 50%: (Above screen shot)


42/69

Expr.No: Date :


PercentageSplit Result at 50%:



Kappa statistic

Mean absolute error





PercentageSplit Result at 99.9%:

Correctly Classified InstancesIncorrectly Classified Instances

Kappa statistic

Mean absolute error






43/69

Expr.No: Date :


7. Check to see if the data shows a bias against "foreign workers"(attribute 20),or "personal-status"(attribute 9). One way to do this(Perhaps rather simple minded) is to remove these attributesfromthedataset and see if the decision tree created in those casesis significantly different from the full dataset case which you

have already done. To remove an attribute you can use thereprocess tab in WEKA's GUI Explorer. Did removing theseattributes have any significant effect? Discuss.

This increases in accuracy because the two attributes foreign workers andpersonal status are not much important in training and analyzing.By removing this,the time has been reduced to some extent and then it results inincrease in theaccuracy.The decision tree which is created is very large compared to the decisiontree whichwe have trained now. This is the main difference between these twodecision trees.

After removing forign worker, write summary in table shown below



Kappa statistic

Mean absolute error






44/69

Expr.No: Date :


After removing 9th attribute (Personal Status), write summary in table shown below


Incorrectly Classified InstancesKappa statistic

Mean absolute error






45/69

Expr.No: Date :


.

Using Cross validation :After removing 9th attribute (Personal Status), writesummary in table shown below

Correctly Classified InstancesIncorrectly Classified Instances

Kappa statistic

Mean absolute error






46/69

Expr.No: Date :


Using Percentage split:After removing 9th attribute (Personal Status), writesummary in table shown below



Kappa statistic

Mean absolute error






47/69

Expr.No: Date :


Using Cross validation




Kappa statistic

Mean absolute error





After removing the 20th attribute (forign worker), the cross validation is as above.


48/69

Expr.No: Date :


Using Percentage split




Kappa statistic

Mean absolute error





After removing 20th attribute (forign worker),the percentage split is as above.


49/69

Expr.No: Date :


8.Another question might be, do you really need to input so manyattributes toget good results? Maybe only a few would do. Forexample, you could try just having attributes 2, 3, 5, 7, 10, 17 (and

21, the class attribute (naturally)). Try out some combinations. (Youhad removed two attributes in problem 7Remember to reload theARFF data file to get all the attributes initially beforeyou startselecting the ones you want.)

Select attribute 2,3,5,7,10,17,21 and click on invert to remove the remaining attributes.

Classify using J48 and collect summary and write in below table.



Kappa statistic

Mean absolute errorRoot mean squared error




sd


50/69

Expr.No: Date :


Select random attributes and then check the accuracy.



Kappa statistic

Mean absolute error






51/69

Expr.No: Date :


After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the leftover attributes and visualize them.



Kappa statistic

Mean absolute error





After we removing above 14 attributes, classify using J48 and Cross validationCorrectly Classified Instances


Kappa statistic

Mean absolute error





After we removing above 14 attributes, classify using J48 and Percentage splitCorrectly Classified Instances


Kappa statistic

Mean absolute error






52/69

Expr.No: Date :


9.Sometimes, the cost of rejecting an applicant who actually has agood creditCase 1. might be higher than accepting an applicant who has bad

creditCase 2.Instead of counting the misclassifications equally in both

cases, give a higher cost to the first case (say cost 5) andlower cost to the second case. You can do this by using acost matrix in WEKA.

Train your Decision Tree again and report the Decision Tree andcross-validation results. Are they significantly different fromresults obtained in problem 6 (using equal cost)?


53/69

Expr.No: Date :


1.Select classify tab.2. Select More Option from Test Option.

3.Tick on cost sensitive Evaluation and click on set.


54/69

Expr.No: Date :


4.Set classes as 2.5.Click on Resizeand then well get cost matrix.6.Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0 as shown below.

0.0 5.05.0 0.0

7.Then confusion matrix will be generated and you can find out the differencebetween good and bad attribute.

8. Check accuracy whetherits changing or not.



Kappa statistic

Mean absolute error






55/69

Expr.No: Date :


10.Do you think it is a good idea to prefer simple decision treesinstead of having long complex decision trees? How does thecomplexity of a Decision Tree relate to the bias of the model?

When we consider long complex decision trees, we will have many unnecessaryattributes in the tree which results in increase of the bias of the model. Because ofthis, the accuracy of the model can also effect.

This problem can be reduced by considering simple decision tree. The attributes willbe lessand it decreases the bias of the model. Due to this the result will be moreaccurate.

So it is a good idea to prefer simple decision trees instead of long complex trees.

1. Open any existing ARFF file e.g labour.arff.

2. In preprocess tab,select ALL to select all the attributes.3. Go to classify tab and then use traning set with J48 algorithm.


56/69

Expr.No: Date :


4. To generate the decision tree, right click on the result list and select visualize tree

option,by which the decision tree will be generated.


57/69

Expr.No: Date :


5. Right click on J48 algorithm to get Generic Object Editor window

6. In this,make the unpruned option as true .

7. Then press OK and then start. we find the tree will become more complex if not

pruned.

Visualize tree


58/69

Expr.No: Date :


8.The tree has become more complex.


59/69

Expr.No: Date :


11. You can make your Decision Trees simpler by pruning the nodes. Oneapproach is to use Reduced Error Pruning - Explain thisidea briefly. Try reduced error pruning for training your DecisionTrees using cross-validation (you can do this in WEKA) and reportthe Decision Tree you obtain? Also, report your accuracy using

the pruned model. Does your accuracy increase?


60/69

Expr.No: Date :


Reduced-error pruning:-

1. Right click on J48 algorithm to get Generic Object Editor window2. In this,make reduced error pruning option as true and also the unpruned option as true

.3. Then press OK and then start.


61/69

Expr.No: Date :


4. Find the accuracy has been increased/ decreased by selecting the reduced error

pruning option and write in below table and justify it.



Kappa statistic

Mean absolute error






62/69

Expr.No: Date :


12.(Extra Credit): How can you convert a Decision Trees into "if-then-elserules".Make up your own small Decision Tree consisting of 2-3 levelsandconvert it into a set of rules. There also exist differentclassifiers that output themodel in the form of rules - one such

classifier in WEKAis rules.PART, train thismodel and report theset of rules obtained. Sometimes just one attribute canbe goodenough in making the decision, yes, just one! Can you predictwhatattribute that might be in this dataset? OneR classifier usesa single attribute tomake decisions (it chooses the attributebased on minimum error). Report therule obtained by training aone R classifier. Rank the performance of j48, PARTand oneR.

In WEKA, rules.PART is one of the classifier which converts the decision trees intoIF-THEN-ELSE rules.

Converting Decision trees into IF-THEN-ELSE rules using rules. PART classifier:-

PART decision listoutlook = overcast: yes (4.0)windy = TRUE: no (4.0/1.0)outlook = sunny: no (3.0/1.0): yes (3.0)Number of Rules : 4Yes, sometimes just one attribute can be good enough in making the decision.In this dataset (Weather), Single attribute for making the decision is outlookoutlook:sunny -> noovercast -> yesrainy -> yes(10/14 instances correct)With respect to the time, the oneR classifier has higher ranking and J48 is in 2ndplace andPART gets 3rd place.J48 PART oneRTIME (sec) 0.12 0.14 0.04RANK II III IBut if you consider the accuracy, The J48 classifier has higher ranking, PART getssecond place and oneRgets lst place

J48 PART oneRACCURACY (%) 70.5 70.2% 66.8%


63/69

Expr.No: Date :


1.Open existing file as weather.nomial.arff

2.Select All.

3.Goto classify.

4.Start.

== Evaluation on training set ===

=== Summary ===



Kappa statistic

Mean absolute error






64/69

Expr.No: Date :


Here the accuracy is 100%


65/69

Expr.No: Date :


The tree is something like if-then-else rule.

Write if-then-else rules using decision tree in following table.


66/69

Expr.No: Date :


To click out the rules

1. Go to choose then click on Rule then select PART.


67/69

Expr.No: Date :


2. Click on Save and start.


=== Summary ===

Using PART Algorithm



Kappa statistic

Mean absolute error



Root relative squared errorTotal Number of Instances

3. Similarly forOneR algorithm.

4. Click on Save and start.


68/69

Expr.No: Date :



=== Summary ===

Using OneR Algorithm



Kappa statistic

Mean absolute error






69/69

Expr.No: Date :

WEKA Lab Record

Documents

Transcript of WEKA Lab Record