DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB...

DATA MINING AND BUSINESS INTELLIGENCE

(DMBI)

LAB MANUAL

EXPERIMENT NO. 1

AIM: Solving exercises in Data Exploration.

THEORY:

Data Exploration

Data Exploration is about describing the data by means of statistical and visualization techniques. We explore

data in order to bring important aspects of that data into focus for further analysis.

1. Univariate Analysis

Univariate analysis explores variables (attributes) one by one. Variables could be either categorical or

numerical. There are different statistical and visualization techniques of investigation for each type of variable.

Numerical variables can be transformed into categorical counterparts by a process called binning or

discretization.It is also possible to transform a categorical variable into its numerical counterpart by a

process called encoding. Finally, proper handling of missing values is an important issue in mining data.

Numerical Variables

A numerical or continuous variable (attribute) is one that may take on any value within a finite or infinite

interval

(e.g., height, weight, temperature, blood glucose, ...). There are two types of numerical variables, interval and

ratio. An interval variable has values whose differences are interpretable, but it does not have a true zero. A good

example is temperature in Centigrade degrees. Data on an interval scale can be added and subtracted but cannot be

meaningfully multiplied or divided. For example, we cannot say that one day is twice as hot as another day. In

contrast, a ratio variable has values with a true zero and can be added, subtracted, multiplied or divided (e.g., weight).

Univariate Analysis - Numerical

Statistics Visualization Equation Description

Count Histogram N

The number of values

(observations) of the

variable.

Minimum Box Plot Min The smallest value of the

variable.

Maximum Box Plot Max The largest value of the

variable.

Mean Box Plot

The sum of the values

divided by the count.

Median Box Plot

The middle value. Below

and above median lies an

equal number of values.

Mode Histogram

The most frequent value.

There can be more than

one mode.

Quantile Box Plot

A set of 'cut points' that

divide a set of data into

groups containing equal

numbers of values

(Quartile, Quintile,

Percentile, ...).

Range Box Plot Max-Min The difference between

maximum and minimum.

Variance Histogram

A measure of data

dispersion.

Standard

Deviation Histogram

The square root of

variance.

Coefficient

of

Deviation

Histogram

A measure of data

dispersion divided by

mean.

Skewness Histogram

A measure of symmetry

or asymmetry in the

distribution of data.

Kurtosis Histogram

A measure of whether the

data are peaked or flat

relative to a normal

distribution.

->Box plot and histogram for the "sepal length" variable from the Iris dataset.

2. Bivariate Analysis

Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the

concept of relationship between two variables, whether there exists an association and the

strength of this association, or whether there are differences between two variables and the

significance of these differences. There are three types of bivariate analysis.

CONCLUSION: (Conclusion to be based on the aim and outcomes achieved)

http://www.saedsayad.com/datasets/iris.txt

EXPERIMENT NO. 2

AIM: Solving exercises in Data Preprocessing.

THEORY:

Why preprocessing ?

1. Real world data are generally

o Incomplete: lacking attribute values, lacking certain attributes of interest, or

containing only aggregate data

o Noisy: containing errors or outliers

o Inconsistent: containing discrepancies in codes or names

2. Tasks in data preprocessing

o Data cleaning: fill in missing values, smooth noisy data, identify or remove

outliers, and resolve inconsistencies.

o Data integration: using multiple databases, data cubes, or files.

o Data transformation: normalization and aggregation.

o Data reduction: reducing the volume but producing the same or similar analytical

results.

o Data discretization: part of data reduction, replacing numerical attributes with

nominal ones.

Data cleaning

1. Fill in missing values (attribute or class value):

o Ignore the tuple: usually done when class label is missing.

o Use the attribute mean (or majority nominal value) to fill in the missing value.

o Use the attribute mean (or majority nominal value) for all samples belonging to

the same class.

o Predict the missing value by using a learning algorithm: consider the attribute

with the missing value as a dependent (class) variable and run a learning

algorithm (usually Bayes or decision tree) to predict the missing value.

2. Identify outliers and smooth out noisy data:

o Binning

Sort the attribute values and partition them into bins (see "Unsupervised

discretization" below);

Then smooth by bin means, bin median, or bin boundaries.

o Clustering: group values in clusters and then detect and remove outliers

(automatic or manual)

o Regression: smooth by fitting the data into regression functions.

3. Correct inconsistent data: use domain knowledge or expert decision.

Data transformation

1. Normalization:

o Scaling attribute values to fall within a specified range.

Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-

Min)/(Max-Min)

o Scaling by using mean and standard deviation (useful when min and max are

unknown or when there are outliers): V'=(V-Mean)/StDev

2. Aggregation: moving up in the concept hierarchy on numeric attributes.

3. Generalization: moving up in the concept hierarchy on nominal attributes.

4. Attribute construction: replacing or adding new attributes inferred by existing attributes.

Data reduction

1. Reducing the number of attributes

o Data cube aggregation: applying roll-up, slice or dice operations.

o Removing irrelevant attributes: attribute selection (filtering and wrapper

methods), searching the attribute space (see Lecture 5: Attribute-oriented

analysis).

o Principle component analysis (numeric attributes only): searching for a lower

dimensional space that can best represent the data..

2. Reducing the number of attribute values

o Binning (histograms): reducing the number of attributes by grouping them into

intervals (bins).

o Clustering: grouping values in clusters.

o Aggregation or generalization

3. Reducing the number of tuples

o Sampling

Discretization and generating concept hierarchies

1. Unsupervised discretization - class variable is not used.

o Equal-interval (equiwidth) binning: split the whole range of numbers in intervals

with equal size.

o Equal-frequency (equidepth) binning: use intervals containing equal number of

values.

2. Supervised discretization - uses the values of the class variable.

o Using class boundaries. Three steps:

Sort values.

Place breakpoints between values belonging to different classes.

If too many intervals, merge intervals with equal or similar class

distributions.

o Entropy (information)-based discretization. Example:

Information in a class distribution:

Denote a set of five values occurring in tuples belonging to two

classes (+ and -) as [+,+,+,-,-]

That is, the first 3 belong to "+" tuples and the last 2 - to "-" tuples

Then, Info([+,+,+,-,-]) = -(3/5)*log(3/5)-(2/5)*log(2/5) (logs are

base 2)

3/5 and 2/5 are relative frequencies (probabilities)

Ignoring the order of the values, we can use the following notation:

[3,2] meaning 3 values from one class and 2 - from the other.

Then, Info([3,2]) = -(3/5)*log(3/5)-(2/5)*log(2/5)

Information in a split (2/5 and 3/5 are weight coefficients):

Info([+,+],[+,-,-]) = (2/5)*Info([+,+]) + (3/5)*Info([+,-,-])

Or, Info([2,0],[1,2]) = (2/5)*Info([2,0]) + (3/5)*Info([1,2])

Method:

Sort the values;

Calculate information in all possible splits;

Choose the split that minimizes information;

Do not include breakpoints between values belonging to the same

class (this will increase information);

Apply the same to the resulting intervals until some stopping

criterion is satisfied.

3. Generating concept hierarchies: recursively applying partitioning or discretization

methods.

Normalization by Z-score

Assume that there are five rows with the IDs A, B, C, D and E, each row containing n different

variables (columns). We use record E as an example in the calculations below. The remaining

rows are normalized in the same way.

The normalized value of ei for row E in the ith column is calculated as:

where

If all values for row E are identical—so the standard deviation of E (std(E)) is equal to zero—

then all values for row E are set to zero.

Normalization by min max transformation

If you want to normalize you data you can do as you suggest and simply calculate:

zi= xi−min(x)/max(x)−min(x)

where x=(x1,...,xn) and zi is now your ith normalized data


EXPERIMENT NO. 3 AIM: Study of Data Mining tool – WEKA.

THEORY:

1. Introduction To WEKA

A) MAIN FEATURE OF WEKA

Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning

software written in Java, developed at the University of Waikato, New Zealand.

Weka is free software available under the GNU General Public License. The Weka workbench

contains a collection of visualization tools and algorithms for data analysis and predictive

modeling, together with graphical user interfaces for easy access to this functionality.

Weka supports several standard data mining tasks, more specifically, data preprocessing,

clustering, classification, regression, visualization, and feature selection. Weka provides access

to SQL databases using Java Database Connectivity and can process the result returned by a

database query. It is not capable of multi-relational data mining, but there is separate software for

converting a collection of linked database tables into a single table that is suitable for processing

using Weka.

B) DOWNLOAD AND INSTALLATION:

Step 1: download form link

http://sourceforge.net/projects/weka/postdownload?source=dlp

http://en.wikipedia.org/wiki/Machine_learning

http://en.wikipedia.org/wiki/Java_%28programming_language%29

http://en.wikipedia.org/wiki/University_of_Waikato

http://en.wikipedia.org/wiki/New_Zealand

http://en.wikipedia.org/wiki/Free_software

http://en.wikipedia.org/wiki/GNU_General_Public_License

http://en.wikipedia.org/wiki/Data_analysis

http://en.wikipedia.org/wiki/Predictive_modeling


http://en.wikipedia.org/wiki/Data_mining

http://en.wikipedia.org/wiki/Preprocessing

http://en.wikipedia.org/wiki/Data_clustering

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/Regression_analysis

http://en.wikipedia.org/wiki/Feature_selection

http://en.wikipedia.org/wiki/SQL

http://en.wikipedia.org/wiki/Database

http://en.wikipedia.org/wiki/Java_Database_Connectivity

Step 2 : Installation step

I) command: sudo apt-get install weka

C) START APP: Start app as show in below diagram.

D) MAIN USER INTERFACE:

Weka's main user interface is the Explorer, but essentially the same functionality can be accessed

through the component-based Knowledge Flow interface and from the command line. There is

also the Experimenter, which allows the systematic comparison of the predictive performance of

http://en.wikipedia.org/wiki/Command_line

Weka's machine learning algorithms on a collection of datasets.

The Explorer winterface features several panels providing access to the main components of the

workbench:

The Preprocess panel has facilities for importing data from a database, a CSV file, etc.,

and for preprocessing this data using a so-called filtering algorithm. These filters can be

used to transform the data (e.g., turning numeric attributes into discrete ones) and make it

possible to delete instances and attributes according to specific criteria.

The Classify panel enables the user to apply classification and regression algorithms

(indiscriminately called classifiers in Weka) to the resulting dataset, to estimate the

accuracy of the resulting predictive model, and to visualize erroneous predictions, ROC

curves, etc., or the model itself (if the model is amenable to visualization like, e.g., a

decision tree).

The Associate panel provides access to association rule learners that attempt to identify

all important interrelationships between attributes in the data.

The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-

means algorithm. There is also an implementation of the expectation maximization

algorithm for learning a mixture of normal distributions.

The Select attributes panel provides algorithms for identifying the most predictive

attributes in a dataset.

The Visualize panel shows a scatter plot matrix, where individual scatter plots can be

selected and enlarged, and analyzed further using various selection operators.


http://en.wikipedia.org/wiki/Comma-separated_values

http://en.wikipedia.org/wiki/Statistical_classification

http://en.wikipedia.org/wiki/Regression_analysis

http://en.wikipedia.org/wiki/Accuracy


http://en.wikipedia.org/wiki/Receiver_operating_characteristic

http://en.wikipedia.org/wiki/Receiver_operating_characteristic

http://en.wikipedia.org/wiki/Decision_tree

http://en.wikipedia.org/wiki/Association_rule_learning

http://en.wikipedia.org/wiki/Cluster_analysis

http://en.wikipedia.org/wiki/K-means

http://en.wikipedia.org/wiki/K-means

http://en.wikipedia.org/wiki/Expectation-maximization_algorithm

http://en.wikipedia.org/wiki/Expectation-maximization_algorithm

http://en.wikipedia.org/wiki/Normal_distribution

http://en.wikipedia.org/wiki/Scatter_plot

2. WEKA Functions and Tools:

A) Loading, Preprocessing and Visualization of Data file:

• Load data file in formats: ARFF, CSV, C4.5, binary

• Import from URL or SQL database (using JDBC)

• Preprocessing filters

– Adding/removing attributes

– Attribute value substitution

– Discretization

– Time series filters (delta, shift)

– Sampling, randomization

– Missing value management

– Normalization and other numeric transformations

B) FEATURE SELECTION:

In Weka, you have three options of performing attribute selection from command line (not

everything is possible from the GUI):

the native approach, using the attribute selection classes directly

using a meta-classifier

the filter approach

C) CLASSIFICATION

A trained model can be saved like this, e.g., J48:

train your model on the training data /some/where/train.arff

right-click in the Results list on the item which model you want to save

select Save model and save it to /other/place/j48.model

You can load the previously saved model with the following steps:

load your test data /some/where/test.arff via the Supplied test set button

right-click in the Results list, select Load model and choose /other/place/j48.model

select Re-evaluate model on current test set

D) CLUSTERING:

Load the data file AUTOS.arff into WEKA using the same steps we used to load data into

the Preprocess tab. Take a few minutes to look around the data in this tab. Look at the columns,

the attribute data, the distribution of the columns, etc. The screen should look like the figure

shown below after loading the data.

With this data set, we are looking to create clusters, so instead of clicking on the Classify tab,

click on the Cluster tab. Click Choose and select SimpleKMeans from the choices that appear

(this will be our preferred method of clustering for this article). WEKA Explorer window should

look like the following figure at this point.

E) REGRESSION:

• Predicted target is continuous

• Methods

– Linear regression

– Simple Linear Regression

– Neural networks

– Regression trees …

3.) Data format in WEKA:

ARFF:

Attribute Relationship File Format (ARFF) is the text format file used by Weka to store data in a

database. This kind of file is structured as follows ("weather" relational database):

@relation weather

@attribute outlook {sunny, overcast, rainy}

@attribute temperature real

@attribute humidity real

@attribute windy {TRUE, FALSE}

@attribute play {yes, no}

@data

sunny,85,85,FALSE,no

sunny,80,90,TRUE,no

overcast,83,86,FALSE,yes

rainy,70,96,FALSE,yes

rainy,68,80,FALSE,yes

rainy,65,70,TRUE,no

overcast,64,65,TRUE,yes

The ARFF file contains two sections: the header and the data section. The first line of the header

tells us the relation name. Then there is the list of the attributes (@attribute...). Each attribute is

associated with a unique name and a type. The latter describes the kind of data contained in the

variable and what values it can have. The variables types are: numeric, nominal, string and date.

The class attribute is by default the last one of the list. In the header section there can also be

some comment lines, identified with a '%' at the beginning, which can describe the database

content or give the reader information about the author. After that there is the data itself (@data),

each line stores the attribute of a single entry separated by a comma.

4). Pros and Cons of WEKA data mining:

The WEKA system has covered the entire machine learning (knowledge discovery) process.

Although an research project, the WEKA system has been able to implement and evaluate a

number of different Algorithms for different steps in the machine learning process.

The output and the information provided by the package is sufficient for an expert in machine

learning and related topics. The results as displayed by the system show a detailed description of

the flow and the steps involved in the entire machine learning process. The outputs provided by

different algorithms are easy to compare and hence make the analysis easier.

ARFF dataset is one of the most widely used data storage formats for research databases, making

this system easier for use in research oriented projects.

This package provides and number of application program interfaces (API) which help novice

Data miners build their systems using the ”core WEKA system”.

Since the system provides a number of switches and options, we can customize the output of the

system to suit our needs.

First, major disadvantage is that the system is a Java based system and requires Java

Virtual Machine installed for its execution. Since the system is entirely based on Command Line

parameters and switches, it is difficult for an amateur to use the system efficiently. A Textual

interface and output makes it all the more difficult to interpret and visualize the results.

Important results such as the pruned trees, hierarchy based outputs cannot be displayed making it

difficult to visualize the results.

Although a commonly used dataset, ARFF is the only format that the WEKA system supports.

All the current version i.e. 3.0.1 has some bugs or disadvantages, the developers are working on

a better system and have come up with a new version which has a graphical user interface

making the system complete.


EXPERIMENT NO. 4

AIM: Implementation of preprocessing in WEKA.

THEORY:

This example illustrates some of the basic data preprocessing operations that can be performed

using WEKA. The sample data set used for this example is the "bank data" available in comma-

separated format.

The data contains the following fields

Id a unique identification number

Age age of customer in years (numeric)

Sex MALE / FEMALE

Region inner_city/rural/suburban/town

Income income of customer (numeric)

married is the customer married (YES/NO)

children number of children (numeric)

Car does the customer own a car (YES/NO)

save_acct does the customer have a saving account (YES/NO)

current_acct does the customer have a current account (YES/NO)

mortgage does the customer have a mortgage (YES/NO)

Pep did the customer buy a PEP (Personal Equity Plan) after the last mailing

(YES/NO)

Loading the Data

In addition to the native ARFF data file format, WEKA has the capability to read in ".csv"

format files. This is fortunate since many databases or spreadsheet applications can save or

export data into flat files in this format. As can be seen in the sample data file, the first row

contains the attribute names (separated by commas) followed by each data row with attribute

values listed in the same order (also separated by commas). In fact, once loaded into WEKA, the

data set can be saved into ARFF format. If, however, you are interested in conveting a ".csv" file

into WEKA's native ARFF using the commandline, this can be accomplished using the following

command: java weka.core.converters.CSVLoader filename.csv > filename.arff

In this example, we load the data set into WEKA, perform a series of operations using WEKA's

attribute and discretization filters, and then perform association rule mining on the resulting data

set. While all of these operations can be performed from the command line, we use the GUI

interface for WEKA Explorer.

Initially (in the Preprocess tab) click "open" and navigate to the directory containing the data file

(.csv or .arff). In this case we will open the above data file. This is shown in Figure p1.

Figure p1

Since the data is not in ARFF format, a dialog box will prompt you to use the convertor, as in

Figure p2. You can click on "Use Covertor" button, and click OK in the next dialog box that

appears (See Figure p3). Figure p3

Once the data is loaded, WEKA will recognize the attributes and during the scan of the data will

compute some basic statistics on each attribute. The left panel in Figure p4 shows the list of

recognized attributes, while the top panels indicate the names of the base relation (or table) and

the current working relation (which are the same initially). Figure p4

Clicking on any attribute in the left panel will show the basic statistics on that attribute. For

categorical attributes, the frequency for each attribute value is shown, while for continuous

attributes we can obtain min, max, mean, standard deviation, etc.

Selecting or Filtering Attributes

In our sample data file, each record is uniquely identified by a customer id (the "id" attribute).

We need to remove this attribute before the data mining step. We can do this by using the

Attribute filters in WEKA. In the "Filter" panel, click on the "Choose" button. This will show a

popup window with a list available filters. Scroll down the list and select the

"weka.filters.unsupervised.attribute.Remove" filter as shown in Figure p7.

Figure p7

Next, click on text box immediately to the right of the "Choose" buttom. In the resulting dialog

box enter the index of the attribute to be filtered out (this can be a range or a list separated by

commas). In this case, we enter 1 which is the index of the "id" attribute (see the left panel).

Make sure that the "invertSelection" option is set to false (otherwise everything except attribute 1

will be filtered). Then click "OK". Now, in the filter box you will see "Remove -R 1" (see Figure

p9). Figure p9

Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute and

create a new working relation (whose name now includes the details of the filter that was

applied). The result is depicted in Figure p10:

Figure p10

It is possible now to apply additional filters to the new working relation. In this example,

however, we will save our intermediate results as separate data files and treat each step as a

separate WEKA session. To save the new working relation as an ARFF file, click on save button

in the top panel. Here, we will save the new relation in the file "bank-data-R1.arff".

Figure p12 shows the top portion of the new generated ARFF file (in TextPad). Figure p12

Note that in the new data set, the "id" attribute and all the corresponding values in the records

have been removed. Also, note that Weka has automatically determined the correct types and

values associated with the attributes, as listed in the Attributes section of the ARFF file.

Discretization

Some techniques, such as association rule mining, can only be performed on categorical data.

This requires performing discretization on numeric or continuous attributes. There are 3 such

attributes in this data set: "age", "income", and "children". In the case of the "children" attribute

the range of possible values are only 0, 1, 2, and 3. In this case, we have opted for keeping all of

these values in the data. This means we can simply discretize by removing the keyword

"numeric" as the type for the "children" attribute in the ARFF file, and replacing it with the set of

discrete values values. We do this directly in our text editor as seen in Figure p13. In this case,

we have saved the resulting relation in a separate file "bankdata2.arff".Figure p13

We will rely on WEKA to perform discretization on the "age" and "income" attributes. In this

example, we divide each of these into 3 bins (intervals). The WEKA discretization filter, can

divide the ranges blindly, or used various statistical techniques to automatically determine the

best way of partitioning the data. In this case, we will perform simple binning. First we will load

our filtered data set into WEKA by opening the file "bank-data2.arff". If we select the "children"

attribute in this new data set, we see that it is now a categorical attribute with four possible

discrete values. This is depicted in Figure p15. Figure p15

Now, once again we activate the Filter dialog box, but this time, we will select

"weka.filters.unsupervised.attribute.Discretize" from the list (see Figure p16). Figure p16

Next, to change the defaults for this filter, click on the box immediately to the right of the

"Choose" button. This will open the Discretize Filter dialog box. We enter the index for the the

attributes to be discretized. In this case we enter 1 corresponding to attribute "age". We also enter

3 as the number of bins (note that it is possible to discretize more than one attribute at the same

time (by using a list of attribute indeces). Since we are doing simple binning, all of the other

available options are set to "false". The dialog box is depicted in Figure p17.

Figure p17

Click "Apply" in the Filter panel. This will result in a new working relation with the selected

attribute partitioned into 3 bins (see Figure p18). To examine the results, we save the new

working relation in the file "bank-data3.arff" as depicted in Figure p19. Figure p18

Figure p19

Let us now examine the new data set using our text editor (in this case, TextPad). The top portion

of the data is shown in Figure p19. You can observe that WEKA has assigned its own labels to

each of the value ranges for the discretized attribute. For example, the lower range in the "age"

attribute is labeled "(-inf-34.333333]" (enclosed in single quotes and escape characters), while

the middle range is labeled "(34.333333-50.666667]", and so on. These labels now also appear in

the data records where the original age value was in the corresponding range.

Next, we apply the same process to discretize the "income" attribute into 3 bins. Again, Weka

automatically performs the binning and replaces the values in the "income" column with the

appropriate automatically generated labels. We save the new file into "bank-data3.arff",

replacing the older version.

Clearly, the WEKA labels, while readable, leave much to be desired as far as naming

conventions go. We will thus use the global search/replace functions in TextPad to replace these

labels with more succinct and readable ones. Fortunately, TextPad has a powerful regular

expression pattern matching capability which allows us to do this efficiently. The TextPad

search/replace dialog box for replacing the age label "(-inf-34.333333]" with the label "0_34".

Note that the "regular expression" option is selected. In the "Find what" box we have entered the

full label '\'(-inf-34.333333]\'' (including the back-slashes and single quotes). Furthermore, back-

slashes are escaped with another back-slash so that in the regular expression patterns matching

they are treated as literals (resulting in: '\\'(-inf-34.333333]\\''. In the "Replace with" box we enter

"0_34".

Now we click on the "Replace All" button to replace all instances of the old patterns with the

new one. The result of this operation is depicted in Figure p21.

Figure p21

Note that the new label now appears in place of the old one both in the attribute section of the

ARFF file as well as in the relevant data records. We repeat this manual re-labeling process with

all of the WEKA-assigned labels for the "age" and the "income" attributes. Figure p22 shows the

final result of the transformation and the newly assigned labels for these attribute values.

Figure p22

We now also change the relation name in the ARFF file to "bank-data-final" and save the file as

bank-data-final.arff


EXPERIMENT NO. 5

AIM: Implementation of any one classifier using JAVA.

THEORY:

A Bayesian classifier is a simple probabilistic classifier. Bayesian classifier can predict

membership probabilities such as the probabilities that a sample belongs to a particular class or

groupings. Bayesian classification is based on Bayes theorem and this technique tends to be

highly accurate and fast making it useful on large databases.

Baye’s Theorem: Baye’s theorem states that the probabilities of event B is given by,

P(A/B)=(P(B/A)P(A))/P(B)

Naïve Bayesian Classification Algorithm:

The operation of the Naïve Bayesian is as follows,

1) Each data sample is represented by an n-dimensions feature Vector X=(X1, X2,……….,Xn).

2) Suppose that there are in classes C1,C2,…….,Cn gives an unknown data Sample X , the

classifier will predict that X belongs to the highest posterior probability. P(C1/X)=

(P(X/Ci)P(Ci))/P(X)

Thus we maximize P(C1/X) the class C for which P(C1/X) is maximized is called maximum

positive hypothesis by Bayesian’s theorem.

3) As P(X) is constant for all classes only P(X/Ci).P(i) needs to be minimized. The class

probability P(Ci) can be estimated as P(Ci)=Si/S where Si=number of training samples of Ci

S=total number of training samples.

4) Sample X is therefore assigned to class Ci if and only if P(X/Ci).P(Ci)>P(X/Cj).P(Cj) for

i<=j<=m. y≠1 In other words if it is assigned to the class C for which P(X/Ci).P(Ci) is Max.


EXPERIMENT NO. 6

AIM: Implementation of any one clustering algorithm using JAVA and verify result with

WEKA.

THEORY: Types of clustering:

Hierarchical algorithms: Hierarchical algorithms find successive clusters using previously

established clusters. These algorithms usually are either agglomerative ("bottom-up") or divisive

("top- down"). Agglomerative algorithms begin with each element as a separate cluster and

merge them into successively larger clusters. Divisive algorithms begin with the whole set and

proceed to divide it into successively smaller clusters.

Partitional algorithms: Partitional algorithms typically determine all clusters at once, but can

also be used as divisive algorithms in the hierarchical clustering e.g K-mean, K-medoid.

Density-based clustering algorithms: Density-based clustering algorithms are devised to

discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the

density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of

this kind.

Agglomerative clustering Algorithm:

1. Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0. 2. Find

the least dissimilar pair of clusters in the current clustering, say pair (r), (s), according to

d[(r),(s)] = min d[(i),(j)]

where the minimum is over all pairs of clusters in the current clustering. 3. Increment the

sequence number : m = m +1. Merge clusters (r) and (s) into a single cluster to form the next

clustering m. Set the level of this clustering to

L(m) = d[(r),(s)]

4. Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters

(r) and (s) and adding a row and column corresponding to the newly formed

cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined in

this way:

d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)] 5. If all objects are in one cluster, stop. Else, go to step 2.

CONCLUSION: (Conclusion to be based on the aim and outcomes achieved).

EXPERIMENT NO. 7

AIM: Implementation of association mining rule –Apriori algorithm using JAVA and verify

result with WEKA.

THEORY:

Apriori is an algorithm for frequent item set mining and association rule learning over transactional

databases. It proceeds by identifying the frequent individual items in the database and extending them to

larger and larger item sets as long as those item sets appear sufficiently often in the database. The

frequent item sets determined by Apriori can be used to determine association rules which highlight

general trends in the database: this has applications in domains such as market basket analysis.

Apriori is designed to operate on databases containing transactions (for example, collections of items

bought by customers, or details of a website frequentation).Each transaction is seen as a set of items (an

itemset). Given a threshold , the Apriori algorithm identifies the item sets which are subsets of at least

transactions in the database. Apriori uses a "bottom up" approach, where frequent subsets are extended

one item at a time (a step known as candidate generation), and groups of candidates are tested against the

data. The algorithm terminates when no further successful extensions are found.

Limitations

Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs,

which have spawned other algorithms. Candidate generation generates large numbers of subsets

(the algorithm attempts to load up the candidate set with as many as possible before each scan).

Bottom-up subset exploration (essentially a breadth-first traversal of the subset lattice) finds any

maximal subset S only after all of its proper subsets.

Later algorithms such as Max-Miner[2] try to identify the maximal frequent item sets without

enumerating their subsets, and perform "jumps" in the search space rather than a purely bottom-

up approach.

Algorithm:


http://en.wikipedia.org/w/index.php?title=Max-Miner&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Max-Miner&action=edit&redlink=1

Example 1

Consider the following database, where each row is a transaction and each cell is an individual

item of the transaction:

alpha Beta epsilon

alpha Beta theta

alpha Beta epsilon

alpha Beta theta

The association rules that can be determined from this database are the following:

1. 100% of sets with alpha also contain beta

2. 50% of sets with alpha, beta also have epsilon

3. 50% of sets with alpha, beta also have theta

We can also illustrate this through a variety of examples

select * from grocery_customer;

SALES_ID PROD_NAME

111 bread

111 milk

112 bread

112 jam

113 milk


EXPERIMENT NO. 8

AIM: Using WEKA to compare different classifiers using Experimenter.

THEORY:

1. Start Weka: Start Weka. This may involve finding it in program launcher or double clicking

on the weka.jar file. This will start the Weka GUI Chooser.

Weka GUI Chooser: Lets you choose one of the Explorer, Experimenter, KnowledgeExplorer

and the Simple CLI (command line interface).

Click the “Experimenter” button to launch the Weka Experimenter.

The Weka Experimenter allows you to design your own experiments of running algorithms on

datasets, run the experiments and analyze the results. It’s a powerful tool.

2. Design Experiment: Click the “New” button to create a new experiment configuration.

Weka Experimenter

Start a new Experiment

http://3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com/wp-content/uploads/2014/02/weka-loader.png

http://3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com/wp-content/uploads/2014/02/Screen-Shot-2014-02-19-at-5.36.21-AM.png

Test Options

The experimenter configures the test options for you with sensible defaults. The experiment is

configured to use Cross Validation with 10 folds. It is a “Classification” type problem and each

algorithm + dataset combination is run 10 times (iteration control).

Iris flower Dataset: Let’s start out by selecting the dataset.

1. In the “Datasets” select click the “Add new…” button.

2. Open the “data“directory and choose the “iris.arff” dataset.

The Iris flower dataset is a famous dataset from statistics and is heavily borrowed by researchers

in machine learning. It contains 150 instances (rows) and 4 attributes (columns) and a class

attribute for the species of iris flower (one of setosa, versicolor, virginica). You can read more

about Iris flower dataset on Wikipedia.

Let’s choose 3 algorithms to run our dataset.

ZeroR

1. Click “Add new…” in the “Algorithms” section.

2. Click the “Choose” button.

3. Click “ZeroR” under the “rules” selection.

ZeroR is the simplest algorithm we can run. It picks the class value that is the majority in the

dataset and gives that for all predictions. Given that all three class values have an equal share (50

instances), it picks the first class value “setosa” and gives that as the answer for all predictions.

Just off the top of our head, we know that the best result ZeroR can give is 33.33% (50/150).

This is good to have as a baseline that we demand algorithms to outperform.

OneR



3. Click “OneR” under the “rules” selection.

OneR is like our second simplest algorithm. It picks one attribute that best correlates with the

class value and splits it up to get the best prediction accuracy it can. Like the ZeroR algorithm,

the algorithm is so simple that you could implement it by hand and we would expect that more

sophisticated algorithms out perform it.

J48



3. Click “J48” under the “trees” selection.

http://machinelearningmastery.com/how-to-choose-the-right-test-options-when-evaluating-machine-learning-algorithms/

http://en.wikipedia.org/wiki/Iris_flower_data_set

J48 is decision tree algorithm. It is an implementation of the C4.8 algorithm in Java (“J” for Java

and 48 for C4.8). The C4.8 algorithm is a minor extension to the famous C4.5 algorithm and is a

very powerful prediction algorithm.

Weka Experimenter

Configure the experiment

We are ready to run our experiment.

3. Run Experiment

Click the “Run” tab at the top of the screen.

This tab is the control panel for running the currently configured experiment.

Click the big “Start” button to start the experiment and watch the “Log” and “Status” sections to

keep an eye on how it is doing.

Weka Experimenter

Run the experiment



Given that the dataset is small and the algorithms are fast, the experiment should complete in

seconds.

4. Review Results

Click the “Analyse” tab at the top of the screen.

This will open up the experiment results analysis panel.

Weka Experimenter

Load the experiment results

Click the “Experiment” button in the “Source” section to load the results from the current

experiment.

Algorithm Rank

The first thing we want to know is which algorithm was the best. We can do that by ranking the

algorithms by the number of times a given algorithm beat the other algorithms.

1. Click the “Select” button for the “Test base” and choose “Ranking“.

2. Now Click the “Perform test” button.


Weka Experimenter

Rank the algorithms in the experiment results

The ranking table shows the number of statistically significant wins each algorithm has had

against all other algorithms on the dataset. A win, means an accuracy that is better than the

accuracy of another algorithm and that the difference was statistically significant.

We can see that both J48 and OneR have one win each and that ZeroR has two losses. This is

good, it means that OneR and J48 are both potentially contenders outperforming out baseline of

ZeroR.

Algorithm Accuracy

Next we want to know what scores the algorithms achieved.

1. Click the “Select” button for the “Test base” and choose the “ZeroR” algorithm in the list

and click the “Select” button.

2. Click the check-box next to “Show std. deviations“.

3. Now click the “Perform test” button.

http://machinelearningmastery.com/how-to-choose-the-right-test-options-when-evaluating-machine-learning-algorithms/


Weka Experimenter

Algorithm accuracy compared to ZeroR

In the “Test output” we can see a table with the results for 3 algorithms. Each algorithm was run

10 times on the dataset and the accuracy reported is the mean and the standard deviation in

rackets of those 10 runs.

We can see that both the OneR and J48 algorithms have a little “v” next to their results. This

means that the difference in the accuracy for these algorithms compared to ZeroR is statistically

significant. We can also see that the accuracy for these algorithms compared to ZeroR is high, so

we can say that these two algorithms achieved a statistically significantly better result than the

ZeroR baseline.

The score for J48 is higher than the score for OneR, so next we want to see if the difference

between these two accuracy scores is significant.

1. Click the “Select” button for the “Test base” and choose the “J48” algorithm in the list

and click the “Select” button.

2. Now click the “Perform test” button.

Weka Experimenter

Algorithm accuracy compared to J48



We can see that the ZeroR has a “*” next to its results, indicating that its results compared to the

J48 are statistically different. But we already knew this. We do not see a “*” next to the results

for the OneR algorithm. This tells us that although the mean accuracy between J48 and OneR is

different, the differences is not statistically significant.

All things being equal, we would choose the OneR algorithm to make predictions on this

problem because it is the simpler of the two algorithms.

If we wanted to report the results, we would say that the OneR algorithm achieved a

classification accuracy of 92.53% (+/- 5.47%) which is statistically significantly better than

ZeroR at 33.33% (+/- 5.47%).


EXPERIMENT NO. 9

AIM: Implementation of KDD process in WEKA – Knowledge Flow

THEORY: Knowledge Flow

Major steps for building a process

1. Adding required nodes

1) Add a data source node from “DataSources”

2) Right click to configure it with a data set

3) Add a classAssigner node from “Evaluation” and a CrossValidationFoldmaker node

4) Add a classifier, e.g. J48, from Classifiers

5) Add a classiferPerformanceEvaluator node from “Evaluation”

6) Add a text viewer from “Visualization”

2. Connect the nodes

Right click “DataSource” node and choose DataSet, then connect it to the ClassAssigner

node, do the same or similar for connecting between the other nodes.

3. Run the process (using the default setups for each node)

Right click DataSource node and choose “Start loading”, the process should run and

“Status” window should indicate if the run is correct and completed.

4. View the results:

If the run is correctly completed, right click “Text Viewer” node and choose “Show

results”, then another window pops out to show the results.

Results of the KDD process


EXPERIMENT NO. 10

AIM: Study and implementation of mining algorithm in XL Miner.

THEORY: Hierarchical cluster Analysis using XL Miner

Cluster Analysis: Cluster Analysis, also called data segmentation, has a variety of goals. All

relate to grouping or segmenting a collection of objects (also called observations, individuals,

cases, or data rows) into subsets or "clusters", such that those within each cluster are more

closely related to one another than objects assigned to different clusters. Central to all of the

goals of cluster analysis is the notion of degree of similarity (or dissimilarity) between the

individual objects being clustered. There are two major methods of clustering -- hierarchical

clustering and k-means clustering.

In hierarchical clustering the data are not partitioned into a particular cluster in a single step.

Instead, a series of partitions takes place, which may run from a single cluster containing all

objects to n clusters each containing a single object. Hierarchical Clustering is subdivided into

agglomerative methods, which proceed by series of fusions of the n objects into groups, and

divisive methods, which separate n objects successively into finer groupings. Agglomerative

techniques are more commonly used, and this is the method implemented in XLMiner™.

Hierarchical clustering may be represented by a two dimensional diagram known as dendrogram

which illustrates the fusions or divisions made at each successive stage of analysis. An example

of such a dendrogram is given below:

Agglomerative methods

An agglomerative hierarchical clustering procedure produces a series of partitions of the data,

Pn, Pn-1, ....... , P1. The first Pn consists of n single object 'clusters', the last P1, consists of single

group containing all n cases.

At each particular stage the method joins together the two clusters which are closest together

(most similar). (At the first stage, of course, this amounts to joining together the two objects

that are closest together, since at the initial stage each cluster has one object.)

Differences between methods arise because of the different ways of defining distance (or

similarity) between clusters. Several agglomerative techniques will now be described in detail.

Single linkage clustering

One of the simplest agglomerative hierarchical clustering method is single linkage, also known

as the nearest neighbor technique. The defining feature of the method is that distance between

groups is defined as the distance between the closest pair of objects, where only pairs consisting

of one object from each group are considered.

In the single linkage method, D(r,s) is computed as

D(r,s) = Min { d(i,j) : Where object i is in cluster r and object j is cluster s }

Here the distance between every possible object pair (i,j) is computed, where object i is in cluster

r and object j is in cluster s. The minimum value of these distances is said to be the distance

between clusters r and s. In other words, the distance between two clusters is given by the value

of the shortest link between the clusters.

At each stage of hierarchical clustering, the clusters r and s , for which D(r,s) is minimum, are

merged.

This measure of inter-group distance is illustrated in the figure below:


EXPERIMENT NO. 11

AIM: Mini Project on any Business Intelligence application.

OBJECTIVE:

A BI report must be prepared outlining the following steps:

a) Problem definition, Identifying which data mining task is needed

b) Identify and use a standard data mining dataset available for the problem. Some links for data

mining datasets are: WEKA site, UCI Machine Learning Repository, KDD site, KDD Cup etc.

c) Implement the data mining algorithm of choice

d) Interpret and visualize the results

DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB...

Documents

Transcript of DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB...