DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB...
Transcript of DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB...
![Page 1: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/1.jpg)
DATA MINING AND BUSINESS INTELLIGENCE
(DMBI)
LAB MANUAL
![Page 2: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/2.jpg)
EXPERIMENT NO. 1
AIM: Solving exercises in Data Exploration.
THEORY:
Data Exploration
Data Exploration is about describing the data by means of statistical and visualization techniques. We explore
data in order to bring important aspects of that data into focus for further analysis.
1. Univariate Analysis
Univariate analysis explores variables (attributes) one by one. Variables could be either categorical or
numerical. There are different statistical and visualization techniques of investigation for each type of variable.
Numerical variables can be transformed into categorical counterparts by a process called binning or
discretization.It is also possible to transform a categorical variable into its numerical counterpart by a
process called encoding. Finally, proper handling of missing values is an important issue in mining data.
Numerical Variables
A numerical or continuous variable (attribute) is one that may take on any value within a finite or infinite
interval
(e.g., height, weight, temperature, blood glucose, ...). There are two types of numerical variables, interval and
ratio. An interval variable has values whose differences are interpretable, but it does not have a true zero. A good
example is temperature in Centigrade degrees. Data on an interval scale can be added and subtracted but cannot be
meaningfully multiplied or divided. For example, we cannot say that one day is twice as hot as another day. In
contrast, a ratio variable has values with a true zero and can be added, subtracted, multiplied or divided (e.g., weight).
Univariate Analysis - Numerical
Statistics Visualization Equation Description
Count Histogram N
The number of values
(observations) of the
variable.
Minimum Box Plot Min The smallest value of the
variable.
![Page 3: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/3.jpg)
Maximum Box Plot Max The largest value of the
variable.
Mean Box Plot
The sum of the values
divided by the count.
Median Box Plot
The middle value. Below
and above median lies an
equal number of values.
Mode Histogram
The most frequent value.
There can be more than
one mode.
Quantile Box Plot
A set of 'cut points' that
divide a set of data into
groups containing equal
numbers of values
(Quartile, Quintile,
Percentile, ...).
Range Box Plot Max-Min The difference between
maximum and minimum.
Variance Histogram
A measure of data
dispersion.
Standard
Deviation Histogram
The square root of
variance.
Coefficient
of
Deviation
Histogram
A measure of data
dispersion divided by
mean.
Skewness Histogram
A measure of symmetry
or asymmetry in the
distribution of data.
Kurtosis Histogram
A measure of whether the
data are peaked or flat
relative to a normal
distribution.
![Page 4: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/4.jpg)
->Box plot and histogram for the "sepal length" variable from the Iris dataset.
2. Bivariate Analysis
Bivariate analysis is the simultaneous analysis of two variables (attributes). It explores the
concept of relationship between two variables, whether there exists an association and the
strength of this association, or whether there are differences between two variables and the
significance of these differences. There are three types of bivariate analysis.
CONCLUSION: (Conclusion to be based on the aim and outcomes achieved)
![Page 5: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/5.jpg)
EXPERIMENT NO. 2
AIM: Solving exercises in Data Preprocessing.
THEORY:
Why preprocessing ?
1. Real world data are generally
o Incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
o Noisy: containing errors or outliers
o Inconsistent: containing discrepancies in codes or names
2. Tasks in data preprocessing
o Data cleaning: fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies.
o Data integration: using multiple databases, data cubes, or files.
o Data transformation: normalization and aggregation.
o Data reduction: reducing the volume but producing the same or similar analytical
results.
o Data discretization: part of data reduction, replacing numerical attributes with
nominal ones.
Data cleaning
1. Fill in missing values (attribute or class value):
o Ignore the tuple: usually done when class label is missing.
o Use the attribute mean (or majority nominal value) to fill in the missing value.
o Use the attribute mean (or majority nominal value) for all samples belonging to
the same class.
o Predict the missing value by using a learning algorithm: consider the attribute
with the missing value as a dependent (class) variable and run a learning
algorithm (usually Bayes or decision tree) to predict the missing value.
2. Identify outliers and smooth out noisy data:
o Binning
Sort the attribute values and partition them into bins (see "Unsupervised
discretization" below);
Then smooth by bin means, bin median, or bin boundaries.
o Clustering: group values in clusters and then detect and remove outliers
(automatic or manual)
o Regression: smooth by fitting the data into regression functions.
3. Correct inconsistent data: use domain knowledge or expert decision.
![Page 6: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/6.jpg)
Data transformation
1. Normalization:
o Scaling attribute values to fall within a specified range.
Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-
Min)/(Max-Min)
o Scaling by using mean and standard deviation (useful when min and max are
unknown or when there are outliers): V'=(V-Mean)/StDev
2. Aggregation: moving up in the concept hierarchy on numeric attributes.
3. Generalization: moving up in the concept hierarchy on nominal attributes.
4. Attribute construction: replacing or adding new attributes inferred by existing attributes.
Data reduction
1. Reducing the number of attributes
o Data cube aggregation: applying roll-up, slice or dice operations.
o Removing irrelevant attributes: attribute selection (filtering and wrapper
methods), searching the attribute space (see Lecture 5: Attribute-oriented
analysis).
o Principle component analysis (numeric attributes only): searching for a lower
dimensional space that can best represent the data..
2. Reducing the number of attribute values
o Binning (histograms): reducing the number of attributes by grouping them into
intervals (bins).
o Clustering: grouping values in clusters.
o Aggregation or generalization
3. Reducing the number of tuples
o Sampling
Discretization and generating concept hierarchies
1. Unsupervised discretization - class variable is not used.
o Equal-interval (equiwidth) binning: split the whole range of numbers in intervals
with equal size.
o Equal-frequency (equidepth) binning: use intervals containing equal number of
values.
2. Supervised discretization - uses the values of the class variable.
o Using class boundaries. Three steps:
Sort values.
Place breakpoints between values belonging to different classes.
If too many intervals, merge intervals with equal or similar class
distributions.
o Entropy (information)-based discretization. Example:
Information in a class distribution:
Denote a set of five values occurring in tuples belonging to two
classes (+ and -) as [+,+,+,-,-]
![Page 7: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/7.jpg)
That is, the first 3 belong to "+" tuples and the last 2 - to "-" tuples
Then, Info([+,+,+,-,-]) = -(3/5)*log(3/5)-(2/5)*log(2/5) (logs are
base 2)
3/5 and 2/5 are relative frequencies (probabilities)
Ignoring the order of the values, we can use the following notation:
[3,2] meaning 3 values from one class and 2 - from the other.
Then, Info([3,2]) = -(3/5)*log(3/5)-(2/5)*log(2/5)
Information in a split (2/5 and 3/5 are weight coefficients):
Info([+,+],[+,-,-]) = (2/5)*Info([+,+]) + (3/5)*Info([+,-,-])
Or, Info([2,0],[1,2]) = (2/5)*Info([2,0]) + (3/5)*Info([1,2])
Method:
Sort the values;
Calculate information in all possible splits;
Choose the split that minimizes information;
Do not include breakpoints between values belonging to the same
class (this will increase information);
Apply the same to the resulting intervals until some stopping
criterion is satisfied.
3. Generating concept hierarchies: recursively applying partitioning or discretization
methods.
Normalization by Z-score
Assume that there are five rows with the IDs A, B, C, D and E, each row containing n different
variables (columns). We use record E as an example in the calculations below. The remaining
rows are normalized in the same way.
The normalized value of ei for row E in the ith column is calculated as:
where
![Page 8: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/8.jpg)
If all values for row E are identical—so the standard deviation of E (std(E)) is equal to zero—
then all values for row E are set to zero.
Normalization by min max transformation
If you want to normalize you data you can do as you suggest and simply calculate:
zi= xi−min(x)/max(x)−min(x)
where x=(x1,...,xn) and zi is now your ith normalized data
CONCLUSION: (Conclusion to be based on the aim and outcomes achieved)
![Page 9: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/9.jpg)
EXPERIMENT NO. 3 AIM: Study of Data Mining tool – WEKA.
THEORY:
1. Introduction To WEKA
A) MAIN FEATURE OF WEKA
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning
software written in Java, developed at the University of Waikato, New Zealand.
Weka is free software available under the GNU General Public License. The Weka workbench
contains a collection of visualization tools and algorithms for data analysis and predictive
modeling, together with graphical user interfaces for easy access to this functionality.
Weka supports several standard data mining tasks, more specifically, data preprocessing,
clustering, classification, regression, visualization, and feature selection. Weka provides access
to SQL databases using Java Database Connectivity and can process the result returned by a
database query. It is not capable of multi-relational data mining, but there is separate software for
converting a collection of linked database tables into a single table that is suitable for processing
using Weka.
B) DOWNLOAD AND INSTALLATION:
Step 1: download form link
http://sourceforge.net/projects/weka/postdownload?source=dlp
![Page 10: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/10.jpg)
Step 2 : Installation step
I) command: sudo apt-get install weka
C) START APP: Start app as show in below diagram.
D) MAIN USER INTERFACE:
Weka's main user interface is the Explorer, but essentially the same functionality can be accessed
through the component-based Knowledge Flow interface and from the command line. There is
also the Experimenter, which allows the systematic comparison of the predictive performance of
![Page 11: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/11.jpg)
Weka's machine learning algorithms on a collection of datasets.
The Explorer winterface features several panels providing access to the main components of the
workbench:
The Preprocess panel has facilities for importing data from a database, a CSV file, etc.,
and for preprocessing this data using a so-called filtering algorithm. These filters can be
used to transform the data (e.g., turning numeric attributes into discrete ones) and make it
possible to delete instances and attributes according to specific criteria.
The Classify panel enables the user to apply classification and regression algorithms
(indiscriminately called classifiers in Weka) to the resulting dataset, to estimate the
accuracy of the resulting predictive model, and to visualize erroneous predictions, ROC
curves, etc., or the model itself (if the model is amenable to visualization like, e.g., a
decision tree).
The Associate panel provides access to association rule learners that attempt to identify
all important interrelationships between attributes in the data.
The Cluster panel gives access to the clustering techniques in Weka, e.g., the simple k-
means algorithm. There is also an implementation of the expectation maximization
algorithm for learning a mixture of normal distributions.
The Select attributes panel provides algorithms for identifying the most predictive
attributes in a dataset.
The Visualize panel shows a scatter plot matrix, where individual scatter plots can be
selected and enlarged, and analyzed further using various selection operators.
![Page 12: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/12.jpg)
2. WEKA Functions and Tools:
A) Loading, Preprocessing and Visualization of Data file:
• Load data file in formats: ARFF, CSV, C4.5, binary
• Import from URL or SQL database (using JDBC)
• Preprocessing filters
– Adding/removing attributes
– Attribute value substitution
– Discretization
– Time series filters (delta, shift)
– Sampling, randomization
– Missing value management
– Normalization and other numeric transformations
![Page 13: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/13.jpg)
B) FEATURE SELECTION:
In Weka, you have three options of performing attribute selection from command line (not
everything is possible from the GUI):
the native approach, using the attribute selection classes directly
using a meta-classifier
the filter approach
C) CLASSIFICATION
A trained model can be saved like this, e.g., J48:
train your model on the training data /some/where/train.arff
right-click in the Results list on the item which model you want to save
select Save model and save it to /other/place/j48.model
You can load the previously saved model with the following steps:
load your test data /some/where/test.arff via the Supplied test set button
right-click in the Results list, select Load model and choose /other/place/j48.model
select Re-evaluate model on current test set
![Page 14: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/14.jpg)
D) CLUSTERING:
Load the data file AUTOS.arff into WEKA using the same steps we used to load data into
the Preprocess tab. Take a few minutes to look around the data in this tab. Look at the columns,
the attribute data, the distribution of the columns, etc. The screen should look like the figure
shown below after loading the data.
With this data set, we are looking to create clusters, so instead of clicking on the Classify tab,
click on the Cluster tab. Click Choose and select SimpleKMeans from the choices that appear
![Page 15: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/15.jpg)
(this will be our preferred method of clustering for this article). WEKA Explorer window should
look like the following figure at this point.
E) REGRESSION:
• Predicted target is continuous
• Methods
– Linear regression
– Simple Linear Regression
– Neural networks
– Regression trees …
![Page 16: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/16.jpg)
3.) Data format in WEKA:
ARFF:
Attribute Relationship File Format (ARFF) is the text format file used by Weka to store data in a
database. This kind of file is structured as follows ("weather" relational database):
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
The ARFF file contains two sections: the header and the data section. The first line of the header
tells us the relation name. Then there is the list of the attributes (@attribute...). Each attribute is
associated with a unique name and a type. The latter describes the kind of data contained in the
variable and what values it can have. The variables types are: numeric, nominal, string and date.
The class attribute is by default the last one of the list. In the header section there can also be
some comment lines, identified with a '%' at the beginning, which can describe the database
content or give the reader information about the author. After that there is the data itself (@data),
each line stores the attribute of a single entry separated by a comma.
![Page 17: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/17.jpg)
4). Pros and Cons of WEKA data mining:
The WEKA system has covered the entire machine learning (knowledge discovery) process.
Although an research project, the WEKA system has been able to implement and evaluate a
number of different Algorithms for different steps in the machine learning process.
The output and the information provided by the package is sufficient for an expert in machine
learning and related topics. The results as displayed by the system show a detailed description of
the flow and the steps involved in the entire machine learning process. The outputs provided by
different algorithms are easy to compare and hence make the analysis easier.
ARFF dataset is one of the most widely used data storage formats for research databases, making
this system easier for use in research oriented projects.
This package provides and number of application program interfaces (API) which help novice
Data miners build their systems using the ”core WEKA system”.
Since the system provides a number of switches and options, we can customize the output of the
system to suit our needs.
First, major disadvantage is that the system is a Java based system and requires Java
Virtual Machine installed for its execution. Since the system is entirely based on Command Line
parameters and switches, it is difficult for an amateur to use the system efficiently. A Textual
interface and output makes it all the more difficult to interpret and visualize the results.
Important results such as the pruned trees, hierarchy based outputs cannot be displayed making it
difficult to visualize the results.
Although a commonly used dataset, ARFF is the only format that the WEKA system supports.
All the current version i.e. 3.0.1 has some bugs or disadvantages, the developers are working on
a better system and have come up with a new version which has a graphical user interface
making the system complete.
CONCLUSION: (Conclusion to be based on the aim and outcomes achieved)
![Page 18: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/18.jpg)
EXPERIMENT NO. 4
AIM: Implementation of preprocessing in WEKA.
THEORY:
This example illustrates some of the basic data preprocessing operations that can be performed
using WEKA. The sample data set used for this example is the "bank data" available in comma-
separated format.
The data contains the following fields
Id a unique identification number
Age age of customer in years (numeric)
Sex MALE / FEMALE
Region inner_city/rural/suburban/town
Income income of customer (numeric)
married is the customer married (YES/NO)
children number of children (numeric)
Car does the customer own a car (YES/NO)
save_acct does the customer have a saving account (YES/NO)
current_acct does the customer have a current account (YES/NO)
mortgage does the customer have a mortgage (YES/NO)
Pep did the customer buy a PEP (Personal Equity Plan) after the last mailing
(YES/NO)
Loading the Data
In addition to the native ARFF data file format, WEKA has the capability to read in ".csv"
format files. This is fortunate since many databases or spreadsheet applications can save or
export data into flat files in this format. As can be seen in the sample data file, the first row
contains the attribute names (separated by commas) followed by each data row with attribute
values listed in the same order (also separated by commas). In fact, once loaded into WEKA, the
data set can be saved into ARFF format. If, however, you are interested in conveting a ".csv" file
into WEKA's native ARFF using the commandline, this can be accomplished using the following
command: java weka.core.converters.CSVLoader filename.csv > filename.arff
In this example, we load the data set into WEKA, perform a series of operations using WEKA's
attribute and discretization filters, and then perform association rule mining on the resulting data
![Page 19: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/19.jpg)
set. While all of these operations can be performed from the command line, we use the GUI
interface for WEKA Explorer.
Initially (in the Preprocess tab) click "open" and navigate to the directory containing the data file
(.csv or .arff). In this case we will open the above data file. This is shown in Figure p1.
Figure p1
Since the data is not in ARFF format, a dialog box will prompt you to use the convertor, as in
Figure p2. You can click on "Use Covertor" button, and click OK in the next dialog box that
appears (See Figure p3). Figure p3
![Page 20: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/20.jpg)
Once the data is loaded, WEKA will recognize the attributes and during the scan of the data will
compute some basic statistics on each attribute. The left panel in Figure p4 shows the list of
recognized attributes, while the top panels indicate the names of the base relation (or table) and
the current working relation (which are the same initially). Figure p4
![Page 21: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/21.jpg)
Clicking on any attribute in the left panel will show the basic statistics on that attribute. For
categorical attributes, the frequency for each attribute value is shown, while for continuous
attributes we can obtain min, max, mean, standard deviation, etc.
Selecting or Filtering Attributes
In our sample data file, each record is uniquely identified by a customer id (the "id" attribute).
We need to remove this attribute before the data mining step. We can do this by using the
Attribute filters in WEKA. In the "Filter" panel, click on the "Choose" button. This will show a
popup window with a list available filters. Scroll down the list and select the
"weka.filters.unsupervised.attribute.Remove" filter as shown in Figure p7.
Figure p7
Next, click on text box immediately to the right of the "Choose" buttom. In the resulting dialog
box enter the index of the attribute to be filtered out (this can be a range or a list separated by
commas). In this case, we enter 1 which is the index of the "id" attribute (see the left panel).
Make sure that the "invertSelection" option is set to false (otherwise everything except attribute 1
will be filtered). Then click "OK". Now, in the filter box you will see "Remove -R 1" (see Figure
p9). Figure p9
![Page 22: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/22.jpg)
Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute and
create a new working relation (whose name now includes the details of the filter that was
applied). The result is depicted in Figure p10:
Figure p10
![Page 23: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/23.jpg)
It is possible now to apply additional filters to the new working relation. In this example,
however, we will save our intermediate results as separate data files and treat each step as a
separate WEKA session. To save the new working relation as an ARFF file, click on save button
in the top panel. Here, we will save the new relation in the file "bank-data-R1.arff".
Figure p12 shows the top portion of the new generated ARFF file (in TextPad). Figure p12
![Page 24: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/24.jpg)
Note that in the new data set, the "id" attribute and all the corresponding values in the records
have been removed. Also, note that Weka has automatically determined the correct types and
values associated with the attributes, as listed in the Attributes section of the ARFF file.
Discretization
Some techniques, such as association rule mining, can only be performed on categorical data.
This requires performing discretization on numeric or continuous attributes. There are 3 such
attributes in this data set: "age", "income", and "children". In the case of the "children" attribute
the range of possible values are only 0, 1, 2, and 3. In this case, we have opted for keeping all of
these values in the data. This means we can simply discretize by removing the keyword
"numeric" as the type for the "children" attribute in the ARFF file, and replacing it with the set of
discrete values values. We do this directly in our text editor as seen in Figure p13. In this case,
we have saved the resulting relation in a separate file "bankdata2.arff".Figure p13
We will rely on WEKA to perform discretization on the "age" and "income" attributes. In this
example, we divide each of these into 3 bins (intervals). The WEKA discretization filter, can
divide the ranges blindly, or used various statistical techniques to automatically determine the
best way of partitioning the data. In this case, we will perform simple binning. First we will load
our filtered data set into WEKA by opening the file "bank-data2.arff". If we select the "children"
attribute in this new data set, we see that it is now a categorical attribute with four possible
![Page 25: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/25.jpg)
discrete values. This is depicted in Figure p15. Figure p15
Now, once again we activate the Filter dialog box, but this time, we will select
"weka.filters.unsupervised.attribute.Discretize" from the list (see Figure p16). Figure p16
Next, to change the defaults for this filter, click on the box immediately to the right of the
"Choose" button. This will open the Discretize Filter dialog box. We enter the index for the the
attributes to be discretized. In this case we enter 1 corresponding to attribute "age". We also enter
3 as the number of bins (note that it is possible to discretize more than one attribute at the same
time (by using a list of attribute indeces). Since we are doing simple binning, all of the other
available options are set to "false". The dialog box is depicted in Figure p17.
![Page 26: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/26.jpg)
Figure p17
Click "Apply" in the Filter panel. This will result in a new working relation with the selected
attribute partitioned into 3 bins (see Figure p18). To examine the results, we save the new
working relation in the file "bank-data3.arff" as depicted in Figure p19. Figure p18
![Page 27: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/27.jpg)
Figure p19
Let us now examine the new data set using our text editor (in this case, TextPad). The top portion
of the data is shown in Figure p19. You can observe that WEKA has assigned its own labels to
each of the value ranges for the discretized attribute. For example, the lower range in the "age"
attribute is labeled "(-inf-34.333333]" (enclosed in single quotes and escape characters), while
the middle range is labeled "(34.333333-50.666667]", and so on. These labels now also appear in
the data records where the original age value was in the corresponding range.
Next, we apply the same process to discretize the "income" attribute into 3 bins. Again, Weka
automatically performs the binning and replaces the values in the "income" column with the
appropriate automatically generated labels. We save the new file into "bank-data3.arff",
replacing the older version.
Clearly, the WEKA labels, while readable, leave much to be desired as far as naming
conventions go. We will thus use the global search/replace functions in TextPad to replace these
labels with more succinct and readable ones. Fortunately, TextPad has a powerful regular
expression pattern matching capability which allows us to do this efficiently. The TextPad
search/replace dialog box for replacing the age label "(-inf-34.333333]" with the label "0_34".
Note that the "regular expression" option is selected. In the "Find what" box we have entered the
full label '\'(-inf-34.333333]\'' (including the back-slashes and single quotes). Furthermore, back-
slashes are escaped with another back-slash so that in the regular expression patterns matching
![Page 28: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/28.jpg)
they are treated as literals (resulting in: '\\'(-inf-34.333333]\\''. In the "Replace with" box we enter
"0_34".
Now we click on the "Replace All" button to replace all instances of the old patterns with the
new one. The result of this operation is depicted in Figure p21.
Figure p21
Note that the new label now appears in place of the old one both in the attribute section of the
ARFF file as well as in the relevant data records. We repeat this manual re-labeling process with
all of the WEKA-assigned labels for the "age" and the "income" attributes. Figure p22 shows the
final result of the transformation and the newly assigned labels for these attribute values.
![Page 29: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/29.jpg)
Figure p22
We now also change the relation name in the ARFF file to "bank-data-final" and save the file as
bank-data-final.arff
CONCLUSION: (Conclusion to be based on the aim and outcomes achieved)
![Page 30: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/30.jpg)
EXPERIMENT NO. 5
AIM: Implementation of any one classifier using JAVA.
THEORY:
A Bayesian classifier is a simple probabilistic classifier. Bayesian classifier can predict
membership probabilities such as the probabilities that a sample belongs to a particular class or
groupings. Bayesian classification is based on Bayes theorem and this technique tends to be
highly accurate and fast making it useful on large databases.
Baye’s Theorem: Baye’s theorem states that the probabilities of event B is given by,
P(A/B)=(P(B/A)P(A))/P(B)
Naïve Bayesian Classification Algorithm:
The operation of the Naïve Bayesian is as follows,
1) Each data sample is represented by an n-dimensions feature Vector X=(X1, X2,……….,Xn).
2) Suppose that there are in classes C1,C2,…….,Cn gives an unknown data Sample X , the
classifier will predict that X belongs to the highest posterior probability. P(C1/X)=
(P(X/Ci)P(Ci))/P(X)
Thus we maximize P(C1/X) the class C for which P(C1/X) is maximized is called maximum
positive hypothesis by Bayesian’s theorem.
3) As P(X) is constant for all classes only P(X/Ci).P(i) needs to be minimized. The class
probability P(Ci) can be estimated as P(Ci)=Si/S where Si=number of training samples of Ci
S=total number of training samples.
4) Sample X is therefore assigned to class Ci if and only if P(X/Ci).P(Ci)>P(X/Cj).P(Cj) for
i<=j<=m. y≠1 In other words if it is assigned to the class C for which P(X/Ci).P(Ci) is Max.
CONCLUSION: (Conclusion to be based on the aim and outcomes achieved)
![Page 31: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/31.jpg)
EXPERIMENT NO. 6
AIM: Implementation of any one clustering algorithm using JAVA and verify result with
WEKA.
THEORY: Types of clustering:
Hierarchical algorithms: Hierarchical algorithms find successive clusters using previously
established clusters. These algorithms usually are either agglomerative ("bottom-up") or divisive
("top- down"). Agglomerative algorithms begin with each element as a separate cluster and
merge them into successively larger clusters. Divisive algorithms begin with the whole set and
proceed to divide it into successively smaller clusters.
Partitional algorithms: Partitional algorithms typically determine all clusters at once, but can
also be used as divisive algorithms in the hierarchical clustering e.g K-mean, K-medoid.
Density-based clustering algorithms: Density-based clustering algorithms are devised to
discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the
density of data objects exceeds a threshold. DBSCAN and OPTICS are two typical algorithms of
this kind.
Agglomerative clustering Algorithm:
1. Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0. 2. Find
the least dissimilar pair of clusters in the current clustering, say pair (r), (s), according to
d[(r),(s)] = min d[(i),(j)]
where the minimum is over all pairs of clusters in the current clustering. 3. Increment the
sequence number : m = m +1. Merge clusters (r) and (s) into a single cluster to form the next
clustering m. Set the level of this clustering to
L(m) = d[(r),(s)]
4. Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters
(r) and (s) and adding a row and column corresponding to the newly formed
cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined in
this way:
d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)] 5. If all objects are in one cluster, stop. Else, go to step 2.
CONCLUSION: (Conclusion to be based on the aim and outcomes achieved).
![Page 32: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/32.jpg)
EXPERIMENT NO. 7
AIM: Implementation of association mining rule –Apriori algorithm using JAVA and verify
result with WEKA.
THEORY:
Apriori is an algorithm for frequent item set mining and association rule learning over transactional
databases. It proceeds by identifying the frequent individual items in the database and extending them to
larger and larger item sets as long as those item sets appear sufficiently often in the database. The
frequent item sets determined by Apriori can be used to determine association rules which highlight
general trends in the database: this has applications in domains such as market basket analysis.
Apriori is designed to operate on databases containing transactions (for example, collections of items
bought by customers, or details of a website frequentation).Each transaction is seen as a set of items (an
itemset). Given a threshold , the Apriori algorithm identifies the item sets which are subsets of at least
transactions in the database. Apriori uses a "bottom up" approach, where frequent subsets are extended
one item at a time (a step known as candidate generation), and groups of candidates are tested against the
data. The algorithm terminates when no further successful extensions are found.
Limitations
Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs,
which have spawned other algorithms. Candidate generation generates large numbers of subsets
(the algorithm attempts to load up the candidate set with as many as possible before each scan).
Bottom-up subset exploration (essentially a breadth-first traversal of the subset lattice) finds any
maximal subset S only after all of its proper subsets.
Later algorithms such as Max-Miner[2] try to identify the maximal frequent item sets without
enumerating their subsets, and perform "jumps" in the search space rather than a purely bottom-
up approach.
Algorithm:
![Page 33: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/33.jpg)
Example 1
Consider the following database, where each row is a transaction and each cell is an individual
item of the transaction:
alpha Beta epsilon
alpha Beta theta
alpha Beta epsilon
alpha Beta theta
The association rules that can be determined from this database are the following:
1. 100% of sets with alpha also contain beta
2. 50% of sets with alpha, beta also have epsilon
3. 50% of sets with alpha, beta also have theta
We can also illustrate this through a variety of examples
select * from grocery_customer;
SALES_ID PROD_NAME
111 bread
111 milk
112 bread
112 jam
113 milk
CONCLUSION: (Conclusion to be based on the aim and outcomes achieved)
![Page 34: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/34.jpg)
EXPERIMENT NO. 8
AIM: Using WEKA to compare different classifiers using Experimenter.
THEORY:
1. Start Weka: Start Weka. This may involve finding it in program launcher or double clicking
on the weka.jar file. This will start the Weka GUI Chooser.
Weka GUI Chooser: Lets you choose one of the Explorer, Experimenter, KnowledgeExplorer
and the Simple CLI (command line interface).
Click the “Experimenter” button to launch the Weka Experimenter.
The Weka Experimenter allows you to design your own experiments of running algorithms on
datasets, run the experiments and analyze the results. It’s a powerful tool.
2. Design Experiment: Click the “New” button to create a new experiment configuration.
Weka Experimenter
Start a new Experiment
![Page 35: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/35.jpg)
Test Options
The experimenter configures the test options for you with sensible defaults. The experiment is
configured to use Cross Validation with 10 folds. It is a “Classification” type problem and each
algorithm + dataset combination is run 10 times (iteration control).
Iris flower Dataset: Let’s start out by selecting the dataset.
1. In the “Datasets” select click the “Add new…” button.
2. Open the “data“directory and choose the “iris.arff” dataset.
The Iris flower dataset is a famous dataset from statistics and is heavily borrowed by researchers
in machine learning. It contains 150 instances (rows) and 4 attributes (columns) and a class
attribute for the species of iris flower (one of setosa, versicolor, virginica). You can read more
about Iris flower dataset on Wikipedia.
Let’s choose 3 algorithms to run our dataset.
ZeroR
1. Click “Add new…” in the “Algorithms” section.
2. Click the “Choose” button.
3. Click “ZeroR” under the “rules” selection.
ZeroR is the simplest algorithm we can run. It picks the class value that is the majority in the
dataset and gives that for all predictions. Given that all three class values have an equal share (50
instances), it picks the first class value “setosa” and gives that as the answer for all predictions.
Just off the top of our head, we know that the best result ZeroR can give is 33.33% (50/150).
This is good to have as a baseline that we demand algorithms to outperform.
OneR
1. Click “Add new…” in the “Algorithms” section.
2. Click the “Choose” button.
3. Click “OneR” under the “rules” selection.
OneR is like our second simplest algorithm. It picks one attribute that best correlates with the
class value and splits it up to get the best prediction accuracy it can. Like the ZeroR algorithm,
the algorithm is so simple that you could implement it by hand and we would expect that more
sophisticated algorithms out perform it.
J48
1. Click “Add new…” in the “Algorithms” section.
2. Click the “Choose” button.
3. Click “J48” under the “trees” selection.
![Page 36: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/36.jpg)
J48 is decision tree algorithm. It is an implementation of the C4.8 algorithm in Java (“J” for Java
and 48 for C4.8). The C4.8 algorithm is a minor extension to the famous C4.5 algorithm and is a
very powerful prediction algorithm.
Weka Experimenter
Configure the experiment
We are ready to run our experiment.
3. Run Experiment
Click the “Run” tab at the top of the screen.
This tab is the control panel for running the currently configured experiment.
Click the big “Start” button to start the experiment and watch the “Log” and “Status” sections to
keep an eye on how it is doing.
Weka Experimenter
Run the experiment
![Page 37: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/37.jpg)
Given that the dataset is small and the algorithms are fast, the experiment should complete in
seconds.
4. Review Results
Click the “Analyse” tab at the top of the screen.
This will open up the experiment results analysis panel.
Weka Experimenter
Load the experiment results
Click the “Experiment” button in the “Source” section to load the results from the current
experiment.
Algorithm Rank
The first thing we want to know is which algorithm was the best. We can do that by ranking the
algorithms by the number of times a given algorithm beat the other algorithms.
1. Click the “Select” button for the “Test base” and choose “Ranking“.
2. Now Click the “Perform test” button.
![Page 38: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/38.jpg)
Weka Experimenter
Rank the algorithms in the experiment results
The ranking table shows the number of statistically significant wins each algorithm has had
against all other algorithms on the dataset. A win, means an accuracy that is better than the
accuracy of another algorithm and that the difference was statistically significant.
We can see that both J48 and OneR have one win each and that ZeroR has two losses. This is
good, it means that OneR and J48 are both potentially contenders outperforming out baseline of
ZeroR.
Algorithm Accuracy
Next we want to know what scores the algorithms achieved.
1. Click the “Select” button for the “Test base” and choose the “ZeroR” algorithm in the list
and click the “Select” button.
2. Click the check-box next to “Show std. deviations“.
3. Now click the “Perform test” button.
![Page 39: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/39.jpg)
Weka Experimenter
Algorithm accuracy compared to ZeroR
In the “Test output” we can see a table with the results for 3 algorithms. Each algorithm was run
10 times on the dataset and the accuracy reported is the mean and the standard deviation in
rackets of those 10 runs.
We can see that both the OneR and J48 algorithms have a little “v” next to their results. This
means that the difference in the accuracy for these algorithms compared to ZeroR is statistically
significant. We can also see that the accuracy for these algorithms compared to ZeroR is high, so
we can say that these two algorithms achieved a statistically significantly better result than the
ZeroR baseline.
The score for J48 is higher than the score for OneR, so next we want to see if the difference
between these two accuracy scores is significant.
1. Click the “Select” button for the “Test base” and choose the “J48” algorithm in the list
and click the “Select” button.
2. Now click the “Perform test” button.
Weka Experimenter
Algorithm accuracy compared to J48
![Page 40: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/40.jpg)
We can see that the ZeroR has a “*” next to its results, indicating that its results compared to the
J48 are statistically different. But we already knew this. We do not see a “*” next to the results
for the OneR algorithm. This tells us that although the mean accuracy between J48 and OneR is
different, the differences is not statistically significant.
All things being equal, we would choose the OneR algorithm to make predictions on this
problem because it is the simpler of the two algorithms.
If we wanted to report the results, we would say that the OneR algorithm achieved a
classification accuracy of 92.53% (+/- 5.47%) which is statistically significantly better than
ZeroR at 33.33% (+/- 5.47%).
CONCLUSION: (Conclusion to be based on the aim and outcomes achieved)
![Page 41: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/41.jpg)
EXPERIMENT NO. 9
AIM: Implementation of KDD process in WEKA – Knowledge Flow
THEORY: Knowledge Flow
Major steps for building a process
1. Adding required nodes
1) Add a data source node from “DataSources”
2) Right click to configure it with a data set
3) Add a classAssigner node from “Evaluation” and a CrossValidationFoldmaker node
4) Add a classifier, e.g. J48, from Classifiers
5) Add a classiferPerformanceEvaluator node from “Evaluation”
6) Add a text viewer from “Visualization”
2. Connect the nodes
Right click “DataSource” node and choose DataSet, then connect it to the ClassAssigner
node, do the same or similar for connecting between the other nodes.
3. Run the process (using the default setups for each node)
Right click DataSource node and choose “Start loading”, the process should run and
“Status” window should indicate if the run is correct and completed.
4. View the results:
If the run is correctly completed, right click “Text Viewer” node and choose “Show
results”, then another window pops out to show the results.
![Page 42: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/42.jpg)
Results of the KDD process
CONCLUSION: (Conclusion to be based on the aim and outcomes achieved)
![Page 43: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/43.jpg)
EXPERIMENT NO. 10
AIM: Study and implementation of mining algorithm in XL Miner.
THEORY: Hierarchical cluster Analysis using XL Miner
Cluster Analysis: Cluster Analysis, also called data segmentation, has a variety of goals. All
relate to grouping or segmenting a collection of objects (also called observations, individuals,
cases, or data rows) into subsets or "clusters", such that those within each cluster are more
closely related to one another than objects assigned to different clusters. Central to all of the
goals of cluster analysis is the notion of degree of similarity (or dissimilarity) between the
individual objects being clustered. There are two major methods of clustering -- hierarchical
clustering and k-means clustering.
In hierarchical clustering the data are not partitioned into a particular cluster in a single step.
Instead, a series of partitions takes place, which may run from a single cluster containing all
objects to n clusters each containing a single object. Hierarchical Clustering is subdivided into
agglomerative methods, which proceed by series of fusions of the n objects into groups, and
divisive methods, which separate n objects successively into finer groupings. Agglomerative
techniques are more commonly used, and this is the method implemented in XLMiner™.
Hierarchical clustering may be represented by a two dimensional diagram known as dendrogram
which illustrates the fusions or divisions made at each successive stage of analysis. An example
of such a dendrogram is given below:
Agglomerative methods
An agglomerative hierarchical clustering procedure produces a series of partitions of the data,
Pn, Pn-1, ....... , P1. The first Pn consists of n single object 'clusters', the last P1, consists of single
group containing all n cases.
![Page 44: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/44.jpg)
At each particular stage the method joins together the two clusters which are closest together
(most similar). (At the first stage, of course, this amounts to joining together the two objects
that are closest together, since at the initial stage each cluster has one object.)
Differences between methods arise because of the different ways of defining distance (or
similarity) between clusters. Several agglomerative techniques will now be described in detail.
Single linkage clustering
One of the simplest agglomerative hierarchical clustering method is single linkage, also known
as the nearest neighbor technique. The defining feature of the method is that distance between
groups is defined as the distance between the closest pair of objects, where only pairs consisting
of one object from each group are considered.
In the single linkage method, D(r,s) is computed as
D(r,s) = Min { d(i,j) : Where object i is in cluster r and object j is cluster s }
Here the distance between every possible object pair (i,j) is computed, where object i is in cluster
r and object j is in cluster s. The minimum value of these distances is said to be the distance
between clusters r and s. In other words, the distance between two clusters is given by the value
of the shortest link between the clusters.
At each stage of hierarchical clustering, the clusters r and s , for which D(r,s) is minimum, are
merged.
This measure of inter-group distance is illustrated in the figure below:
CONCLUSION: (Conclusion to be based on the aim and outcomes achieved)
![Page 45: DATA MINING AND BUSINESS INTELLIGENCE …dmce.ac.in/dmce-002/IT/Faculty/16005/dmbi_manual.pdfLAB MANUAL . EXPERIMENT NO. 1 ... (automatic or manual) ... AIM: Study of Data Mining tool](https://reader033.fdocuments.us/reader033/viewer/2022051801/5adb8e7b7f8b9add658e156e/html5/thumbnails/45.jpg)
EXPERIMENT NO. 11
AIM: Mini Project on any Business Intelligence application.
OBJECTIVE:
A BI report must be prepared outlining the following steps:
a) Problem definition, Identifying which data mining task is needed
b) Identify and use a standard data mining dataset available for the problem. Some links for data
mining datasets are: WEKA site, UCI Machine Learning Repository, KDD site, KDD Cup etc.
c) Implement the data mining algorithm of choice
d) Interpret and visualize the results