Word
description
Transcript of Word
Carter
SECTION 1: INTRODUCTION
1.1 OVERVIEW
The process of data collection, no matter the objective, can provide a person or
entity with a large amount of generic information on a particular subject. However, to
learn from and to interpret the data requires much more than simply gathering it. For
example, a business may build meaningful relationships with its customers by learning
from previous interactions with them, observing their needs, and remembering what their
preferences are, in order to determine how to serve them better in the future. In order for
this type of learning to take place, data must first be collected and organized in a useful
and consistent way. This procedure is known as data warehousing. Data warehousing
allows a user to remember what has been noticed in the data. Afterwards, the data must
then be analyzed, interpreted, and transformed into useful information. At this stage is
where data mining comes into play. Data mining is the exploration and analysis, by
automatic or semiautomatic means, of large quantities of data in order to discover
meaningful patterns and rules (Berry, Linoff, pg. 5). Data mining can be applied in a
wide variety of areas, from sports to law enforcement to education.
In this project, I use data mining techniques to predict the current contraceptive
method choice (no use, long-term methods, or short-term methods) of Indonesian women
based on their demographic and socio-economic characteristics. The algorithms that are
implemented here are Naive Bayesian Classification, One-Rule Classification and
Decision Tree. This project presents a Web-based client/server application. The project
makes use of the three-tier client/server architecture, with the Web browser as the client
1
Carter
front-end, the Common Gateway Interface (CGI), Perl, Visual Basic, and Active Server
Pages (ASP) as the middle-tier software, and Microsoft Access 2000 and a comma-
separated value (CSV) text file for the database back-end. The database administrator
has the capability to add, delete, edit, and search for records. The administrator can also
change the administrator password and add users who have permission to gain access to
the website. Users have privileges to add records and to search for records. A logging
system is also implemented, which keeps track of the time, date, host server, browser,
and operating system of users that access the database. The log is accessible by both the
administrator and the users.
1.2 BACKGROUND INFORMATION
According to the Central Bureau of Statistics, the nation of Indonesia is the
fourth-most populous country in the world, with an estimated total population of 207
million in 2000 (United Nations Population Fund). Indonesia has a growth rate of 1.5
percent a year, and although the population growth rate is at a moderate level, the country
has a significant momentum of growth. The government of Indonesia is concerned about
the uneven distribution of the population and the scale of the population growth. This is
especially true in when considering overcrowding in urban, densely populated areas, such
as Java and Bali. Other areas of concern are the relatively high infant and under-five
mortality rates (52 and 71 per 1,000, respectively) and the persistently high maternal
mortality ratio (estimated at 370 per 100,000 births).
Indonesia has been recognized for the success of its family planning efforts.
However, according to (United Nations Population Fund), the progress in the
contraceptive prevalence rate (CPR) seems to have stalled at about 57 percent. Also, the
2
Carter
burden of use of contraceptives appears to be unevenly shouldered by women, as the
male-based CPR is less than 2 percent. And even though the “unmet need” for
contraceptives of currently married women has been estimated at the relatively small 9.2
percent, this number is probably considerably higher when unmarried men and women
are taken into account. In order to meet this need, it is paramount that the quality and
scope of contraceptive services and information be expanded. A critical challenge for
Indonesia remains the access to affordable contraceptives by all its citizens, especially the
poor.
1.3 ABOUT THE DATASET
This dataset comes via a subset of the 1987 National Indonesia Contraceptive
Prevalence Survey. It was created and donated by Tjen-Sien Lim on June 7, 1997. The
contents were downloaded from the UCI Machine Learning Depository. The samples
contained in the survey are of married women who were either not pregnant or did not
know if they were pregnant at the time of the interview. The problem faced is predicting
the contraceptive method choice of the woman based on her demographic and socio-
economic characteristics. Predicting the contraceptive method choice of Indonesian
women can assist the government with how to and where to target and provide
information on contraceptive choices for its female population. The three choices are no
use, long-term methods, or short-term methods. The number of instances is 1473, and the
number of attributes is 11, including the primary key (ID) and the classifying attribute
(cmchoice).
3
Carter
1.4 ATTRIBUTE INFORMATION
No. Attribute Description Type / Values1. ID ID number (primary key attribute)2. wife_age Wife’s age (numerical)3. wife_ed Wife’s education (categorical) 1=low level, 2, 3, 4=high level4. hus_ed Husband’s education (categorical) 1=low level, 2, 3, 4=high level5. no_child Number of children ever born? (numerical)6. wife_rel Wife’s religion (binary) 0=Non-Islam, 1=Islam7. wife_work Wife’s now working? (binary) 0=Yes, 1=No8. hus_oc Husband’s occupation (categorical) 1=low level, 2, 3, 4=high level9. st_live Standard-of-living index (categorical) 1=low, 2, 3, 4=high10. media Media exposure (binary) 0=Good, 1=Not good11. cmchoice Contraceptive method used (class) 1=No-use, 2-Long-term, 3=Short-term
Figure 1: Attribute Information
There are no missing values in the dataset.
4
Carter
SECTION 2: TECHNICAL DESCRIPTION
2.1 THE THREE-TIER ARCHITECTURE
The most commonly used application development architecture, and the one
supported by most application servers, is a component-based, three-tier model (Directions
on Microsoft). Components provide an increase of reusable code and simplify
development. By using components, a developer can package the compiled (binary) code
in such a way that another developer is able to easily and efficiently discover the
functions provided by the component (usually by using a programming language
application such as Visual Basic) and invoke those functions. This is accomplished while
keeping the internal workings of the component hidden.
The three-tier architecture increases scalability and reliability by separating the
three major logical functions of an application (user interaction, business logic, data
storage) from one another. Many Web services must provide functionality that displays
the graphical user interface (GUI), performs the main logic of the program, and then
stores and retrieves data. And although a developer may write a single module that will
interconnect the three functions of user interaction, logic, and data storage, such an
approach would require a great deal of work in maintenance and in deployment.
Therefore, developers attempt to divide the application’s functionality into tiers, or
layers. Years ago, as business applications moved from minicomputer or mainframe
systems to the PC, developers adopted a two-tier strategy, which is also known as the
client-server model. In this model, the data storage (typically provided by a server
running a database management system such as SQL Server or DB2) is separated from
the rest of the application (typically running on desktop PCs). This resulted in many
5
Carter
developer tools being created around the client-server model. However, the client-server
model had its drawbacks, which included the following, as described in (Directions on
Microsoft):
Difficult to evolve. Because the client piece of a client-server system included
both the GUI and the business logic, developers updating the GUI could
inadvertently change the business logic as well.
Difficult to deploy. A client application had to be deployed on the desktop PC of
each user who wanted to access the application, potentially requiring thousands of
deployments.
Difficult to scale. Each running client connected directly to the database, thereby
consuming server resources and often limiting the number of simultaneous users
that could access an application.
On the other hand, the three-tier model introduces an intermediate business-logic tier
between the GUI and the data storage, which provides these advantages over the client-
server model:
Increased scalability. Logic components can be pooled and shared across multiple
running clients.
Easier to maintain. Since the GUI code is separate from the business logic, the GUI
can be changed and enhanced without accidentally altering core business rules. In
addition, when the business logic must be changed, only a relatively small number of
middle-tier servers need to be updated instead of a larger number of desktop PCs.
Shared business logic and support for multiple interfaces. The same business logic
can be used from a Web-based interface and a thick-client interface.
6
Carter
Figure 2 illustrates the setup of a typical three-tier architectural model:
Figure 2 (Delphi 2)
7
Carter
2.2 WEB BROWSER / HTML
HTML, the HyperText Markup Language, is the standard authoring language for
publishing on the World Wide Web. Having gone through several stages of evolution,
today’s HTML has a wide range of features reflecting the needs of a very diverse and
international community wishing to make information available on the Web (HTML
Activity Statement). HTML defines the layout and structure of a Web document by
using a series of tags and attributes.
In this project, I use HTML for the structure of the Web pages within my project
site. A Web browser is a software application used to locate and display HTML pages.
The Microsoft Internet Explorer Web browser serves as the client in this application.
2.3 CGI
The Common Gateway Interface (CGI) is a standard for interfacing external
applications with information servers, such as HTTP or Web servers (CGI: Common
Gateway Interface). A CGI program is executed in real-time, which means that it can
output dynamic data to a Web page. On the other hand, a generic HTML document
that is retrieved contains static information, which means it exists in a constant state and
the information outputted to the screen does not change. Because a CGI program is
executable, it allows visitors to a Web page to run a program on the server where the CGI
document is hosted. For this and other reasons, authors of CGI scripts must take some
security measures when it comes to the execution of the scripts. CGI programs must
reside in a special directory, so that the Web server knows to execute the program instead
of merely displaying it to the browser. Typically, this directory is under direct control of
8
Carter
the webmaster, which prevents the average user from creating CGI programs. The most
common practice is to place CGI programs in a directory entitled ‘/cgi-bin’.
2.4 PERL
A CGI program can be written in any language that allows it to be executed on
the user’s system, and Perl is the language of choice for many developers. Perl is an
acronym for the Practical Extraction Report Language. Perl is available for most
operating systems, including virtually all Unix-like platforms (Perl). The language is
optimized for scanning arbitrary text files, extracting information from those text files,
and printing reports based on that information. Perl can handle many system
management tasks, and the language’s designers intended it to be practical, easy to use,
and efficient. Perl combines many features of C, sed, awk, and sh, as well as csh,
Pascal, and BASIC-PLUS. Expression syntax in Perl corresponds closely to C
expression syntax. Perl, unlike most Unix utilities, does not arbitrarily limit the size of
the user’s data, as long as the required memory is available. As an example, Perl can
parse a whole file as a single string. Recursion in Perl is of unlimited depth. The tables
used by hashes, commonly referred to as associative arrays, grow as necessary to
prevent diminished performance. One of Perl’s most useful capabilities is that it can
use sophisticated pattern matching techniques to scan large amounts of data quickly.
And although optimized for scanning text, Perl can also deal with binary data.
In this project, I use Perl to implement CGI scripts for performing the database
manipulation operations, such as insert, delete, edit, and search. Perl and CGI serve as
a part of the middle tier of this application.
9
Carter
2.5 ASP / VBScript
Active Server Pages (ASP) are components that allow Web developers to create
server-side scripted templates. In turn, these templates generate dynamic, interactive web
server applications. By embedding special programmatic codes in standard HTML pages,
a user can interact with page objects such as Active-X or Java components, access data in
a database, or create other types of dynamic output. The HTML output by an Active
Server Page is totally browser independent, which means that it can be read equally well
by Microsoft Explorer, Netscape Navigator, or most other browsers (ASP-help.com).
In this project, I use ASP technology to allow the implementation of the user login
feature, as well as the add user function, which is done using Visual Basic script, or
VBScript. ASP / VBScript serve as a part of the middle tier of this application.
2.6 B-Course
B-Course is a Web-based data analysis tool for Naive Bayesian modeling.
Specifically, B-Course is used for dependence and classification modeling. B-Course can
be freely used for educational and research purposes as an analysis tool where
dependence or classification modeling based on data is needed. The software provides
two courses of modeling: dependency modeling and classification.
2.7 VISUAL BASIC DATA MINING.NET
Visual Basic Data Mining.Net is a Web portal that provides data mining
algorithm and application documentation, as well as various source codes in .Net and
Visual Basic. These features of the site demonstrate how the .NET Framework and/or
Visual Basic can be used to either learn how data mining algorithms and applications
function or to build data mining applications. Visual Basic Data Mining.Net also offers
10
Carter
a data mining community and provides functionality of data mining algorithms and
applications. The site provides a wizard-based interface for implementing the
algorithms. Visual Basic Data Mining.Net can be found online at: http://www.visual-
basic-data-mining.net.
2.8 SEE5
See5 analyzes data to produce decision trees and/or rulesets that relate a case’s
class to the values of its attributes (See5). In See5, an application consists of a
collection of text files. These files define classes and attributes, describe the cases to
be analyzed, provide new cases to test the classifiers produced by See5, and specify
misclassification costs or penalties. A See5 application consists of two mandatory
files, which are a .names file and a .data file. The .names file defines the classes and
attributes associated with the data. The .data file contains the actual cases to be
analyzed by See5 in the process of producing a classifier.
11
Carter
SECTION 3: DATA MINING ALGORITHMS
3.1 NAIVE BAYESIAN CLASSIFICATION
Bayes Theorem illustrates how to calculate the probability of one event given that
it is known some other event has occurred. Expressed algebraically, this is a simple class-
conditional approach, based upon the following assumption:
P(A|B) = P(A) * P(B|A) / P(B)
or, the probability that A takes place given that B has occurred (P(A|B)) equals the
probability that A occurs (P(A)) times the probability that B occurs if A has happened
(P(B|A)), divided by the probability of B occurring (P(B)). Naive Bayesian classifiers
make the assumption that an attribute’s effect on a given class is independent of values of
any other attribute, and this assumption is known as class conditional independence. It is
made to simplify the computation and in this sense considered to be “naive” (Naive
Bayes – Introduction).
The independence assumption that underlies the Naive Bayesian classification
technique is one that is deep-seated and therefore, may not be realistic. However, a
Naive Bayesian classifier can yield an excellent prediction. One example of this case
may occur when a feature selection process on the data is completed prior to
classification. This ensures that only one pair of any highly correlated features is saved
and used in the classification process. When dealing with gene expression data, feature
selection must be performed prior to classification due to the extremely high
dimensionality of the feature space (Wallach, 2003).
12
Carter
A Bayesian network consists of nodes and arcs that can connect pairs of nodes
(P.Myllymäki, et. al). For each variable, exactly one node exists. A major
restriction for the Bayesian network is that arcs are not allowed to form loops. If
the arcs can be followed such that some node is visited twice, the model is not a
Bayesian network. Figure 3 is an example of a network that is NOT a Bayesian
network:
Figure 3 (P.Myllymäki, et.al.)
Presented next is a dependency model for a Bayesian network. This example
model is given in (P.Myllymäki, et. al):
A and B are dependent on each other if we know something about C or D (or both).
A and C are dependent on each other no matter what we know and what we don't know about B or D (or both).
B and C are dependent on each other no matter what we know and what we don't know about A or D (or both).
C and D are dependent on each other no matter what we know and what we don't know about A or B (or both).
There are no other dependencies that do not follow from those listed above.
Figure 4 shows the Bayesian network for these dependencies:
Figure 4 (P.Myllymäki, et.al.)
13
Carter
A and B are considered dependent, when given a (possibly empty) set S that contains
some other variables of the network, if one can freely travel the arcs from A to B. If the
arcs cannot be freely traveled from A to B, A and B are not dependent given S. The
ability to travel an arc is generally independent of the direction of the arc. If S is an
empty set, one may travel the arcs forward and backward, given that the same node is
never visited twice and that an arc is first traveled forward, and immediately afterward
traveled backward on some other arc.
In this project, I use B-Course to perform Naive Bayesian dependency modeling
and Naive Bayesian Classification on the contraceptive method choice database.
3.2 ONE-RULE CLASSIFICATION
The one-rule algorithm creates one data mining rule for the dataset based on one
attribute (one column in a database table). After comparing the error rates from all the
attributes, it then chooses the rule that gives the lowest classification error. The rule will
assign to one category or class each distinct value of one chosen attribute. This rule can
be defined in pseudocode as (Tagbo):
For each attribute in the data set
For each distinct value of the attribute
Find the most frequent classification
Assign the classification to the value
Calculate the error rate for the value
Calculate the total error rate for the attribute
Choose the attribute with the lowest error rate
Create one rule for the chosen attribute
14
Carter
The goal of the one rule data mining algorithm in this implementation is to
classify each of the attributes wife_age, hus_ed, no_child, wife_rel, wife_work, hus_oc,
st_live, and media of the contraceptive method choice database as no use, long-term
methods, and short-term methods. Afterwards, the attribute with the lowest error rate is
chosen as the best rule. In this project, I use Visual Basic Data Mining.Net to process
the results of the One-Rule Classification algorithm on the contraceptive method choice
database.
3.3 DECISION TREE
A visual aid for data mining is the decision tree. A decision tree is in essence a
flow chart of questions or data points. These questions or data points eventually lead to a
decision. Decision tree algorithms begin by finding the test that performs the best task of
splitting the data among the preferred categories. At each successive level of the tree,
subsets created by the previous split are themselves split, making a path down the tree.
Each of the paths through the tree represents a rule. However, some rules are more useful
than other ones. And in some cases, the predictive power of the entire tree can be
bettered by pruning back the weaker branches. At each node of the tree, three things can
be measured: the number of records entering the node, the percentage of records
classified correctly at the node, and the way the records would be classified if it were a
leaf node. The tree continues to grow until it is no longer possible to locate more useful
ways to split the incoming records. Decision trees create a set of bins or boxes where the
data miner may place records.
15
Carter
In Figure 5, a partial binary tree for the classification of musical instruments. The
gap in the center of the row of bins corresponds to the root node of the tree. All stringed
instruments then fall to the left of the gap, and all other instruments fall to the right.
Figure 5 (Berry, Linoff, pg. 245)
In this project, I use See5 to construct decision trees and process those results for the
contraceptive method choice database.
16
Carter
SECTION 4: SYSTEM DESIGN
4.1 SYSTEM LAYOUT
Figure 6: Project System Flow
17
Query / Manipulation User Login
Administrator Login
Search Add Records
DeleteEdit
Add Users
Logs Data Mining
Naive Bayes
Decision Tree
One Rule
Change Admin Password
Carter
4.2 WEBSITE PRESENTATION
Figure 7: The contraceptive method choice database homepage
18
Carter
Figure 8: Administrator Login Page
To guarantee security, only the privileged database administrator can log in to the
database to perform three of the database manipulation functions, which are to add users,
delete records, and edit records. The administrator can also add users and change the
admin password.
19
Carter
Figure 9: Administrator Options
After the administrator successfully logs in, administrator options are presented. These
options include: search records, change password, add records, add users, delete records,
and edit records. NOTE: Clicking the “Delete Record” button next to an entry will
delete that entry from the database.
20
Carter
Figure 10: Password Change Success Page
21
Carter
Figure 11: Add User Page
Figure 12: Add User Success Page
22
Carter
Figure 13: Edit Record Page
23
Carter
Figure 14: Edit Record Success Page
24
Carter
Figure 15: User Login Page
Users have privileges to add records and to search for records.
Figure 16: Bad User Login Page
25
Carter
Figure 17: User Request Page
Figure 18: User Request Success Page
26
Carter
Figure 19: Email Message
This is the email that the system automatically sends to the database administrator when a
user requests a login name and password.
27
Carter
Figure 20: Search Page
Figure 21: Search Page Results
Both the database administrator and users have access to the search function.
28
Carter
Figure 22: Add Record Page
Figure 23: Add Record Success Page
Both the database administrator and users have access to the add records function.
29
Carter
Figure 24: Access Log Detail
Both the database administrator and users have access to the access log feature. A count
is kept for the different types of browsers and operating systems used. The log detail
contains the time, date, host server, browser, and operating system of the computer that
accesses the system.
30
Carter
SECTION 5: DISCUSSION
5.1 NAIVE BAYESIAN RESULTS
B-Course was used to construct Bayesian dependency models for the
contraceptive method choice database. All variables, excluding the primary key
ID, were used in constructing the model. When the software is invoked, B-Course
searches for the most probable model for the data and returns these intermediate
results. B-Course can then continue using a search strategy of selecting models
that resemble the current best model, instead of picking models randomly from a
set. As B-Course continues, it collects a set of relatively good models and then
attempts to combine the best parts of these models so that the resulting combined
model is better than any of the original models.
After evaluating 8539 candidate models, B-Course returned the following
Bayesian network as the best model:
Figure 25: Bayesian Network (P.Myllymäki, et.al.)
31
Carter
B-Course was started again, evaluating 444681 more candidate models, for a grand
total of 453220 models evaluated. After searching these candidate models, B-
Course located a new Bayesian network that represents the same model as the
previous network:
Figure 26: New Bayesian Network (P.Myllymäki, et.al.)
B-Course also provides for Naive Bayesian classification. In classification
modeling, one attribute of the data is chosen as the class variable, and the other attributes
become predictor variables. The ultimate goal is to find the model that, given the values
of predictor variables, deduces the value of the class variable. Classification modeling
can also help to test whether some classes are similar or not. For example, if a model can
correctly tell the classes apart, then there must be some difference in those particular
classes. More analysis can measure how significant the differences in classes are.
32
Carter
B-Course merges many quantitative models to build one single classification model.
After running B-Course, 301 candidate models were evaluated. The estimated
classification accuracy of the best model found was 48.74%. On the average the correct
class received 36.56% probability. Figure 27 displays the variables B-Course found as
the best subset for predicting the class variable:
Figure 27: Classification model (P.Myllymäki, et.al.)
Figure 28: Class arc weights (P.Myllymäki, et.al.)
33
Carter
It was estimated that if the selected models were used, then 48.74% of future
classifications would be done correctly. B-Course built 1473 models, each of which was
constructed using the data items in the dataset. Next, the model was used to classify the
data items not used in the model’s construction. Out of 1473 models, 718 succeeded in
classifying the one unseen data item correctly.
A confusion matrix displays how many members of a certain class were predicted
to be members of a different class. Figure 28 shows a confusion matrix for the Naïve
Bayesian classifier, where the entries denoting numbers of correct classifications are in
bold print.
ConfusionPredicted
Long-term No-use Short-term
Actual
Long-term 102 60 171
No-use 79 319 231
Short-term 66 147 297Figure 29: Confusion Matrix (P.Myllymäki, et.al.)
5.2 ONE-RULE RESULTS
Using Visual Basic Data Mining.Net software, I applied the one-rule
classification algorithm to the contraceptive method choice database. The steps used in
producing the one-rule results are as follows:
Step 1: Decide which of the attributes will be used to create the best one-rule for the dataset. Attribute ID is not chosen because it is the primary key for the database. Attribute cmchoice is not selected because it is the class attribute, containing the categories needed for classification. The remaining 9 attributes are chosen.
Step 2: List the distinct values of each attribute. These values can be seen in Figure 1.
34
Carter
Step 3: Find the most frequent classification for every distinct value of an attribute using the contraceptive method choice class values (no use, long-term methods, short-term methods). For example, according to the output, when no_child = 8, there were 9 cases of category no use, 7 cases of category long-term methods, and 8 cases of category short-term methods. Therefore, the most frequent classification is category no use, and a rule is made classifying 8 children as category no use, or 8 children No Use. The error rate for 8 children is the total number of times it appears in the dataset (24) minus the number of instances of its most frequent class (9), divided by the total (24). So the error rate in this case is 15 / 24.
Step 4: Repeat Step 3 for each case of each attribute.
Step 5: Choose the attribute with the lowest error rate
Step 6: Create a one-rule classification based on this attribute
Figure 30 displays a portion of the one-rule classification output. As shown, the attribute
with the lowest error rate, which was selected as the best rule, is no_child.
35
Carter
Attribute IsNumeric BestRule Value L.P.B. U.P.B. Class Frequency Total
wife_work False False
wife_rel False False
wife_ed False False
st_liv False False
media False False
hus_oc False False
hus_ed False False
wife_age True False
no_child True True 0 0 0 1 62 62
no_child True True 0 0 0 1 33 34
no_child True True 0 0 1 1 94 95
no_child True True 0 1 1 2 31 31
no_child True True 0 1 1 3 61 61
no_child True True 0 1 1 1 49 49
no_child True True 0 1 1 2 15 15
no_child True True 0 1 1.5 3 26 26
no_child True True 0 1.5 2 1 83 83
no_child True True 0 2 2 2 39 39
no_child True True 0 2 2 3 77 77
no_child True True 0 2 2 1 31 31
no_child True True 0 2 2 2 17 17
no_child True True 0 2 2.5 3 29 29
no_child True True 0 2.5 3 1 46 46
no_child True True 0 3 3 2 44 44
no_child True True 0 3 3 3 90 90
no_child True True 0 3 3 1 24 24
no_child True True 0 3 3 2 26 26
no_child True True 0 3 3.5 3 29 29
no_child True True 0 3.5 4 1 37 37
Figure 30: One Rule Output (Tagbo)
36
Carter
5.3 DECISION TREE RESULTS
I performed decision tree analysis on the contraceptive method choice dataset
using See5. There are 1473 instances in the dataset, with the 10 attributes, plus the
unique identifier ID. However, this version of See5 allowed a maximum of 400 cases
that could be used at a time. The class attribute, cmchoice, is represented by three
categories (1 = no use; 2 = long-term; 3 = short-term). The numbers shown between 0
and 1 represent the probability of the attribute, at the given criteria, belonging to the
specific class (no use, long-term, short term). The 400 cases were selected such that
relatively equal numbers of cases for each contraceptive method choice classification are
present. Thus, for the cmc.data file, the breakdown by ID is as follows:
No use (1): ID# 1 – 133
Long-term (2): ID# 416-549
Short-term (3): ID# 643-776
Below is a partial screen shot of a mine for the ruleset of the attributes. A 95%
confidence interval was used for all mines.
37
Carter
Figure 31: Ruleset (Quinlan)
38
Carter
Figure 32: Decision Tree Output (Quinlan)
See5 creates a decision tree of the results. To paraphrase, the tree can be translated in
this manner:
39
Carter
if no_child is less than or equal to 0, then no useelse if no_child > 0
if wife_ed = 1if wife_age > 36, then no useelse if wife_age <= 36
if st_live = 1, then no useif st_live = 2, then long-termif st_live = 3, then short-termif st_live = 4, then long-term
if wife_ed = 2……………. (etc…)
From the decision tree, conclusions can be drawn for determining which contraceptive
method choice is best for Indonesian women. For example, a woman with no children
would be most likely to choose no use. A wife with at least one child, a low educational
level, and above the age of 36 is predicted for no use. A wife with at least one child, a
low educational level, less than or equal to 36 years old, and with a standard of living
index of 2 is predicted to have long-term methods. A wife with those same
characteristics, but with a standard of living index of 3, is predicted to have short-term
methods. Numerous predictions can be seen throughout the decision tree.
Many times, classification decisions can occur slowly with changes in attribute
values. For example, a threshold may be a value less than or equal to 0.5 for one
classification, say long-term methods, and the values more than 0.5 may be another
classification, say short-term methods. If the former holds, we go no further and predict
long-term methods. By default, a threshold such as this is sharp. Therefore, a case with a
hypothetical value of 0.49 is treated quite differently from one with a value of 0.51.
See5 contains an option to invoke, instead of sharp thresholds like the case
mentioned in the previous paragraph, fuzzy thresholds. A fuzzy set is a set whose
40
Carter
elements are usually neither totally in the set nor totally out of the set (Meadow, et.al., pg.
217). When this is invoked, each threshold is broken into three ranges – they are denoted
by a lower bound lb, an upper bound ub, and a central value t. If the questioned attribute
value is below lb or above ub, classification is made by using the single branch
corresponding to the `<=' or '>' result respectively. If the value falls between lb and ub,
then both branches of the tree are investigated, with the results combined
probabilistically. The values of lb and ub are determined by See5 based on a study of the
perceived sensitivity of classification to small changes in the threshold. Figure 33 shows
a screenshot of the classifier construction options, and Figure 34 displays part of the
decision tree with fuzzy thresholds:
Figure 33: Classifier Construction Options (Quinlan)
41
Carter
Figure 34: Decision Tree Output with Fuzzy Thresholds (Quinlan)
Of note is how the upper and lower bounds of the thresholds are specified. For instance,
in the non-fuzzy example, when no_child is > 0 and wife_ed = 1, wife_age has one
threshold, 36 – if wife_age is greater than 36, no use is returned; if wife_age is less than
or equal to 36, then the tree branches to the st_live attribute to determine the appropriate
class. However, in the fuzzy example, there is no one specific threshold, or cut-off. If
no_child >= 1 and wife_ed = 1 when wife_age is >= 38, no use is returned; when
wife_age is <= 35, then the tree branches to the the st_live attribrute to determine which
class is predicted. The fuzzy thresholds option constructs an interval close to the
threshold. Within this interval, both branches of the tree are explored. Next, the results
are combined to give a predicted class. When wife_age is greater than 35 and less than
38, or 35 < wife_age < 38, the prediction becomes imprecise. A wife_age value of 36.5 is
chosen as the fuzzy threshold.
42
Carter
5.4 CONCLUSION
All three data mining algorithms were successful at predicting the contraceptive
method choice of an Indonesian woman based on her demographic and socio-economic
characteristics. B-Course created Bayesian dependency networks for the attributes of the
dataset. The estimated classification accuracy of the best model found was 48.74%.
With the resulting accuracy of the classification being less than 50% in this case, the
Naive Bayesian algorithm may not be the best model for this dataset. It is possible that
the creation of more candidate models may increase the accuracy percentage. One-Rule
classification determined that the no_child attribute, which is the number of children born
to an Indonesian woman, was the best rule for predicting the contraceptive method
choice. The decision tree algorithm determined that the best predictor of the
contraceptive method choice was the rule where no_child <=0, which would predict the
no use category (95.7%). In comparing the regular decision tree to the decision tree
containing fuzzy thresholds, the regular decision tree had an error rate of 25.0%, while
the decision tree with fuzzy thresholds had an error rate of 25.5%. There was not a
significant difference between these two methods.
43
Carter
WORKS CITED
ASPhelp.com. “What are Active Server Pages?”. Retrieved March 8, 2003, from the World Wide Web. http://www.asp-help.com/getstarted/gs_aboutasp.asp
Berry, Michael, and Gordon Linoff. Data Mining Techniques for Marketing, Sales, and Customer Support. New York: John Wiley and Sons. 1997.
CGI: Common Gateway Interface. Retrieved March 8, 2003, from the World Wide Web. http://hoohoo.ncsa.uiuc.edu/cgi/intro.html.
Delphi 2 – Developing for Multi-Tier Distributed Computing Architectures. Retrieved March 9, 2003, from the World Wide Web. http://community.borland.com/article/0,1410,10343,00.html#three.
Directions on Microsoft. “What is an Application Server?”. Retrieved March 9, 2003, from the World Wide Web. http://www.directionsonmicrosoft.com/sample/DOMIS/research/2002/12dec/1202wiaas.htm
HTML Activity Statement. Retrieved March 8, 2003, from the World Wide Web. http://www.w3.org/MarkUp/Activity.
Lewis, David. “Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval”. Proceedings of ECML-98, 10th European Conference on Machine Learning. Florham Park, NJ: AT&T Labs Research, 1998.
Meadow, Charles, B.R. Boyce, D.H. Kraft. Text Information Retrieval Systems, 2 nd Edition . San Diego: Academic Press. 2000.
“Naïve Bayes – Introduction”. Retrieved February 5, 2003, from the World Wide Web. http://www.resample.com/xlminer/help/NaiveBC/classiNB_intro.htm.
O’Reilly and Associates. “Perl”. Retrieved March 8, 2003, from the World Wide Web. http://www.perldoc.com/perl5.6/pod/perl.html.
P.Myllymäki, T.Silander, H.Tirri, and P.Uronen. B-Course: A Web-Based Tool for Bayesian and Causal Data Analysis. International Journal on Artificial Intelligence Tools, Vol 11, No. 3 (2002) 369-387.
Quinlan, Ross. “RuleQuest Research Data Mining Tools”. Retrieved March 18, 2003, from the World Wide Web. http://www.rulequest.com/.
Tagbo, Kingsley. “Visual Basic Data Mining.Net”. http://www.visual-basic-data-mining.net. 2002.
44
Carter
United Nations Population Fund - Indonesia. Retrieved March 16, 2003, from the World Wide Web. http://www.un.or.id/unfpa/idpop.html.
Wallach, Hannah. “Supervised Learning Methods”. Retrieved March 14, 2003, from the World Wide Web. http://www.srcf.ucam.org/~hmw26/coursework/dme/node14.html
45