Web-Based Data Mining System
description
Transcript of Web-Based Data Mining System
C.-C. ChanDepartment of Computer Science
University of AkronAkron, OH 44325-4003
1UA Faculty Forum 2008 by C.-C. Chan
OutlineOverview of Data MiningSoftware ToolsA Rule-Based System for Data MiningConcluding Remarks
2UA Faculty Forum 2008 by C.-C. Chan
Data Mining (KDD)From Data to KnowledgeProcess of KDD (Knowledge Discovery in
Databases)Related TechnologiesComparisons
3UA Faculty Forum 2008 by C.-C. Chan
Why KDD?
We are drowning in information, but starving for knowledge John Naisbett
Growing Gap between Data Generation and Data Understanding:Automation of business activities:
Telephone calls, credit card charges, medical tests, etc.Earth observation satellites:
Estimated will generate one terabyte (1015 bytes) of data per day. At a rate of one picture per second.Biology: Human Genome database project has collected over gigabytes of data on the human genetic code [Fasman, Cuticchia, Kingsbury, 1994.]
US Census data: NASA databases: …World Wide Web:
4UA Faculty Forum 2008 by C.-C. Chan
Process of KDD
5
[1] Fayyad, U., Editorial, [1] Fayyad, U., Editorial, Int. J. of Data Mining and Knowledge DiscoveryInt. J. of Data Mining and Knowledge Discovery, Vol.1, Issue 1, 1997., Vol.1, Issue 1, 1997.[2] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery: an [2] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery: an overview," in overview," in Advances in Knowledge Discovery and Data MiningAdvances in Knowledge Discovery and Data Mining, Fayyad et al (Eds.), MIT Press, 1996., Fayyad et al (Eds.), MIT Press, 1996.
UA Faculty Forum 2008 by C.-C. Chan
Process of KDD
1. Selection Learning the application domain Creating a target dataset
2. Pre-Processing Data cleaning and preprocessing
3. Transformation Data reduction and projection
4. Data Mining Choosing the functions and algorithms of data mining Association rules, classification rules, clustering rules
5. Interpretation and Evaluation Validate and verify discovered patterns
6. Using discovered knowledge
6UA Faculty Forum 2008 by C.-C. Chan
Typical Data Mining TasksFinding Association Rules [Rakesh Agrawal et al,
1993]Each transaction is a set of items.
Given a set of transactions, an association rule is of the form X Y
where X and Y are sets of items. e.g.: 30% of transactions that contain beer also contain
diapers; 2% of all transactions contain both of these items.
Applications:Market basket analysis and cross-marketingCatalog designStore layoutBuying patterns
7UA Faculty Forum 2008 by C.-C. Chan
Finding Sequential Patterns Each data sequence is a list of transactions. Find all sequential patterns with a user-specified minimum
support. e.g.: Consider a book-club database A sequential pattern might be
5% of customers bought “Harry Potter I”, then “Harry Potter II”, and then “Harry Potter III”.
Applications:Add-on salesCustomer satisfactionIdentify symptoms/diseases that precede certain
diseases
8UA Faculty Forum 2008 by C.-C. Chan
Finding Classification Rules Finding discriminant rules for objects of different
classes.Approaches:
Finding Decision Trees Finding Production Rules
Applications:Process loans and credit cards applicationsModel identification
9UA Faculty Forum 2008 by C.-C. Chan
Text MiningWeb Usage MiningEtc.
10UA Faculty Forum 2008 by C.-C. Chan
Related Technologies Database Systems
MS SQL server Transaction databases OLAP (Data Cubes) Data Mining
Decision Trees Clustering Tools
Machine Learning/Data Mining Systems CART (Classification And Regression Trees) C 5.x (Decision Trees) WEKA (Waikato Environment for Knowledge Analysis) LERS ROSE 2
Rule-Based Expert System Development Environments CLIPS, JESS EXSYS
Web-based Platforms Java MS .Net
11UA Faculty Forum 2008 by C.-C. Chan
12
Pre-Processing
LearningData Mining
Inference Engine
End-User Interface
Web-Based Access
Reasoning withUncertaintie
s
MS SQL Server
N/A Decision TreesClustering
N/A N/A N/A N/A
CARTC 5.x
N/A Decision Trees Built-in Embedded N/A N/A
WEKA Yes Trees, Rules, Clustering, Association
N/A Embedded Need Programming
N/A
CLIPSJESS
N/A N/A Built-in Embedded NeedProgramming
3rd parties Extensions
Comparisons
UA Faculty Forum 2008 by C.-C. Chan
Rule-Based Data Mining System ObjectivesDevelop an integrated rule-based data mining
system provides Synergy of database systems, machine
learning, and expert systemsDealing with uncertain rulesDelivery of web-based user interface
13UA Faculty Forum 2008 by C.-C. Chan
Structure of Rule-Based SystemsStructure of Rule-Based Systems
Rule Base
Working Memory
Execution
Selector
Matcher
No
Yes
Answer
Inference Result
14UA Faculty Forum 2008 by C.-C. Chan
15
System Workflow
InputData Set
Data Pre-processing
RuleGenerator
UserInterfaceGenerator
UA Faculty Forum 2008 by C.-C. Chan
16
Input Data Set:Text file with comma separated values (CSV)It is assumed that there are N columns of values
corresponding to N variables or parameters, which may be real or symbolic values.
The first N – 1 variables are considered as inputs and the last one is the output variable.
Data Preprocessing:Discretize domains of real variables into a finite number
of intervals Discretized data file is then used to generate an
attribute information file and a training data file.Rule Generator:
A symbolic learning program called BLEM2 is used to generate rules with uncertainty
User Interface Generator:Generate a web-based rule-based system from a rule file
and corresponding attribute file
UA Faculty Forum 2008 by C.-C. Chan
Architecture of RBC generator
17
Requests
Middle Tier
Client
Responses
SQL DB server
Workflow of RBC generator
Rule set File
Metadata File
Rule Table Definition
RBC Generator
UA Faculty Forum 2008 by C.-C. Chan
Concluding RemarksA system for generating rule-based classifier from data with the following benefits:
No need of end user programmingAutomatic rule-based system creationDelivery system is web-based provides easy
access
18UA Faculty Forum 2008 by C.-C. Chan
Project StatusThe current version 1.4 of our system provides fundamental features for data mining from data including:
Data PreprocessingManagement of preprocessed data filesMachine Learning tool to generate rules from
data Rule-Based Classifier system supporting
uncertain rulesWeb-Based access
19UA Faculty Forum 2008 by C.-C. Chan
Future WorkMore advanced features in Data
Preprocessing such as data cleansing, data transformation, and data statistics
Learning from multi-criteria inputs with preferential rankings to support Multiple Criteria Decision Making processes
Concept-Oriented information retrieval and search
20UA Faculty Forum 2008 by C.-C. Chan
Thank You!
21UA Faculty Forum 2008 by C.-C. Chan