Post on 04-Jan-2016
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Part IData Mining Fundamentals
Chapter 1Data Mining: A First View
Jason C. H. Chen, Ph.D.Professor of MIS
School of Business AdministrationGonzaga UniversitySpokane, WA 99223
chen@jepson.gonzaga.edu
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
1.1 Data Mining: A Definition
3A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
1.1 Data Mining: A Definition
• The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data.
4A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Induction-based Learning
• The process of forming general concept definitions by observing specific examples of concepts to be learned.
Knowledge Discovery in Databases (KDD)
• The application of the scientific method to data mining. Data mining is one step of the KDD process.
5A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Data Mining Examples
• A telephone company used a data mining tool to analyze their customer’s data warehouse. The data mining tool found about 10,000 supposedly residential customers that were expending over $1,000 monthly in phone bills.
• After further study, the phone company discovered that they were really small business owners trying to avoid paying business rates
*
6A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Other Data Mining Examples
• 65% of customers who did not use the credit card in the last six months are 88% likely to cancel their accounts.
• If age < 30 and income <= $25,000 and credit rating < 3 and credit amount > $25,000 then the minimum loan term is 10 years.
• 82% of customers who bought a new TV 27" or larger are 90% likely to buy an entertainment center within the next 4 weeks.
7A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
1.2 What Can Computers Learn?
8A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Four Levels of Learning• Fact
– a simple statement of truth
• Concept– a set of objects, symbols, or events grouped together because they
share certain characteristics
• Principle– is a step-by-step course of action to achieve a goal. We use procedures
in our everyday functioning as well as in the solution of difficult problems
• Procedure– represents the highest level of learning. Principles are general truths or
laws that are basic to other truths.
Source: Merril and Tennyson, 1977, p.5 of the text N
9A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Concepts• Computers are good at learning concepts.
Concepts are the output of a data mining session.
Three Concept Views
• Classical View
• Probabilistic View
• Exemplar View
10A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Three Concept Views
• Classical View– Attests that all concepts have definite defining
properties.
• Probabilistic View– Concepts are represented by properties that are probable
of concept members.
• Exemplar View– States that a given instance is determined to be an
example of a particular concept if the instance is similar enough to a set of one or more known examples of the concepts
11A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Figure - A hierarchy of data mining strategies
Data Mining Strategies
Unsupervised Clustering
Supervised Learning
Market Basket Analysis
Classification EstimationPrediction
Categorical/discrete(current behavior)
NumericFuture outcome
(categorical/numeric)
No output attributes
12A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Supervised Learning
Two purposes:• 1. Build a learner (classification) model using data
instances of known origin.– is an induction process
• 2. Use the model to determine the outcome new instances of unknown origin.– is a deduction process
Supervised learning is the process of building classification models using data instances of known origin.
A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Supervised Learning:
A Decision Tree Example
14A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Decision Tree• A tree structure where non-terminal nodes
represent tests on one or more attributes and terminal nodes reflect decision outcomes.
Table 1.1 – Hypothetical Training Data for Disease DiagnosisPatient Sore Swollen
ID# Throat Fever Glands Congestion Headache Diagnosis1 Yes Yes Yes Yes Yes Strep throat
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
4 Yes No Yes No No Strep throat
5 No Yes No Yes No Cold
6 No No No Yes No Allergy
7 No No Yes No No Strep throat
8 Yes No No Yes Yes Allergy
9 No Yes No Yes Yes Cold
10 Yes Yes No Yes Yes Cold
15A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
SwollenGlands
Fever
No
Yes
Diagnosis = Allergy Diagnosis = Cold
No
Yes
Diagnosis = Strep Throat
Figure 1.1 – A decision tree for the data in Table 1.1
16A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Patient Sore Swollen
ID# Throat Fever Glands Congestion Headache Diagnosis11 No No Yes Yes Yes ?
12 Yes Yes No No Yes ?
13 No No No No Yes ?
Table 1.2 Data Instances with an Unknown Classification
Patient Sore Swollen
ID# Throat Fever Glands Congestion Headache Diagnosis1 Yes Yes Yes Yes Yes Strep throat
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
4 Yes No Yes No No Strep throat
5 No Yes No Yes No Cold
6 No No No Yes No Allergy
7 No No Yes No No Strep throat
8 Yes No No Yes Yes Allergy
9 No Yes No Yes Yes Cold
10 Yes Yes No Yes Yes Cold
Table 1.1 – Hypothetical Training Data for Disease Diagnosis
17A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Production Rules
• IF Swollen Glands = Yes THEN Diagnosis = Strep Throat• IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold• IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy
We can translate any decision tree into a set of production rules. They are rules of the form:IF <antecedent conditions>THEN <consequent conditions>
18A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Unsupervised Clustering
• A data mining method that builds models from data without predefined classes (see Table 1.3).
• Data instances are grouped together based on a similarity scheme defined by the clustering system.
• With the help of one or several evaluation techniques, it is up to us to decide the meaning of the formed clusters.
19A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Table 1.3 – Acme Investors Incorporated
Customer Account Margin Transaction Trades/ Favorite Annual
ID Type Account Method Month Sex Age Recreation Income1005 Joint No Online 12.5 F 30–39 Tennis 40–59K
1013 Custodial No Broker 0.5 F 50–59 Skiing 80–99K
1245 Joint No Online 3.6 M 20–29 Golf 20–39K
2110 Individual Yes Broker 22.3 M 30–39 Fishing 40–59K
1001 Individual Yes Online 5 M 40–49 Golf 60–79K
20A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Possible Questions
1. Can I develop a general profile of an online investor? If so, what characteristics distinguish online investors from investors that use a broker?
2. Can I determine if a new customer who does not initially open a margin account is likely to do so in the future?
3. Can I build a model able to accurately predict the average number of trades per month for a new investor?
4. What characteristics differentiate female and male investors?
1. What attribute similarities group customers of Acme Investors together?2. What differences in attribute values segment the customer database?
Questions for supervised learning
Questions for unsupervised learning
21A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
1.3 Is Data Mining Appropriate for My Problem?
22A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Data Mining or Data Query?
• Shallow Knowledge– is factual; tools used: DBMS/SQL
• Multidimensional Knowledge– Is factual; tools used: OLAP
• Hidden Knowledge– Represents patterns or regularities in data that cannot
be easily found, tools used: data mining
• Deep Knowledge– Knowledge stored in a database that can only be found
if we are given some direction.
23A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Data Mining vs. Data Query: An Example
• Use data query if you already almost know what you are looking for.
• Use data mining to find regularities in data that are not obvious.
24A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
1.4 Expert Systems or Data Mining?
25A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Expert System and Knowledge Engineer
• An expert system is a computer program that emulates the problem-solving skills of one or more human experts.
• A knowledge engineer is a person trained to interact with an expert in order to capture their knowledge.
26A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Data Mining Tool
Expert SystemBuilding Tool
Human Expert
If Swollen Glands = YesThen Diagnosis = Strep Throat
If Swollen Glands = YesThen Diagnosis = Strep Throat
Knowledge Engineer
Data
27A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
1.5 A Simple Data Mining Process Model
28A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Figure 1.3 - A simples data mining process model
Operational Database
Data Warehouse
SQL Queries
Data MiningInterpretation &
Evaluation
Result Application
29A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
Characteristics of Data Warehouse
• Data Warehouse: – Definitions: a subject-oriented, integrated, time-
variant, non-updatable collection of data used in support of management decision-making processes
– Subject-oriented: e.g. customers, patients, students, products
– Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources
– Time-variant: Can study trends and changes– Nonupdatable: Read-only, periodically refreshed
• Data Mart:– A data warehouse that is limited in scope
30A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
A four-step process for performing a data mining session
• 1. Assembling the data– Operational database (relational databases and flat
files) vs. data warehouse
• 2. Mining the Data (Giving the data to a mining tool)
– Instances for building the model or testing the model
• 3. Interpreting the results• 4. Result application
31A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
1.7 Data Mining Applications (p.24)
• Fraud Detection
• Health care
• Business and finance
• Scientific applications
• Sports and gaming
32A/W & Dr. Chen, Data MiningDr. Chen, Data Mining
X
X
X
X
X
XX
X
X
_
_
__
_
_
_
_
_
__
Intrinsic(Predicted)
Value
Actual Value
Customer Intrinsic Value
A
B
C