Data Mining Lecture 1: Introduction to Data Mining

27
Data Mining Lecture 1: Introduction to Data Mining Manuel Penaloza, PhD

description

 

Transcript of Data Mining Lecture 1: Introduction to Data Mining

Page 1: Data Mining Lecture 1: Introduction to Data Mining

Data Mining

Lecture 1:

Introduction to Data Mining

Manuel Penaloza, PhD

Page 2: Data Mining Lecture 1: Introduction to Data Mining

2

Introduction to Data Mining

• Society produces huge amounts of data daily— Retail Store

– POS data on customer purchases

— Banks– Collection of customer service calls

— Telecommunications– Phone call records (mobile and house-based calls)

— Medicine– Genomic data collected on the structure of genes

— Government– Law enforcement data, income tax data

— Others: (Transactional) data from Sports, Schools, Research, Search engines, etc.

Page 3: Data Mining Lecture 1: Introduction to Data Mining

3

What is Data Mining (DM)?

• It is the process of discovering hidden relationships and patterns in large data sets— It can also predict the outcome of a future observation

• Data mining is an interdisciplinary field— It is an extension to statistical analysis— It uses techniques from:

– Statistics– Machine learning– Pattern recognition– Database technology– Visualization– High-performance computing

Page 4: Data Mining Lecture 1: Introduction to Data Mining

4

Questions answered by DM

• Extracting useful information from a dataset that answer:— Which CC customers are most profitable?— Which loan applicants are high-risk?— Which customer will respond to a planned promotion?— How do we detect phone card fraud?— How do customer profile change over time?— Which customers do prefer product A over product B?— What is the revenue prediction for next year?—Which students are most likely to transfer than others?— Which tax payer may be cheating the system?— Who is most likely to violate a probation sentence?— What is the predicted outcome for some treatment?

Page 5: Data Mining Lecture 1: Introduction to Data Mining

5

Data sources

• Relational Databases— Transactional data with many tables

• Data warehouses— Historical data, aggregated and updated periodically

• Files— In special format (e.g., CSV) or proprietary binary

• Internet or electronic mail— HTML, XML, web search results, e-mails

• Scientific, research— Seismology, remote sensing, etc.

Page 6: Data Mining Lecture 1: Introduction to Data Mining

6

Example: Health System

• Characteristics of the Health System:— Personal medical records (GP, specialists, etc.)— Billing records— Hospital data (surgery, admission, etc.)

• Questions:— Are MD's following the procedures?— Which patient may have an adverse drug reactions?— Are people committing frauds?— Which patient are most likely to get cancer?

Page 7: Data Mining Lecture 1: Introduction to Data Mining

7

Case study: E-commerce

• A person buys book from Amazon.com• Objective: Recommend other books this

person is likely to buy• Amazon may do clustering or sequential

pattern analysis based on books bought by other people

• Data analyzed:—“Customer who bought “Data Mining: Practical

Machine Learning Tools and Techniques” also bought “Introduction to Data Mining”

• Recommendations have been successful for Amazon— Increasing buyer’s satisfaction and purchases

Page 8: Data Mining Lecture 1: Introduction to Data Mining

8

What motivated data mining?

• Growth in data collection• Presence of data warehouses with reliable data• Competitive pressure to increase sales• The development of commercial off the shelves

(COTS) data mining software— Examples: XLMiner, Insightful Miner, SAS, SPSS

• Growth of computing power and storage capacity• High dimensionality of the data• Heterogeneous and complex data• Limitation of humans

Page 9: Data Mining Lecture 1: Introduction to Data Mining

9

Insightful MinerTM 7: GUI

*Figures taken from the Insightful Miner 7 Guide

Page 10: Data Mining Lecture 1: Introduction to Data Mining

10

Creating Models

• Create a network of pipelined components— By dragging and dropping components

Page 11: Data Mining Lecture 1: Introduction to Data Mining

11

Choosing a data mining system

• They have different functionality or methodology

• Selection determined by:— Type of operating system used in your organization— The data sources handle by the tool:

– ASCII text files, relational databases, XML data

— The data mining functions and methods offered— Scalability of the system

– Row and column scalability

— Visualization tools available— Graphical user interface that guides the execution

of the methods— Integration with other information systems— Cost and performance

Page 12: Data Mining Lecture 1: Introduction to Data Mining

12

Data Mining in Databases• Current applications include data mining

modules• Example:

— Database management systems such as Oracle and MS SQL Server

— CRM (Customer Relationship Management)

• Advantages for Database systems:— One Stop shopping— Minimize data movement and conversion

• Disadvantages for Database systems:— Limited to DM methods available in the system— Data extractions and transformations may not be

powerful enough

Page 13: Data Mining Lecture 1: Introduction to Data Mining

13

Standard data mining life cycle

• CRISP (Cross-Industry Standard Process)• It is an iterative process with phase

dependencies• IT consists of six (6) phases: see

www.crisp-dm.orgfor more information

Page 14: Data Mining Lecture 1: Introduction to Data Mining

14

CRISP_DM

• Cross-industry standard developed in 1996— Analysts from SPSS/ISL, NCR, Daimler-Benz,

OHRA

• Funding from European Commission• Important Characteristics:

— Non-proprietary— Application/Industry neutral— Tool neutral— General problem-solving process— Process with six phases but missing:

– Saving results and updating the model

Page 15: Data Mining Lecture 1: Introduction to Data Mining

15

CRISP-DM Phases (1)

• Business Understanding— Understand project objectives and

requirements— Formulation of a data mining problem

definition

• Data Understanding— Data collection— Evaluate the quality of the data— Perform exploratory data analysis

• Data Preparation— Clean, prepare, integrate, and transform the

data— Select appropriate attributes and variables

Page 16: Data Mining Lecture 1: Introduction to Data Mining

16

CRISP-DM Phases (2)

• Modeling— Select and apply appropriate modeling techniques— Calibrate model parameters to optimize results— If necessary, return to data preparation phase to

satisfy model's data format

• Evaluation— Determine if model satisfies objectives set in phase

1— Identify business issues that have not been

addressed

• Deployment— Organize and present the model to the “user”— Put model into practice— Set up for continuous mining of the data

Page 17: Data Mining Lecture 1: Introduction to Data Mining

17

Data mining tasks (1)

• Classification— Predict the categorical value of a target (dependent)

variable based on the values of other attributes— Target variable is partitioned into classes— It predicts class membership of a new observation— Examples: Which drug should be prescribed for

older patients with low sodium/potassium ratios?

• Estimation— Similar to classification except target variable is

numeric—That is, predicting a numeric value— Example: Estimate the blood pressure of a person

based on his/her age, gender, body mass index, etc.

Page 18: Data Mining Lecture 1: Introduction to Data Mining

18

Data mining tasks (2)

• Prediction— Similar to estimation except that results lie in

the future—Example: Predict the price of a stock 3 months

into the future

• Clustering— Grouping similar records together— Example: Find patients with similar profiles

• Associations— Uncover rules that indicates the association

between two or more attributes— Find out which items are purchased together

Page 19: Data Mining Lecture 1: Introduction to Data Mining

19

Task: Classification

• Build a model that learns to predict the class from pre-labeled instances or observations— Many approaches: Regression, Decision Trees,

Neural NetworksGiven a set of points from classes what is the class of new point ?

* Diagram taken fromwww.kdnuggets.com/data_mining_course/index.html

Page 20: Data Mining Lecture 1: Introduction to Data Mining

20

Task: Clustering

• Find grouping of instances given un-labeled data

* Diagram taken fromwww.kdnuggets.com/data_mining_course/index.html

Page 21: Data Mining Lecture 1: Introduction to Data Mining

21

DM looks easy

Data

Data Mining Method

Regression

Decision Tree

Neural Network

Association Rules

Model

- But it is not easy

- Real-world is complicate

Page 22: Data Mining Lecture 1: Introduction to Data Mining

22

Methods and Techniques

• Cluster Analysis (tasks: clustering)• Association Rules (tasks: association)• Decision trees (tasks: prediction, classification)• Neural networks (tasks: prediction,

classification)• K-nearest neighbor (tasks: prediction,

classification, clustering)• Regression analysis (task: estimation,

prediction)• Confidence interval estimation (task: estimation)

Page 23: Data Mining Lecture 1: Introduction to Data Mining

23

Fallacies of Data Mining (1)

• Fallacy 1: There are data mining tools that automatically find the answers to our problem— Reality: There are no automatic tools that will

solve your problems “while you wait”

• Fallacy 2: The DM process require little human intervention— Reality: The DM process require human

intervention in all its phases, including updating and evaluating the model by human experts

• Fallacy 3: Data mining have a quick ROI— Reality: It depends on the startup costs,

personnel costs, data source costs, and so on

Page 24: Data Mining Lecture 1: Introduction to Data Mining

24

Fallacies of Data Mining (2)

• Fallacy 4: DM tools are easy to use— Reality: Analysts must be familiar with the model

• Fallacy 5: DM will identify the causes to the business problem— Reality: DM tool only identify patterns in your

data, analysts must identify the cause

• Fallacy 6: Data mining will clean up a data repository automatically— Reality: Sequence of transformation tasks must

be defined by an analysts during early DM phases

* Fallacies described by Jen Que Louie, President of Nautilus Systems, Inc.

Page 25: Data Mining Lecture 1: Introduction to Data Mining

25

In summary,

• Problems suitable for Data Mining:—Require to discover knowledge to make right

decisions—Current solutions are not adequate—Expected high-payoff for the right decisions—Have accessible, sufficient, and relevant data—Have a changing environment

• IMPORTANT:— ENSURE privacy if personal data is used!—Not every data mining application is

successful!

Page 26: Data Mining Lecture 1: Introduction to Data Mining

26

Main References• Ian Witten and Eibe Frank. Data Mining: Practical

Machine Learning Tools and Techniques, 2nd edition, Morgan Kaufmann Publishers

• Daniel LaRose. Discovering Knowledge in Data: An Introduction to Data Mining, Wiley Publication

• Pang-Ning Tang et. al. Introduction to Data Mining, Addison Wesley

• Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers

• Online data mining course offered by KDnuggetsTM at www.kdnuggets.com/data_mining_course/index.html

• Engineering Statistics Handbook available online at http://www.itl.nist.gov/div898/handbook/eda/section1/eda126.htm

Page 27: Data Mining Lecture 1: Introduction to Data Mining

27

Exercise #1

• CRISP-DM is not the only DM process, do a quick search on the Internet for another process. Describe any similarity and differences with CRISP-DM.

• Determine how data mining could help a web search engine company like Google in its operation?— Identify one or more objectives.— Which data mining task(s) could help this

company?