lovede

21
Data Mining Adrian Tuhtan 004757481 CS157A Section1

Transcript of lovede

Page 1: lovede

Data Mining

Adrian Tuhtan 004757481 CS157A Section1

Page 2: lovede

Overview

Introduction Explanation of Data Mining Techniques Advantages Applications Privacy

Page 3: lovede

Data Mining What is Data Mining? “The process of semi automatically analyzing large

databases to find useful patterns” (Silberschatz) KDD – “Knowledge Discovery in Databases” (3) “Attempts to discover rules and patterns from data” Discover Rules Make Predictions Areas of Use

Internet – Discover needs of customers Economics – Predict stock prices Science – Predict environmental change Medicine – Match patients with similar problems cure

Page 4: lovede

Example of Data Mining Credit Card Company wants to discover information about

clients from databases. Want to find: Clients who respond to promotions in “Junk Mail” Clients that are likely to change to another competitor Clients that are likely to not pay Services that clients use to try to promote services affiliated

with the Credit Card Company Anything else that may help the Company provide/ promote

services to help their clients and ultimately make more money.

Page 5: lovede

Data Mining & Data Warehousing

Data Warehouse: “is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site.” (Silberschatz) Collect data Store in single repository Allows for easier query development as a single repository

can be queried.

Data Mining: Analyzing databases or Data Warehouses to discover

patterns about the data to gain knowledge. Knowledge is power.

Page 6: lovede

Discovery of Knowledge

Page 7: lovede

Data Mining Techniques

Classification Clustering Regression Association Rules

Page 8: lovede

Classification Classification: Given a set of items that have several classes,

and given the past instances (training instances) with their associated class, Classification is the process of predicting the class of a new item.

Therefore to classify the new item and identify to which class it belongs

Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications “Responds Rarely, Responds Sometimes, Responds Frequently”.

The bank will then attempt to find rules about the customers that respond Frequently and Sometimes.

The rules could be used to predict needs of potential customers.

Page 9: lovede

Technique for Classification

Decision-Tree Classifiers

Job

Income

Job

Income Income

CarpenterEngineer Doctor

Bad Good Bad Good Bad Good

<30K <40K <50K>50K >90K>100K

Predicting credit risk of a person with the jobs specified.

Page 10: lovede

Clustering “Clustering algorithms find groups of items that are

similar. … It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. ” (2)

Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased.

The categories are unspecified and this is referred to as ‘unsupervised learning’

Page 11: lovede

Clustering Group Data into Clusters

Similar data is grouped in the same cluster Dissimilar data is grouped in the same cluster

How is this achieved ? K-Nearest Neighbor

A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer).(2)

Hierarchical Group data into t-trees

Page 12: lovede

Regression “Regression deals with the prediction of a value, rather

than a class.” (1, P747) Example: Find out if there is a relationship between

smoking patients and cancer related illness.

Given values: X1, X2... Xn Objective predict variable Y One way is to predict coefficients a0, a1, a2

Y = a0 + a1X1 + a2X2 + … anXn Linear Regression

Page 13: lovede

Regression Example graph:

Line of Best Fit Curve Fitting

Page 14: lovede

Association Rules “An association algorithm creates rules that describe how

often events have occurred together.” (2)

Example: When a customer buys a hammer, then 90% of the time they will buy nails.

Page 15: lovede

Association Rules Support: “is a measure of what fraction of the

population satisfies both the antecedent and the consequent of the rule”(1, p748)

Example: People who buy hotdog buns also buy hotdog sausages in

99% of cases. = High Support People who buy hotdog buns buy hangers in 0.005% of

cases. = Low support

Situations where there is high support for the antecedent are worth careful attention E.g. Hotdog sausages should be placed in near hotdog buns

in supermarkets if there is also high confidence.

Page 16: lovede

Association Rules Confidence: “is a measure of how often the consequent is

true when the antecedent is true.” (1, p748) Example:

90% of Hotdog bun purchases are accompanied by hotdog sausages.

High confidence is meaningful as we can derive rules. Hotdog bun Hotdog sausage 2 rules may have different confidence levels and

have the same support. E.g. Hotdog sausage Hotdog bun may have a

much lower confidence than Hotdog bun Hotdog sausage yet they both can have the same support.

Page 17: lovede

Advantages of Data Mining Provides new knowledge from existing data

Public databases Government sources Company Databases

Old data can be used to develop new knowledge

New knowledge can be used to improve services or products

Improvements lead to: Bigger profits More efficient service

Page 18: lovede

Uses of Data Mining Sales/ Marketing

Diversify target market Identify clients needs to increase response rates

Risk Assessment Identify Customers that pose high credit risk

Fraud Detection Identify people misusing the system. E.g. People who have

two Social Security Numbers Customer Care

Identify customers likely to change providers Identify customer needs

Page 19: lovede

Applications of Data Mining

(4)

Source IDC 1998

Page 20: lovede

Privacy Concerns Effective Data Mining requires large sources of data To achieve a wide spectrum of data, link multiple data

sources Linking sources leads can be problematic for privacy as

follows: If the following histories of a customer were linked: Shopping History Credit History Bank History Employment History

The users life story can be painted from the collected data

Page 21: lovede

References Silberschatz, Korth, Sudarshan, “Database System

Concepts”, 5th Edition, Mc Graw Hill, 2005 http://www.twocrows.com/glossary.htm, “Two Crows,

Data Mining Glossary” http://en.wikipedia.org/wiki/Data_mining, “Wikipedia” http://phoenix.phys.clemson.edu/tutorials/excel/regression.html http://wwwmaths.anu.edu.au/~steve/pdcn.pdf