Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 1: Introduction to...

35
slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 1: Introduction to Data Mining Some slide material based on: Groth; Han and Kamber; Cerrito; SAS Education

Transcript of Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 1: Introduction to...

slide 1

DSCI 4520/5240: Data MiningFall 2013 – Dr. Nick Evangelopoulos

Lecture 1:

Introduction to Data Mining

Some slide material based on:Groth; Han and Kamber; Cerrito; SAS Education

slide 2

DSCI 4520/5240DATA MINING

ITDS Résumé Book

ITDS majors (BCIS/DS), please send your résumé to [email protected], so that we can include it to the ITDS Résumé Book we send to our corporate partners for hiring/coop consideration. Make sure the résumés are formatted per UNT standards. Here is a link to the sample résumés: https://unt.optimalresume.com/

slide 3

DSCI 4520/5240DATA MINING

Data (and the lack thereof)

(Sir Arthur Conan Doyle: Sherlock Holmes, "A Scandal in Bohemia") http://www.dilbert.com/2012-12-05/

“It is a capital mistake to theorize before one has data.

Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

slide 4

DSCI 4520/5240DATA MINING

http://www.dilbert.com/2012-12-05/

Data (and the lack thereof)

slide 5

DSCI 4520/5240DATA MINING

Nobel Laureate Calls Data Mining "A Must"

In an interview with ComputerWorld in January 1999, Dr. Penzias (won the 1978 Nobel Prize in physics and was the vice president and chief scientist at Bell Laboratories) considered large scale data mining from very large databases as the key application for corporations in the next few years.

In response to ComputerWorld's age-old question of "What will be the killer applications in the corporation?" Dr. Penzias replied:

"Data mining." He then added: "Data mining will become much more important and companies will throw away nothing about their customers because it will be so valuable. If you're not doing this, you're out of business" he said.

slide 6

DSCI 4520/5240DATA MINING

What Is Data Mining?

Data mining (knowledge discovery in databases):

A process of identifying hidden patterns and relationships within data (Groth)

Data mining:

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases

slide 7

DSCI 4520/5240DATA MINING

Motivation: “Necessity is the Mother of Invention”

Data explosion problem

Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

Problem: We are drowning in data, but starving for knowledge!

Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing Extraction of interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases

slide 8

DSCI 4520/5240DATA MINING

elec

tron

ic p

oint

-of-s

ale

data

hosp

ital p

atie

nt reg

istr

ies

cata

log

orde

rs

ban

k tr

ansa

ctio

ns

rem

ote

sens

ing

imag

es

tax

retu

rns

airli

ne res

erva

tions

c

redi

t car

d ch

arge

s

stoc

k tr

ades

O

LTP

tel

epho

ne c

alls

Data Deluge

slide 9

DSCI 4520/5240DATA MINING

Data Mining, circa 1963

IBM 7090 600 cases

“Machine storage limitationsrestricted the total number ofvariables which could beconsidered at one time to 25.”

“Machine storage limitationsrestricted the total number ofvariables which could beconsidered at one time to 25.”

slide 10

DSCI 4520/5240DATA MINING

Business Decision Support

Database Marketing

– Target marketing

– Customer relationship management

Credit Risk Management

– Credit scoring

Fraud Detection Healthcare Informatics

– Clinical decision support

slide 11

DSCI 4520/5240DATA MINING

Required Expertise

Domain

Data

Analytical Methods

slide 12

DSCI 4520/5240DATA MINING

Multidisciplinary

Databases

Statistics

PatternRecognition

KDD

MachineLearning AI

Neurocomputing

Data Mining

slide 13

DSCI 4520/5240DATA MINING

What Is Data Mining?

IT: Complicated database queries

ML: Inductive learning from examples

Stat: What we were taught not to do

slide 14

DSCI 4520/5240DATA MINING

Comparing Statistics to Data Mining (from Cerrito 2006)

slide 15

DSCI 4520/5240DATA MINING

Comparing Statistics to Data Mining (from Cerrito 2006)

slide 16

DSCI 4520/5240DATA MINING

...

Predictive Modeling

......

......

......

......

......

...

...

...

...

...

...

...

...

Inputs

Cases

Target

...

...

slide 17

DSCI 4520/5240DATA MINING

Types of Targets

Supervised Classification– Event/no event (binary target)

– Class label (multiclass problem)

Regression– Continuous outcome

Survival Analysis– Time-to-event (possibly censored)

slide 18

DSCI 4520/5240DATA MINING

Why Data Mining? — Potential Applications

Database analysis and decision support Market analysis and management

– target marketing, customer relation management, market basket analysis, cross selling, market segmentation

Risk analysis and management

– Forecasting, customer retention, improved underwriting, quality control, competitive analysis

Fraud detection and management

Other Applications Text mining (news group, email, documents) and Web

analysis. Intelligent query answering

slide 19

DSCI 4520/5240DATA MINING

Market Analysis and Management (1)

Where are the data sources for analysis?

Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies

Target marketing

Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.

Cross-market analysis

Associations/co-relations between product sales

Prediction based on the association information

slide 20

DSCI 4520/5240DATA MINING

Market Analysis and Management (2)

Customer profiling

data mining can tell you what types of customers

buy what products (clustering or classification)

Identifying customer requirements

identifying the best products for different customers

use prediction to find what factors will attract new

customers

slide 21

DSCI 4520/5240DATA MINING

Corporate Analysis and Risk Management

Finance planning and asset evaluation

cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend

analysis, etc.)Resource planning:

summarize and compare the resources and spendingCompetition:

monitor competitors and market directions group customers into classes and a class-based pricing

procedure set pricing strategy in a highly competitive market

slide 22

DSCI 4520/5240DATA MINING

Other Applications

Sports

IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat

Astronomy

JPL and the Palomar Observatory discovered 22 quasars with the help of data mining

Internet Web Surf-Aid

IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

slide 23

DSCI 4520/5240DATA MINING

On the News:Rexer Analytics Annual Data Mining survey

The 2013 survey will become available in Fall 2013 (stay tuned)

slide 24

DSCI 4520/5240DATA MINING

Rexer Analytics 2011 Survey Overview

• SURVEY & PARTICIPANTS: 52-item survey of data miners, conducted on-line in 2011. Participants: 1,319 data miners from over 60 countries.

• FIELDS & GOALS: CRM/Marketing has been the #1 field for the past five years. “Improving the understanding of customers”, “retaining customers” and other CRM goals continue to be the primary goals.

• ALGORITHMS: Decision trees, regression, and cluster analysis continue to form the top three algorithms for most data miners. A third of data miners currently use text mining and another third plan to do so in the future.

• TOOLS: R continued its rise this year and is now being used by close to half of all data miners (47%). R users prefer it for being free, open source, and having a wide variety of algorithms. STATISTICA is selected as the primary data mining tool (17%). STATISTICA, KNIME, Rapid Miner and Salford Systems received the strongest satisfaction ratings.

• ANALYTIC CAPABILITY AND SUCCESS MEASUREMENT: Only 12% of corporate respondents rate their company as having very high analytic sophistication. Measures of analytic success: Return on Investment (ROI), and predictive validity or accuracy of their models. Challenges to measuring success: user cooperation and data availability/quality.

slide 25

DSCI 4520/5240DATA MINING Where Data Miners Work

Data Mining is everywhere!

Data miners also report working in Non-profit (6%), Hospitality / Entertainment / Sports (3%), Military / Security (3%), and Other (9%).

© 2012 Rexer Analytics

slide 26

DSCI 4520/5240DATA MINING The Algorithms Data Miners use

© 2012 Rexer Analytics

slide 27

DSCI 4520/5240DATA MINING The positive impact of Data Mining

In the 5th Annual Survey (2011) of Rexer Analytics (1,319 participant data miners from over 60 countries) data miners shared examples of situations where data mining is having a positive impact on society. The five areas mentioned most often were:

Health / Medical ProgressBusiness ImprovementsPersonalized Communications & MarketingFraud DetectionEnvironmental

slide 28

DSCI 4520/5240DATA MINING

Text Miners

Plan to Start Text Mining

No Plans to Conduct Text

Mining

34%

33%

33%

Text MaterialCustomer / market surveys 38%Blogs and other social media 33%E-mail or other correspondence 27%News articles 25%Scientific or technical literature 23%Web-site feedback 22%Online forums or review sites 21%Contact center notes or transcripts 16%Employee surveys 15%Insurance claims or underwriting notes 15%Medical records 11%Point of service notes or transcripts 10%

The rise of Text Mining

© 2012 Rexer Analytics

slide 29

DSCI 4520/5240DATA MINING

• The average data miner reports using 4 software tools.

• R is used by the most data miners (47%).Overall Corporate Consultants Academics NGO / Gov’t

Data Mining Software

29© 2012 Rexer Analytics

slide 30

DSCI 4520/5240DATA MINING Satisfaction with Data Mining Tools

Extremely SatisfiedExtremely Dissatisfied

© 2012 Rexer Analytics

slide 31

DSCI 4520/5240DATA MINING Measuring Analytic Success

© 2012 Rexer Analytics

53

0 10

Number of respondents

50

60

Model Performance (Accuracy, F, ROC, AUC, Lift)

Financial Performance (ROI, etc.)Performance in Control or Other Group

Feedback from User / Client / Management

Cross-Validation

20

30

40

43

35

29

14

Question: Please share your best practices concerning how you measure analytic project performance / success. (text box provided for response)

slide 32

DSCI 4520/5240DATA MINING Overcoming Data Mining challenges

In the four annual data miner surveys, these key challenges have been identified by data miners more than any others:

Dirty DataExplaining Data Mining to OthersUnavailability of Data / Difficult Access to Data

slide 33

DSCI 4520/5240DATA MINING

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

slide 34

DSCI 4520/5240DATA MINING

Steps of a KDD Process

Learning the application domain: relevant prior knowledge and goals of application

Creating a target data set: data selectionData cleaning and preprocessing: (may take 60% of effort!)Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

Choosing data mining algorithms summarization, classification, regression, association, clustering.

Data mining: search for patterns of interestPattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

slide 35

DSCI 4520/5240DATA MINING Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions

End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP