Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

25
Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan

Transcript of Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Page 1: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Introduction toData-Mining

Marko Grobelnik

Institut Jozef Stefan

Page 2: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Outline

• Motivation & Definition

• What are typical applications?

• How do we build solutions?

• Method & algorithms

• Tools & standards

• …conclusion

Page 3: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Motivation: “Necessity is the Mother of Invention”

• Data explosion problem: • Automated data collection tools and mature

database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

• We are drowning in data, but starving for knowledge!

Page 4: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Data pyramid

Data

Information

Knowledge

Wisdom

Data + context

Information + rules

Knowledge + experience

Page 5: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

What Is Data Mining?

• Data mining (knowledge discovery in databases - KDD, business intelligence): • Extraction of interesting ( non-trivial, implicit,

previously unknown and potentially useful) information from data in large databases

• “Tell me something interesting about the data.”

• “Describe the data.”

Page 6: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Potential Applications

• Database analysis and decision support

• Market analysis and management

• Risk analysis and management

• Fraud detection and management

• Text analysis - Text Mining

• Web analysis - Web Mining

• Intelligent query answering

Page 7: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Market Analysis and Management

• Where are the data sources for analysis?• Credit card transactions, loyalty cards, discount coupons,

customer complaint calls, plus (public) lifestyle studies.

• Target marketing:• Find clusters of “model” customers who share the same

characteristics: interest, income level, spending habits, etc.

• Determine customer purchasing patterns over time:• Conversion of single to a joint bank account: marriage,

etc.

Page 8: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Analysis and Risk Management

• Finance planning and asset evaluation:• cash flow analysis and predictioncash flow analysis and prediction• time series analysis (trend analysis, etc.)time series analysis (trend analysis, etc.)

• Resource planning:• summarize and compare the resources and summarize and compare the resources and

spendingspending• Competition:

• Monitor competitors and market directions• Set pricing strategy in a highly competitive

market

Page 9: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Fraud Detection and Management

• Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances

• Examples application:• Auto Insurance: detect a group of people who stage

accidents to collect on insurance• Money Laundering: detect suspicious money

transactions • Detecting telephone fraud: detecting suspicious

patterns (generate call model - destination, time, duration)

Page 10: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Other Areas of application• Sports

• Analysis of game in NBA (eg., detect the opponent’s strategy)

• Astronomy• discovery and classification of new objects

• Internet • analysis of Web access logs, discovery of user behavior

patterns, analyzing effectiveness of Web marketing, improving Web site organization

• Text• news analysis, medical record analysis, automatic email

sorting and filtering, automatic document categorization

Page 11: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Data mining: intersection of multiple disciplines

• Database systems, data warehouse and OLAP

• Statistics

• Machine learning

• Visualization

• Information science

• High performance computing

• Other disciplines:

• Neural networks, mathematical modeling, information retrieval, pattern recognition, ...

Page 12: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

From data to knowledge

• Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 13: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Main steps of KDD

• Learning the application domain:• relevant prior knowledge and goals of application

• Data cleaning and preprocessing: (may take 60% of effort!):• creating a target data set: data selection • find useful features, generate new features, map feature values,

discretization of values

• Choosing data mining tools/algorithms • summarization, classification, regression, association,

clustering.

• Data mining: search for patterns of interest• Interpretation: analysis of results.

• visualization, transformation, removing redundant patterns, etc.

• Use of discovered knowledge.

Page 14: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Data Mining and Business Intelligence

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Page 15: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Mining the data: what kind of data?

• Relational databases

• Data warehouses

• Transactional databases

• Advanced DB systems and information repositories: object-oriented and object-relational databases, spatial databases, time-series data and temporal data, text databases and multimedia databases, heterogeneous and legacy databases, WWW

Page 16: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Data mining algorithms (I)

• Association: • finding rules like “if the customer bought item A, then in

X% of transactions she/he also bought item B”. This holds for Y% of all transactions

• Classification and Prediction: • classify data based on the values in a classifying attribute,

e.g., classify countries based on climate, or classify cars based on gas mileage

• predict some unknown or missing attribute values based on other information

Page 17: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Data mining algorithms (II)• Clustering:

• group data to form new classes, e.g., find groups of customers with similar behavior

• Time-series analysis: • trend and deviation analysis: find and characterize evolution

trend, sequential patterns, similar sequences, and deviation data, e.g., stock analysis.

• similarity-based pattern-directed analysis: find and characterize user-specified patterns in large databases.

• cyclicity/periodicity analysis: find segment-wise or total cycles or periodic behaviors in time-related data.

• Other pattern-directed or statistical analysis

Page 18: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Association rules

• Finding associations or correlations among a set of items

• Applications:• basket data analysis, cross-marketing,…

• Example:• buying beer and chips -> ketchup [0.5%,60%]

• rule form:LHS RHS [support, confidence]

Page 19: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Classification

• Finding rules that describe given groups of objects

• Applications: credit approval, target marketing, medical diagnosis, treatment effectiveness analysis,...

• Example: based on the past symptoms and diagnoses of patients generate a model describing influence of symptoms to disease to be used for classification of future test data and better understanding of each class

• Methods: decision-trees (e.g., ID3, C5), statistics, neural networks,...

Page 20: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Classification using decision trees • A decision tree:

• Top-down decision tree generation algorithm, at each step:• partition examples based on the selected attribute

value• select attribute favoring the partitioning which

makes the majority of examples belong to a single class

windy

sunny rainovercast

N P

P

N P

humidity

outlook

Page 21: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Classification methods• Decision trees and decision rules:

• give a training set of labeled data• tree pruning used for noise handling and avoiding data

overfiting• Bayesian classification:

• Naïve Bayesian classification• Bayesian belief networks

• Neural network approach:• multi-layer networks and back-propagation

• Genetic algorithms:• genetic operators (mutation, cross-over,…) and fitness

function selection

Page 22: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Clustering methods

• partitioning a set of data into a set of classes,

called clusters, such that the members of each

class are sharing some interesting common

properties.

• high quality clusters if the intra-class similarity

is high and the inter-class similarity is low

• Important is distance measure

Page 23: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Data-Mining tools

• Main producers of Data-Mining software:• IBM – Intelligent Miner, extender for DB2

• SAS – Enterprise Miner

• SPSS – Clementine

• Microsoft – Analysis Server (…part of SQL Server 2000)

• …many more smaller producers

Page 24: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

Data Mining standards• PMML (Predictive Modelling Markup Language)

• XML like language for saving and sharing models (most widely accepted standard)

• CRISP • standardized methodology for building Data Mining applications

• OLE DB for Data Mining • Microsoft’s standard for developing OLEDB/COM components

for extending Analysis server with new Data Mining functionality (uses customized SQL language)

• IBM and Oracle prepared standard extensions to SQL language to support Data Mining functionality

Page 25: Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.

…conclusion

• Data Mining is an area in the rapid development

• Who and Why needs Data Mining?• …(almost) everybody having the data?

• …to get something more out of the data

• More information:• http://www.kdnuggets.com/