Data Miningzsi.tech.us.edu.pl/~ppaszek/PLIKI/DM/sDM_01_DM_intro.pdf · Data Mining Data mining is...

32
Data Mining Piotr Paszek [email protected] Introduction (Piotr Paszek) Data Mining DM – KDD 1 / 43

Transcript of Data Miningzsi.tech.us.edu.pl/~ppaszek/PLIKI/DM/sDM_01_DM_intro.pdf · Data Mining Data mining is...

Data Mining

Piotr Paszek

[email protected]

Introduction

(Piotr Paszek) Data Mining DM – KDD 1 / 43

Recommended Reference Books

1 J. Han, M. Kamber, and J. Pei, Data Mining: Concepts andTechniques. Morgan Kaufmann, 3rd ed. 2011.

2 I. Witten, E. Frank, and M. Hall. Data Mining: PracticalMachine Learning Tools and Techniques. Morgan Kaufmann,3rd ed. 2011.

3 P. Tan, M. Steinbach, and V. Kumar, Introduction to DataMining, Wiley, 2nd ed. 2016.

4 X. Wu, V. Kumar: The Top Ten Algorithms In Data Mining.Chapman & Hall, 2009.

(Piotr Paszek) Data Mining DM – KDD 3 / 43

Data Mining – Etymology

In the 1960s, statisticians used terms like data fishing or datadredging to refer to what they considered the bad practice ofanalysing data without an a-priori hypothesis.The term data mining appeared around 1990 in the databasecommunity.Other terms used include data archaeology, information harvesting,information discovery, knowledge extraction, etc.Gregory Piatetsky-Shapiro coined the term knowledge discovery indatabases and this term became more popular in AI and machinelearning community. However, the term data mining became morepopular in the business and press communities.Currently, the terms data mining and knowledge discovery are oftenused interchangeably.

(Piotr Paszek) Data Mining DM – KDD 4 / 43

Data mining (definition?)

Data mining is the computing process of discovering patterns in largedata sets involving methods at the intersection of machine learning,statistics, and database systems.An essential process where intelligent methods are applied to extractdata patterns. It is an interdisciplinary subfield of computer science.The overall goal of the data mining process is to extract informationfrom a data set and transform it into an understandable structure forfurther use.https://en.wikipedia.org/wiki/Data_mining

(Piotr Paszek) Data Mining DM – KDD 5 / 43

Data Mining: Confluence of Multiple Disciplines

(Piotr Paszek) Data Mining DM – KDD 6 / 43

Data Mining

Data mining is the extraction of implicit, previously unknown, andpotentially useful information from data. The idea is to buildcomputer programs that sift through databases automatically, seekingregularities or patterns. Strong patterns, if found, will likelygeneralize to make accurate predictions on future data. . . .Machine learning provides the technical basis for data mining. It isused to extract information from the raw data in databases . . .Data mining is defined as the process of discovering patterns in data.The process must be automatic or semiautomatic.The patterns discovered must be meaningful in that they lead tosome advantage, usually an economic one.

Ian Witten, Eibe Frank, Mark Hall. Data Mining: Practical Machine Learning Tools and

Techniques. Third Edition. Morgan Kaufmann Publishers, 2011.

(Piotr Paszek) Data Mining DM – KDD 7 / 43

Data Mining

Data mining, also popularly referred to as Knowledge Discovery in

Databases (KDD), is the automated or convenient extraction ofpatterns representing knowledge implicitly stored or captured in largedatabases, data warehouses, the Web, other massive informationrepositories or data streams.

Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. 2nd Edition. Morgan

Kaufmann Publishers, 2006.

(Piotr Paszek) Data Mining DM – KDD 8 / 43

Knowledge Discovery in Databases (KDD)

field is concerned with the development of methods and techniquesfor making sense of data.. . .At the core of the process is the application of specific data-miningmethods for pattern discovery and extraction.. . .KDD refers to the overall process of discovering useful knowledgefrom data, and data mining refers to a particular step in this process.Data mining is the application of specific algorithms for extractingpatterns from data.

Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth. From Data Mining to Knowledge

Discovery in Databases. AI Magazine, 17(3): 37–54, 1996.

(Piotr Paszek) Data Mining DM – KDD 9 / 43

KDD process

1. Understand the application domain and the goal of the process2. Create target dataset as a subset of all the data that is available3. Data cleaning and preprocessing to remove noise, handling missing

data and outliers4. Data reduction and projection in order to focus on the features

that are relevant to the problem5. Match goals of process to a data mining method. Decide the

purpose of the model such as summarization or classification6. Choose the data mining algorithms to match the purpose of the

model (from step 5)7. Data mining, it means run algorithms on data8. Interpretation of mined patterns to make them understandable by

the user, such as summarization and visualization9. Acting on the discovered knowledge, such as reporting or making

decisions

U. Fayyad, G. Piatetsky-Shapiro, P. Smyth. The KDD Process for Extracting Useful Knowledge

from Volumes of Data. Communications of the ACM 39, 11, 1996, 27-34.

(Piotr Paszek) Data Mining DM – KDD 10 / 43

KDD Process

1. Data cleaning to remove noise and inconsistent data.2. Data integration, where multiple data sources may be combined.3. Data selection, where data relevant to the analysis task are

retrieved from the database.4. Data transformation, where data are transformed and consolidated

into forms appropriate for mining by preforming summary oraggregation operations.

5. Data mining, which is an essential process where intelligentmethods are applied to extract data patterns.

6. Pattern evaluation to identify the truly interesting patternsrepresenting knowledge based on interesting measures.

7. Knowledge presentation, where visualization and knowledgerepresentation techniques are used to present mined knowledge tousers.

Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Second Edition.

Morgan Kaufmann Publishers, 2006.

(Piotr Paszek) Data Mining DM – KDD 12 / 43

KDD Process

Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Second Edition.

Morgan Kaufmann Publishers, 2006.

(Piotr Paszek) Data Mining DM – KDD 13 / 43

Data Mining in Business Intelligence

(Piotr Paszek) Data Mining DM – KDD 14 / 43

KDD Process

This is a view from typical machine learning and statistics communities

(Piotr Paszek) Data Mining DM – KDD 15 / 43

CRISP-DM

Cross-Industry Standard Process for Data Mining (CRISP-DM)is a data mining process model that describes commonly usedapproaches that data mining experts use to tackle problems.

(Piotr Paszek) Data Mining DM – KDD 16 / 43

Phases of the CRISP-DM reference model

P. Chapman, J. Clinton et al. (2000); CRISP-DM 1.0 Step-by-step data mining guides

(Piotr Paszek) Data Mining DM – KDD 17 / 43

CRISP-DM major phase

1. Business Understanding

2. Data Understanding

3. Data Preparation

4. Modeling

5. Evaluation

6. Deployment

P. Chapman, J. Clinton et al. (2000); CRISP-DM 1.0 Step-by-step data mining guides

(Piotr Paszek) Data Mining DM – KDD 18 / 43

CRISP-DM

Business UnderstandingThis initial phase focuses on understanding the project objectivesand requirements from a business perspective, and thenconverting this knowledge into a data mining problem definition,and a preliminary plan designed to achieve the objectives. Adecision model, especially one built using the Decision Modeland Notation standard can be used.

(Piotr Paszek) Data Mining DM – KDD 19 / 43

CRISP-DM

Data UnderstandingThe data understanding phase starts with an initial datacollection and proceeds with activities in order to get familiarwith the data, to identify data quality problems, to discover firstinsights into the data, or to detect interesting subsets to formhypotheses for hidden information.

(Piotr Paszek) Data Mining DM – KDD 20 / 43

CRISP-DM

Data PreparationThe data preparation phase covers all activities to construct thefinal dataset (data that will be fed into the modelling tool(s))from the initial raw data. Data preparation tasks are likely to beperformed multiple times, and not in any prescribed order. Tasksinclude table, record, and attribute selection as well astransformation and cleaning of data for modelling tools.

(Piotr Paszek) Data Mining DM – KDD 21 / 43

CRISP-DM

ModelingIn this phase, various modeling techniques are selected andapplied, and their parameters are calibrated to optimal values.Typically, there are several techniques for the same data miningproblem type. Some techniques have specific requirements onthe form of data. Therefore, stepping back to the datapreparation phase is often needed.

(Piotr Paszek) Data Mining DM – KDD 22 / 43

CRISP-DM

EvaluationAt this stage in the project you have built a model (or models)that appears to have high quality, from a data analysisperspective. Before proceeding to final deployment of the model,it is important to more thoroughly evaluate the model, andreview the steps executed to construct the model, to be certainit properly achieves the business objectives. A key objective is todetermine if there is some important business issue that has notbeen su�ciently considered. At the end of this phase, a decisionon the use of the data mining results should be reached.

(Piotr Paszek) Data Mining DM – KDD 23 / 43

CRISP-DM

DeploymentCreation of the model is generally not the end of the project.Even if the purpose of the model is to increase knowledge of thedata, the knowledge gained will need to be organized andpresented in a way that is useful to the customer. Depending onthe requirements, the deployment phase can be as simple asgenerating a report or as complex as implementing a repeatabledata scoring (e.g. segment allocation) or data mining process.In many cases it will be the customer, not the data analyst, whowill carry out the deployment steps. Even if the analyst deploysthe model it is important for the customer to understand upfront the actions which will need to be carried out in order toactually make use of the created models.

(Piotr Paszek) Data Mining DM – KDD 24 / 43

DM Software

Best free DM software (alphabetic order):

KNIME Analytics Platform

Orange Data mining

R Software Environment, Rattle GUI

RapidMiner Studio

Rough Set Exploration System

Weka Data Mining

(Piotr Paszek) Data Mining DM – KDD 25 / 43

KNIME Analytics Platform

The Konstanz Information Miner (KNIME), is an open source dataanalytics, reporting and integration platform.KNIME integrates various components for machine learning and datamining through its modular data pipelining concept and provides agraphical user interface allows assembly of nodes for datapreprocessing, for modelling and data analysis and visualization.KNIME Analytics Platform provides over 1000 data analytic routines,either natively or through R and Weka.KNIME is written in Java and based on Eclipse and makes use of itsextension mechanism to add plugins providing additional functionality.

(Piotr Paszek) Data Mining DM – KDD 26 / 43

Orange Data mining

Orange is an open source data visualization and analysis tool. Orangeis developed at University of Ljubljana, Slovenia, along with opensource community.Data mining is done through visual programming or Python

scripting.Orange is a Python library.Orange consists of a canvas interface onto which the user placeswidgets and creates a data analysis workflow. In Orange, dataanalysis process can be designed through visual programming.Orange runs on many platforms (Windows, Mac OS X, Linux).Orange can read files in native and other data formats.Orange is devoted to machine learning methods for classification, orsupervised data mining.

(Piotr Paszek) Data Mining DM – KDD 28 / 43

R Software Environment

R is a free software environment for statistical computing andgraphics. It compiles and runs on a wide variety of UNIX platforms,Windows and MacOS.R is an integrated suite of software facilities for data manipulation,calculation and graphical display.The R language is widely used among statisticians and data minersfor developing statistical software and data analysis.R provides a wide variety of statistical and graphical techniques,including linear and nonlinear modeling, classical statistical tests,time-series analysis, classification, clustering, and others.

(Piotr Paszek) Data Mining DM – KDD 30 / 43

Rattle GUI

The R Analytical Tool To Learn Easily (Rattle) is a popular GUI fordata mining using R. It is Free Open Source Software.Rattle runs on many platforms (Windows, Mac OS X, Linux).It presents statistical and visual summaries of data, transforms datathat can be readily modeled, builds both unsupervised and supervisedmodels from the data, presents the performance of modelsgraphically, and scores new datasets.One of the most important features is that all of the user’sinteractions through the graphical user interface are captured as an R

script that can be readily executed in R independently of the Rattleinterface.Through a simple and logical graphical user interface based onGnome, Rattle can be used by itself to deliver data mining projects.

(Piotr Paszek) Data Mining DM – KDD 32 / 43

RapidMiner Studio

RapidMiner Studio is a visual design environment for machinelearning, data mining, text mining, predictive analytics and businessanalytics.It provides a deep library of machine learning algorithms, datapreparation and exploration functions, and model validation tools tosupport all your data science projects and use cases.Data science teams can easily re-use existing R and Python code,and add new functionality via a large marketplace of pre-builtextensions.RapidMiner supports all steps of the data mining process includingresults visualization, validation and optimization.RapidMiner is written in the Java programming language.RapidMiner provides learning schemes and models and algorithmsfrom Weka and R scripts that can be used through extensions.

(Piotr Paszek) Data Mining DM – KDD 34 / 43

Rough Set Exploration System

Rough Set Exploration System (RSES) is a tool set for analysingdata with the use of methods coming from Rough Set Theory. It is agraphical, user-friendly front-end running under Windows andproviding access to methods from RSESlib library.RSESlib is a core of RSES’ computational kernel.Both library and GUI are designed and implemented at the WarsawUniversity.RSESlib is a library of functions for performing various dataexploration tasks such as: calculation of reducts, generation ofdecision rules, classification, discretization, decomposition, search forpatterns in data, data manipulation.The library is implemented in Java. First version of library wasincluded in the computational kernel of ROSETTA system.

(Piotr Paszek) Data Mining DM – KDD 36 / 43

Weka Data Mining

WEKA is a collection of machine learning algorithms for datamining tasks. The algorithms can either be applied directly to adataset or called from your own Java code.Weka features include machine learning, data mining, preprocessing,classification, regression, clustering, association rules, attributeselection, experiments, workflow and visualization.Weka is written in Java, developed at the University of Waikato, NewZealand.It runs on many platforms (Windows, Mac OS X, Linux).Weka is open source software issued under the GNU General PublicLicense.

(Piotr Paszek) Data Mining DM – KDD 38 / 43

Data Mining - tasks I

Anomaly detection (outlier/change/deviation detection)The identification of unusual data records, that might beinteresting or data errors that require further investigation.

Association rule learning (dependency modelling)Searches for relationships between variables. For example, asupermarket might gather data on customer purchasing habits.

ClusteringDiscovering groups and structures in the data that are in someway or another similar, without using known structures in thedata.

(Piotr Paszek) Data Mining DM – KDD 41 / 43

Data Mining - tasks II

ClassificationBuilding a model that describe how to classify (assign) the dataitems into one of a predefined classes. For example, an e-mailprogram might attempt to classify an e-mail as legitimate or asspam.

RegressionPredicting the value of a given (continuous) feature based onthe values of other features in the data, assuming a linear ornon-linear model of dependency.

Summarization– providing a more compact representation of the data set,including visualization and report generation.

(Piotr Paszek) Data Mining DM – KDD 42 / 43