Post on 16-Mar-2018
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 1/19
Knowledge Discovery in Databases
(Information Harvesting, Data Archeology, Data
Mining, Knowledge Destilery, ....)
Non-trivial process of identifying valid, novel,
potentially useful and ultimately understandable
patterns from data (Fayyad a kol., 1996)
Data mining involves the use of sophisticated data
analysis tools to discover previously unknown, valid
patterns and relationships in large data sets
(Adriaans, Zantinge, 1999)
Analysis of observational data sets to find
unsuspected relationships and summarize data in
novel ways that are both understandable and useful
to the data owner (Hand, Manilla, Smyth, 2001)
Data mining is the process of analyzing hidden
patterns of data from different perspectives and
categorizing them into useful information
(techopedia.org, 2011)
Three sources
databases (query languages, OLAP), statistics
(data analysis), artificial intelligence (machine
learning)
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 2/19
KDD Tasks
(Klosgen, Zytkow, 1997)
classification/prediction: the task is to
find knowledge applicable to automatically
process new examples
desription: the task is to find dominant
structure or relationships
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 3/19
search for „nuggets“: the task is to find
partial novel and surprising knowledge
(Chapman a kol, 2000)
data description and summarisation:
concise description of characteristics of
the data, typically in elementary and
aggregated form
segmentation: separation of the data into
interesting and meaningful subgroups or
classes
concept description: understandable
description of concepts or classes to gain
insight
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 4/19
classification: build classification models
(sometimes called classifiers) which assign
the correct class label to previously unseen
and unlabeled objects
prediction: similar to classification, but the
target attribute (class) is not a qualitative
discrete attribute but a continuous one.
Prediction also often deals with time
dependent concepts
dependency analysis: describe significant
dependencies (or associations) between data
items or events
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 5/19
Managerial viewpoint
7. Interpre-
tace
6. Data
mining
1. Řešitelský
tým
4. Výběr
metod
3. Získání
dat
2. Specifikace
problému
Znalosti
pro řešení
Manažerský
problém
5.Předzpraco-
vání dat
Data processing viewpoint
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 6/19
Application areas of KDD
Segmentation and classification (clients of a bank
or insurance company),
Credit Risk Assessment,
Fraud detection
Prediction of stock market prices,
Prediction of energy consumption,
Intrusion detection,
Churn Analysis (telco services providers, internet
providers),
Microarray data analysis (molecular biology),
Targeted marketing,
Medical diagnosis,
Market Basket Analysis.
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 7/19
Market basket analysis: data expolration
Collected data – content of market baskets in transactional form
Basket_id Item_id
10011 152
10011 37
10012 1
10012 152
10012 785
10012 6
10013 10
10014 15
10014 811
. . . . . .
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 8/19
Market basket analysis: dependency analysis
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 9/19
Market basket analysis: classification
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 10/19
KDD Standards
1. Methodologies
(Marban a kol, 2009)
5A Developed in mid. 90th by SPSS. The name is an
acronym for the performed steps:
Assess – assess the requirements of the project,
Access – access the available data,
Analyze – perform the analyses,
Act – turn knowledge into actions,
Automate – deploy the models in an automatic way.
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 11/19
SEMMA Developed in mid. 90th by SAS:
Sample the data by creating one or more data
tables,
Explore the data by searching for relationships,
trends or anomalies,
Modify the data by creating, selecting, and
transforming the variables,
Model the relationships between input and output
variables by using various data mining techniques,
Assess the quality of the models.
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 12/19
CRISP-DM Currently a de-facto standard supported by most
data mining systems
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 13/19
2. Standards to describe models
Predictive Modeling Markup Language
Standard based on XML developed at Data Mining
Group (www.dmg.org), that allows to describe data,
data transformations and created models. Main parts
of a PMML document:
Header
Data Dictionary
Data Transformations
Model
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 14/19
<?xml version="1.0" ?>
<PMML version="4.0">
<Header copyright="P.B." description="An example decision tree model."/>
<DataDictionary numberOfFields="5" >
<DataField name="income" optype="categorical" />
<Value value="low"/>
<Value value="high"/>
<DataField name=account" optype= categorical " />
<Value value="low"/>
<Value value="medium"/>
<Value value="high"/>
<DataField name="sex" optype="categorical" >
<Value value="male"/>
<Value value="female"/>
</DataField>
<DataField name="unemployed" optype="categorical" >
<Value value="yes"/>
<Value value="no"/>
</DataField>
<DataField name=loan" optype="categorical" >
<Value value="A"/>
<Value value="n"/>
</DataField>
</DataDictionary>
<TreeModel modelName="loan aproval decision tree" >
<MiningSchema>
<MiningField name=“income"/>
<MiningField name="account"/>
<MiningField name="sex"/>
<MiningField name="unemployed"/>
<MiningField name="loan" usageType="predicted"/>
</MiningSchema>
<Node score="A">
<True/>
<Node score="A">
<SimplePredicate field="income" operator="equal" value="high"/>
</Node>
<Node score="n">
<SimplePredicate field="income" operator="equal" value="low"/>
<Node score="A">
<SimplePredicate field="account" operator="equal"
value="high"/>
</Node>
<Node score="n">
<SimplePredicate field="account" operator="equal"
value="low"/>
<Node score="n">
<SimplePredicate field="unemployed" operator="equal"
value="yes“/>
</Node>
<Node score="A">
<SimplePredicate field="unemployed" operator="equal"
value="no“/>
</Node>
</Node>
</Node>
</Node>
</TreeModel>
</PMML>
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 15/19
3. Programming standards (API)
SQL/MM Data Mining
Standard interface that enables to access data
mining algorithms from relational databases
OLE DB for Data Mining
API developed by Microsoft
Java Data Mining
CREATE MINING MODEL CreditRisk
(
CustomerId long key,
Income text discrete,
Account text discrete,
Sex text discrete,
Unemployed boolean discrete,
Loan text discrete predict,
)
USING [Microsoft Decision Tree]
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 16/19
Data Mining Systems
cover the whole KDD process (from data
preprocessing to model evaluation),
offer more data mining algorithms (than single-
purpose machine learning systems),
focus on visualization (both in the way how to
use the system and in the way how to present
and interpret data and results).
System Vendor URL
SPM Salford
Systems
www.salford-systems.com
Clementine SPSS www-01.ibm.com/software/analytics/
spss/products/modeler/
Enterprise
Miner
SAS Institute www.sas.com/technologies/analytics/
datamining/miner/
GhostMiner Fujitsu www.fqs.pl/business_intelligence/prod
ucts/ghostminer
Intelligent
Miner
IBM www-01.ibm.com/software/data/
infosphere/warehouse/enterprise.html
KnowledgeSt
udio
Angoss www.angoss.com
Oracle Data
Mining
Oracle www.oracle.com/us/products/database/
options/data-mining/index.html
PolyAnalyst Megaputer www.megaputer.com/
Statistica
Data Miner
StatSoft www.statsoft.com/products/data-
mining-solutions/
LISp Miner VŠE lispminer.vse.cz
RapidMiner Rapid-I rapid-i.com/
Weka University of
Waikato
www.cs.waikato.ac.nz/ml/weka/index.
html
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 17/19
Weka
Rapid Miner
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 18/19
SAS Enterprise Miner
IBM SPSS Modeler (Clementine)
Knowledge Discovery in Databases T1: introduction
P. Berka, 2012 19/19