Knowledge Discovery in Databases - vse.czsorry.vse.cz/~berka/docs/4iz451/sl01-kdd-en.pdfKnowledge...

Knowledge Discovery in Databases T1: introduction

P. Berka, 2012 1/19

Knowledge Discovery in Databases

(Information Harvesting, Data Archeology, Data

Mining, Knowledge Destilery, ....)

Non-trivial process of identifying valid, novel,

potentially useful and ultimately understandable

patterns from data (Fayyad a kol., 1996)

Data mining involves the use of sophisticated data

analysis tools to discover previously unknown, valid

patterns and relationships in large data sets

(Adriaans, Zantinge, 1999)

Analysis of observational data sets to find

unsuspected relationships and summarize data in

novel ways that are both understandable and useful

to the data owner (Hand, Manilla, Smyth, 2001)

Data mining is the process of analyzing hidden

patterns of data from different perspectives and

categorizing them into useful information

(techopedia.org, 2011)

Three sources

databases (query languages, OLAP), statistics

(data analysis), artificial intelligence (machine

learning)

P. Berka, 2012 2/19

KDD Tasks

(Klosgen, Zytkow, 1997)

classification/prediction: the task is to

find knowledge applicable to automatically

process new examples

desription: the task is to find dominant

structure or relationships

P. Berka, 2012 3/19

search for „nuggets“: the task is to find

partial novel and surprising knowledge

(Chapman a kol, 2000)

data description and summarisation:

concise description of characteristics of

the data, typically in elementary and

aggregated form

segmentation: separation of the data into

interesting and meaningful subgroups or

classes

concept description: understandable

description of concepts or classes to gain

insight

P. Berka, 2012 4/19

classification: build classification models

(sometimes called classifiers) which assign

the correct class label to previously unseen

and unlabeled objects

prediction: similar to classification, but the

target attribute (class) is not a qualitative

discrete attribute but a continuous one.

Prediction also often deals with time

dependent concepts

dependency analysis: describe significant

dependencies (or associations) between data

items or events

P. Berka, 2012 5/19

Managerial viewpoint

7. Interpre-

6. Data

mining

1. Řešitelský

4. Výběr

3. Získání

2. Specifikace

problému

Znalosti

pro řešení

Manažerský

problém

5.Předzpraco-

vání dat

Data processing viewpoint

P. Berka, 2012 6/19

Application areas of KDD

Segmentation and classification (clients of a bank

or insurance company),

Credit Risk Assessment,

Fraud detection

Prediction of stock market prices,

Prediction of energy consumption,

Intrusion detection,

Churn Analysis (telco services providers, internet

providers),

Microarray data analysis (molecular biology),

Targeted marketing,

Medical diagnosis,

Market Basket Analysis.

P. Berka, 2012 7/19

Market basket analysis: data expolration

Collected data – content of market baskets in transactional form

Basket_id Item_id

10011 152

10011 37

10012 1

10012 152

10012 785

10012 6

10013 10

10014 15

10014 811

. . . . . .

P. Berka, 2012 8/19

Market basket analysis: dependency analysis

P. Berka, 2012 9/19

Market basket analysis: classification

P. Berka, 2012 10/19

KDD Standards

1. Methodologies

(Marban a kol, 2009)

5A Developed in mid. 90th by SPSS. The name is an

acronym for the performed steps:

Assess – assess the requirements of the project,

Access – access the available data,

Analyze – perform the analyses,

Act – turn knowledge into actions,

Automate – deploy the models in an automatic way.

P. Berka, 2012 11/19

SEMMA Developed in mid. 90th by SAS:

Sample the data by creating one or more data

tables,

Explore the data by searching for relationships,

trends or anomalies,

Modify the data by creating, selecting, and

transforming the variables,

Model the relationships between input and output

variables by using various data mining techniques,

Assess the quality of the models.

P. Berka, 2012 12/19

CRISP-DM Currently a de-facto standard supported by most

data mining systems

P. Berka, 2012 13/19

2. Standards to describe models

Predictive Modeling Markup Language

Standard based on XML developed at Data Mining

Group (www.dmg.org), that allows to describe data,

data transformations and created models. Main parts

of a PMML document:

Header

Data Dictionary

Data Transformations

P. Berka, 2012 14/19

<?xml version="1.0" ?>

</DataField>

</DataField>

</DataField>

</DataDictionary>

</MiningSchema>

<True/>

</Node>

<SimplePredicate field="account" operator="equal"

value="high"/>

</Node>

<SimplePredicate field="account" operator="equal"

value="low"/>

<SimplePredicate field="unemployed" operator="equal"

value="yes“/>

</Node>

<SimplePredicate field="unemployed" operator="equal"

value="no“/>

</Node>

</TreeModel>

</PMML>

P. Berka, 2012 15/19

3. Programming standards (API)

SQL/MM Data Mining

Standard interface that enables to access data

mining algorithms from relational databases

OLE DB for Data Mining

API developed by Microsoft

Java Data Mining

CREATE MINING MODEL CreditRisk

CustomerId long key,

Income text discrete,

Account text discrete,

Sex text discrete,

Unemployed boolean discrete,

Loan text discrete predict,

USING [Microsoft Decision Tree]

P. Berka, 2012 16/19

Data Mining Systems

cover the whole KDD process (from data

preprocessing to model evaluation),

offer more data mining algorithms (than single-

purpose machine learning systems),

focus on visualization (both in the way how to

use the system and in the way how to present

and interpret data and results).

System Vendor URL

SPM Salford

Systems

www.salford-systems.com

Clementine SPSS www-01.ibm.com/software/analytics/

spss/products/modeler/

Enterprise

SAS Institute www.sas.com/technologies/analytics/

datamining/miner/

GhostMiner Fujitsu www.fqs.pl/business_intelligence/prod

ucts/ghostminer

Intelligent

IBM www-01.ibm.com/software/data/

infosphere/warehouse/enterprise.html

KnowledgeSt

Angoss www.angoss.com

Oracle Data

Mining

Oracle www.oracle.com/us/products/database/

options/data-mining/index.html

PolyAnalyst Megaputer www.megaputer.com/

Statistica

Data Miner

StatSoft www.statsoft.com/products/data-

mining-solutions/

LISp Miner VŠE lispminer.vse.cz

RapidMiner Rapid-I rapid-i.com/

Weka University of

Waikato

www.cs.waikato.ac.nz/ml/weka/index.

P. Berka, 2012 17/19

Rapid Miner

P. Berka, 2012 18/19

SAS Enterprise Miner

IBM SPSS Modeler (Clementine)

P. Berka, 2012 19/19

Knowledge Discovery in Databases - vse.czsorry.vse.cz/~berka/docs/4iz451/sl01-kdd-en.pdfKnowledge...

Documents

Transcript of Knowledge Discovery in Databases - vse.czsorry.vse.cz/~berka/docs/4iz451/sl01-kdd-en.pdfKnowledge...

r~ tlACTIVE CASES1 LITIGATION STATUS REPORT (As of January 10, 2019) Berka v. NRC, No. 1:17-cv-02836-APM (D.D.C.) On December 14, 2017, George Berka commenced a lawsuit in …

Text mining - sorry.vse.czberka/docs/4iz451/sl14-text-en.pdf · Knowledge Discovery in Databases T14: text mining P. Berka, 2018 1/20

Márta Berka University of Debrecen Dept of Colloid and ...dragon.unideb.hu/~kolloid/colloid/lectures/pharmacy/lecture09.pdf · Márta Berka University of Debrecen Dept of Colloid

Classification Algorithms Continuedberka/docs/4iz451/dm09-rules-regression-knn.pdf · Final rule: Second rule for recommending “hard lenses”: (built from instances not covered

Machine Learning - sorry.vse.czberka/docs/4iz451/sl04-uceni-en.pdf · 2019-10-01 · Relation between machine learning and data mining . Knowledge Discovery in Databases T4: machine

pdf ecoNomIcS aND FINaNce Dr Martin Berka Senior Lecturer

Adrian Onet -2015. Non-Relational Databases Graph databases Columnar databases. Large databases: Key-Value Stores (Amazon) BigTable (Google)

Berka ijhci04 real-time_analysis_of_eeg_indices

XML databases xml-db XML databases

SPATIO-TEMPORAL DATABASES Spatio-temporal Databases.

Machine Learning, Data Mining, and Knowledge Discovery: An ...berka/docs/4iz451/dm01-introduction-ml-data... · Data Mining and Knowledge Discovery integrates theory and heuristics

Databases Creating databases to store information.

Temporal Databases. Outline Spatial Databases Indexing, Query processing Temporal Databases Spatio-temporal ….

Úvod do návrhu léčiv ... (Berka, ver. 2014)

Text Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and video.

Real Exchange Rates and Sectoral Productivity in …cengel/PublishedPapers/Berka...Real Exchange Rates and Sectoral Productivity in the Eurozone† By Martin Berka, Michael B. Devereux,

CHRISTMAS CLASSICS for classic guitar (transc Berka) (chitarra).pdf

Neural networks - vse.czsorry.vse.cz/~berka/docs/4iz451/sl08-nn-en.pdfKnowledge Discovery in Databases T8: neural networks P. Berka, 2012 18/20 Application of neural network select

NewSQL Databases and Time Series Databases

Multimedia Databases Text I. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.