Elementary Concepts of data minig

Post on 22-May-2015

993 views 0 download

description

Mathematical analysis of Graph and Huff amn coding

Transcript of Elementary Concepts of data minig

Elementary Concepts Data Mining Technology

Anjan.KII Sem M.Tech

CSEM.S.R.I.T

Agenda

Need for Dimensionality ReductionPCA revisitedData Mining elementary conceptsHands On Problem-Q3Potter’s Wheel-Data Cleaning Tool

Need for Dimensionality Reduction

It is easy to collect data but accumulates in an unprecedented speed.

Data is not collected only for data miningData preprocessing is an important part

for effective machine learning and data mining.

Dimensionality reduction is an effective approach to downsizing data

Dimensionality Reduction?

Learning and data mining techniques may not be effective for high-dimensional data due its dimensionality.

Query accuracy and efficiency degrade rapidly as the dimension increases.

Visualization: projection of high-dimensional data onto 2D or 3D.

Data compression: efficient storage and retrieval.

Noise removal: positive effect on query accuracy.

Principal Component Analysis

PCA is a statistical technique used in face recognition and image compression and is unsupervised linear algorithm.

A common technique for finding patterns in data of high dimension. Mining for principal component in image.

Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables

Retains most of the sample's information.Ex: High resolution image transformed to low resolution image.

Geometric Picture of Principal Components (PCs)

Algebraic Derivation of PCs

Knowledge Discovery (KDD) Process

Data mining—core of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

AlgorithmOther

Disciplines

Visualization

Question 3 (1.3 of Chap 1 of Han & Kamber)

Suppose your task as a software engineer at Big university is to design a data mining system to examine the university course database, which contains the following information: name, address, status, course taken, the cumulative grade point average(GPA) of each student. Describe the architecture you would choose. What is the purpose of each component of this architecture?

Proposed Data Mining Technology

Data Mining System

College DB

University DB

UniversityWarehouse

Exam DB

Response Attribution

Back Office Systems

Data Mining system

OLAP Tools

Pattern Evaluation

Graphical Interface

Potter‘s Wheel

Problem of conventional approaches Time consuming (many iterations), long waiting

periods Users have to write complex transformation scripts Separate Tools for auditing and transformation

Potter‘s Wheel approach: Interactive system, instant feedback Integration of both, data auditing and transformation Intuitive User Interface – spreadsheet like

application

Potter‘s Wheel

Potter’s Wheel- features

Instead of complex transform specifications with regular expressions or custom programs user specifies by example (e.g. splitting)

Data auditing extensible with user defined domains Parse „Tayler, Jane, JFK to ORD on April 23, 2000 Coach“ as „[A-Za-z,]*

<Airport> to <Airport> on <Date> <Class>“ instead of „[A-Za-z,]* [A-Z]³ to [A-Z]³ on [A-Za-z]* [0-9]*, [0-9]* [A-Za-z]*

Allows easier detection of e.g. logical errors like false airport codes Problem: tradeoff between overfitting and underfitting structure Potter‘s Wheel uses Minimun description length method to balance this

tradeoff and choose appropriate structure Data auditing in background on the fly (data streaming also possible) Reorderer allows sorting on the fly User only works on a view – real data isn‘t changed until user exports

set of transforms e.g. as C program an runs it on the real data Undo without problems: just delete unwanted transform from sequence

and redo everything else

Potter‘s Wheel - Conclusion

Problems: Usability of User Interface How does duplicate elimination work? Kind of a black box system

General Open Problems of Data Cleaning: (Automatic) correction of wrong values

Mask wrong values but keep them Keep several possible values at the same time (2*age.

2*birthday) Leeds to problems if other values depend on a certain alternative

and this turns out to be wrong Maintenance of cleaned data, especially if sources

can‘t be cleaned Data cleaning framework desireable