Post on 22-May-2015
description
Elementary Concepts Data Mining Technology
Anjan.KII Sem M.Tech
CSEM.S.R.I.T
Agenda
Need for Dimensionality ReductionPCA revisitedData Mining elementary conceptsHands On Problem-Q3Potter’s Wheel-Data Cleaning Tool
Need for Dimensionality Reduction
It is easy to collect data but accumulates in an unprecedented speed.
Data is not collected only for data miningData preprocessing is an important part
for effective machine learning and data mining.
Dimensionality reduction is an effective approach to downsizing data
Dimensionality Reduction?
Learning and data mining techniques may not be effective for high-dimensional data due its dimensionality.
Query accuracy and efficiency degrade rapidly as the dimension increases.
Visualization: projection of high-dimensional data onto 2D or 3D.
Data compression: efficient storage and retrieval.
Noise removal: positive effect on query accuracy.
Principal Component Analysis
PCA is a statistical technique used in face recognition and image compression and is unsupervised linear algorithm.
A common technique for finding patterns in data of high dimension. Mining for principal component in image.
Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables
Retains most of the sample's information.Ex: High resolution image transformed to low resolution image.
Geometric Picture of Principal Components (PCs)
Algebraic Derivation of PCs
Knowledge Discovery (KDD) Process
Data mining—core of knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database Technology Statistics
MachineLearning
PatternRecognition
AlgorithmOther
Disciplines
Visualization
Question 3 (1.3 of Chap 1 of Han & Kamber)
Suppose your task as a software engineer at Big university is to design a data mining system to examine the university course database, which contains the following information: name, address, status, course taken, the cumulative grade point average(GPA) of each student. Describe the architecture you would choose. What is the purpose of each component of this architecture?
Proposed Data Mining Technology
Data Mining System
College DB
University DB
UniversityWarehouse
Exam DB
Response Attribution
Back Office Systems
Data Mining system
OLAP Tools
Pattern Evaluation
Graphical Interface
Potter‘s Wheel
Problem of conventional approaches Time consuming (many iterations), long waiting
periods Users have to write complex transformation scripts Separate Tools for auditing and transformation
Potter‘s Wheel approach: Interactive system, instant feedback Integration of both, data auditing and transformation Intuitive User Interface – spreadsheet like
application
Potter‘s Wheel
Potter’s Wheel- features
Instead of complex transform specifications with regular expressions or custom programs user specifies by example (e.g. splitting)
Data auditing extensible with user defined domains Parse „Tayler, Jane, JFK to ORD on April 23, 2000 Coach“ as „[A-Za-z,]*
<Airport> to <Airport> on <Date> <Class>“ instead of „[A-Za-z,]* [A-Z]³ to [A-Z]³ on [A-Za-z]* [0-9]*, [0-9]* [A-Za-z]*
Allows easier detection of e.g. logical errors like false airport codes Problem: tradeoff between overfitting and underfitting structure Potter‘s Wheel uses Minimun description length method to balance this
tradeoff and choose appropriate structure Data auditing in background on the fly (data streaming also possible) Reorderer allows sorting on the fly User only works on a view – real data isn‘t changed until user exports
set of transforms e.g. as C program an runs it on the real data Undo without problems: just delete unwanted transform from sequence
and redo everything else
Potter‘s Wheel - Conclusion
Problems: Usability of User Interface How does duplicate elimination work? Kind of a black box system
General Open Problems of Data Cleaning: (Automatic) correction of wrong values
Mask wrong values but keep them Keep several possible values at the same time (2*age.
2*birthday) Leeds to problems if other values depend on a certain alternative
and this turns out to be wrong Maintenance of cleaned data, especially if sources
can‘t be cleaned Data cleaning framework desireable