Post on 18-Dec-2015
CogNovaTechnologies
1
Knowledge Discovery Knowledge Discovery and Data Miningand Data Mining
An IntroductionAn Introduction
Daniel L. SilverDaniel L. Silver
Copyright (c), 2003All Rights Reserved
CogNovaTechnologies
2
AgendaAgenda
Introduction to KDD & DMIntroduction to KDD & DM Overview of the KDD ProcessOverview of the KDD Process Benefits, Costs, Status and TrendsBenefits, Costs, Status and Trends
CogNovaTechnologies
3
““We are drowning in information, We are drowning in information, but starving for knowledge.but starving for knowledge.”” John John
NaisbettNaisbett
Megatrends, 1988Megatrends, 1988
Data Analytics or KDD:Data Analytics or KDD:Data Warehousing, Data Mining, Data Warehousing, Data Mining,
Data Visualization Data Visualization
CogNovaTechnologies
4
IntroductionIntroductionData Analytics is not a new field ...Data Analytics is not a new field ... Since 1990’s referred to as:Since 1990’s referred to as: Data Analysis,Data Analysis, Data Mining, Data WarehousingData Mining, Data Warehousing
A multidisciplinary field:A multidisciplinary field:• Database and data warehousingDatabase and data warehousing• Data and model visualization methodsData and model visualization methods• On-line Analytical ProcessingOn-line Analytical Processing• Statistics and machine learning Statistics and machine learning • Knowledge managementKnowledge management
CogNovaTechnologies
6
IntroductionIntroduction
What is Data Analytics (KDD)? What is Data Analytics (KDD)?
A ProcessA Process The selection and processing of data for:The selection and processing of data for:
• the identification of novel, accurate, and the identification of novel, accurate, and useful patterns, and useful patterns, and
• the modeling of real-world phenomenon.the modeling of real-world phenomenon. Data Warehousing, Data mining, and Data Data Warehousing, Data mining, and Data
Visualization Visualization are major components.are major components.
CogNovaTechnologies
7
The KDD ProcessThe KDD Process
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
DataWarehouse
Data Sources
Patterns & Models
Prepared Data
ConsolidatedData
CogNovaTechnologies
8
Introduction – KDD In ContextIntroduction – KDD In Context
CogNovaTechnologies
9
The KDD ProcessThe KDD Process
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
Data Sources
Patterns & Models
Prepared Data
ConsolidatedData
IdentifyProblem or Opportunity
Measure Effectof Action
Act onKnowledge
“The VirtuousCycle” Berry & Linoff
Knowledge
ResultsStrategy
Problem
CogNovaTechnologies
9
Introduction - CRISPIntroduction - CRISP CrCrossoss I Industry ndustry SStandard tandard PProcess for Data rocess for Data
MiningMining Developed by employees at SPSS, NCR, Developed by employees at SPSS, NCR,
DaimlerCrysler DaimlerCrysler Iterative process with 6 major steps:Iterative process with 6 major steps:
• Business UnderstandingBusiness Understanding• Data UnderstandingData Understanding• Data PreparationData Preparation• Modeling Modeling • EvaluationEvaluation• DeploymentDeployment
CogNovaTechnologies
10
Why? … Why? … RelationshipRelationship MarketingMarketinga.k.aa.k.a
Customer Relationship Customer Relationship ManagementManagement
Marketing Embraces KM, DW, Marketing Embraces KM, DW, DMDM
Marketing
TraditionalMarketing
MIS
DataWarehousingData Mining
CogNovaTechnologies
11
What is Relationship What is Relationship Marketing?Marketing?
Knowing your customers Knowing your customers on an individual basison an individual basis
Maximizing life-time value Maximizing life-time value not individual sales not individual sales
Developing and Developing and maintaining a mutually maintaining a mutually beneficial relationshipbeneficial relationship
Acquire, retain, win-back Acquire, retain, win-back desirable customersdesirable customers
Arbuckle’sMarket
“ The Corner Store ”
CogNovaTechnologies
12
Knowledge DiscoveryKnowledge Discovery
What can KDD do for an organization?What can KDD do for an organization?
Impact on MarketingImpact on Marketing Target marketing at a credit card companyTarget marketing at a credit card company Consumer usage analysis at a telecomm Consumer usage analysis at a telecomm
providerprovider Loyalty assessment at a service bureauLoyalty assessment at a service bureau Quality of service analysis at an appliance Quality of service analysis at an appliance
chainchain
CogNovaTechnologies
13
Application Areas Application Areas Private/Commercial SectorPrivate/Commercial Sector
Marketing: Marketing: segmentation, product targeting,segmentation, product targeting,customer value and retention, ...customer value and retention, ...
Finance: Finance: investment support, portfolio managementinvestment support, portfolio management Banking & Insurance: Banking & Insurance: credit and policy approvalcredit and policy approval Security: Security: fraud detection, access controlfraud detection, access control Science and medicine: Science and medicine: hypothesis discovery, hypothesis discovery,
prediction, classification, diagnosis prediction, classification, diagnosis Manufacturing: Manufacturing: process modeling, quality control,process modeling, quality control,
resource allocation resource allocation Engineering: Engineering: pattern recognition, signal processingpattern recognition, signal processing Internet: Internet: smart search engines, web marketing smart search engines, web marketing
CogNovaTechnologies
14
Application Areas Application Areas Public/GovPublic/Gov’’t Sectort Sector
Finance: Finance: investment management, price investment management, price forecastingforecasting
Taxation: Taxation: adaptive monitoring, fraud detection adaptive monitoring, fraud detection Health care: Health care: medical diagnosis, risk assessment,medical diagnosis, risk assessment,
cost /quality controlcost /quality control Education: Education: process and quality modeling, process and quality modeling,
resource forecastingresource forecasting Insurance: Insurance: workerworker’’s compensation analysis s compensation analysis Security: Security: bomb, iceberg detectionbomb, iceberg detection Transportation: Transportation: simulation and analysissimulation and analysis Statistics: Statistics: demographic analysis, municipal demographic analysis, municipal
planning planning
CogNovaTechnologies
16
The KDD ProcessThe KDD Process
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
Data Sources
Patterns & Models
Prepared Data
ConsolidatedData
CogNovaTechnologies
17
The KDD ProcessThe KDD Process
Possible results for any one effort:Possible results for any one effort: Confirmation of the obviousConfirmation of the obvious
New knowledge - the data mine New knowledge - the data mine ““nuggetnugget””
No significant relations found No significant relations found (random (random data)data)
CogNovaTechnologies
18
The KDD ProcessThe KDD ProcessCore Problems & Approaches Core Problems & Approaches
Problems:Problems:• identificationidentification of relevant data of relevant data• representationrepresentation of data of data• searchsearch for valid pattern or model for valid pattern or model
Approaches:Approaches:• top-down top-down deduction deduction by expertby expert• interactive interactive visualization visualization of data/modelsof data/models• * bottom-up* bottom-up induction induction from data *from data *
Probabilityof sale
Income
Age
DataMining
OLAP
CogNovaTechnologies
19
The KDD ProcessThe KDD ProcessThe Architecture of a KDD SystemThe Architecture of a KDD System
Graphical User Interface
DataConsolidation
Selectionand
Preprocessing
DataMining
Interpretationand Evaluation
Warehouse KnowledgeData Sources
CogNovaTechnologies
20
The KDD ProcessThe KDD Process
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
CogNovaTechnologies
21
Data ConsolidationData Consolidation
Garbage in Garbage out Garbage in Garbage out The quality of results relates directly to The quality of results relates directly to
quality of the dataquality of the data 50%-70% of KDD process effort will be 50%-70% of KDD process effort will be
spent on data consolidation, cleansing spent on data consolidation, cleansing and preprocessingand preprocessing
Major justification for a corporate Major justification for a corporate Data Data WarehouseWarehouse
CogNovaTechnologies
22
Data Consolidation & Data Consolidation & WarehousingWarehousingFrom data sources to consolidated data From data sources to consolidated data
repositoryrepository
RDBMS
Legacy DBMS
Flat Files
DataConsolidationand Cleansing
Warehouseor Datamart
External
Analysis and Info Sharing
Inflow
MetaflowUpflowDownflowOutflow
CogNovaTechnologies
24
Data Warehousing – A Data Warehousing – A ProcessProcess
Definition: The strategic collection, cleansing, and Definition: The strategic collection, cleansing, and consolidation of organizational data to meet operational, consolidation of organizational data to meet operational, analytical, and communication needs.analytical, and communication needs.
75% of early DW projects were not completed75% of early DW projects were not completed Data warehousing is not a projectData warehousing is not a project It is an on-going set of organizational activitiesIt is an on-going set of organizational activities Must be business benefits drivenMust be business benefits driven
CogNovaTechnologies
27
Relationship between DW Relationship between DW and DM?and DM?
Source of consolidated
data
Rationalefor data
consolidation
Data Warehousing
AnalysisQuery/Reporting
OLAPData Mining
Strategic Tactical
CogNovaTechnologies
28
The KDD ProcessThe KDD Process
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
CogNovaTechnologies
29
Selection and Selection and PreprocessingPreprocessing Generate a set of examplesGenerate a set of examples
• choose sampling methodchoose sampling method• consider sample complexityconsider sample complexity• deal with volume bias issuesdeal with volume bias issues
Reduce attribute dimensionalityReduce attribute dimensionality• remove redundant and/or correlating attributesremove redundant and/or correlating attributes• combine attributes (sum, multiply, difference)combine attributes (sum, multiply, difference)
Reduce attribute value rangesReduce attribute value ranges• group symbolic discrete valuesgroup symbolic discrete values• quantize continuous numeric valuesquantize continuous numeric values
OLAP and visualization tools play key role (Han OLAP and visualization tools play key role (Han calls this calls this descriptive data miningdescriptive data mining))
CogNovaTechnologies
30
OLAP: OLAP: On-Line Analytical On-Line Analytical ProcessingProcessing
OLAP FunctionalityOLAP Functionality Dimension selection Dimension selection
• slice & diceslice & dice RotationRotation
• allows change in perspectiveallows change in perspective
FiltrationFiltration • value range selectionvalue range selection
HierarchiesHierarchies• drill-downs to lower levels drill-downs to lower levels • roll-ups to higher levelsroll-ups to higher levels
OLAPcube
Year by Month
Product Classby Product Name
SalesRegion
Profit Values
CogNovaTechnologies
31
Selection and Selection and PreprocessingPreprocessing Transform dataTransform data
• decorrelate and normalize values decorrelate and normalize values • map time-series data to static representationmap time-series data to static representation
Encode data Encode data • representation must be appropriately for the representation must be appropriately for the
Data Mining tool which will be used Data Mining tool which will be used • continue to reduce attribute dimensionality continue to reduce attribute dimensionality
where possible without loss of informationwhere possible without loss of information OLAP and visualization tools as well as OLAP and visualization tools as well as
transformation and encoding softwaretransformation and encoding software
CogNovaTechnologies
33
The KDD ProcessThe KDD Process
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
CogNovaTechnologies
34
Overview of Data Mining Overview of Data Mining MethodsMethods
Automated Exploration/DiscoveryAutomated Exploration/Discovery• e.g.. e.g.. discovering new market segmentsdiscovering new market segments• distance and probabilistic clustering algorithmsdistance and probabilistic clustering algorithms
Prediction/ClassificationPrediction/Classification• e.g.. e.g.. forecasting gross sales given current factorsforecasting gross sales given current factors• regression, neural networks, genetic algorithmsregression, neural networks, genetic algorithms
Explanation/DescriptionExplanation/Description• e.g.. e.g.. characterizing customers by demographics characterizing customers by demographics
and purchase historyand purchase history• inductive decision trees, inductive decision trees,
association rule systemsassociation rule systems
x1
x2
f(x)
x
if age > 35 and income < $35k then ...Focus is on induction of a model
from specific examples
CogNovaTechnologies
35
Data Mining MethodsData Mining MethodsAutomated Exploration and DiscoveryAutomated Exploration and Discovery Distance-based numerical clusteringDistance-based numerical clustering
• metric grouping of examples (KNN)metric grouping of examples (KNN)• graphical visualization can be usedgraphical visualization can be used
Bayesian clusteringBayesian clustering• search for the number of classes which result search for the number of classes which result
in best fit of a probability distribution to the in best fit of a probability distribution to the data data
Unsupervised LearningUnsupervised Learning
Income
Age
CogNovaTechnologies
36
Data Mining MethodsData Mining MethodsPrediction and Classification Prediction and Classification
Function approximation Function approximation (curve fitting)(curve fitting) Classification Classification (concept learning, pattern (concept learning, pattern
recognition)recognition) Methods:Methods:
• Statistical regressionStatistical regression• Artificial neural networksArtificial neural networks• Genetic algorithmsGenetic algorithms• Nearest neighbour algorithmsNearest neighbour algorithms
Supervised LearningSupervised LearningI1 I2 I3 I4
O1 O2
f(x)
x
x1
x2
AB
CogNovaTechnologies
37
Data Mining MethodsData Mining Methods
Generalization Generalization The objective of learning is to achieve The objective of learning is to achieve
good good generalizationgeneralization to new cases, to new cases, otherwise just use a look-up table.otherwise just use a look-up table.
Generalization can be defined as a Generalization can be defined as a mathematical mathematical interpolationinterpolation or or regressionregression over a set of training points: over a set of training points:
f(x)
x
CogNovaTechnologies
41
Data Mining MethodsData Mining MethodsExplanation and DescriptionExplanation and Description
Learn a generalized hypothesis (model) Learn a generalized hypothesis (model) from selected datafrom selected data
Description/Interpretation of model Description/Interpretation of model provides new human knowledge provides new human knowledge
Methods:Methods:• Inductive decision tree and rule systemsInductive decision tree and rule systems• Association rule systemsAssociation rule systems• Link AnalysisLink Analysis
A?
B? C?
D?
Root
Leaf
Yes
CogNovaTechnologies
42
Modeling & Data MiningModeling & Data Mining
DEMODEMO
WEKA – A Data Mining WEKA – A Data Mining EnvironmentEnvironment
CogNovaTechnologies
43
The KDD ProcessThe KDD Process
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidationand Warehousing
Knowledge
p(x)=0.02
Warehouse
CogNovaTechnologies
44
Interpretation and Interpretation and EvaluationEvaluation
EvaluationEvaluation Statistical validation and significance testingStatistical validation and significance testing Qualitative review by experts in the fieldQualitative review by experts in the field Pilot surveys to evaluate model accuracyPilot surveys to evaluate model accuracy
InterpretationInterpretation Inductive tree and rule models can be read Inductive tree and rule models can be read
directlydirectly Clustering results can be graphed and tabledClustering results can be graphed and tabled Code can be automatically generated by some Code can be automatically generated by some
systems systems (ANNs, IDTs, Regression models)(ANNs, IDTs, Regression models)
CogNovaTechnologies
45
Interpretation and Interpretation and EvaluationEvaluation
Visualization tools can be very helpful:Visualization tools can be very helpful:• sensitivity analysis (I/O relationship)sensitivity analysis (I/O relationship)• histograms of value distributionshistograms of value distributions• time-series plots and animationtime-series plots and animation• requires training and practicerequires training and practice
Response
Velocity
Temp
CogNovaTechnologies
47
Benefits of Data Benefits of Data Analytics(KDD)Analytics(KDD)
Maximum utility from corporate dataMaximum utility from corporate data• discovery of new knowledgediscovery of new knowledge• generation of predictive modelsgeneration of predictive models
Important feedback to data warehousing Important feedback to data warehousing efforteffort• identification and justification of essential dataidentification and justification of essential data
Reduction of application dev Reduction of application dev ’’t backlogt backlog• model development model development vs. vs. software developmentsoftware development
Effect on bottom line of organizationEffect on bottom line of organization• cost reduction, increased productivity, risk cost reduction, increased productivity, risk
avoidance … competitive advantageavoidance … competitive advantage
CogNovaTechnologies
48
Requirements and Costs of Requirements and Costs of KDDKDD
HardwareHardware - - computationally intensivecomputationally intensive SoftwareSoftware - - micro < $20k, integrated suites $100k+micro < $20k, integrated suites $100k+ DataData - internal collection, surveys, external sources- internal collection, surveys, external sources Human resourcesHuman resources
• DB/DP/DC expertise to consolidate and preprocess DB/DP/DC expertise to consolidate and preprocess datadata
• Machine learning and stats competenceMachine learning and stats competence• Application knowledge & project mgmtApplication knowledge & project mgmt
70% 70% of the effort is expended on the data of the effort is expended on the data consolidation and preprocessing activitiesconsolidation and preprocessing activities
CogNovaTechnologies
49
Current Status and TrendsCurrent Status and Trends Standards and methodologies are maturingStandards and methodologies are maturing Many products:Many products:
• Open source (WEKA, RapidMiner)Open source (WEKA, RapidMiner)• micro DM packages (IBM Cognos)micro DM packages (IBM Cognos)• Macro integrated suites (IBM SPSS Macro integrated suites (IBM SPSS
Modeler, SAS Enterprise Miner)Modeler, SAS Enterprise Miner) Software costs have stabalizedSoftware costs have stabalized Major players have been determinedMajor players have been determined Internet - Internet - ““thethe”” sink and source of data sink and source of data Legal and ethical issues on the horizonLegal and ethical issues on the horizon
CogNovaTechnologies
50
Current Status and TrendsCurrent Status and Trends
Methods usedMethods used• http://www.kdnuggets.com/polls/2013/analy
tics-big-data-mining-data-science-software.html
Appication areas:Appication areas:• http://www.kdnuggets.com/polls/2012/wher
e-applied-analytics-data-mining.html
Other Poles: Other Poles: • http://www.kdnuggets.com/polls/index.html
CogNovaTechnologies
51
The Current Status and The Current Status and TrendsTrendsWhat has prevented the use of Data Mining?What has prevented the use of Data Mining? Products:Products:
• General in nature, not tailored for businessGeneral in nature, not tailored for business• Missing standard interfaces to organizational Missing standard interfaces to organizational
datadata• Emphasis on sales and not training/consulting Emphasis on sales and not training/consulting
Customers:Customers:• Frightened by technical skill set requiredFrightened by technical skill set required• Uncertain of mining results and ROIUncertain of mining results and ROI• Convinced warehouse must be completed firstConvinced warehouse must be completed first• Lacking knowledge of external data sourcesLacking knowledge of external data sources
CogNovaTechnologies
52
Key Technologies for KDDKey Technologies for KDD
Data warehousing and distributed Data warehousing and distributed database database
Parallel computingParallel computing AI and expert systemsAI and expert systems Machine learning and statistical inferenceMachine learning and statistical inference Visualization (including Virtual Reality)Visualization (including Virtual Reality) Internet - future sink and source of dataInternet - future sink and source of data
• adaptive filters, knowledge extractorsadaptive filters, knowledge extractors• smart web servicessmart web services
CogNovaTechnologies
53
Current Management Current Management IssuesIssuesOwnership of data and Ownership of data and
knowledgeknowledgeSecurity of customer dataSecurity of customer dataResponsibility for accuracy of Responsibility for accuracy of
informationinformationEthical practices - fair use of Ethical practices - fair use of
datadata
CogNovaTechnologies
54
A List of Major VendorsA List of Major VendorsLots of PlayersLots of Players
Approaching market from hardware, Approaching market from hardware, database, statistical, machine learning, database, statistical, machine learning,
education, financial/marketing, and education, financial/marketing, and management consulting:management consulting:
IBMIBM, , SASSAS, , SPSSSPSS, , SGISGI, , Thinking MachinesThinking Machines, , CognosCognos, , ZDM ScientificZDM Scientific, , NeuralwareNeuralware, ,
Information DiscoveryInformation Discovery, , American American HeuristicsHeuristics, , Data DistilleriesData Distilleries, ,
SuperInductionSuperInduction