Data Mining: The Next Revolution Data Mining: The Next ...
description
Transcript of Data Mining: The Next Revolution Data Mining: The Next ...
Data Mining: The Next Revolution Data Mining: The Next Revolution in Institutional Researchin Institutional Research
C. R. Thulasi KumarC. R. Thulasi KumarOffice of Information Management & AnalysisOffice of Information Management & Analysis
University of Northern IowaUniversity of Northern IowaMay 31, 2004May 31, 2004
The Evolution of Data AnalysisThe Evolution of Data AnalysisEvolutionary
Step
Business Question
Enabling Technologies
Product Providers
Characteristics
Data Collection (1960s)
"What was my total revenue in the last five years?"
Computers, tapes, disks
IBM, CDC
Retrospective, static data delivery
Data Access (1980s)
"What were unit sales in New England last March?"
Relational databases (RDBMS), Structured Query Language (SQL), ODBC
Oracle, Sybase, Informix, IBM, Microsoft
Retrospective, dynamic data delivery at record level
Data Warehousing & Decision Support (1990s)
"What were unit sales in New England last March? Drill down to Boston."
On-line analytic processing (OLAP), multidimensional databases, data warehouses
SPSS, Comshare, Arbor, Cognos, Microstrategy, NCR
Retrospective, dynamic data delivery at multiple levels
Data Mining (Emerging Today)
"What’s likely to happen to Boston unit sales next month? Why?"
Advanced algorithms, multiprocessor computers, massive databases
SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous startups
Prospective, proactive information delivery
Source: SPSS BI
What is Data Mining?What is Data Mining?
The exploration and analysis of large quantities of data in ordeThe exploration and analysis of large quantities of data in order r to discover meaningful patterns and rules (Berry and Linoff).to discover meaningful patterns and rules (Berry and Linoff).
The process of discovering meaningful new correlations, The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data patterns, and trends by sifting through large amounts of data stored in repositories and by using pattern recognition stored in repositories and by using pattern recognition technologies as well as statistical and mathematical techniques technologies as well as statistical and mathematical techniques (The Gartner Group).(The Gartner Group).
The nontrivial extraction of implicit, previously unknown, and The nontrivial extraction of implicit, previously unknown, and potentially useful information from data (Frawley, Paitestskypotentially useful information from data (Frawley, Paitestsky--Shapiro and Mathews).Shapiro and Mathews).
Differences between Statistics andDifferences between Statistics andData MiningData Mining
STATISTICS DATA MINING
Confirmative Explorative
Small data sets/File-based Large data sets/Databases
Small number of variables Large number of variables
Deductive Inductive
Numeric data Numeric and non-numeric
Clean data Data cleaning
Why Data Mining?Why Data Mining?
Too much dataToo much dataToo many recordsToo many recordsToo many variablesToo many variables
Interesting patterns difficult to find with traditional Interesting patterns difficult to find with traditional statistics, due tostatistics, due to
Complex non linear relationshipsComplex non linear relationshipsMultiMulti--variable combinationvariable combination
Source: Abbot, Data Mining: Level II
Data Mining is not…Data Mining is not…
OLAPOLAPData WarehousingData WarehousingData VisualizationData VisualizationSQLSQLAd Hoc QueriesAd Hoc QueriesReportingReporting
Data Mining AlgorithmsData Mining Algorithms
StatisticsStatisticsDistributions, mathematics, etc.Distributions, mathematics, etc.
Machine LearningMachine LearningComputer science, heuristics and induction algorithmsComputer science, heuristics and induction algorithms
Artificial IntelligenceArtificial IntelligenceEmulating human intelligenceEmulating human intelligence
Neural NetworksNeural NetworksBiological models, psychology and engineeringBiological models, psychology and engineering
Data Mining is… Data Mining is…
Predictive ModelingPredictive ModelingLiner/Logistic RegressionLiner/Logistic RegressionNeural NetworksNeural NetworksDecision TreesDecision Trees
ClusteringClusteringKohonen Neural Networks ClusteringKohonen Neural Networks ClusteringKK--Means ClusteringMeans ClusteringNearest Neighbor ClusteringNearest Neighbor Clustering
Data Mining is…(cont’d)Data Mining is…(cont’d)
SegmentationSegmentationDecision TreesDecision TreesNeural NetworksNeural NetworksPredictive ModelingPredictive Modeling
Affinity AnalysisAffinity AnalysisAssociation RuleAssociation RuleSequence Generators
Cat. % nBad 52.01 168
Good 47.99 155Total (100.00) 323
Credit ranking (1=default)
Cat. % nBad 86.67 143
Good 13.33 22Total (51.08) 165
Paid Weekly/MonthlyP-value=0.0000, Chi-square=179.6665, df=1
Weekly pay
Cat. % nBad 15.82 25Good 84.18 133Total (48.92) 158
Monthly salary
Cat. % nBad 90.51 143
Good 9.49 15Total (48.92) 158
Age CategoricalP-value=0.0000, Chi-square=30.1113, df=1
Young (< 25);Middle (25-35)
Cat. % nBad 0.00 0Good 100.00 7Total (2.17) 7
Old ( > 35)
Cat. % nBad 48.98 24Good 51.02 25Total (15.17) 49
Age CategoricalP-value=0.0000, Chi-square=58.7255, df=1
Young (< 25)
Cat. % nBad 0.92 1Good 99.08 108Total (33.75) 109
Middle (25-35);Old ( > 35)
Cat. % nBad 0.00 0Good 100.00 8Total (2.48) 8
Social ClassP-value=0.0016, Chi-square=12.0388, df=1
Management;Clerical
Cat. % nBad 58.54 24
Good 41.46 17Total (12.69) 41
Professional
Sequence Generators
Kohonen NetworkKohonen Network
Seeks to describe dataset in terms of natural clusters Seeks to describe dataset in terms of natural clusters of casesof cases
Source: SPSS BI
Apriori Apriori Seeks association rules in dataset“Market Basket” analysisSequence discovery
Source: SPSS BI
Areas of Current ApplicationAreas of Current Application
Credit Card/Insurance Fraud DetectionCredit Card/Insurance Fraud DetectionCredit/Risk ScoringCredit/Risk ScoringDirect Mail MarketingDirect Mail MarketingParts Failure PredictionParts Failure PredictionRecruiting/Attracting Customers Recruiting/Attracting Customers Service Delivery and Customer RetentionService Delivery and Customer Retention“Market Basket” Analysis“Market Basket” Analysis
Higher Education ApplicationsHigher Education Applications
Student academic success/Retention and graduationStudent academic success/Retention and graduationIdentify high risk studentsIdentify high risk studentsPredict course demandPredict course demandProfile good transfer candidatesProfile good transfer candidatesApplication success ratesApplication success ratesPredict potential alumni donationsPredict potential alumni donations
Software VendorsSoftware Vendors
Clementine (SPSS)Clementine (SPSS)Intelligent Miner (IBM)Intelligent Miner (IBM)Insightful Miner (Insightful)Insightful Miner (Insightful)Enterpriser Miner (SAS)Enterpriser Miner (SAS)Affinium Model (Affinium Model (UnicaUnica))CART (Salford Systems)CART (Salford Systems)XLMinerXLMinerGhostMinerGhostMinerSPlusSPlus
Clementine (SPSS)Clementine (SPSS)
Insightful Miner (Insightful)Insightful Miner (Insightful)
CART (Salford Systems)CART (Salford Systems)
How much does it cost?How much does it cost?Clementine (SPSS)Clementine (SPSS)
Price variesPrice variesInsightful Miner (Insightful)Insightful Miner (Insightful)
Small/fraction of other mining toolsSmall/fraction of other mining toolsEnterpriser Miner (SAS) Enterpriser Miner (SAS)
Academic server license $40KAcademic server license $40K--100K100KAffinium Model (Affinium Model (UnicaUnica))Intelligent Miner (IBM)Intelligent Miner (IBM)XLMinerXLMiner
Standard academic version $199 for twoStandard academic version $199 for two--yearsyearsGhostMinerGhostMiner
$2.5K$2.5K--30K + Maintenance fee30K + Maintenance feeCART (Salford Systems) CART (Salford Systems)
Very low for academic licenseVery low for academic license
ResourcesResources
Web SitesWeb Siteshttp://www.kdnuggets.com/http://www.kdnuggets.com/http://www.uni.edu/instrsch/dm/index.htmlhttp://www.uni.edu/instrsch/dm/index.html
TrainingTraininghttp://www.thehttp://www.the--modelingmodeling--agency.comagency.com
What is Data Mining?
• The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data stored in repositories and by using pattern recognition technologies as well as statistical and mathematical techniques (The Gartner Group).
• The Nontrivial extraction of implicit, previously unknown and potentially useful information from data (Frawley, Paitestky-Shapiro and Mathews).
Data Mining in Institutional Research
• Data analysis for institutional research (IR) has evolved from simple retrospective data delivery in the 1960’s to retrospective dynamic data delivery at multiple levels in the 1990’s. Unlike the past methodologies, data mining is prospective and proactive in data analysis and information delivery. With a blend of tools and techniques from disciplinessuch as statistics, computer science, mathematics, biology and engineering, data mining provides new opportunities for institutional research professionals to provide decision support data. This site provides a collection of resources from an introductory perspective for institutional research professionals interested in data mining.
• As this area is still in its infant stages, real world examples of IR applications are difficult to find, let alone emulate. As moreand more examples in IR become available, this site will be updated. Until that time, most of the examples refer to the current data mining applications in the business and industry sectors.
• Data mining has been used by universities in a number of areas, including but not limited to enrollment management, retention and graduation analysis, survey data analysis, and donation prediction (alumni contribution).
Comments or Suggestions? Email Dr. Kumar, Information Management & Analysis
Last Modified: March 25, 2004
Copyright 2004 University of Northern Iowa Office of Information Management & Analysis
TrainingTraining(The Modeling Agency)(The Modeling Agency)
DATA MINING: LEVEL IDATA MINING: LEVEL IA Strategic Overview of Methods, Resources and Applications for A Strategic Overview of Methods, Resources and Applications for Predictive Analytics by Predictive Analytics by Tony Rathburn; Eric SiegelTony Rathburn; Eric Siegel
Registration:Registration: $1,295, 2 Days*$1,295, 2 Days*Washington, DCWashington, DC -- June 21 & 22, 2004June 21 & 22, 2004San Diego, CASan Diego, CA -- September 20 & 21, 2004September 20 & 21, 2004Las Vegas, NVLas Vegas, NV -- November 29 & 30, 2004November 29 & 30, 2004
*DM Levels I & II Package $1,995*DM Levels I & II Package $1,995DATA MINING: LEVEL IIDATA MINING: LEVEL IIA Tactical DrillA Tactical Drill--Down of the Data Mining Process, Tools and Techniques by Dean AbDown of the Data Mining Process, Tools and Techniques by Dean Abbottbott
Registration:Registration: $1,295, 2 Days*$1,295, 2 Days*Washington, DCWashington, DC -- June 23 & 24, 2004June 23 & 24, 2004San Diego, CASan Diego, CA -- September 22 & 23, 2004 September 22 & 23, 2004 Las Vegas, NVLas Vegas, NV -- December 1 & 2, 2004December 1 & 2, 2004
DATA MINING: LEVEL IIIDATA MINING: LEVEL IIIA HandsA Hands--On Application Workshop for Data Mining Practitioners by Dean AbOn Application Workshop for Data Mining Practitioners by Dean Abbottbott
Registration:Registration: $695, 1 Day* $695, 1 Day* Washington, DCWashington, DC -- June 25, 2004 June 25, 2004 San Diego, CASan Diego, CA -- September 24, 2004 September 24, 2004 Las Vegas, NVLas Vegas, NV -- December 3, 2004December 3, 2004
Selected Data Mining BooksSelected Data Mining Books
What percentage (%) of time in your data mining project (s) is sWhat percentage (%) of time in your data mining project (s) is spent pent on data cleaning and preparation? (187 votes total)on data cleaning and preparation? (187 votes total)
Over 80% Over 80% (46) (46) 25%25%61 to 80% 61 to 80% (73) (73) 39%39%41 to 60% 41 to 60% (46) (46) 25%25%21 to 40% 21 to 40% (7) (7) 4%4%20% or less20% or less (15) (15) 8%8%
Source: http://www.kdnuggets.com/
Thank YouThank You