Private Sector Program Workshop on Data Mining
Transcript of Private Sector Program Workshop on Data Mining
Michael Welge
Automated Learning GroupNational Center for Supercomputing ApplicationsUniversity of Illinois
[email protected]. 244.1999
April 28, 2003
Private Sector Program Workshop on Data Mining
alg | Automated Learning Group
Workshop Overview
• Data Mining Concepts and Techniques
• Break
• Data Mining Frameworks D2K/D2KSL
• Lunch – Center Atrium
• Data Mining Applications• Text mining• Image Mining
alg | Automated Learning Group
Data Mining Concept and Techniques Overview
• Automated Learning Group Background
• Introduction to Knowledge Discovery in Databases and Data Mining
• Applications of Data Mining
• Knowledge Discovery in Database Process
• Data Mining Paradigms
• Knowledge Discovery in Databases Framework
• Current and Future Research Activities
• Major Challenges in Data Mining
• Summary/References
alg | Automated Learning Group
Goals
• Understanding of the Knowledge Discovery in Databases Processes
• Gain Knowledge of Basic Data Mining Operations and Techniques
• Key Issues in Application Deployment
• Understanding the Role of Information Visualization in Data Mining
• Understanding the Role of the Knowledge Discovery Framework
alg | Automated Learning Group
ALG Background
• A brief history of the NCSA Automated Learning Group (ALG)• NCSA Industrial program foundation• State and Federal program support• Evolving framework to support KDD
• ALG’s Participation in Related Campus Activities• OVCR Faculty Fellows Program• REU Data Mining• Disability Research Institute (DRI) • Mid-America Earthquake Center (MAE) • Multi-Sector Crisis Management Consortium (MSCMC) • Technology Research Education Collaboration Center (TRECC)
alg | Automated Learning Group
ALG Mission
The specific mission of the Automated Learning Group is:
• To collaborate with researchers to develop novel computer methods and the scientific foundation for using historical data to improve future decision making
• To work closely with industrial, government, and academic partners to explore new application areas for such methods, and
• To transfer the resulting software technology into real world applications
alg | Automated Learning Group
ALG Research, Development, & Technology Transfer Model
alg | Automated Learning Group
Motivation: “Necessity is Mother of Invention”
• Data Explosion Problem• Automated Data Collection Tools And Mature Database Technology
Lead To Tremendous Amounts Of Data Stores In Databases, Data Warehouses, And Other Information Repositories.
• We Are Drowning In Data, But Starving For Knowledge
• Solution: Data Management Environments and Data Mining• Data Warehousing and On-Line Analytical Processing• Extraction Of Interesting Knowledge (Rules, Regularities, Patterns)
From Large Data And Large Databases
alg | Automated Learning Group
Why Do We Need Data Mining ?
• Data volumes are too large for classical analysis approaches:• Large number of records (108 – 1012 bytes)• High dimensional data ( 102 – 104 attributes)
How do you explore millions of records, tens or hundreds of fields, and find patterns?
alg | Automated Learning Group
Why Do We Need Data Mining?
• As databases grow, the ability to support the decision support process using traditional query languages becomes infeasible
• Many queries of interest are difficult to state in a query language (query formulation problem)• “Find all cases of fraud”
• “Find all individuals likely to need Education Credit Assistance”
• “Find all documents that are similar to this customers problem”
alg | Automated Learning Group
What is Data Mining? (Knowledge Discovery in Databases)
Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.
• The understandable patterns are used to:• Make predictions or classifications about new data• Discovery of new business rules• Summarize the contents of a large database to support decision
making• Information visualization to aid humans in discovering deeper
patterns
alg | Automated Learning Group
Why Data Mining? – Potential Application
• Database analysis and decision support• Market analysis and management
– target marketing, customer relation management, market basket analysis, cross selling, market segmentation
• Risk analysis and management
– Forecasting, customer retention, improved underwriting, quality control, competitive analysis
• Fraud detection and management
• Other Applications• Text mining (news group, email, documents) and Web analysis.
• Many, Many - Others
alg | Automated Learning Group
Market Analysis and Management
• Where are the data sources for analysis?• Credit card transactions, loyalty cards, discount coupons, customer
complaint calls, plus (public) lifestyle studies
• Target marketing• Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
• Determine customer purchasing patterns over time• Conversion of single to a joint bank account: marriage, etc.
• Cross-market analysis• Associations/co-relations between product sales
• Prediction based on the association information
alg | Automated Learning Group
Market Analysis and Management
• Customer profiling
• data mining can tell you what types of customers buy what
products (clustering or classification)
• Identifying customer requirements
• identifying the best products for different customers
• use prediction to find what factors will attract new customers
• Provides summary information
• various multidimensional summary reports
• statistical summary information (data central tendency and
variation)
alg | Automated Learning Group
Fraud and Inappropriate Behavior Management
• Applications• widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
• Approach• use historical data to build models of fraudulent behavior and use
data mining to help identify similar instances
• Examples• tax claims: detect a group of people who file false Tax claims• money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network) • medical insurance: detect professional patients and ring of doctors
and ring of references
alg | Automated Learning Group
Fraud and Inappropriate Behavior Management
• Detecting inappropriate medical treatment• Australian Health Insurance Commission identifies that in many
cases blanket screening tests were requested.
• Detecting telephone fraud• Telephone call model: destination of the call, duration, time of day
or week. Analyze patterns that deviate from an expected norm.• British Telecom identified discrete groups of callers with frequent
intra-group calls, especially mobile phones, and broke a multimillion dollar fraud.
• Retail• Analysts estimate that 38% of retail shrink is due to dishonest
employees.
alg | Automated Learning Group
Corporate Analysis and Risk Management
• Finance planning and asset evaluation• cash flow analysis and prediction• contingent claim analysis to evaluate assets • cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
• Resource planning:• summarize and compare the resources and spending
• Competition:• monitor competitors and market directions • group customers into classes and a class-based pricing procedure• set pricing strategy in a highly competitive market
alg | Automated Learning Group
Many Many Others
• Description of Land Uses
• Precision Farming
• Peer Group Study
• Real-time Diagnosis of Mechanical Systems
• National Crime Incident Reporting System (Homeland Security)
• Student/Teacher Performance System
• Making Human Resource Decisions
• Automated Completion of Repetitive Forms
• Predicting the Function of a Gene Complex
• Auditing Tool
• Systems for Intrusion Detection
alg | Automated Learning Group
Data Management Environments and Data Mining
alg | Automated Learning Group
KDD Process
• Develop an Understanding of the Application Domain • Relevant prior knowledge, problem objectives, success criteria, current
solution, inventory resources, constraints, terminology, cost and benefits
• Create Target Data Set• Collect initial data, describe, focus on a subset of variables, verify data
quality
• Data Cleaning and Preprocessing• Remove noise, outliers, missing fields, time sequence information, known
trends, integrate data
• Data Reduction and Projection• Feature subset selection, feature construction, discretizations,
aggregations
• Selection of Data Mining Task• Classification, segmentation, deviation detection, link analysis
• Select Data Mining Approach(es)• Data Mining to Extract Patterns or Models• Interpretation and Evaluation of Patterns/Models• Consolidating Discovered Knowledge
alg | Automated Learning Group
Knowledge Discovery In Databases Process
alg | Automated Learning Group
Required Effort for Each KDD Step
0
10
20
30
40
50
60
BusinessObjectives
Determination
Data Preparation Data Mining Analysis &Assimilation
Eff
ort
(%
)
alg | Automated Learning Group
Data Mining and Business Intelligence
Increasing potentialto supportbusiness decisions End User
Business Analyst
DataAnalyst
DBA
MakingDecisions
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP
alg | Automated Learning Group
Data Mining: On What Kind of Data?
• Relational Databases
• Data Warehouses
• Transactional Databases
• Advanced Database Systems• Object-Relational• Spatial• Temporal• Text• Heterogeneous, Legacy, and Distributed• WWW
alg | Automated Learning Group
Data Mining Paradigms
• Concept description: Characterization and discrimination• Generalize, summarize, and contrast data characteristics, e.g., dry
vs. wet regions
• Discovery - Association (correlation and causality)• age(“20..29”) ^ income(“20..29K”) buys(“PC”) [support = 2%,
confidence = 60%]
alg | Automated Learning Group
Data Mining Paradigms
• Classification and Prediction
• Finding models (functions) that describe and distinguish classes or concepts for future prediction
• E.g., classify countries based on climate, or classify cars based on gas mileage
• Presentation: decision-tree, classification rule, neural network
• Prediction: Predict some unknown or missing numerical values
• Cluster analysis• Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
• Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
alg | Automated Learning Group
Data Mining Paradigms
• Outlier analysis• Outlier: a data object that does not comply with the general behavior of the
data
• It can be considered as noise or exception but is quite useful in fraud
detection, rare events analysis
• Other pattern-directed or statistical analyses
alg | Automated Learning Group
Origins of Data Mining
• Draws ideas from database systems, machine learning, statistics, mathematical programming, information visualization, and high performance computing.
• Traditional techniques may be unsuitable• Enormity of data• High dimensionality of data• Heterogeneous, distributed nature of data
alg | Automated Learning Group
Data Mining in Action
alg | Automated Learning Group
Requirements For a Successful Data Mining Effort• There is a sponsor for the application.
• The business case for the application is clearly understood and measurable, and the objectives are likely to be achievable given the resources being applied.
• The application has a high likelihood of having a significant impact on the business.
• Business domain knowledge is available.
• Good quality relevant data in sufficient quantities is available.
• The right people---domain, data management, and data mining experts---are available.
For a first time project the following criteria could be added:
• The scope of the application is limited - try to show results in 6-9 months
• The data source should be limited to those that are well known, relatively clean and freely accessible
alg | Automated Learning Group
Need for Data Mining Framework
• Human analysis breaks down with volume and dimensionality.• How quickly can you digest 10 million records with 100 fields each?• High data growth rate, changing underlying source
• What is typically done by non-statisticians?• Select a few fields (usually 2-3 out of 50-100), attempt to visualize
or fit to a simple model
• What about traditional statistical approaches?• In general, do not scale to large database
alg | Automated Learning Group
D2K - Data To Knowledge
D2K is a rapid, flexible data mining system that integrates effective analytical data mining methods for prediction, discovery, and anomaly detection with data management and information visualization.
• Visual Programming Environment
• Robust Computational Infrastructure
• Flexible And Extensible Architecture
• Rapid Application Development Environment
• Integrated Environment For Models And Visualization
• Workflow and Group Use Interface
alg | Automated Learning Group
D2K – Infrastructure, Toolkit, Modules, and Applications
• Data Selection• Distributed Knowledge
Sources
• Data Transformation• Feature Selection/
Construction• Example Selection
• Data Modeling• Scalable Algorithms
– Predictive– Discovery– Anomaly
Detection• Bias Optimization• Layer Learning
• Model Evaluation• Information
Visualization
alg | Automated Learning Group
D2K – Infrastructure, Toolkit, Modules, and Applications
alg | Automated Learning Group
D2K/T2K/I2K - Data, Text, and Image Analysis
alg | Automated Learning Group
D2K – SL
• Intuitive interfaces into D2K functionality for non-data mining professionals.
• Transparent access to mine data stored in databases.
• Extensible from desktop to cluster to grid.
• Visualization support at all stages of the data mining process.
• Support for very large data sets.
alg | Automated Learning Group
• Mines and archives information from the web, Usenet, news-feeds, mailing lists, intranets, and databases
• Provides cost effective, efficient, easy to use solutions for searching multiple government/military web sites
• Automated information clustering, classification, and association discovery
• Visualization of search and data organization
• Learns from users; leverages the power of large user communities
• Provides the means to share information and alerts others with similar interests
REVEAL
alg | Automated Learning Group
Decision Making in Uncertain Settings
• Evolutionary Multi-Objective Optimization
• DISCUS• Computer -> Computer
– Genetic Algorithms• Computer -> Human
– Interactive Genetic Algorithms
• Human -> Human– Human-based
Genetic Algorithms
alg | Automated Learning Group
Data Spaces - Publish, Query, and Discover Data
alg | Automated Learning Group
Mining Alarming Incidents in Data Streams - MAIDS
MAIDS is aimed to:• Discover changes, trends and evolution characteristics in data streams.• Construct clusters and classification models from data streams.• Explore frequent patterns and similarities among data streams
MAIDS can be applied to:• Network intrusion detection• Remote sensor data• Telecommunication data flow analysis•Financial data trend prediction•Web click streams analysis
alg | Automated Learning Group
D2K Infrastructure – Grid Powered
alg | Automated Learning Group
Major Challenges in Data Mining
• Mining methodology and user interaction• Mining different kinds of knowledge in databases
• Interactive mining of knowledge at multiple levels of abstraction
• Incorporation of background knowledge
• Data mining query languages and ad-hoc data mining
• Expression and visualization of data mining results
• Handling noise and incomplete data
• Pattern evaluation: the interestingness problem
• Performance and scalability• Efficiency and scalability of data mining algorithms
• Parallel, distributed and incremental mining methods
alg | Automated Learning Group
Major Challenges in Data Mining
• Issues relating to the diversity of data types• Handling relational and complex types of data• Mining information from heterogeneous databases and global
information systems (WWW)
• Issues related to applications and social impacts• Application of discovered knowledge
– Domain-specific data mining tools– Intelligent query answering– Process control and decision making
• Integration of the discovered knowledge with existing knowledge: A knowledge fusion problem
• Protection of data security, integrity, and privacy
alg | Automated Learning Group
Summary
• Data mining: discovering interesting patterns from large amounts of data
• A natural evolution of database technology, in great demand, with wide applications
• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
• Mining can be performed in a variety of information repositories
• Data mining paradigms: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.
• Data mining framework
• Major issues in data mining
alg | Automated Learning Group
References
• J. Han and M. Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann, 2000. (A Very Special Thanks to Jiawei Han
for Slide Use)
• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy. Advances in Knowledge Discovery and Data
Mining. AAAI/MIT Press, 1996.
• T. Imielinski and H. Mannila. A database perspective on
knowledge discovery. Communications of ACM, 39:58-64, 1996.
• G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data
mining to knowledge discovery: An overview. In U.M. Fayyad,
et al. (eds.), Advances in Knowledge Discovery and Data
Mining, 1-35. AAAI/MIT Press, 1996.
• G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in
Databases. AAAI/MIT Press, 1991.