Transcript of Data Mining Tools Overview & Tutorial Ahmed Sameh Prince Sultan University Department of Computer...
- Slide 1
- Data Mining Tools Overview & Tutorial Ahmed Sameh Prince
Sultan University Department of Computer Science & Info Sys May
2010 (Some slides belong to IBM) 1
- Slide 2
- 2 Introduction Outline zDefine data mining zData mining vs.
databases zBasic data mining tasks zData mining development zData
mining issues Goal: Provide an overview of data mining.
- Slide 3
- 3 Introduction zData is growing at a phenomenal rate zUsers
expect more sophisticated information zHow? UNCOVER HIDDEN
INFORMATION DATA MINING
- Slide 4
- 4 Data Mining Definition zFinding hidden information in a
database zFit data to a model zSimilar terms yExploratory data
analysis yData driven discovery yDeductive learning
- Slide 5
- 5 Data Mining Algorithm zObjective: Fit Data to a Model
yDescriptive yPredictive zPreference Technique to choose the best
model zSearch Technique to search the data yQuery
- Slide 6
- 6 Database Processing vs. Data Mining Processing zQuery yWell
defined ySQL zQuery yPoorly defined yNo precise query language Data
Data Operational data Output Output Precise Subset of database Data
Data Not operational data Output Output Fuzzy Not a subset of
database
- Slide 7
- 7 Query Examples zDatabase zData Mining Find all customers who
have purchased milk Find all items which are frequently purchased
with milk. (association rules) Find all credit applicants with last
name of Smith. Identify customers who have purchased more than
$10,000 in the last month. Find all credit applicants who are poor
credit risks. (classification) Identify customers with similar
buying habits. (Clustering)
- Slide 8
- 8 Related Fields Statistics Machine Learning Databases
Visualization Data Mining and Knowledge Discovery
- Slide 9
- 9 Statistics, Machine Learning and Data Mining zStatistics:
ymore theory-based ymore focused on testing hypotheses zMachine
learning ymore heuristic yfocused on improving performance of a
learning agent yalso looks at real-time learning and robotics areas
not part of data mining zData Mining and Knowledge Discovery
yintegrates theory and heuristics yfocus on the entire process of
knowledge discovery, including data cleaning, learning, and
integration and visualization of results zDistinctions are
fuzzy
- Slide 10
- Definition zA class of database application that analyze data
in a database using tools which look for trends or anomalies. zData
mining was invented by IBM.
- Slide 11
- Purpose zTo look for hidden patterns or previously unknown
relationships among the data in a group of data that can be used to
predict future behavior. zEx: Data mining software can help retail
companies find customers with common interests.
- Slide 12
- Background Information zMany of the techniques used by today's
data mining tools have been around for many years, having
originated in the artificial intelligence research of the 1980s and
early 1990s. zData Mining tools are only now being applied to
large-scale database systems.
- Slide 13
- The Need for Data Mining zThe amount of raw data stored in
corporate data warehouses is growing rapidly. zThere is too much
data and complexity that might be relevant to a specific problem.
zData mining promises to bridge the analytical gap by giving
knowledgeworkers the tools to navigate this complex analytical
space.
- Slide 14
- The Need for Data Mining, cont zThe need for information has
resulted in the proliferation of data warehouses that integrate
information multiple sources to support decision making. zOften
include data from external sources, such as customer demographics
and household information.
- Slide 15
- Definition (Cont.) Data mining is the exploration and analysis
of large quantities of data in order to discover valid, novel,
potentially useful, and ultimately understandable patterns in data.
Valid: The patterns hold in general. Novel: We did not know the
pattern beforehand. Useful: We can devise actions from the
patterns. Understandable: We can interpret and comprehend the
patterns.
- Slide 16
- Of laws, Monsters, and Giants zMoores law: processing capacity
doubles every 18 months : CPU, cache, memory zIts more aggressive
cousin: yDisk storage capacity doubles every 9 months What do the
two laws combined produce? A rapidly growing gap between our
ability to generate data, and our ability to make use of it.
- Slide 17
- What is Data Mining? Finding interesting structure in data
zStructure: refers to statistical patterns, predictive models,
hidden relationships zExamples of tasks addressed by Data Mining
yPredictive Modeling (classification, regression) ySegmentation
(Data Clustering ) ySummarization yVisualization
- Slide 18
- Slide 19
- 19 Major Application Areas for Data Mining Solutions
zAdvertising zBioinformatics zCustomer Relationship Management
(CRM) zDatabase Marketing zFraud Detection zeCommerce zHealth Care
zInvestment/Securities zManufacturing, Process Control zSports and
Entertainment zTelecommunications zWeb
- Slide 20
- 20 Data Mining zThe non-trivial extraction of novel, implicit,
and actionable knowledge from large datasets. yExtremely large
datasets yDiscovery of the non-obvious yUseful knowledge that can
improve processes yCan not be done manually zTechnology to enable
data exploration, data analysis, and data visualization of very
large databases at a high level of abstraction, without a specific
hypothesis in mind. zSophisticated data search capability that uses
statistical algorithms to discover patterns and correlations in
data.
- Slide 21
- 21 Data Mining (cont.)
- Slide 22
- 22 Data Mining (cont.) zData Mining is a step of Knowledge
Discovery in Databases (KDD) Process yData Warehousing yData
Selection yData Preprocessing yData Transformation yData Mining
yInterpretation/Evaluation zData Mining is sometimes referred to as
KDD and DM and KDD tend to be used as synonyms
- Slide 23
- 23 Data Mining Evaluation
- Slide 24
- 24 Data Mining is Not zData warehousing zSQL / Ad Hoc Queries /
Reporting zSoftware Agents zOnline Analytical Processing (OLAP)
zData Visualization
- Slide 25
- 25 Data Mining Motivation zChanges in the Business Environment
yCustomers becoming more demanding yMarkets are saturated
zDatabases today are huge: yMore than 1,000,000
entities/records/rows yFrom 10 to 10,000
fields/attributes/variables yGigabytes and terabytes zDatabases a
growing at an unprecedented rate zDecisions must be made rapidly
zDecisions must be made with maximum knowledge
- Slide 26
- Why Use Data Mining Today? Human analysis skills are
inadequate: yVolume and dimensionality of the data yHigh data
growth rate Availability of: yData yStorage yComputational power
yOff-the-shelf software yExpertise
- Slide 27
- An Abundance of Data zSupermarket scanners, POS data zPreferred
customer cards zCredit card transactions zDirect mail response
zCall center records zATM machines zDemographic data zSensor
networks zCameras zWeb server logs zCustomer web site trails
- Slide 28
- Evolution of Database Technology z1960s: IMS, network model
z1970s: The relational data model, first relational DBMS
implementations z1980s: Maturing RDBMS, application-specific DBMS,
(spatial data, scientific data, image data, etc.), OODBMS z1990s:
Mature, high-performance RDBMS technology, parallel DBMS, terabyte
data warehouses, object-relational DBMS, middleware and web
technology z2000s: High availability, zero-administration, seamless
integration into business processes z2010: Sensor database systems,
databases on embedded systems, P2P database systems, large-scale
pub/sub systems, ???
- Slide 29
- Much Commercial Support zMany data mining tools
yhttp://www.kdnuggets.com/softwarehttp://www.kdnuggets.com/software
zDatabase systems with data mining support zVisualization tools
zData mining process support zConsultants
- Slide 30
- Why Use Data Mining Today? Competitive pressure! The secret of
success is to know something that nobody else knows. Aristotle
Onassis zCompetition on service, not only on price (Banks, phone
companies, hotel chains, rental car companies) zPersonalization,
CRM zThe real-time enterprise zSystemic listening zSecurity,
homeland defense
- Slide 31
- The Knowledge Discovery Process Steps: 1.Identify business
problem 2.Data mining 3.Action 4.Evaluation and measurement
5.Deployment and integration into businesses processes
- Slide 32
- Data Mining Step in Detail 2.1 Data preprocessing yData
selection: Identify target datasets and relevant fields yData
cleaning xRemove noise and outliers xData transformation xCreate
common units xGenerate new fields 2.2 Data mining model
construction 2.3 Model evaluation
- Slide 33
- Preprocessing and Mining Original Data Target Data Preprocessed
Data Patterns Knowledge Data Integration and Selection
Preprocessing Model Construction Interpretation
- Slide 34
- 34 Data Mining Techniques Descriptive Clustering Association
Sequential Analysis Predictive Classification Decision Tree Rule
Induction Neural Networks Nearest Neighbor Classification
Regression
- Slide 35
- 35 Data Mining Models and Tasks
- Slide 36
- 36 Basic Data Mining Tasks zClassification maps data into
predefined groups or classes y Supervised learning y Pattern
recognition y Prediction z Regression is used to map a data item to
a real valued prediction variable. zClustering groups similar data
together into clusters. yUnsupervised learning ySegmentation
yPartitioning
- Slide 37
- 37 Basic Data Mining Tasks (contd) zSummarization maps data
into subsets with associated simple descriptions. yCharacterization
yGeneralization zLink Analysis uncovers relationships among data.
yAffinity Analysis yAssociation Rules ySequential Analysis
determines sequential patterns.
- Slide 38
- 38 Ex: Time Series Analysis zExample: Stock Market zPredict
future values zDetermine similar patterns over time zClassify
behavior
- Slide 39
- 39 Data Mining vs. KDD zKnowledge Discovery in Databases (KDD):
process of finding useful information and patterns in data. zData
Mining: Use of algorithms to extract the information and patterns
derived by the KDD process.
- Slide 40
- 40 Data Mining Development Similarity Measures Hierarchical
Clustering IR Systems Imprecise Queries Textual Data Web Search
Engines Bayes Theorem Regression Analysis EM Algorithm K-Means
Clustering Time Series Analysis Neural Networks Decision Tree
Algorithms Algorithm Design Techniques Algorithm Analysis Data
Structures Relational Data Model SQL Association Rule Algorithms
Data Warehousing Scalability Techniques
- Slide 41
- 41 KDD Issues zHuman Interaction zOverfitting zOutliers
zInterpretation zVisualization zLarge Datasets zHigh
Dimensionality
- Slide 42
- 42 KDD Issues (contd) zMultimedia Data zMissing Data
zIrrelevant Data zNoisy Data zChanging Data zIntegration
zApplication
- Slide 43
- 43 Visualization Techniques zGraphical zGeometric zIcon-based
zPixel-based zHierarchical zHybrid
- Slide 44
- 44 Data Mining Applications
- Slide 45
- 45 Data Mining Applications: Retail zPerforming basket analysis
yWhich items customers tend to purchase together. This knowledge
can improve stocking, store layout strategies, and promotions.
zSales forecasting yExamining time-based patterns helps retailers
make stocking decisions. If a customer purchases an item today,
when are they likely to purchase a complementary item? zDatabase
marketing yRetailers can develop profiles of customers with certain
behaviors, for example, those who purchase designer labels clothing
or those who attend sales. This information can be used to focus
costeffective promotions. zMerchandise planning and allocation
yWhen retailers add new stores, they can improve merchandise
planning and allocation by examining patterns in stores with
similar demographic characteristics. Retailers can also use data
mining to determine the ideal layout for a specific store.
- Slide 46
- 46 Data Mining Applications: Banking zCard marketing yBy
identifying customer segments, card issuers and acquirers can
improve profitability with more effective acquisition and retention
programs, targeted product development, and customized pricing.
zCardholder pricing and profitability yCard issuers can take
advantage of data mining technology to price their products so as
to maximize profit and minimize loss of customers. Includes risk-
based pricing. zFraud detection yFraud is enormously costly. By
analyzing past transactions that were later determined to be
fraudulent, banks can identify patterns. z Predictive life-cycle
management yDM helps banks predict each customers lifetime value
and to service each segment appropriately (for example, offering
special deals and discounts).
- Slide 47
- 47 Data Mining Applications: Telecommunication zCall detail
record analysis yTelecommunication companies accumulate detailed
call records. By identifying customer segments with similar use
patterns, the companies can develop attractive pricing and feature
promotions. zCustomer loyalty ySome customers repeatedly switch
providers, or churn, to take advantage of attractive incentives by
competing companies. The companies can use DM to identify the
characteristics of customers who are likely to remain loyal once
they switch, thus enabling the companies to target their spending
on customers who will produce the most profit.
- Slide 48
- 48 Data Mining Applications: Other Applications zCustomer
segmentation yAll industries can take advantage of DM to discover
discrete segments in their customer bases by considering additional
variables beyond traditional analysis. zManufacturing yThrough
choice boards, manufacturers are beginning to customize products
for customers; therefore they must be able to predict which
features should be bundled to meet customer demand. zWarranties
yManufacturers need to predict the number of customers who will
submit warranty claims and the average cost of those claims.
zFrequent flier incentives yAirlines can identify groups of
customers that can be given incentives to fly more.
- Slide 49
- 49 Which are our lowest/highest margin customers ? Who are my
customers and what products are they buying? Which customers are
most likely to go to the competition ? What impact will new
products/services have on revenue and margins? What impact will new
products/services have on revenue and margins? What product prom-
-otions have the biggest impact on revenue? What is the most
effective distribution channel? A producer wants to know.
- Slide 50
- 50 Data, Data everywhere yet... zI cant find the data I need
ydata is scattered over the network ymany versions, subtle
differences zI cant get the data I need yneed an expert to get the
data zI cant understand the data I found yavailable data poorly
documented zI cant use the data I found yresults are unexpected
ydata needs to be transformed from one form to other
- Slide 51
- 51 What is a Data Warehouse? A single, complete and consistent
store of data obtained from a variety of different sources made
available to end users in a what they can understand and use in a
business context. [Barry Devlin]
- Slide 52
- 52 What are the users saying... zData should be integrated
across the enterprise zSummary data has a real value to the
organization zHistorical data holds the key to understanding data
over time zWhat-if capabilities are required
- Slide 53
- 53 What is Data Warehousing? A process of transforming data
into information and making it available to users in a timely
enough manner to make a difference [Forrester Research, April 1996]
Data Information
- Slide 54
- 54 Very Large Data Bases zTerabytes -- 10^12 bytes: zPetabytes
-- 10^15 bytes: zExabytes -- 10^18 bytes: zZettabytes -- 10^21
bytes: zZottabytes -- 10^24 bytes: Walmart -- 24 Terabytes
Geographic Information Systems National Medical Records Weather
images Intelligence Agency Videos
- Slide 55
- 55 Data Warehousing -- It is a process zTechnique for
assembling and managing data from various sources for the purpose
of answering business questions. Thus making decisions that were
not previous possible zA decision support database maintained
separately from the organizations operational database
- Slide 56
- 56 Data Warehouse zA data warehouse is a ysubject-oriented
yintegrated ytime-varying ynon-volatile collection of data that is
used primarily in organizational decision making. -- Bill Inmon,
Building the Data Warehouse 1996
- Slide 57
- Data Warehousing Concepts Decision support is key for companies
wanting to turn their organizational data into an information asset
Traditional database is transaction-oriented while data warehouse
is data-retrieval optimized for decision-support Data Warehouse "A
subject-oriented, integrated, time-variant, and non-volatile
collection of data in support of management's decision-making
process" OLAP (on-line analytical processing), Decision Support
Systems (DSS), Executive Information Systems (EIS), and data mining
applications 57
- Slide 58
- What does data warehouse do? integrate diverse information from
various systems which enable users to quickly produce powerful
ad-hoc queries and perform complex analysis create an
infrastructure for reusing the data in numerous ways create an open
systems environment to make useful information easily accessible to
authorized users help managers make informed decisions 58
- Slide 59
- Benefits of Data Warehousing zPotential high returns on
investment zCompetitive advantage zIncreased productivity of
corporate decision-makers 59
- Slide 60
- Comparison of OLTP and Data Warehousing OLTP systemsData
warehousing systems Holds current dataHolds historic data Stores
detailed dataStores detailed, lightly, and summarized data Data is
dynamicData is largely static Repetitive processingAd hoc,
unstructured, and heuristic processing High level of transaction
throughputMedium to low transaction throughput Predictable pattern
of usageUnpredictable pattern of usage Transaction drivenAnalysis
driven Application orientedSubject oriented Supports day-to-day
decisionsSupports strategic decisions Serves large number ofServes
relatively lower number clerical / operational usersof managerial
users 60
- Slide 61
- Data Warehouse Architecture Operational Data Load Manager
Warehouse Manager Query Manager Detailed Data Lightly and Highly
Summarized Data Archive / Backup Data Meta-Data End-user Access
Tools 61
- Slide 62
- End-user Access Tools zReporting and query tools zApplication
development tools zExecutive Information System (EIS) tools zOnline
Analytical Processing (OLAP) tools zData mining tools 62
- Slide 63
- Data Warehousing Tools and Technologies Extraction, Cleansing,
and Transformation Tools Data Warehouse DBMS Load performance Load
processing Data quality management Query performance Terabyte
scalability Networked data warehouse Warehouse administration
Integrated dimensional tools Advanced query functionality 63
- Slide 64
- Data Marts zA subset of data warehouse that supports the
requirements of a particular department or business function
64
- Slide 65
- Online Analytical Processing (OLAP) zOLAP yThe dynamic
synthesis, analysis, and consolidation of large volume of multi-
dimensional data zMulti-dimensional OLAP yCubes of data 65
- Slide 66
- Problems of Data Warehousing zUnderestimation of resources for
data loading zHidden problem with source systems zRequired data not
captured zIncreased end-user demands zData homogenization zHigh
demand for resources zData ownership zHigh maintenance zLong
duration projects zComplexity of integration 66
- Slide 67
- Codd's Rules for OLAP Multi-dimensional conceptual view
Transparency Accessibility Consistent reporting performance
Client-server architecture Generic dimensionality Dynamic sparse
matrix handling Multi-user support Unrestricted cross-dimensional
operations Intuitive data manipulation Flexible reporting Unlimited
dimensions and aggregation levels 67
- Slide 68
- OLAP Tools zMulti-dimensional OLAP (MOLAP) yMulti-dimensional
DBMS (MDDBMS) zRelational OLAP (ROLAP) yCreation of multiple
multi-dimensional views of the two-dimensional relations zManaged
Query Environment (MQE) yDeliver selected data directly from the
DBMS to the desktop in the form of a data cube, where it is stored,
analyzed, and manipulated locally 68
- Slide 69
- Data Mining Definition The process of extracting valid,
previously unknown, comprehensible, and actionable information from
large database and using it to make crucial business decisions
Knowledge discovery Association rules Sequential patterns
Classification trees Goals Prediction Identification Classification
Optimization 69
- Slide 70
- Data Mining Techniques zPredictive Modeling ySupervised
training with two phases yTraining phase : building a model using
large sample of historical data called the training set yTesting
phase : trying the model on new data zDatabase Segmentation zLink
Analysis zDeviation Detection 70
- Slide 71
- What are Data Mining Tasks? zClassification zRegression
zClustering zSummarization zDependency modeling z Change and
Deviation Detection 71
- Slide 72
- What are Data Mining Discoveries? z New Purchase Trends z Plan
Investment Strategies z Detect Unauthorized Expenditure z
Fraudulent Activities z Crime Trends z Smugglers-border crossing
72
- Slide 73
- 73 Data Warehouse Architecture Data Warehouse Engine Optimized
Loader Extraction Cleansing Analyze Query Metadata Repository
Relational Databases Legacy Data Purchased Data ERP Systems
- Slide 74
- 74 Data Warehouse for Decision Support & OLAP zPutting
Information technology to help the knowledge worker make faster and
better decisions yWhich of my customers are most likely to go to
the competition? yWhat product promotions have the biggest impact
on revenue? yHow did the share price of software companies
correlate with profits over last 10 years?
- Slide 75
- 75 Decision Support zUsed to manage and control business zData
is historical or point-in-time zOptimized for inquiry rather than
update zUse of the system is loosely defined and can be ad-hoc
zUsed by managers and end-users to understand the business and make
judgements
- Slide 76
- 76 Data Mining works with Warehouse Data zData Warehousing
provides the Enterprise with a memory zData Mining provides the
Enterprise with intelligence
- Slide 77
- 77 We want to know... zGiven a database of 100,000 names, which
persons are the least likely to default on their credit cards?
zWhich types of transactions are likely to be fraudulent given the
demographics and transactional history of a particular customer?
zIf I raise the price of my product by Rs. 2, what is the effect on
my ROI? zIf I offer only 2,500 airline miles as an incentive to
purchase rather than 5,000, how many lost responses will result?
zIf I emphasize ease-of-use of the product as opposed to its
technical capabilities, what will be the net effect on my revenues?
zWhich of my customers are likely to be the most loyal? Data Mining
helps extract such information
- Slide 78
- 78 Application Areas IndustryApplication FinanceCredit Card
Analysis InsuranceClaims, Fraud Analysis TelecommunicationCall
record analysis TransportLogistics management Consumer
goodspromotion analysis Data Service providersValue added data
UtilitiesPower usage analysis
- Slide 79
- 79 Data Mining in Use zThe US Government uses Data Mining to
track fraud zA Supermarket becomes an information broker
zBasketball teams use it to track game strategy zCross Selling
zWarranty Claims Routing zHolding on to Good Customers zWeeding out
Bad Customers
- Slide 80
- 80 What makes data mining possible? zAdvances in the following
areas are making data mining deployable: ydata warehousing ybetter
and more data (i.e., operational, behavioral, and demographic) ythe
emergence of easily deployed data mining tools and ythe advent of
new data mining techniques. -- Gartner Group
- Slide 81
- 81 Why Separate Data Warehouse? zPerformance yOp dbs designed
& tuned for known txs & workloads. yComplex OLAP queries
would degrade perf. for op txs. ySpecial data organization, access
& implementation methods needed for multidimensional views
& queries. zFunction yMissing data: Decision support requires
historical data, which op dbs do not typically maintain. yData
consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many heterogeneous
sources: op dbs, external sources. yData quality: Different sources
typically use inconsistent data representations, codes, and formats
which have to be reconciled.
- Slide 82
- 82 What are Operational Systems? zThey are OLTP systems zRun
mission critical applications zNeed to work with stringent
performance requirements for routine tasks zUsed to run a
business!
- Slide 83
- 83 RDBMS used for OLTP zDatabase Systems have been used
traditionally for OLTP yclerical data processing tasks ydetailed,
up to date data ystructured repetitive tasks yread/update a few
records yisolation, recovery and integrity are critical
- Slide 84
- 84 Operational Systems zRun the business in real time zBased on
up-to-the-second data zOptimized to handle large numbers of simple
read/write transactions zOptimized for fast response to predefined
transactions zUsed by people who deal with customers, products --
clerks, salespeople etc. zThey are increasingly used by
customers
- Slide 85
- 85 Examples of Operational Data
- Slide 86
- 86 Application-Orientation vs. Subject-Orientation
Application-Orientation Operational Database Loans Credit Card
Trust Savings Subject-Orientation Data Warehouse Customer Vendor
Product Activity
- Slide 87
- 87 OLTP vs. Data Warehouse zOLTP systems are tuned for known
transactions and workloads while workload is not known a priori in
a data warehouse zSpecial data organization, access methods and
implementation methods are needed to support data warehouse queries
(typically multidimensional queries) ye.g., average amount spent on
phone calls between 9AM-5PM in Pune during the month of
December
- Slide 88
- 88 OLTP vs Data Warehouse zOLTP yApplication Oriented yUsed to
run business yDetailed data yCurrent up to date yIsolated Data
yRepetitive access yClerical User zWarehouse (DSS) ySubject
Oriented yUsed to analyze business ySummarized and refined
ySnapshot data yIntegrated Data yAd-hoc access yKnowledge User
(Manager)
- Slide 89
- 89 OLTP vs Data Warehouse zOLTP yPerformance Sensitive yFew
Records accessed at a time (tens) yRead/Update Access yNo data
redundancy yDatabase Size 100MB -100 GB zData Warehouse
yPerformance relaxed yLarge volumes accessed at a time(millions)
yMostly Read (Batch Update) yRedundancy present yDatabase Size 100
GB - few terabytes
- Slide 90
- 90 OLTP vs Data Warehouse zOLTP yTransaction throughput is the
performance metric yThousands of users yManaged in entirety zData
Warehouse yQuery throughput is the performance metric yHundreds of
users yManaged by subsets
- Slide 91
- 91 To summarize... zOLTP Systems are used to run a business
zThe Data Warehouse helps to optimize the business
- Slide 92
- 92 Why Now? zData is being produced zERP provides clean data
zThe computing power is available zThe computing power is
affordable zThe competitive pressures are strong zCommercial
products are available
- Slide 93
- 93 Myths surrounding OLAP Servers and Data Marts zData marts
and OLAP servers are departmental solutions supporting a handful of
users zMillion dollar massively parallel hardware is needed to
deliver fast time for complex queries zOLAP servers require massive
and unwieldy indices zComplex OLAP queries clog the network with
data zData warehouses must be at least 100 GB to be effective
Source -- Arbor Software Home Page
- Slide 94
- II. On-Line Analytical Processing (OLAP) Making Decision
Support Possible
- Slide 95
- 95 Typical OLAP Queries zWrite a multi-table join to compare
sales for each product line YTD this year vs. last year. zRepeat
the above process to find the top 5 product contributors to margin.
zRepeat the above process to find the sales of a product line to
new vs. existing customers. zRepeat the above process to find the
customers that have had negative sales growth.
- Slide 96
- 96 * Reference:
http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html What Is OLAP?
zOnline Analytical Processing - coined by EF Codd in 1994 paper
contracted by Arbor Software* zGenerally synonymous with earlier
terms such as Decisions Support, Business Intelligence, Executive
Information System zOLAP = Multidimensional Database zMOLAP:
Multidimensional OLAP (Arbor Essbase, Oracle Express) zROLAP:
Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)
- Slide 97
- 97 The OLAP Market zRapid growth in the enterprise market
y1995: $700 Million y1997: $2.1 Billion zSignificant consolidation
activity among major DBMS vendors y10/94: Sybase acquires
ExpressWay y7/95: Oracle acquires Express y11/95: Informix acquires
Metacube y1/97: Arbor partners up with IBM y10/96: Microsoft
acquires Panorama zResult: OLAP shifted from small vertical niche
to mainstream DBMS category
- Slide 98
- 98 Strengths of OLAP zIt is a powerful visualization paradigm
zIt provides fast, interactive response times zIt is good for
analyzing time series zIt can be useful to find some clusters and
outliers zMany vendors offer OLAP tools
- Slide 99
- 99 Nigel Pendse, Richard Creath - The OLAP Report OLAP Is FASMI
zFast zAnalysis zShared zMultidimensional zInformation
- Slide 100
- 100 Month 1234765 Product Toothpaste Juice Cola Milk Cream Soap
Region W S N Dimensions: Product, Region, Time Hierarchical
summarization paths Product Region Time Industry Country Year
Category Region Quarter Product City Month Week Office Day Office
Day Multi-dimensional Data zHeyI sold $100M worth of goods
- Slide 101
- 101 A Visual Operation: Pivot (Rotate) 10 47 30 12
JuiceColaMilkCream NYLASF 3/1 3/2 3/3 3/4 Date Month Region
Product
- Slide 102
- 102 Slicing and Dicing Product Sales Channel Regions
RetailDirectSpecial Household Telecomm Video Audio India Far East
Europe The Telecomm Slice
- Slide 103
- 103 Roll-up and Drill Down zSales Channel zRegion zCountry
zState zLocation Address zSales Representative Roll Up Higher Level
of Aggregation Low-level Details Drill-Down
- Slide 104
- Results of Data Mining Include: zForecasting what may happen in
the future zClassifying people or things into groups by recognizing
patterns zClustering people or things into groups based on their
attributes zAssociating what events are likely to occur together
zSequencing what events are likely to lead to later events
- Slide 105
- Data mining is not zBrute-force crunching of bulk data zBlind
application of algorithms zGoing to find relationships where none
exist zPresenting data in different ways zA database intensive task
zA difficult to understand technology requiring an advanced degree
in computer science
- Slide 106
- Data Mining versus OLAP zOLAP - On-line Analytical Processing
yProvides you with a very good view of what is happening, but can
not predict what will happen in the future or why it is
happening
- Slide 107
- Data Mining Versus Statistical Analysis Data Mining Originally
developed to act as expert systems to solve problems Less
interested in the mechanics of the technique If it makes sense then
lets use it Does not require assumptions to be made about data Can
find patterns in very large amounts of data Requires understanding
of data and business problem Data Analysis Tests for statistical
correctness of models Are statistical assumptions of models
correct? Eg Is the R-Square good? Hypothesis testing Is the
relationship significant? Use a t-test to validate significance
Tends to rely on sampling Techniques are not optimised for large
amounts of data Requires strong statistical skills
- Slide 108
- Examples of What People are Doing with Data Mining:
Fraud/Non-Compliance Anomaly detection Isolate the factors that
lead to fraud, waste and abuse Target auditing and investigative
efforts more effectively Credit/Risk Scoring Intrusion detection
Parts failure prediction Recruiting/Attracting customers Maximizing
profitability (cross selling, identifying profitable customers)
Service Delivery and Customer Retention Build profiles of customers
likely to use which services Web Mining
- Slide 109
- What data mining has done for... Scheduled its workforce to
provide faster, more accurate answers to questions. The US Internal
Revenue Service needed to improve customer service and...
- Slide 110
- What data mining has done for... analyzed suspects cell phone
usage to focus investigations. The US Drug Enforcement Agency
needed to be more effective in their drug busts and
- Slide 111
- What data mining has done for... Reduced direct mail costs by
30% while garnering 95% of the campaigns revenue. HSBC need to
cross-sell more effectively by identifying profiles that would be
interested in higher yielding investments and...
- Slide 112
- Suggestion:Predicting Washington zC-Span has lunched a digital
archieve of 500,000 hours of audio debates. zText Mining or Audio
Mining of these talks to reveal cwetrain questions such as.
- Slide 113
- Example Application: Sports IBM Advanced Scout analyzes NBA
game statistics yShots blocked yAssists yFouls zGoogle: IBM
Advanced Scout
- Slide 114
- zDSS Agent uses intelligent agents data mining provides
multiple functions recognizes sales patterns among stores discovers
sales patterns by time of day day of year category of product etc.
swiftly identifies trends & shifts in customer tastes performs
Market Basket Analysis (MBA) analyzes Point-of-Sale or -Service
(POS) data identifies relationships among products and/or services
purchased E.g. A customer who buys Brand X slacks has a 35% chance
of buying Brand Y shirts. Agent tool is also used by other Fortune
1000 firms average ROI > 300 % average payback in 1 ~ 2 years
Market Basket Analysis
- Slide 157
- Case Based Reasoning (CBR) General scheme for a case based
reasoning (CBR) model. The target case is matched against similar
precedents in the historical database, such as cases A and B.
- Slide 158
- Case Based Reasoning (CBR) zLearning through the accumulation
of experience zKey issues Indexing: storing cases for quick,
effective access of precedents Retrieval: accessing the appropriate
precedent cases zAdvantages Explicit knowledge form recognizable to
humans No need to re-code knowledge for computer processing
zLimitations Retrieving precedents based on superficial features
E.g. Matching Indonesia with U.S. because both have similar
population size Traditional approach ignores the issue of
generalizing knowledge
- Slide 159
- Genetic Algorithm Generation of candidate solutions using the
procedures of biological evolution. Procedure 0. Initialize. Create
a population of potential solutions ("organisms"). 1. Evaluate.
Determine the level of "fitness" for each solution. 2. Cull.
Discard the poor solutions. 3. Breed. a. Select 2 "fit" solutions
to serve as parents. b. From the 2 parents, generate offspring. *
Crossover: Cut the parents at random and switch the 2 halves. *
Mutation: Randomly change the value in a parent solution. 4.
Repeat. Go back to Step 1 above.
- Slide 160
- Genetic Algorithm (Cont.) zAdvantages Applicable to a wide
range of problem domains. Robustness: can obtain solutions even
when the performance function is highly irregular or input data are
noisy. Implicit parallelism: can search in many directions
concurrently. zLimitations Slow, like neural networks. But:
computation can be distributed over multiple processors (unlike
neural networks) Source: www.pathology.washington.edu
- Slide 161
- Multistrategy Learning zEvery technique has advantages &
limitations zMultistrategy approach Take advantage of the strengths
of diverse techniques Circumvent the limitations of each
methodology
- Slide 162
- Types of Models Prediction Models for Predicting and
Classifying Regression algorithms (predict numeric outcome): neural
networks, rule induction, CART (OLS regression, GLM) Classification
algorithm predict symbolic outcome): CHAID, C5.0 (discriminant
analysis, logistic regression) Descriptive Models for Grouping and
Finding Associations Clustering/Grouping algorithms: K-means,
Kohonen Association algorithms: apriori, GRI
- Slide 163
- Neural Networks zDescription yDifficult interpretation yTends
to overfit the data yExtensive amount of training time yA lot of
data preparation yWorks with all data types
- Slide 164
- Rule Induction Description zIntuitive output zHandles all forms
of numeric data, as well as non-numeric (symbolic) data C5
Algorithm a special case of rule induction zTarget variable must be
symbolic
- Slide 165
- Apriori Description Seeks association rules in dataset Market
basket analysis Sequence discovery
- Slide 166
- Data Mining Is zThe automated process of finding relationships
and patterns in stored data z It is different from the use of SQL
queries and other business intelligence tools
- Slide 167
- Data Mining Is zMotivated by business need, large amounts of
available data, and humans limited cognitive processing abilities
zEnabled by data warehousing, parallel processing, and data mining
algorithms
- Slide 168
- Common Types of Information from Data Mining zAssociations --
identifies occurrences that are linked to a single event zSequences
-- identifies events that are linked over time zClassification --
recognizes patterns that describe the group to which an item
belongs
- Slide 169
- Common Types of Information from Data Mining zClustering --
discovers different groupings within the data zForecasting --
estimates future values
- Slide 170
- Commonly Used Data Mining Techniques zArtificial neural
networks zDecision trees zGenetic algorithms zNearest neighbor
method zRule induction
- Slide 171
- The Current State of Data Mining Tools zMany of the vendors are
small companies zIBM and SAS have been in the market for some time,
and more biggies are moving into this market zBI tools and RDMS
products are increasingly including basic data mining capabilities
zPackaged data mining applications are becoming common
- Slide 172
- The Data Mining Process zRequires personnel with domain, data
warehousing, and data mining expertise zRequires data selection,
data extraction, data cleansing, and data transformation zMost data
mining tools work with highly granular flat files zIs an iterative
and interactive process
- Slide 173
- Why Data Mining zCredit ratings/targeted marketing : yGiven a
database of 100,000 names, which persons are the least likely to
default on their credit cards? yIdentify likely responders to sales
promotions zFraud detection yWhich types of transactions are likely
to be fraudulent, given the demographics and transactional history
of a particular customer? zCustomer relationship management :
yWhich of my customers are likely to be the most loyal, and which
are most likely to leave for a competitor? : Data Mining helps
extract such information
- Slide 174
- Applications zBanking: loan/credit card approval ypredict good
customers based on old customers zCustomer relationship management:
yidentify those who are likely to leave for a competitor. zTargeted
marketing: yidentify likely responders to promotions zFraud
detection: telecommunications, financial transactions yfrom an
online stream of event identify fraudulent events zManufacturing
and production: yautomatically adjust knobs when process parameter
changes
- Slide 175
- Applications (continued) zMedicine: disease outcome,
effectiveness of treatments yanalyze patient disease history: find
relationship between diseases zMolecular/Pharmaceutical: identify
new drugs zScientific data analysis: yidentify new galaxies by
searching for sub clusters zWeb site/store design and promotion:
yfind affinity of visitor to pages and modify layout
- Slide 176
- The KDD process zProblem fomulation zData collection ysubset
data: sampling might hurt if highly skewed data yfeature selection:
principal component analysis, heuristic search zPre-processing:
cleaning yname/address cleaning, different meanings (annual,
yearly), duplicate removal, supplying missing values
zTransformation: ymap complex objects e.g. time series data to
features e.g. frequency zChoosing mining task and mining method:
zResult evaluation and Visualization: Knowledge discovery is an
iterative process
- Slide 177
- Relationship with other fields zOverlaps with machine learning,
statistics, artificial intelligence, databases, visualization but
more stress on yscalability of number of features and instances
ystress on algorithms and architectures whereas foundations of
methods and formulations provided by statistics and machine
learning. yautomation for handling large, heterogeneous data
- Slide 178
- Some basic operations zPredictive: yRegression yClassification
yCollaborative Filtering zDescriptive: yClustering / similarity
matching yAssociation rules and variants yDeviation detection
- Slide 179
- Classification zGiven old data about customers and payments,
predict new applicants loan eligibility. Age Salary Profession
Location Customer type Previous customers ClassifierDecision rules
Salary > 5 L Prof. = Exec New applicants data Good/ bad
- Slide 180
- Classification methods zGoal: Predict class Ci = f(x1, x2,..
Xn) zRegression: (linear or any other polynomial) ya*x1 + b*x2 + c
= Ci. zNearest neighour zDecision tree classifier: divide decision
space into piecewise constant regions. zProbabilistic/generative
models zNeural networks: partition by non- linear boundaries
- Slide 181
- zDefine proximity between instances, find neighbors of new
instance and assign majority class zCase based reasoning: when
attributes are more complicated than real-valued. Nearest neighbor
Cons Slow during application. No feature selection. Notion of
proximity vague Pros + Fast training
- Slide 182
- Clustering zUnsupervised learning when old data with class
labels not available e.g. when introducing a new product.
zGroup/cluster existing customers based on time series of payment
history such that similar customers in same cluster. zKey
requirement: Need a good measure of similarity between instances.
zIdentify micro-markets and develop policies for each
- Slide 183
- Applications zCustomer segmentation e.g. for targeted marketing
yGroup/cluster existing customers based on time series of payment
history such that similar customers in same cluster. yIdentify
micro-markets and develop policies for each zCollaborative
filtering: ygroup based on common items purchased zText clustering
zCompression
- Slide 184
- Distance functions zNumeric data: euclidean, manhattan
distances zCategorical data: 0/1 to indicate presence/absence
followed by yHamming distance (# dissimilarity) yJaccard
coefficients: #similarity in 1s/(# of 1s) ydata dependent measures:
similarity of A and B depends on co-occurance with C. zCombined
numeric and categorical data: yweighted normalized distance:
- Slide 185
- Clustering methods zHierarchical clustering yagglomerative Vs
divisive ysingle link Vs complete link zPartitional clustering
ydistance-based: K-means ymodel-based: EM ydensity-based:
- Slide 186
- Agglomerative Hierarchical clustering zGiven: matrix of
similarity between every point pair zStart with each point in a
separate cluster and merge clusters based on some criteria :
ySingle link: merge two clusters such that the minimum distance
between two points from the two different cluster is the least
yComplete link: merge two clusters such that all points in one
cluster are close to all points in the other.
- Slide 187
- Partitional methods: K-means zCriteria: minimize sum of square
of distance xBetween each point and centroid of the cluster.
xBetween each pair of points in the cluster zAlgorithm: ySelect
initial partition with K clusters: random, first K, K separated
points yRepeat until stabilization: xAssign each point to closest
cluster center xGenerate new cluster centers xAdjust clusters by
merging/splitting
- Slide 188
- Collaborative Filtering zGiven database of user preferences,
predict preference of new user zExample: predict what new movies
you will like based on yyour past preferences yothers with similar
past preferences ytheir preferences for the new movies zExample:
predict what books/CDs a person may want to buy y(and suggest it,
or give discounts to tempt customer)
- Slide 189
- Association rules zGiven set T of groups of items zExample: set
of item sets purchased zGoal: find all rules on itemsets of the
form a-->b such that y support of a and b > user threshold s
yconditional probability (confidence) of b given a > user
threshold c zExample: Milk --> bread zPurchase of product A
--> service B Milk, cereal Tea, milk Tea, rice, bread cereal
T
- Slide 190
- Prevalent Interesting zAnalysts already know about prevalent
rules zInteresting rules are those that deviate from prior
expectation zMinings payoff is in finding surprising phenomena 1995
1998 Milk and cereal sell together! Zzzz... Milk and cereal sell
together!
- Slide 191
- Applications of fast itemset counting Find correlated events:
zApplications in medicine: find redundant tests zCross selling in
retail, banking zImprove predictive capability of classifiers that
assume attribute independence z New similarity measures of
categorical attributes [Mannila et al, KDD 98]
- Slide 192
- Application Areas IndustryApplication FinanceCredit Card
Analysis InsuranceClaims, Fraud Analysis TelecommunicationCall
record analysis TransportLogistics management Consumer
goodspromotion analysis Data Service providersValue added data
UtilitiesPower usage analysis
- Slide 193
- Usage scenarios zData warehouse mining: yassimilate data from
operational sources ymine static data zMining log data zContinuous
mining: example in process control zStages in mining: y data
selection pre-processing: cleaning transformation mining result
evaluation visualization
- Slide 194
- Mining market zAround 20 to 30 mining tool vendors zMajor tool
players: yClementine, yIBMs Intelligent Miner, ySGIs MineSet, ySASs
Enterprise Miner. zAll pretty much the same set of tools zMany
embedded products: yfraud detection: yelectronic commerce
applications, yhealth care, ycustomer relationship management:
Epiphany
- Slide 195
- Vertical integration: Mining on the web zWeb log analysis for
site design: ywhat are popular pages, ywhat links are hard to find.
zElectronic stores sales enhancements: yrecommendations,
advertisement: yCollaborative filtering: Net perception, Wisewire
yInventory control: what was a shopper looking for and could not
find..
- Slide 196
- State of art in mining OLAP integration zDecision trees
[Information discovery, Cognos] yfind factors influencing high
profits zClustering [Pilot software] ysegment customers to define
hierarchy on that dimension zTime series analysis: [Seagates Holos]
yQuery for various shapes along time: eg. spikes, outliers
zMulti-level Associations [Han et al.] yfind association between
members of dimensions zSarawagi [VLDB2000]
- Slide 197
- Data Mining in Use zThe US Government uses Data Mining to track
fraud zA Supermarket becomes an information broker zBasketball
teams use it to track game strategy zCross Selling zTarget
Marketing zHolding on to Good Customers zWeeding out Bad
Customers
- Slide 198
- Some success stories zNetwork intrusion detection using a
combination of sequential rule discovery and classification tree on
4 GB DARPA data yWon over (manual) knowledge engineering approach
yhttp://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good
detailed description of the entire process zMajor US bank: customer
attrition prediction yFirst segment customers based on financial
behavior: found 3 segments yBuild attrition models for each of the
3 segments y40-50% of attritions were predicted == factor of 18
increase zTargeted credit marketing: major US banks yfind customer
segments based on 13 months credit balances ybuild another response
model based on surveys yincreased response 4 times -- 2%
- Slide 199
- Data Mining Tools: KnowledeSe eker 4.5 199 What is
KnowledgeSeeker? Produced by ANGOSS Software Corporation, who focus
solely on data mining software. Offer training and consulting
services Produce data mining add-ins which accepts data from all
major databases Works with popular query and reporting,
spreadsheet, statistical and OLAP & ROLAP tools.
- Slide 200
- Data Mining Tools: KnowledeSe eker 4.5 200 CompanySoftware
Clementine 6.0 Enterprise Miner 3.0 Intelligent Miner Major
Competitors
- Slide 201
- Data Mining Tools: KnowledeSe eker 4.5 201 CompanySoftware
Mineset 3.1 Darwin Scenario Major Competitors
- Slide 202
- Data Mining Tools: KnowledeSe eker 4.5 202 Current Applications
Manufacturing Used by the R.R. Donnelly & Sons commercial
printing company to improve process control, cut costs and increase
productivity. Used extensively by Hewlett Packard in their United
States manufacturing plants as a process control tool both to
analyze factors impacting product quality as well as to generate
rules for production control systems.
- Slide 203
- Data Mining Tools: KnowledeSe eker 4.5 203 Current Applications
Auditing Used by the IRS to combat fraud, reduce risk, and increase
collection rates. Finance Used by the Canadian Imperial Bank of
Commerce (CIBC) to create models for fraud detection and risk
management.
- Slide 204
- Data Mining Tools: KnowledeSe eker 4.5 204 Current Applications
CRM Telephony Used by US West to reduce churning and increase
customer loyalty for a new voice messaging technology.
- Slide 205
- Data Mining Tools: KnowledeSe eker 4.5 205 Current Applications
Marketing Used by the Washington Post to improve their direct mail
targeting and to conduct survey analysis. Health Care Used by the
Oxford Transplant Center to discover factors affecting transplant
survival rates. Used by the University of Rochester Cancer Center
to study the effect of anxiety on chemotherapy-related nausea.
- Slide 206
- Data Mining Tools: KnowledeSe eker 4.5 206 More Customers
- Slide 207
- Data Mining Tools: KnowledeSe eker 4.5 207 Questions 1.What
percentage of people in the test group have high blood pressure
with these characteristics: 66-year-old male regular smoker that
has low to moderate salt consumption? 2.Do the risk levels change
for a male with the same characteristics who quit smoking? What are
the percentages? 3.If you are a 2% milk drinker, how many factors
are still interesting? 4.Knowing that salt consumption and smoking
habits are interesting factors, which one has a stronger
correlation to blood pressure levels? 5.Grow an automatic tree.
Look to see if gender is an interesting factor for 55-year-old
regular smoker who does not each cheese?
- Slide 208
- Association zClassic market-basket analysis, which treats the
purchase of a number of items (for example, the contents of a
shopping basket) as a single transaction. zThis information can be
used to adjust inventories, modify floor or shelf layouts, or
introduce targeted promotional activities to increase overall sales
or move specific products. zExample : 80 percent of all
transactions in which beer was purchased also included potato
chips.
- Slide 209
- Sequence-based analysis zTraditional market-basket analysis
deals with a collection of items as part of a point-in-time
transaction. zto identify a typical set of purchases that might
predict the subsequent purchase of a specific item.
- Slide 210
- Clustering zClustering approach address segmentation problems.
zThese approaches assign records with a large number of attributes
into a relatively small set of groups or "segments." zExample :
Buying habits of multiple population segments might be compared to
determine which segments to target for a new sales campaign.
- Slide 211
- Classification zMost commonly applied data mining technique
zAlgorithm uses preclassified examples to determine the set of
parameters required for proper discrimination. zExample : A
classifier derived from the Classification approach is capable of
identifying risky loans, could be used to aid in the decision of
whether to grant a loan to an individual.
- Slide 212
- Issues of Data Mining zPresent-day tools are strong but require
significant expertise to implement effectively. zIssues of Data
Mining ySusceptibility to "dirty" or irrelevant data. yInability to
"explain" results in human terms.
- Slide 213
- Issues zsusceptibility to "dirty" or irrelevant data yData
mining tools of today simply take everything they are given as
factual and draw the resulting conclusions. yUsers must take the
necessary precautions to ensure that the data being analyzed is
"clean."
- Slide 214
- Issues, cont zinability to "explain" results in human terms
yMany of the tools employed in data mining analysis use complex
mathematical algorithms that are not easily mapped into human
terms. ywhat good does the information do if you dont understand
it?
- Slide 215
- Comparison with reporting, BI and OLAP Reporting zSimple
relationships zChoose the relevant factors zExamine all details
(Also applies to visualisation & simple statistics) Data Mining
zComplex relationships zAutomatically find the relevant factors
zShow only relevant details zPrediction
- Slide 216
- Comparison with Statistics Statistical analysis zMainly about
hypothesis testing zFocussed on precision Data mining zMainly about
hypothesis generation zFocussed on deployment
- Slide 217
- Example: data mining and customer processes zInsight: Who are
my customers and why do they behave the way they do? zPrediction:
Who is a good prospect, for what product, who is at risk, what is
the next thing to offer? zUses: Targeted marketing, mail- shots,
call-centres, adaptive web- sites
- Slide 218
- Example: data mining and fraud detection zInsight: How can
(specific method of) fraud be recognised? What constitute normal,
abnormal and suspicious events? zPrediction: Recognise similarity
to previous frauds how similar? Spot abnormal events how
suspicious? zUsed by: Banks, telcos, retail, government
- Slide 219
- Example: data mining and diagnosing cancer zComplex data from
genetics yChallenging data mining problem zFind patterns of gene
activation indicating different diseases / stages zChanged the way
I think about cancer Oncologist from Chicago Childrens Memorial
Hospital
- Slide 220
- Example: data mining and policing zKnowing the patterns helps
plan effective crime prevention zCrime hot-spots understood better
zSift through mountains of crime reports zIdentify crime series
zOther people save money using data mining we save lives. Police
force homicide specialist and data miner
- Slide 221
- Data mining tools: Clementine and its philosophy
- Slide 222
- How to do data mining zLots of data mining operations zHow do
you glue them together to solve a problem? zHow do we actually do
data mining? zMethodology yNot just the right way, but any way
- Slide 223
- Myths about Data Mining (1) Data, Process and Tech Data mining
is all about massive data It can be, but some important datasets
are very small, and sampling is often appropriate Data mining is a
technical process Business analysts perform data mining every day
It is a business process Data mining is all about algorithms
Algorithms are a key tool But data mining is done by people, not by
algorithms Data mining is all about predictive accuracy It's about
usefulness Accuracy is only a small component
- Slide 224
- Myths about Data Mining (2) Data Quality Data mining only works
with clean data Cleaning the data is part of the data mining
process Need not be clean initially Data mining only works with
complete data Data mining works with whatever data you have.
Complete is good, incomplete is also ok. Data mining only works
with correct data Errors in data are inevitable. Data mining helps
you deal with them.
- Slide 225
- One last exploding myth Neural Networks are not useful when you
need to understand the patterns that you find (which is nearly
always in data mining) Related to over-simplistic views of data
mining Data mining techniques form a toolkit We often use
techniques in surprising ways E.g. Neural nets for field selection
Neural nets for pattern confirmation Neural nets combined with
other techniques for cross-checking What use is a pair of
pliers?
- Slide 226
- 226 Related Concepts Outline zDatabase/OLTP Systems zFuzzy Sets
and Logic zInformation Retrieval(Web Search Engines) zDimensional
Modeling zData Warehousing zOLAP/DSS zStatistics zMachine Learning
zPattern Matching Goal: Examine some areas which are related to
data mining.
- Slide 227
- 227 Fuzzy Sets and Logic zFuzzy Set: Set membership function is
a real valued function with output in the range [0,1]. zf(x):
Probability x is in F. z1-f(x): Probability x is not in F. zEX: yT
= {x | x is a person and x is tall} yLet f(x) be the probability
that x is tall yHere f is the membership function DM: Prediction
and classification are fuzzy.
- Slide 228
- 228 Information Retrieval zInformation Retrieval (IR):
retrieving desired information from textual data. zLibrary Science
zDigital Libraries zWeb Search Engines zTraditionally keyword based
zSample query: Find all documents about data mining. DM: Similarity
measures; Mine text/Web data.
- Slide 229
- Prentice Hall 229 Dimensional Modeling zView data in a
hierarchical manner more as business executives might zUseful in
decision support systems and mining zDimension: collection of
logically related attributes; axis for modeling data. zFacts: data
stored zEx: Dimensions products, locations, date Facts quantity,
unit price DM: May view data as dimensinoal.
- Slide 230
- 230 Dimensional Modeling Queries zRoll Up: more general
dimension zDrill Down: more specific dimension zDimension
(Aggregation) Hierarchy zSQL uses aggregation zDecision Support
Systems (DSS): Computer systems and tools to assist managers in
making decisions and solving problems.
- Slide 231
- 231 Cube view of Data
- Slide 232
- 232 Data Warehousing z Subject-oriented, integrated,
time-variant, nonvolatile William Inmon zOperational Data: Data
used in day to day needs of company. zInformational Data: Supports
other functions such as planning and forecasting. zData mining
tools often access data warehouses rather than operational data.
DM: May access data in warehouse.
- Slide 233
- 233 OLAP zOnline Analytic Processing (OLAP): provides more
complex queries than OLTP. zOnLine Transaction Processing (OLTP):
traditional database/transaction processing. zDimensional data;
cube view zVisualization of operations: ySlice: examine sub-cube.
yDice: rotate cube to look at another dimension. yRoll Up/Drill
Down DM: May use OLAP queries.
- Slide 234
- 234 OLAP Operations Single CellMultiple CellsSliceDice Roll Up
Drill Down
- Slide 235
- 235 Statistics zSimple descriptive models zStatistical
inference: generalizing a model created from a sample of the data
to the entire dataset. zExploratory Data Analysis: yData can
actually drive the creation of the model yOpposite of traditional
statistical view. zData mining targeted to business user DM: Many
data mining methods come from statistical techniques.
- Slide 236
- 236 Machine Learning zMachine Learning: area of AI that
examines how to write programs that can learn. zOften used in
classification and prediction zSupervised Learning: learns by
example. zUnsupervised Learning: learns without knowledge of
correct answers. zMachine learning often deals with small static
datasets. DM: Uses many machine learning techniques.
- Slide 237
- Prentice Hall 237 Pattern Matching (Recognition) zPattern
Matching: finds occurrences of a predefined pattern in the data.
zApplications include speech recognition, information retrieval,
time series analysis. DM: Type of classification.
- Slide 238
- 238 DM vs. Related Topics
- Slide 239
- Prentice Hall 239 Data Mining Techniques Outline zStatistical
yPoint Estimation yModels Based on Summarization yBayes Theorem
yHypothesis Testing yRegression and Correlation zSimilarity
Measures zDecision Trees zNeural Networks yActivation Functions
zGenetic Algorithms Goal: Provide an overview of basic data mining
techniques
- Slide 240
- 240 Point Estimation zPoint Estimate: estimate a population
parameter. zMay be made by calculating the parameter for a sample.
zMay be used to predict value for missing data. zEx: yR contains
100 employees y99 have salary information yMean salary of these is
$50,000 yUse $50,000 as value of remaining employees salary. Is
this a good idea?
- Slide 241
- 241 Estimation Error zBias: Difference between expected value
and actual value. zMean Squared Error (MSE): expected value of the
squared difference between the estimate and the actual value: zWhy
square? zRoot Mean Square Error (RMSE)
- Slide 242
- 242 Expectation-Maximization (EM) zSolves estimation with
incomplete data. zObtain initial estimates for parameters.
zIteratively use estimates for missing data and continue until
convergence.
- Slide 243
- 243 Models Based on Summarization zVisualization: Frequency
distribution, mean, variance, median, mode, etc. zBox Plot:
- Slide 244
- 244 Bayes Theorem zPosterior Probability: P(h 1 |x i ) zPrior
Probability: P(h 1 ) zBayes Theorem: zAssign probabilities of
hypotheses given a data value.
- Slide 245
- 245 Hypothesis Testing zFind model to explain behavior by
creating and then testing a hypothesis about the data. zExact
opposite of usual DM approach. zH 0 Null hypothesis; Hypothesis to
be tested. zH 1 Alternative hypothesis
- Slide 246
- 246 Regression zPredict future values based on past values
zLinear Regression assumes linear relationship exists. y = c 0 + c
1 x 1 + + c n x n zFind values to best fit the data
- Slide 247
- 247 Correlation zExamine the degree to which the values for two
variables behave similarly. zCorrelation coefficient r: 1 = perfect
correlation -1 = perfect but opposite correlation 0 = no
correlation
- Slide 248
- Prentice Hall 248 Similarity Measures zDetermine similarity
between two objects. zSimilarity characteristics: zAlternatively,
distance measure measure how unlike or dissimilar objects are.
- Slide 249
- 249 Distance Measures zMeasure dissimilarity between
objects
- Slide 250
- 250 Decision Trees zDecision Tree (DT): yTree where the root
and each internal node is labeled with a question. yThe arcs
represent each possible answer to the associated question. yEach
leaf node represents a prediction of a solution to the problem.
zPopular technique for classification; Leaf node indicates class to
which the corresponding tuple belongs.
- Slide 251
- Prentice Hall 251 Decision Trees zA Decision Tree Model is a
computational model consisting of three parts: yDecision Tree
yAlgorithm to create the tree yAlgorithm that applies the tree to
data zCreation of the tree is the most difficult part. zProcessing
is basically a search similar to that in a binary search tree
(although DT may not be binary).
- Slide 252
- Prentice Hall 252 Neural Networks zBased on observed
functioning of human brain. z(Artificial Neural Networks (ANN) zOur
view of neural networks is very simplistic. zWe view a neural
network (NN) from a graphical viewpoint. zAlternatively, a NN may
be viewed from the perspective of matrices. zUsed in pattern
recognition, speech recognition, computer vision, and
classification.
- Slide 253
- 253 Generating Rules zDecision tree can be converted into a
rule set zStraightforward conversion: yeach path to the leaf
becomes a rule makes an overly complex rule set zMore effective
conversions are not trivial y(e.g. C4.8 tests each node in
root-leaf path to see if it can be eliminated without loss in
accuracy)
- Slide 254
- 254 Covering algorithms zStrategy for generating a rule set
directly: for each class in turn find rule set that covers all
instances in it (excluding instances not in the class) zThis
approach is called a covering approach because at each stage a rule
is identified that covers some of the instances
- Slide 255
- 255 Rules vs. trees zCorresponding decision tree: (produces
exactly the same predictions) zBut: rule sets can be more clear
when decision trees suffer from replicated subtrees zAlso: in
multi-class situations, covering algorithm concentrates on one
class at a time whereas decision tree learner takes all classes
into account
- Slide 256
- 256 A simple covering algorithm zGenerates a rule by adding
tests that maximize rules accuracy zSimilar to situation in
decision trees: problem of selecting an attribute to split on yBut:
decision tree inducer maximizes overall purity zEach new test
reduces rules coverage: witten&eibe
- Slide 257
- Algorithm Components 1. The task the algorithm is used to
address (e.g. classification, clustering, etc.) 2. The structure of
the model or pattern we are fitting to the data (e.g. a linear
regression model) 3. The score function used to judge the quality
of the fitted models or patterns (e.g. accuracy, BIC, etc.) 4. The
search or optimization method used to search over parameters and/or
structures (e.g. steepest descent, MCMC, etc.) 5. The data
management technique used for storing, indexing, and retrieving
data (critical when data too large to reside in memory)
- Slide 258
- Slide 259
- Models and Patterns Models Prediction Probability Distributions
Structured Data Linear regression Piecewise linear
- Slide 260
- Models Prediction Probability Distributions Structured Data
Linear regression Piecewise linear Nonparamatric regression
- Slide 261
- Slide 262
- Models Prediction Probability Distributions Structured Data
Linear regression Piecewise linear Nonparametric regression
Classification logistic regression nave bayes/TAN/bayesian networks
NN support vector machines Trees etc.
- Slide 263
- Models Prediction Probability Distributions Structured Data
Linear regression Piecewise linear Nonparametric regression
Classification Parametric models Mixtures of parametric models
Graphical Markov models (categorical, continuous, mixed)
- Slide 264
- Models Prediction Probability Distributions Structured Data
Linear regression Piecewise linear Nonparametric regression
Classification Parametric models Mixtures of parametric models
Graphical Markov models (categorical, continuous, mixed) Time
series Markov models Mixture Transition Distribution models Hidden
Markov models Spatial models
- Slide 265
- Bias-Variance Tradeoff High Bias - Low VarianceLow Bias - High
Variance overfitting - modeling the random component Score function
should embody the compromise
- Slide 266
- Patterns Global Local Clustering via partitioning Hierarchical
Clustering Mixture Models Outlier detection Changepoint detection
Bump hunting Scan statistics Association rules
- Slide 267
- x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx
xx x x The curve represents a road Each x marks an accident Red x
denotes an injury accident Black x means no injury Is there a
stretch of road where there is an unually large fraction of injury
accidents? Scan Statistics via Permutation Tests
- Slide 268
- Scan with Fixed Window zIf we know the length of the stretch of
road that we seek, e.g., we could slide this window long the road
and find the most unusual window location x x x x x x x x x x x x x
x x x x x x x x x x x x x x x x x xx xx x x
- Slide 269
- Spatial-Temporal Scan Statistics zSpatial-temporal scan
statistic use cylinders where the height of the cylinder represents
a time window
- Slide 270
- 270 Major Data Mining Tasks zClassification: predicting an item
class zClustering: finding clusters in data zAssociations: e.g. A
& B & C occur frequently zVisualization: to facilitate
human discovery zSummarization: describing a group zDeviation
Detection: finding changes zEstimation: predicting a continuous
value zLink Analysis: finding relationships z
- Slide 271
- 271 Classification Learn a method for predicting the instance
class from pre-labeled (classified) instances Many approaches:
Statistics, Decision Trees, Neural Networks,...
- Slide 272
- 272 Clustering Find natural grouping of instances given
un-labeled data
- Slide 273
- 273 Association Rules & Frequent Itemsets Transactions
Frequent Itemsets: Milk, Bread (4) Bread, Cereal (3) Milk, Bread,
Cereal (2) Rules: Milk => Bread (66%)
- Slide 274
- 274 Visualization & Data Mining zVisualizing the data to
facilitate human discovery zPresenting the discovered results in a
visually "nice" way
- Slide 275
- 275 Summarization nDescribe features of the selected group nUse
natural language and graphics nUsually in Combination with
Deviation detection or other methods Average length of stay in this
study area rose 45.7 percent, from 4.3 days to 6.2 days,
because...
- Slide 276
- 276 Data Mining Central Quest Find true patterns and avoid
overfitting (finding seemingly signifcant but really random
patterns due to searching too many possibilites)
- Slide 277
- 277 Classification Learn a method for predicting the instance
class from pre-labeled (classified) instances Many approaches:
Regression, Decision Trees, Bayesian, Neural Networks,... Given a
set of points from classes what is the class of new point ?
- Slide 278
- 278 Classification: Linear Regression Linear Regression w 0 + w
1 x + w 2 y >= 0 Regression computes w i from data to minimize
squared error to fit the data Not flexible enough
- Slide 279
- 279 Classification: Decision Trees X Y if X > 5 then blue
else if Y > 3 then blue else if X > 2 then green else blue 52
3
- Slide 280
- 280 DECISION TREE zAn internal node is a test on an attribute.
zA branch represents an outcome of the test, e.g., Color=red. zA
leaf node represents a class label or class label distribution. zAt
each node, one attribute is chosen to split training examples into
distinct classes as much as possible zA new instance is classified
by following a matching path to a leaf node.
- Slide 281
- 281 Classification: Neural Nets Can select more complex regions
Can be more accurate Also can overfit the data find patterns in
random noise
- Slide 282
- 282 Evaluating which method works the best for classification
zNo model is uniformly the best zDimensions for Comparison yspeed
of training yspeed of model application ynoise tolerance
yexplanation ability zBest Results: Hybrid, Integrated models
- Slide 283
- 283 Comparison of Major Classification Approaches A hybrid
method will have higher accuracy
- Slide 284
- 284 Evaluation of Classification Models zHow predictive is the
model we learned? zError on the training data is not a good
indicator of performance on future data yThe new data will probably
not be exactly the same as the training data! zOverfitting fitting
the training data too precisely - usually leads to poor results on
new data
- Slide 285
- 285 Classification: Train, Validation, Test split Data
Predictions Y N Results Known Training set Validation set + + - - +
Model Builder Evaluate +-+-+-+- Final Model Final Test Set +-+-+-+-
Final Evaluation Model Builder
- Slide 286
- 286 Cross-validation zCross-validation avoids overlapping test
sets yFirst step: data is split into k subsets of equal size
ySecond step: each subset in turn is used for testing and the
remainder for training zThis is called k-fold cross-validation
zOften the subsets are stratified before the cross-validation is
performed zThe error estimates are averaged to yield an overall
error estimate
- Slide 287
- 287 Cross-validation example: Break up data into groups of the
same size Hold aside one group for testing and use the rest to
build model Repeat Test
- Slide 288
- 288 More on cross-validation zStandard method for evaluation:
stratified ten-fold cross-validation zWhy ten? Extensive
experiments have shown that this is the best choice to get an
accurate estimate zStratification reduces the estimates variance
zEven better: repeated stratified cross-validation yE.g. ten-fold
cross-validation is repeated ten times and results are averaged
(reduces the variance)
- Slide 289
- 289 Clustering Methods zMany different method and algorithms:
yFor numeric and/or symbolic data yDeterministic vs. probabilistic
yExclusive vs. overlapping yHierarchical vs. flat yTop-down vs.
bottom-up
- Slide 290
- 290 Clustering Evaluation zManual inspection zBenchmarking on
existing labels zCluster quality measures ydistance measures yhigh
similarity within a cluster, low across clusters
- Slide 291
- 291 The distance function zSimplest case: one numeric attribute
A yDistance(X,Y) = A(X) A(Y) zSeveral numeric attributes:
yDistance(X,Y) = Euclidean distance between X,Y zNominal
attributes: distance is set to 1 if values are different, 0 if they
are equal zAre all attributes equally important? yWeighting the
attributes might be necessary
- Slide 292
- 292 Simple Clustering: K-means Works with numeric data only
1)Pick a number (K) of cluster centers (at random) 2)Assign every
item to its nearest cluster center (e.g. using Euclidean distance)
3)Move each cluster center to the mean of its assigned items
4)Repeat steps 2,3 until convergence (change in cluster assignments
less than a threshold)
- Slide 293
- 293 Data Mining in CRM: Customer Life Cycle zCustomer Life
Cycle yThe stages in the relationship between a customer and a
business zKey stages in the customer lifecycle yProspects: people
who are not yet customers but are in the target market yResponders:
prospects who show an interest in a product or service yActive
Customers: people who are currently using the product or service
yFormer Customers: may be bad customers who did not pay their bills
or who incurred high costs zIts important to know life cycle events
(e.g. retirement)
- Slide 294
- 294 Data Mining in CRM: Customer Life Cycle zWhat marketers
want: Increasing customer revenue and customer profitability
yUp-sell yCross-sell yKeeping the customers for a longer period of
time zSolution: Applying data mining
- Slide 295
- 295 Data Mining in CRM zDM helps to yDetermine the behavior
surrounding a particular lifecycle event yFind other people in
similar life stages and determine which customers are following
similar behavior patterns
- Slide 296
- 296 Data Mining in CRM (cont.) Data Warehouse Data Mining
Campaign Management Customer Profile Customer Life Cycle Info.
- Slide 297
- CRISP-DM: Benefits of a standard methodology zCommunication yA
common language zRepeatability yRational structure zEducation yHow
do I start? www.crisp-dm.org
- Slide 298
- CRISP-DM Overview An industry-standard process model for data
mining. Not sector-specific Non-proprietary CRISP-DM Phases:
Business Understanding Data Understanding Data Preparation Modeling
Evaluation Deployment Not strictly ordered - respects iterative
aspect of data mining www.crisp-dm.org
- Slide 299
- 299 Rules vs. decision lists zPRISM with outer loop removed
generates a decision list for one class ySubsequent rules are
designed for rules that are not covered by previous rules yBut:
order doesnt matter because all rules predict the same class zOuter
loop considers all classes separately yNo order dependence implied
zProblems: overlapping rules, default rule required
- Slide 300
- Process Standardization CRISP-DM: CRoss Industry Standard
Process for Data Mining Initiative launched Sept.1996 SPSS/ISL,
NCR, Daimler-Benz, OHRA Funding from European commission Over 200
members of the CRISP-DM SIG worldwide DM Vendors - SPSS, NCR, IBM,
SAS, SGI, Data Distilleries, Syllogic, Magnify,.. System Suppliers
/ consultants - Cap Gemini, ICL Retail, Deloitte & Touche, End
Users - BT, ABB, Lloyds Bank, AirTouch, Experian,...
- Slide 301
- CRISP-DM Non-proprietary Application/Industry neutral Tool
neutral Focus on business issues As well as technical analysis
Framework for guidance Experience base Templates for Analysis
- Slide 302
- Why CRISP-DM? The data mining process must be reliable and
repeatable by people with little data mining skills CRISP-DM
provides a uniform framework for guidelines experience
documentation CRISP-DM is flexible to account for differences
Different business/agency problems Different data