DAT205 Advanced Data Mining Using SQL Server 2000
description
Transcript of DAT205 Advanced Data Mining Using SQL Server 2000
ZhaoHui Tang ZhaoHui Tang Program ManagerProgram Manager SQL Server Analysis ServicesSQL Server Analysis ServicesMicrosoft CorporationMicrosoft Corporation
DAT205DAT205Advanced Data Mining Using Advanced Data Mining Using SQL Server 2000SQL Server 2000
AgendaAgenda
• Microsoft Data Mining AlgorithmsMicrosoft Data Mining Algorithms• OLE DB for DM Data mining queryOLE DB for DM Data mining query• Data Mining Case Study: Click Stream Data Mining Case Study: Click Stream
Analysis Analysis – Customer SegmentationCustomer Segmentation– Site affiliationSite affiliation– Target ads in banner Target ads in banner
• Performance of Microsoft Data Mining Performance of Microsoft Data Mining Algorithm Algorithm
• Q&AQ&A
Data Mining Algorithms in SQL Data Mining Algorithms in SQL Server 2000Server 2000
Decision TreeDecision Tree• Popular technique for Popular technique for
classification, classification, Prediction taskPrediction task– Churn analysisChurn analysis– Credit risk analysisCredit risk analysis– ……
• Easy to understandEasy to understand– any path from node to any path from node to
leaf forms a ruleleaf forms a rule• Fast to buildFast to build• Prediction based on Prediction based on
leaf node statsleaf node stats• Variation: C4.5, C5, Variation: C4.5, C5,
CART, ChaidCART, Chaid
Attend College:55% Yes45% No
All Students
Attend College:79% Yes21% No
IQ=High
Attend College:35% Yes65% No
IQ < > High
Attend College:94% Yes6% No
Parent Income = High
Attend College:69% Yes31% No
Parent Income = Low
How tree worksHow tree worksIQIQ Parent Parent
EncouragementEncouragementParent Parent IncomeIncome
GenderGender
HighHigh MediumMedium LowLow TrueTrue FalseFalse HighHigh FalseFalse MaleMale FemaleFemale
CollegePCollegePlanlan
YesYes 300 500 200 700 300 400 600 500 500
NoNo 100 1000 900 400 1600 400 1600 1100 900
0
100
200
300
400
500
600
700
800
900
1000
IQ=High IQ=Medium IQ=Low
0
200
400
600
800
1000
1200
1400
1600
1800
PI=High PI=FALSE
0
200
400
600
800
1000
1200
1400
1600
1800
PE=TRUE PE=FALSE
0
200
400
600
800
1000
1200
Male Female
YesYes
NoNo
Split recursivelySplit recursivelyCollege Plan33% Yes67% No
All Students
College Plan63% Yes37% No
Parent Encouragement = True
College Plan16% Yes84% No
Parent Encouragement = False
IQIQ Parent Parent EncouragementEncouragement
Parent Parent IncomeIncome
GenderGender
HighHigh MediumMedium LowLow TrueTrue FalseFalse HighHigh FalseFalse MaleMale FemaleFemale
CollegePCollegePlanlan
YesYes 200 400 100 700 0 300 400 400 250
NoNo 50 250 100 400 0 100 300 250 150
Microsoft Decision TreesMicrosoft Decision Trees
• Probabilistic Classification TreeProbabilistic Classification Tree• Splitting methods: Bayesian score and Splitting methods: Bayesian score and
EntropyEntropy• Forward pruningForward pruning• Tree shape: Binary and Nary treeTree shape: Binary and Nary tree• Scalable frameworkScalable framework
Clustering Algorithm (EM)Clustering Algorithm (EM)
• A popular method for customer A popular method for customer segmentation, mailing list, profiling…segmentation, mailing list, profiling…
• Algorithm processAlgorithm process– Assign a set of Initial PointsAssign a set of Initial Points– Assign initial cluster to each pointsAssign initial cluster to each points– Assign data points to Assign data points to each clustereach cluster with a with a
probabilityprobability– Computer new central point based on Computer new central point based on weighted weighted
computation computation – Cycle until convergenceCycle until convergence
EM IllustrationEM Illustration
X
X
X
Microsoft Clustering Algorithm Microsoft Clustering Algorithm (Scalable EM)(Scalable EM)
Data
Fill BufferBuild/Update
Model
Compressed date Sufficient stats
Identify Data to be Compressed
Stop?
Final Model
OLE DB for Data MiningOLE DB for Data Mining
OLE DB for DMOLE DB for DM• Industry standard for data miningIndustry standard for data mining• Based on existing technologiesBased on existing technologies
– SQLSQL– OLE DBOLE DB
• Define common concepts for DMDefine common concepts for DM– Case, Nested CaseCase, Nested Case– Mining ModelMining Model– Model CreationModel Creation– Model TrainingModel Training– Prediction Prediction
• Language based API Language based API
Customer TableCustomer TableCustomer ID Profession Income Gender Risk
1 Engineer 85 Male No
2 Worker 40 Male Yes
3 Doctor 90 Female No
4 Teacher 50 Female No
5 Worker 45 Male No
… … … … …
DM Query LanguageDM Query LanguageCreate Mining ModelCreate Mining Model CreditRisk CreditRisk
(CustomerID long key,(CustomerID long key,
Gender text discrete,Gender text discrete,
Income long continuous,Income long continuous,
Profession text discrete,Profession text discrete,
RiskRisk text discrete predict)text discrete predict)
UsingUsing Microsoft_Decision_Trees Microsoft_Decision_Trees
Insert intoInsert into CreditRisk CreditRisk
(CustomerId, Gender, Income, (CustomerId, Gender, Income, Profession, Risk)Profession, Risk)
Select Select
CustomerID, Gender, Income, CustomerID, Gender, Income, Profession,RiskProfession,Risk
From CustomersFrom Customers
SelectSelect NewCustomers.CustomerID, NewCustomers.CustomerID, CreditRisk.Risk, PredictProbability(CreditRisk)CreditRisk.Risk, PredictProbability(CreditRisk)
FromFrom CreditRisk CreditRisk Prediction JoinPrediction Join NewCustomers NewCustomers
OnOn CreditRisk.Gender=NewCustomer.Gender CreditRisk.Gender=NewCustomer.Gender
And CreditRisk.Income=NewCustomer.IncomeAnd CreditRisk.Income=NewCustomer.Income
AndAnd
CreditRisk.Profession=NewCustomer.ProfessionCreditRisk.Profession=NewCustomer.Profession
Schema RowsetsSchema Rowsets
• Tabular data to provide meta data Tabular data to provide meta data informationinformation
• List of Schema Rowsets in OLE DB for DMList of Schema Rowsets in OLE DB for DM– Mining_ServicesMining_Services– Mining_Service_ParametersMining_Service_Parameters– Mining_ModelsMining_Models– Mining_ColumnsMining_Columns– Mining_Model_ContentsMining_Model_Contents– Model_Content_PMMLModel_Content_PMML
Mining Model Contents Schema Mining Model Contents Schema RowsetsRowsets
Schema Rowsets & Thin Client Schema Rowsets & Thin Client BrowserBrowser
Case Study: Click Stream Case Study: Click Stream AnalysisAnalysis
Schema Schema
CustomerCustomerCustomerGuidCustomerGuidDayTimeOnLineDayTimeOnLineNightTimeOnLinNightTimeOnLineeBrowserTypeBrowserTypeEmailTimeEmailTimeChatTimeChatTimeGeoLocationGeoLocation
WebClickWebClickCustomerGuidCustomerGuidURLCategoryURLCategoryTimeTimeDurationDurationReferPageReferPage
Web Customer SegmentationWeb Customer Segmentation
Web Visitors SegmentationWeb Visitors Segmentation
Segmentation based on Customer Segmentation based on Customer tabletable
Create Mining ModelCreate Mining Model CustomerClustering CustomerClustering
(CustomerID text key,(CustomerID text key,
DayTimeOnline long continuousDayTimeOnline long continuous
NightTimeOnline long continuous,NightTimeOnline long continuous,
BrowserType BrowserType text discrete, text discrete,
ChatTime ChatTime long continuous,long continuous,
EmailTimeEmailTime long continuous,long continuous,
GeoLocationGeoLocation text discretetext discrete
))
UsingUsing Microsoft_Clustering Microsoft_Clustering
Segmentation based on Customer Segmentation based on Customer and WebClickand WebClick
Create Mining ModelCreate Mining Model CustomerClustering CustomerClustering
(CustomerID text key,(CustomerID text key,
DayTimeOnline long continuous,DayTimeOnline long continuous,
NightTimeOnline long continuous,NightTimeOnline long continuous,
BrowserType BrowserType text discrete, text discrete,
ChatTime ChatTime long continuous,long continuous,
EmailTimeEmailTime long continuous,long continuous,
GeoLocationGeoLocation text discretetext discrete
WebClickWebClick table (table (
UrlCategory text key )UrlCategory text key )
))UsingUsing Microsoft_Clustering Microsoft_Clustering
MSFTies SegmentationMSFTies Segmentation
Web Site AffiliationWeb Site Affiliation
Association analysis using Association analysis using Microsoft Decision Trees Microsoft Decision Trees
Insurance No Insurance
Loan No Loan
Business
Loan No Loan
Stock No Stock
Insurance
Business No Business
Shopping No Shopping
Stock
Stock
Insurance No Insurance
Loan
No Stock
Association analysis using Association analysis using Microsoft Decision Trees Microsoft Decision Trees
Insurance No Insurance
Loan No Loan
Business
Loan No Loan
Stock No Stock
Insurance
Business No Business
Shopping No Shopping
Stock
Stock
Insurance No Insurance
Loan
No Stock
Site AffiliationSite Affiliation
Site AffiliationSite AffiliationCreate Mining ModelCreate Mining Model SiteAffiliation SiteAffiliation
(CustomerID text key,(CustomerID text key,
WebClick table predict (WebClick table predict (
UrlCategory text key )UrlCategory text key )
))UsingUsing Microsoft_Decision_Trees Microsoft_Decision_Trees
Insert intoInsert into SiteAffiliation (CustomerID,WebClick (skip, SiteAffiliation (CustomerID,WebClick (skip, UrlCategory)UrlCategory)OpenRowset(‘MSDataShape’, 'data OpenRowset(‘MSDataShape’, 'data provider=SQLOLEDB;Server=myserver;UID=me; provider=SQLOLEDB;Server=myserver;UID=me; PWD=mypass' , PWD=mypass' , 'Shape{Select CustomerID from Customer}'Shape{Select CustomerID from Customer}
Append ( {Select customerid, URLCategoryAppend ( {Select customerid, URLCategoryfrom WebClick }from WebClick }
relate CustomerID to CustomerID) as WebClick’ relate CustomerID to CustomerID) as WebClick’
))
Path PredictionPath Prediction
Path PredictionPath Prediction
Singleton PredictionSingleton PredictionSelectSelect Flattened Flattened
Topcount((select URLCategory, $adjustedProbability as Topcount((select URLCategory, $adjustedProbability as prob prob
From Predict([Web Click], INCLUDE_STATISTICS, From Predict([Web Click], INCLUDE_STATISTICS, EXCLUSIVE)), prob, 5) EXCLUSIVE)), prob, 5)
FromFrom
WebLog PREDICTION JOIN (select (select 'Business' WebLog PREDICTION JOIN (select (select 'Business' as URLCategory) union (select ‘Telecom’ as as URLCategory) union (select ‘Telecom’ as URLCategory) as WebClick) as inputURLCategory) as WebClick) as input
OnOn
WebLog.[Web Click].URLCategory = WebLog.[Web Click].URLCategory = input.WebClick.URLCategoryinput.WebClick.URLCategory
ArchitectureArchitecture
Web Web CustomerCustomer IISIIS
ASPASP
DM ProviderDM Provider
DMMDMM
InternetInternet
Real Time Predictio
n
ADO/DSOADO/DSO
Performance of DM AlgorithmsPerformance of DM Algorithms
DM Performance Study DM Performance Study
• Joint effort between Unisys & MicrosoftJoint effort between Unisys & Microsoft• Two parts of the white paper:Two parts of the white paper:
First part:First part: Use AS2k to build DM Models for Use AS2k to build DM Models for a a banking business scenario banking business scenario
Second Part:Second Part: Performance results of DM Performance results of DM algorithms studyalgorithms study
• Some results in this session…Some results in this session…• Details in the Details in the paperpaper and and SQL Server SQL Server
magazinemagazine articles… articles…
Data Source for DMMsData Source for DMMs
Training Performance Results…Training Performance Results…
Sample Business Question for Sample Business Question for Non Nested MDTNon Nested MDT
11 Identify those customers that are Identify those customers that are most likely to churn (leave) based most likely to churn (leave) based on customer demographical on customer demographical information.information.
Non Nested: Training Times for varying Number of Input attributesNon Nested: Training Times for varying Number of Input attributes
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
0 50 100 150 200 250
Number of Attributes
Trai
ning
Tim
e (m
inut
es)
Training Time
Assumptions:Assumptions:• 1 mm cases• 25 states• 1 predictable attribute
I/P AttributesI/P Attributes Training TimeTraining Time
1010 4.084.08
2020 7.277.27
5050 31.5431.54
100100 40.5540.55
200200 129.35129.35
Observations:Observations:
Non Nested: Training Times for varying Number of CasesNon Nested: Training Times for varying Number of Cases
Assumptions:Assumptions:• 20 attributes• 25 states• 1 predictable attribute
Training Time
10,0001,000,000
5,000,000
10000000
0.00
20.00
40.00
60.00
80.00
100.00
120.00
0 2,000,000 4,000,000 6,000,000 8,000,000 10,000,000 12,000,000
Number of Cases
Tran
ing
Tim
e (m
inut
es)
Training Time
Observations:Observations:
CasesCases Training Training TimeTime
10,00010,000 0.380.38
1,000,0001,000,000 11.3211.32
5,000,0005,000,000 34.1934.19
10,000,00010,000,000 100.53100.53
Sample Business Question for Sample Business Question for Nested MDTNested MDT
22 Find the list of other products that the Find the list of other products that the customer may be interested in based on the customer may be interested in based on the products the customer has purchased.products the customer has purchased.
Nested Cases: Training Times for varying Sample size of Case TableNested Cases: Training Times for varying Sample size of Case Table
Training Time
0
50
100
150
200
250
300
0 50000 100000 150000 200000 250000
Number of Master Cases
Trai
ning
Tim
e (m
inut
es)
Training Time
Assumptions:Assumptions:• Avg. customer
purchases=25• States in nested=200• Nested key predictable
Observations:Observations:
Master CasesMaster Cases Training Training TimeTime
10,00010,000 15.0915.09
50,00050,000 67.7967.79
100,000100,000 120.88120.88
200,000200,000 240.62240.62
Nested Cases: Training Times for varying Number of Products Nested Cases: Training Times for varying Number of Products purchased per customerpurchased per customer
Assumptions:Assumptions:• 200000 cases• 1000 products in nested
Observations:Observations:
Nested CasesNested Cases Training Training TimeTime
1010 85.2685.26
2525 120.82120.82
5050 172.96172.96
100100 281.65281.65
For more info…For more info…
• DM URLDM URL– www.microsoft.com/data/oledbwww.microsoft.com/data/oledb– www.microsoft.com/data/www.microsoft.com/data/oledb/DMResKit.htmoledb/DMResKit.htm
• News Group:News Group:– Microsoft.public.SQLserver.dataminingMicrosoft.public.SQLserver.datamining– Communities.msn.com/AnalysisServicesDataMiningCommunities.msn.com/AnalysisServicesDataMining
• White papers:White papers:– Performance paper:Performance paper:
www.unisys.com/windows2000/default-07.asp www.unisys.com/windows2000/default-07.asp www.microsoft.com/SQL/evaluation/compare/analysisdmwp.aspwww.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp
Don’t forget to complete the Don’t forget to complete the on-line Session Feedback form on-line Session Feedback form on the Attendee Web siteon the Attendee Web site
https://web.mseventseurope.com/teched/https://web.mseventseurope.com/teched/