Dispensing Processes Impact Computational and Statistical Analyses
Computational and statistical issues in data-mining
description
Transcript of Computational and statistical issues in data-mining
04/10/23 1
Computational and Statistical Issues in Data-Mining
Yoav Freund
Banter Inc.
04/10/23 2
Plan of talk
• Two large scale classification problems.
• Generative versus Predictive modeling
• Boosting
• Applications of boosting
• Computational issues in data-mining.
04/10/23 3
AT&T customer classification
• Distinguish business/residence customers– Classification unavailable for about 30% of known
customers.– Calculate a “Buizocity” score
• Using statistics from call-detail records– Records contain:
• calling number, • called number, • time of day, • length of call.
Freund, Mason, Rogers, Pregibon, Cortes 2000
04/10/23 4
Massive datasets
• 260 Million calls / day• 230 Million telephone numbers to be classified.
04/10/23 5
Non-FacesFaces
Paul Viola’s face recognizer
Training data5000 faces 108 non faces
04/10/23 6
Many Uses - User Interfaces - Interactive Agents - Security Systems - Video Compression - Image Database Analysis
Application of face detector
04/10/23 7
Generative vs. Predictive models
04/10/23 8
Toy Example
• Computer receives telephone call
• Measures Pitch of voice
• Decides gender of caller
HumanVoice
Male
Female
04/10/23 9
Generative modeling
Voice Pitch
Pro
babi
lity
mean1
var1
mean2
var2
04/10/23 10
Discriminative approach
Voice Pitch
No.
of
mis
take
s
04/10/23 11
Ill-behaved data
Voice Pitch
Pro
babi
lity
mean1 mean2
No.
of
mis
take
s
04/10/23 12
Traditional Statistics vs. Machine Learning
DataEstimated world state
PredictionsActionsStatistics
Decision Theory
Machine Learning
04/10/23 13
Model Generative Discriminative
Goal Probability estimates
Classification rule
Performance measure
Likelihood Misclassification rate
Mismatch problems
Outliers Misclassifications
Comparison of methodologies
04/10/23 14
Boosting
04/10/23 15
A weak learner
weak learner
A weak rule
hWeighted
training set
(x1,y1,w1),(x2,y2,w2) … (xn,yn,wn)
instances
x1,x2,x3,…,xnh
labels
y1,y2,y3,…,yn
The weak requirement:
Feature vectorB
inary labelN
on-negative weights
sum to 1
04/10/23 16
The boosting process
weak learner h1
(x1,y1,1/n), … (xn,yn,1/n)
weak learner h2(x1,y1,w1), … (xn,yn,wn)
h3(x1,y1,w1), … (xn,yn,wn) h4
(x1,y1,w1), … (xn,yn,wn) h5(x1,y1,w1), … (xn,yn,wn) h6
(x1,y1,w1), … (xn,yn,wn) h7(x1,y1,w1), … (xn,yn,wn) h8
(x1,y1,w1), … (xn,yn,wn) h9(x1,y1,w1), … (xn,yn,wn) hT
(x1,y1,w1), … (xn,yn,wn)
Final rule: Sign[ ]h1 h2 hT
04/10/23 17
Main properties of adaboost
• If advantages of weak rules over random guessing are: T then in-sample error of final rule is
at most (w.r.t. initial weights)
• Even after in-sample error reaches zero, additional boosting iterations usually improve out-of-sample error. [Schapire,Freund,Bartlett,Lee Ann. Stat. 98]
04/10/23 18
What is a good weak learner?
• The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label.
• Small enough to allow exhaustive search for the minimal weighted training error.
• Small enough to avoid over-fitting.
• Should be able to calculate predicted label very efficiently.
• Rules can be “specialists” – predict only on a small subset of the input space and abstain from predicting on the rest (output 0).
04/10/23 19
Image Features
000,000,6100000,60 Unique Binary Features
otherwise
)( if )(
t
tittit
xfxh
04/10/23 20
Example Classifier for Face Detection
ROC curve for 200 feature classifier
A classifier with 200 rectangle features was learned using AdaBoost
95% correct detection on test set with 1 in 14084false positives.
Not quite competitive...
04/10/23 21
Alternating Trees
Joint work with Llew Mason
04/10/23 22
Decision Trees
X>3
Y>5-1
+1-1
no
yes
yesno
X
Y
3
5
+1
-1
-1
04/10/23 23
-0.2
Decision tree as a sum
X
Y-0.2
Y>5
+0.2-0.3
yesno
X>3
-0.1
no
yes
+0.1
+0.1-0.1
+0.2
-0.3
+1
-1
-1sign
04/10/23 24
An alternating decision tree
X
Y
+0.1-0.1
+0.2
-0.3
sign
-0.2
Y>5
+0.2-0.3yesno
X>3
-0.1
no
yes
+0.1
+0.7
Y<1
0.0
no
yes
+0.7
+1
-1
-1
+1
04/10/23 25
Example: Medical Diagnostics
•Cleve dataset from UC Irvine database.
•Heart disease diagnostics (+1=healthy,-1=sick)
•13 features from tests (real valued and discrete).
•303 instances.
04/10/23 26
Adtree for Cleveland heart-disease diagnostics problem
04/10/23 27
Cross-validated accuracyLearning algorithm
Number of splits
Average test error
Test error variance
ADtree 6 17.0% 0.6%
C5.0 27 27.2% 0.5%
C5.0 + boosting
446 20.2% 0.5%
Boost Stumps
16 16.5% 0.8%
04/10/23 28
Alternating tree for “buizocity”
04/10/23 29
Alternating Tree (Detail)
04/10/23 30
Precision/recall graphs
Score
Acc
urac
y
04/10/23 31
“Drinking out of a fire hose”
Allan Wilks, 1997
04/10/23 32
Massive distributed data streams
Front-end systems
•Cashier’s system•Telephone switch•Web server •Web-camera
Data aggregation
“Data warehouse”
Analytics
04/10/23 33
The database bottleneck
• Physical limit: disk “seek” takes 0.01 sec– Same time to read/write 10^5 bytes– Same time to perform 10^7 CPU operations
• Commercial DBMS are optimized for varying queries and transactions.
• Classification tasks require evaluation of fixed queries on massive data streams.
04/10/23 34
Working with large flat files1. Sort file according to X (“called telephone number”).
– Can be done very efficiently for very large files– Counting occurrences becomes efficient because all records for a
given X appear in the same disk block.
2. Randomly permute records– Reading k consecutive records suffices to estimate a few
statistics for a few decisions (splitting a node in a decision tree).– Done by sorting on a random number.
3. “Hancock” – a system for efficient computation of statistical signatures for data streams.
http://www.research.att.com/~kfisher/hancock/
04/10/23 35
Working with data streams
• “You get to see each record only once”
• Example problem: identify the 10 most popular items for each retail-chain customer over the last 12 months.
• To learn more:
Stanford’s Stream Dream Team: http://www-db.stanford.edu/sdt/
04/10/23 36
Analyzing at the source
Code generation
Statistics aggregation
Analytics
Download code
Front-end systemsUpload statistics
04/10/23 37
Learn Slowly, Predict Fast!
• Buizocity: – 10,000 instances are sufficient for learning.– 300,000,000 have to be labeled (weekly).– Generate ADTree classifier in C, compile it
and run it using Hancock.
04/10/23 38
Paul Viola’s face detector:
• Scan 50,000 location/scale boxes in each image, 15images per sec. to detect a few faces.
• Cascaded method minimizes average processing time
• Training takes a day on a fast parallel machine.
FACEIMAGE BOX Classifier 1
F
T
NON-FACE
Classifier 3T
F
NON-FACE
F
T
NON-FACE
Classifier 2T
F
NON-FACE
04/10/23 39
Summary
• Generative vs. Predictive methodology
• Boosting
• Alternating trees
• The database bottleneck
• Learning slowly, predicting fast.
04/10/23 40
Other work 1
• Specialized data compression: – When data is collected in small bins, most bins are empty.
– Instead of storing the zeros smart compression dramatically reduces data size.
• Model averaging:– Boosting and Bagging make classifiers more stable.
– We need theory that does not use Bayesian assumptions.
– Closely relates to margin-based analysis of boosting and of SVM.
• Zipf’s Law: – Distribution of words in free text is extremely skewed.
– Methods should scale exponentially in entropy rather than linearly in number of words.
04/10/23 41
Other work 2
• Online methods:– Data distribution changes with time.
– Online refinement of feature set.
– Long-term learning.
• Effective label collection– Selective sampling to label only hard cases.
– Comparing labels from different people to estimate reliability.
– Co-training: different channels train each-other. (Blum, Mitchell, McCallum)