Computational and statistical issues in data-mining

04/10/23 1

Computational and Statistical Issues in Data-Mining

Yoav Freund

Banter Inc.

04/10/23 2

Plan of talk

• Two large scale classification problems.

• Generative versus Predictive modeling

• Boosting

• Applications of boosting

• Computational issues in data-mining.

04/10/23 3

AT&T customer classification

• Distinguish business/residence customers– Classification unavailable for about 30% of known

customers.– Calculate a “Buizocity” score

• Using statistics from call-detail records– Records contain:

• calling number, • called number, • time of day, • length of call.

Freund, Mason, Rogers, Pregibon, Cortes 2000

04/10/23 4

Massive datasets

• 260 Million calls / day• 230 Million telephone numbers to be classified.

04/10/23 5

Non-FacesFaces

Paul Viola’s face recognizer

Training data5000 faces 108 non faces

04/10/23 6

Many Uses - User Interfaces - Interactive Agents - Security Systems - Video Compression - Image Database Analysis

Application of face detector

04/10/23 7

Generative vs. Predictive models

04/10/23 8

Toy Example

• Computer receives telephone call

• Measures Pitch of voice

• Decides gender of caller

HumanVoice

Male

Female

04/10/23 9

Generative modeling

Voice Pitch

Pro

babi

lity

mean1

var1

mean2

var2

04/10/23 10

Discriminative approach

Voice Pitch

No.

of

mis

take

s

04/10/23 11

Ill-behaved data

Voice Pitch

Pro

babi

lity

mean1 mean2

No.

of

mis

take

s

04/10/23 12

Traditional Statistics vs. Machine Learning

DataEstimated world state

PredictionsActionsStatistics

Decision Theory

Machine Learning

04/10/23 13

Model Generative Discriminative

Goal Probability estimates

Classification rule

Performance measure

Likelihood Misclassification rate

Mismatch problems

Outliers Misclassifications

Comparison of methodologies

04/10/23 14

Boosting

04/10/23 15

A weak learner

weak learner

A weak rule

hWeighted

training set

(x1,y1,w1),(x2,y2,w2) … (xn,yn,wn)

instances

x1,x2,x3,…,xnh

labels

y1,y2,y3,…,yn

The weak requirement:

Feature vectorB

inary labelN

on-negative weights

sum to 1

04/10/23 16

The boosting process

weak learner h1

(x1,y1,1/n), … (xn,yn,1/n)

weak learner h2(x1,y1,w1), … (xn,yn,wn)

h3(x1,y1,w1), … (xn,yn,wn) h4

(x1,y1,w1), … (xn,yn,wn) h5(x1,y1,w1), … (xn,yn,wn) h6

(x1,y1,w1), … (xn,yn,wn) h7(x1,y1,w1), … (xn,yn,wn) h8

(x1,y1,w1), … (xn,yn,wn) h9(x1,y1,w1), … (xn,yn,wn) hT

(x1,y1,w1), … (xn,yn,wn)

Final rule: Sign[ ]h1 h2 hT

04/10/23 17

Main properties of adaboost

• If advantages of weak rules over random guessing are: T then in-sample error of final rule is

at most (w.r.t. initial weights)

• Even after in-sample error reaches zero, additional boosting iterations usually improve out-of-sample error. [Schapire,Freund,Bartlett,Lee Ann. Stat. 98]

04/10/23 18

What is a good weak learner?

• The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label.

• Small enough to allow exhaustive search for the minimal weighted training error.

• Small enough to avoid over-fitting.

• Should be able to calculate predicted label very efficiently.

• Rules can be “specialists” – predict only on a small subset of the input space and abstain from predicting on the rest (output 0).

04/10/23 19

Image Features

000,000,6100000,60 Unique Binary Features

otherwise

)( if )(

t

tittit

xfxh

04/10/23 20

Example Classifier for Face Detection

ROC curve for 200 feature classifier

A classifier with 200 rectangle features was learned using AdaBoost

95% correct detection on test set with 1 in 14084false positives.

Not quite competitive...

04/10/23 21

Alternating Trees

Joint work with Llew Mason

04/10/23 22

Decision Trees

X>3

Y>5-1

+1-1

no

yes

yesno

X

Y

3

5

+1

-1

-1

04/10/23 23

-0.2

Decision tree as a sum

X

Y-0.2

Y>5

+0.2-0.3

yesno

X>3

-0.1

no

yes

+0.1

+0.1-0.1

+0.2

-0.3

+1

-1

-1sign

04/10/23 24

An alternating decision tree

X

Y

+0.1-0.1

+0.2

-0.3

sign

-0.2

Y>5

+0.2-0.3yesno

X>3

-0.1

no

yes

+0.1

+0.7

Y<1

0.0

no

yes

+0.7

+1

-1

-1

+1

04/10/23 25

Example: Medical Diagnostics

•Cleve dataset from UC Irvine database.

•Heart disease diagnostics (+1=healthy,-1=sick)

•13 features from tests (real valued and discrete).

•303 instances.

04/10/23 26

Adtree for Cleveland heart-disease diagnostics problem

04/10/23 27

Cross-validated accuracyLearning algorithm

Number of splits

Average test error

Test error variance

ADtree 6 17.0% 0.6%

C5.0 27 27.2% 0.5%

C5.0 + boosting

446 20.2% 0.5%

Boost Stumps

16 16.5% 0.8%

04/10/23 28

Alternating tree for “buizocity”

04/10/23 29

Alternating Tree (Detail)

04/10/23 30

Precision/recall graphs

Score

Acc

urac

y

04/10/23 31

“Drinking out of a fire hose”

Allan Wilks, 1997

04/10/23 32

Massive distributed data streams

Front-end systems

•Cashier’s system•Telephone switch•Web server •Web-camera

Data aggregation

“Data warehouse”

Analytics

04/10/23 33

The database bottleneck

• Physical limit: disk “seek” takes 0.01 sec– Same time to read/write 10^5 bytes– Same time to perform 10^7 CPU operations

• Commercial DBMS are optimized for varying queries and transactions.

• Classification tasks require evaluation of fixed queries on massive data streams.

04/10/23 34

Working with large flat files1. Sort file according to X (“called telephone number”).

– Can be done very efficiently for very large files– Counting occurrences becomes efficient because all records for a

given X appear in the same disk block.

2. Randomly permute records– Reading k consecutive records suffices to estimate a few

statistics for a few decisions (splitting a node in a decision tree).– Done by sorting on a random number.

3. “Hancock” – a system for efficient computation of statistical signatures for data streams.

http://www.research.att.com/~kfisher/hancock/

04/10/23 35

Working with data streams

• “You get to see each record only once”

• Example problem: identify the 10 most popular items for each retail-chain customer over the last 12 months.

• To learn more:

Stanford’s Stream Dream Team: http://www-db.stanford.edu/sdt/

04/10/23 36

Analyzing at the source

Code generation

Statistics aggregation

Analytics

Download code

Front-end systemsUpload statistics

04/10/23 37

Learn Slowly, Predict Fast!

• Buizocity: – 10,000 instances are sufficient for learning.– 300,000,000 have to be labeled (weekly).– Generate ADTree classifier in C, compile it

and run it using Hancock.

04/10/23 38

Paul Viola’s face detector:

• Scan 50,000 location/scale boxes in each image, 15images per sec. to detect a few faces.

• Cascaded method minimizes average processing time

• Training takes a day on a fast parallel machine.

FACEIMAGE BOX Classifier 1

F

T

NON-FACE

Classifier 3T

F

NON-FACE

F

T

NON-FACE

Classifier 2T

F

NON-FACE

04/10/23 39

Summary

• Generative vs. Predictive methodology

• Boosting

• Alternating trees

• The database bottleneck

• Learning slowly, predicting fast.

04/10/23 40

Other work 1

• Specialized data compression: – When data is collected in small bins, most bins are empty.

– Instead of storing the zeros smart compression dramatically reduces data size.

• Model averaging:– Boosting and Bagging make classifiers more stable.

– We need theory that does not use Bayesian assumptions.

– Closely relates to margin-based analysis of boosting and of SVM.

• Zipf’s Law: – Distribution of words in free text is extremely skewed.

– Methods should scale exponentially in entropy rather than linearly in number of words.

04/10/23 41

Other work 2

• Online methods:– Data distribution changes with time.

– Online refinement of feature set.

– Long-term learning.

• Effective label collection– Selective sampling to label only hard cases.

– Comparing labels from different people to estimate reliability.

– Co-training: different channels train each-other. (Blum, Mitchell, McCallum)

04/10/23 42

Contact me!

• [email protected]

• http://www.cs.huji.ac.il/~yoavf

Computational and statistical issues in data-mining

Documents

Transcript of Computational and statistical issues in data-mining