Is Random Model Better? - On its accuracy and efficiency - Wei Fan IBM T.J.Watson Joint work with...

Is Random Model Better? -On its accuracy and efficiency-

Wei FanIBM T.J.Watson

Joint work withHaixun Wang, Philip S. Yu, and Sheng Ma

Optimal Model Loss function L(t,y) to evaluate performance.

t is true label and y is prediction Optimal decision decision y* is the label that

minimizes expected loss when x is sampled repeatedly: 0-1 loss: y* is the label that appears the most

often, i.e., if P(fraud|x) > 0.5, predict fraud cost-sensitive loss: the label that minimizes the

“empirical risk”.• If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict

How we look for optimal model? NP-hard for most “model representation” We think that simplest hypothesis that fits

the data is the best. We employ all kinds of heuristics to look

for it. info gain, gini index, Kearns-Mansour, etc pruning: MDL pruning, reduced error-pruning,

cost-based pruning. Reality: tractable, but still pretty

expensive

On the other hand Occam’s Razor’s interpretation: two

hypotheses with the same loss, we should prefer the simpler one.

Very complicated hypotheses that are highly accurate: Meta-learning Boosting (weighted voting) Bagging (sampling without replacement)

Where are we? The above are very complicated to compute.

Question: do we have to?

Do we have to be “perfect”? 0-1 loss binary problem:

P(positive|x) > 0.5, we predict x to be positive. P(positive|x) = 0.6, P(positive|x) = 0.9 makes

no difference in final prediction! Cost-sensitive problems:

P(fraud|x) * $1000 > $90, we predict x to be fraud.

Re-write it P(fraud|x) > 0.09 P(fraud|x) = 1.0 and P(fraud|x) = 0.091 makes

no difference.

Random Decision Tree Building several empty iso-depth tree structures

without even looking at the data. Example is sorted through a unique path from

the root the the leaf. Each tree node records the number of instances belonging to each class.

Update each empty node by scanning the data set only once. It is like “classifying” the data. When an example reaches a node, the number of

examples belonging to a particular class label increments

Classification Each tree outputs membership probability

p(fraud|x) = n_fraud/(n_fraud + n_normal) The membership probability from multiple

random trees are averaged to approximate the true probability

Loss function is required to make a decision 0-1 loss: p(fraud|x) > 0.5, predict fraud cost-sensitive loss: p(fraud|x) $1000 > $90

Tree depth To create diversity Half of the number of features Combinations peak at half the size

of population Such as, combine 2 out 4 gives 6

choices.

Number of trees Sampling theory:

30 gives pretty good estimate with reasonably small variance

10 is usually already in the range. Worst scenario

Only one feature is relevant. All the rest are noise.

Probability:

Simple Feature Info Gain Limitation:

At least one feature with info gain by itself

Same limitation as C4.5 and dti Example

Training Efficiency One complete scan of the training

data. Memory Requirement:

Hold one tree (or better multiple trees) One example read at a time.

Donation Dataset Decide whom to send charity

solicitation letter. It costs $0.68 to send a letter. Loss function

Result

Credit Card Fraud Detect if a transaction is a fraud There is an overhead to detect a

fraud, {$60, $70, $80, $90} Loss Function

Result

Extreme situation

Tree Depth

Compare with Boosting

Compare with Bagging

Conclusion Point out the reality that

conventional inductive learning (single best and multiple complicated) are probably way too complicated beyond necessity

Propose a very efficient and accurate random tree algorithm

Is Random Model Better? - On its accuracy and efficiency - Wei Fan IBM T.J.Watson Joint work with...

Documents

Transcript of Is Random Model Better? - On its accuracy and efficiency - Wei Fan IBM T.J.Watson Joint work with...

Xuan Sheng 玄参

Wei Fan, IBM T.J.Watson Joe McCloskey, US Department of Defense Philip Yu, IBM T.J.Watson

Multi-Agent Learning Mini-Tutorial Gerry Tesauro IBM T.J.Watson Research Center .

Speaker: Li-Sheng Chen

Optimistic Virtual Synchrony Jeremy Sussman - IBM T.J.Watson Idit Keidar – MIT LCS Keith Marzullo – UCSD CS Dept.

Copyright 2004 Sheng Bai1 CommView Report for 60-592 By Sheng Bai.

Final report Sheng Zhong

Crossroads - Rotarybulletins.rotary.org.sg/Volume_56/Vol_56_Issue_29.pdf · with Yu Sheng for the traditional “Lo Hei”! “Yu Sheng” dish Preparing the Yu Sheng Rtn Joseph Chia

Qu Guo Sheng

Tutorial on Secure Multi-Party Computation Yehuda Lindell IBM T.J.Watson.

Fu sheng yuan_taichi_academy_india_newsletter_aug_2010

When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.

MULTI-LAYER GRID EMBEDDINGS Alok Aggarwal, IBM T.J.Watson ...

Sheng Siong Written Report v1.8

HER SHENG MARINE INDUSTRIAL CO., LTD. March MERCRUI… · HER SHENG MARINE INDUSTRIAL CO., LTD. ... Complete, Mercruiser Alpha One Gen. II 1.47. HER SHENG MARINE INDUSTRIAL CO., LTD.

Pou Sheng International (3813 HK)

PPT Sheng Zou

About Living Buddha Lian-sheng

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.

Hsiang sheng huang portfolio