Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision...
-
Upload
caroline-mcginnis -
Category
Documents
-
view
216 -
download
1
Transcript of Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision...
![Page 1: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/1.jpg)
Effective Estimation of Posterior Probabilities:
-Explaining the Accuracy of Randomized Decision Tree Approaches
Wei FanEd GreengrassJoe McCloskey
Philip S. YuKevin Drummey
![Page 2: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/2.jpg)
Example Simple Decision Tree Method
Construction: at each node, a feature is chosen randomly Discrete:
only if it has never been chosen previously on a given decision path starting from the root of the tree.
every example on the same path has the same discrete feature value.
Continuous feature: can be chosen multiple times on the same decision
path. each time a random threshold value is chosen
![Page 3: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/3.jpg)
continues…
Stop when: Number of examples in the leaf node is too
small. The total height of the tree exceeds some limits.
Each node of the tree keeps the number of examples belonging to each class. For example, 10 + and 5 –
Construct at least 10 trees but no need to be more than 30.
![Page 4: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/4.jpg)
Classification
Each tree output estimated posterior probability: A node with 10 + and 5 - outputs P(+|x,t) = 0.67
Multiple trees average their probability estimates as the final output.
Use the estimated probability and given loss function to choose label that minimize expected loss. 0-1 loss or traditional accuracy: choose the most
probable label Cost-sensitive: choose the label that minimize risk.
![Page 5: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/5.jpg)
Difference from Traditional
No Gain function. Info gain Gini index Kearn-Mansour criteria others
No Feature Selection. Don’t choose feature with highest “gain”
Multiple trees. Relies on probability estimates.
![Page 6: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/6.jpg)
How well it works?
Credit card fraud detection: Each transaction has a transaction amount. There is an overhead $90 to challenge a fraud. Predict fraud iif
P(fraud|x) $1000 > $90 P(fraud|x) $1000 is expected loss When expected loss is more than overhead, do sth.
Three models: Traditional Unpruned decision tree Traditional Pruned decision tree RDT
![Page 7: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/7.jpg)
Results
![Page 8: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/8.jpg)
Randomization
Feature selection randomization: RDT: completely random. Random Forest: consider random subset at each node. etc
Feature subset randomization. Fixed random subset.
Data randomization: Bootstrap sample. Bagging and Random Forest Data Partitioning
Feature Combination.
![Page 9: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/9.jpg)
Methods Included
RDT: Choose feature randomly. Choose threshold for continuous randomly.
RF and RF+ (variation of Random Forest): Chooses k features randomly. Choose the one among k with highest infogain Variation I: use original dataset. Variation II: output probability instead of voting.
![Page 10: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/10.jpg)
More Methods
Bagged Probabilistic Tree: Bootstrap Compute probability. Traditional Tree
Disjoint Subset Trees: Shuffle the data. Equal-sized subsets. Traditional Tree
![Page 11: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/11.jpg)
Some concepts True posterior probability P(y|x)
Probability of an example to be a class y as a condition of its feature vector x
Generated from some unknown function F Given a loss function, the optimal decision y* is the class label that minimizes
the expected loss. 0-1 loss: the most probable label.
Binary problem: class +, class – P(+|x) = 0.7 and P(-|x) = 0.3 Predict +
Cost-sensitive loss: choose the class label that reduces expected risk. P(fraud|x) * $1000 > $90
Optimal label *y may not always be the true label. For example, 0-1 loss, P(+|x) = 0.6, the true label may be – with 0.4
probability
![Page 12: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/12.jpg)
Estimated Probability
We use M to “approximate” true function F. We almost never know F. Estimated probability by a model M, P(y|x,M).
The dependency on M is none-trivial: Decision tree uses tree structure and parameters
within the structure to approximate P(y|x) Mixture model uses basis functions such as naïve
Bayes and Gaussian.
Relation between P(y|x,M) and P(y|x)?
![Page 13: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/13.jpg)
Important Observation
If P(y|x,M) = P(y|x), the expected loss for any loss function will be the smallest.
Interesting cases: P(y|x, M) = P(y|x) and 0-1 loss,100% accuracy?
Yes, only if the problem is deterministic or P(y|x) =1 for the true label and 0 for all others!
Otherwise, you can only choose the most likely label, but it can still be wrong for some examples.
Can M beat the accuracy of P(y|x), even if P(y|x, M) =! P(y|x)? Yes, for some specific example or specific test set. But not in general or not “expected loss’’
![Page 14: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/14.jpg)
Reality
Class labels are given, however P(y|x) is not given in any dataset unless the
dataset is synthesized. Next Question: how to set the “true” P(y|x) for
a realistic dataset?
![Page 15: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/15.jpg)
Choosing P(y|x) Naïve Approach
Assume that P(y|x) is 1 for the true class label of x and 0 for all class labels.
For example, two class problem + and – If x’s label is +, assume P(+|x) = 1 and P(-1|x) = 0
Only true if the problem is determinisitic and noise free.
Rather strong assumption and may cause problems. X has true class label: + M1: P(+|x,M1) = 1, P(-|x,M1) =0 M2: P(+|x,M2) = 0.8, P(-|x,M2) = 0.2 Both M2 and M1 are correct. But Penalize M2
![Page 16: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/16.jpg)
Utility-based Choice of P(y|x) Definition: v is the probability threshold for model M to correctly
predict the optimal label y* of x. If P(y*|x,M) > v, predict y*
Assume *y to be the true class label of an example. Example, binary class, 0-1 loss
v=0.5 or If P(y|x,M) > 0.5, predict y Example, credit card fraud cost-sensitive loss
P(y|x,M) * $1000 > $90 v = 90/1000 = 0.09
In summary, we use [v, +1] as the range of true probability P(y|x)
This is weaker than assuming P(y|x) = 1 for the true class label.
![Page 17: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/17.jpg)
Example
Two class problem: Naïve assumption:
P(y|x) = 1 for the correct label. 0 for all others.
We assume P(y|x) (0.5, 1] It includes “naïve assumption” P(y|x) = 1.
We re-define some measurements to fix the problem of “penalty”.
![Page 18: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/18.jpg)
Desiderata
If P(y|x,M) [v, 1], the exact value is trivial, since we already predict the true label.
When P(y|x,M) < v(x,M), the difference is important. Measures how far off we are from making the
right decision. Take into account the loss function, since the
goal is to minimize its expected value.
![Page 19: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/19.jpg)
Evaluating P(y|x,M)
Improved MSE Square Error: Where [[a]] = min(a, 1)
Cross-entropy:
Undefined either when P(y|x.M) = 0 or true probability P(y|x) = 0 No relation to loss function.
Reliability plots previously proposed and used such as Zadrozny and Elkan’02 (explain later)
![Page 20: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/20.jpg)
Synthetic Dataset
True probability P(y|x) is known and can be used to measure the exact MSE.
Standard Bias and Variance Decomposition of MSE
![Page 21: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/21.jpg)
Charts
0
0.1
0.2
0.3
0.4
0.5
0.6
Unp Pruned RD RF+
BiasVariace
![Page 22: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/22.jpg)
Binary Dataset
Donation Dataset: Send a letter to solicit donation. Costs 68c to send a letter Cost-sensitive loss:
P(donate|x) * amt(x) > 68c Used MLR to estimate amt(x). Better results
could be obtained by Heckman’s two-step procedure (Zadrozny and Elkan’02)
![Page 23: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/23.jpg)
How much money we got
![Page 24: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/24.jpg)
Reliability Plot
Divide score or output probability into bins Either equal size such as 10 or 100 bins. Or equal number of examples.
For those examples in the same bin: Average the predicted probability of these examples, and
call it bin_x Divide the number of examples with label y by the total
number of examples in the bin, call it bin_y
Plot (bin_x, bin_y)
![Page 25: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/25.jpg)
Reliability Plot Unpruned Tree
![Page 26: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/26.jpg)
![Page 27: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/27.jpg)
Multi-Class Dataset
Artificial Character Dataset from UCI Class labels: 10 letters Three loss functions:
Top 1: the true label is the most probable letter. Top 2: the true label is among the two most
probable letters. Top 3: the true label is among the top three.
![Page 28: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/28.jpg)
Losses
0
5
10
15
20
25
30
Unp Pruned RD RF+
top1
top2
top3
![Page 29: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/29.jpg)
MSE
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Unp Pruned RD RF+
P(y|x)=1
top1
top2
top3
![Page 30: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/30.jpg)
Detailed Probability
![Page 31: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/31.jpg)
What we learned
On studies of probability approximation: Assuming P(y|x)=1 is a very strong assumption
and cause problems. Suggested a relaxed choice of P(y|x). Improved definition of MSE that takes into loss.
Methodology part: Proposed a variation of Random Forest.
![Page 32: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/32.jpg)
Summary of Experiments
Various experiments Synthetic with true probability P(y|x) Binary and multi-class problems Reliability plots and MSE show that randomized
approaches approximate P(y|x) significantly closer.
Bias and Variance Decomp of Probability as compared to loss function. Reduction comes mainly from variance Bias is reduced as well
![Page 33: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/33.jpg)
What next
We traditionally think that probability estimation is a harder problem than class labels: Simplified approach: naïve Bayes. Uncorrelated
assumption. Finite mixture models: still based on assumption of basis
function. Logistic regression: sensitive to example layout, and
subjective use to categorical features. Bayes network: need knowledge about causal relations.
NP-hard to find the optimal one.
![Page 34: Effective Estimation of Posterior Probabilities: - Explaining the Accuracy of Randomized Decision Tree Approaches Wei Fan Ed Greengrass Joe McCloskey Philip.](https://reader035.fdocuments.us/reader035/viewer/2022081518/5513cbea55034646298b4eba/html5/thumbnails/34.jpg)
continued
We show that rather simple randomized approaches approximate probability very well.
Next step: is it time for us to re-design some better and simpler algorithms to approximate probability better?