Lecture 10: Multivariate Method - Boosted Decision Tree€¦ · Multivariate Method - Boosted...
Transcript of Lecture 10: Multivariate Method - Boosted Decision Tree€¦ · Multivariate Method - Boosted...
University of Copenhagen Niels Bohr Institute
D. Jason Koskinen [email protected]
Photo by Howard Jackman
Advanced Methods in Applied Statistics Feb - Apr 2019
Lecture 10: Multivariate Method - Boosted
Decision Tree
Lecture Material Credits: H. Voss (MPIK), TMVA group
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Using likelihoods to separate background from signal is not always feasible • Likelihood may be too complicated for analytic or Monte Carlo evaluation • High dimensionality makes Monte Carlo computationally expensive
• Data sets which are linearly separable in variables, e.g. between signal and background, have useful tools for doing such a separation (Fisher Discriminant)
• For linear and non-linear classification scenarios and/or where the available separators are weak, there is a class of multivariate tools • k-Nearest Neighbor • Random Forest • Artificial Neural Networks • Support Vector Machine (can be a linear regression classifier too) • (Boosted) Decision Trees • etc.
“Simple” Problems
!2
D. Jason Koskinen - Advanced Methods in Applied Statistics
• We might not be able to easily calculate likelihoods or probability distribution functions, but we can separately identify data or generate Monte Carlo which is known signal and known background
• Use the known signal/background as training samples for a learning algorithm to classify events as signal/background based on only the available variables
• A user provides the events, true classification data sets, and variables, and the machine algorithm (hopefully) produces an inference which can be used on unclassified data to separate signal/background
Supervised Learning Algorithms
!3
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Depending on the kernel, the SVM maps the data into a higher or alternate dimension space, and makes classifications via hyperplanes
Support Vector Machine
!4
*”Support Vector Networks”, Cortes & Vapnik
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Below are events in only x, y, and z for some class of signal and background, which are not obviously easy to separate via straight cuts • Machine algorithms can ‘see’ in higher dimensions very quickly
• This semi-simple example with a boosted decision tree can get a true positive rate of 88% at 1% false positive rate
Training Samples
!5
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Statistical terms can be different depending on the research field
• A standard in Machine Learning is the terms associated with a confusion matrix (https://en.wikipedia.org/wiki/Confusion_matrix)
Clarity
!6
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Decision Tree: Sequential application of cuts splits the data into nodes, where the final nodes (leaves) classify an event as signal or background • Easy to visualize and
interpret
• Resistant to outliers in data
• Disadvantage is that statistical fluctuations in training create instability
Decision Tree
!7
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Machine Learning algorithms can be overly optimized wherein statistical fluctuations from the training data are wrongly characterized as true features of the distributions • Deficit of training data statistics versus number of variables or
complexity
• Model flexibility, e.g. many free parameters
Overtraining
!8*H. Voss (MPIK)
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Just because the training uses lots of data, doesn’t mean the algorithm is overtrained. It’s the balance between statistical fluctuations in the training sample AND the complexity of the model that drive ‘overtraining’.
Overtraining
!9*H. Voss (MPIK)
D. Jason Koskinen - Advanced Methods in Applied Statistics
• A common way to check over-training for any machine learning algorithm is to test the learned inference classification on a statistically independent data set of known signal/background
• Classification should be as similar between the training sample results and testing sample results as statistical fluctuations permit. Significant differences, e.g. in the ROC curves, are a common indicator of over training.
Testing & Training
!10
*n.b. just because the training uses lots of data, doesn’t mean the algorithm is overtrained. It’s the balance between statistical
fluctuations in the training sample AND the complexity of
the model that drive ‘overtraining’.
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Start with a training sample and split using a variable that gives the best separation (for some definition of ‘best’, e.g. Gini-index)
• Continue splitting until some threshold is met: • Statistics per node
• Number of nodes
• Depth of decision tree
• Further splits fall below separation threshold
• Events that fall in the final nodes are classified as signal/background via some metric (binary classification, sig/bkg probability, etc.)
Creating the Decision Tree
!11
D. Jason Koskinen - Advanced Methods in Applied Statistics
• The decision tree (left) and the 2D space (right) which represent the different areas of x1 and x2 classified as signal/background regions by the decision tree
Decision Tree Walkthrough
!12
*J. Therhaag
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Similar to machine learning algorithms, a decision tree can be overtrained. Specifically, it can be very sensitive to statistical fluctuations in the training sample. • High statistics training samples can minimize the impact, but never
remove it
• Removing nodes with low separation power helps, due to reducing the complexity
• Could generate multiple trees and combine them to increase separation power and decrease overtraining, but the decision process would create identical trees for the same training sample
• Common solution to avoid identical decisions with multiple iterations is to use Boosting
Decision Tree Overtraining
!13
D. Jason Koskinen - Advanced Methods in Applied Statistics
• d
Boosting
!14
D. Jason Koskinen - Advanced Methods in Applied Statistics
• While all boosted decision trees have the same principal, the algorithms that determine the ‘boosting’, splitting and pruning of the leaves, etc. are different
• Gradient boosting minimizes the residual error from the prediction/classification of the previous ‘tree’, and uses gradient information to inform the new tree(s)
• Adaptive boosting iteratively modifies training events with new importance ‘weights’ in order to improve learning
Boosting Algorithm
!15
D. Jason Koskinen - Advanced Methods in Applied Statistics !16
D. Jason Koskinen - Advanced Methods in Applied Statistics
Adaptive Boosting Visually
!17
*AdaBoost by Y. Freund & R. E.Schapire
D. Jason Koskinen - Advanced Methods in Applied Statistics
AdaBoost Boosted Decision Trees
!18
• Past the first one, each iterative boosted decision tree (classifier) is trained on the ‘same’ events. But now, the events have weights according to whether they were previously wrongly classified. Also, the regions have weights (W1, W2, W3 in the example below) corresponding to the classification error.
D. Jason Koskinen - Advanced Methods in Applied Statistics
• The combined classifier is the weighted average from all trees for the different regions
• Works very well “out-of-the-box”
AdaBoost Boosted Decision Trees
!19
D. Jason Koskinen - Advanced Methods in Applied Statistics
• After training, and hopefully testing, the BDT can generate a score when run over new data that allows signal/background separation • Place a cut at some value of the test statistic (here it is the BDT
score) to get desired precision and true positive rate
Boosted Decision Tree Classifier
!20
BDT Score1− 0.8− 0.6− 0.4− 0.2− 0 0.2 0.4 0.6 0.8 10
100
200
300
400
500
SignalBackground Keep
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Different software packages and algorithms will return different ‘test statistics’ for the classification • From AdaBoost, we get the ‘decision score’ as the test statistic.
More negative values are background.
• XGBoost gives a probability for an event to be background or signal, which is the test statistic.
Boosted Decision Tree Classifier
!21
BDT Score1− 0.8− 0.6− 0.4− 0.2− 0 0.2 0.4 0.6 0.8 10
100
200
300
400
500
SignalBackground Keep
We are using the BDT as a classifier and want a decision
about whether a new data event is more similar to class-A or
class-B, e.g. signal or background. In AdaBoost we use a “BDT Score” which is here the
BDT decision score.
D. Jason Koskinen - Advanced Methods in Applied Statistics
• It is common to throw an absurd number of variables into a BDT and have the software signify the variables of importance. The more variables used in any supervised learning algorithm, the more difficult it is to debug when something goes wrong, e.g. user error.
• The number of nodes, variables, size of training sample, and depth of each tree can influence the classification outcome. Because BDTs are generally fast to train, play around with the settings/options to see the effects.
• Ensure that the variables used in training match the distribution shapes in data. Poor variable agreement will bias the BDT, and if the BDT uses many variables it can be hard to notice that a problem exists.
BDT Comments
!22
D. Jason Koskinen - Advanced Methods in Applied Statistics
• On the class webpage there are 4 files which include signal and background for training and testing of a learning algorithm (nominally a boosted decision tree)
• The data is generated as an unknown function of three variables
• Plot the three variables for the signal and background samples for training: • 1D histograms, one for each variable
• 2D graphs of the x vs. y, x vs. z, and y vs. z
• After training the BDT, plot the BDT decision score for the test sample separated by color for the signal and background
Exercise #1
!23
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Not a whole lot of separation going on here
Exercise #1 Simple Plots
!24
x3− 2− 1− 0 1 2 3 4 5 6 7
0
20
40
60
80
100
120
140
160
180
200
220
240
Training Sample
y0 1 2 3 4 5 6
0
20
40
60
80
100
120
140
160
180
200
220
240
Training Sample
z0 5 10 15 20 25 30 35 40 45
0
20
40
60
80
100
120
140
160
180
200
220
240
Training Sample
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Not much better in 2D than it was in 1D
Exercise #1 in 2D
!25
D. Jason Koskinen - Advanced Methods in Applied Statistics
Exercise #1 BDT Score Result
!26
BDT Score1− 0.8− 0.6− 0.4− 0.2− 0 0.2 0.4 0.6 0.8 10
100
200
300
400
500
SignalBackground
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Now with 16 variables using the single file on the webpage
• Separate the data from the file into a training and testing sample and repeat • Do not use any training data as a test sample
• Train a BDT (or some other machine learning algorithm) and plot the output scores
• Are any variables essentially worthless? What happens when the lowest rank variable is removed from the training?
Exercise #2
!27
D. Jason Koskinen - Advanced Methods in Applied Statistics
1D Histograms of the 16 Variables
!28
Entries 3000Mean 12.98RMS 2.514
Variable00 5 10 15 20 25
Freq
uenc
y
0
20
40
60
80
100
120
140 Entries 3000Mean 12.98RMS 2.514
Entries 3000Mean 49.96RMS 6.154
Variable120 30 40 50 60 70 80 90
Freq
uenc
y
0
20
40
60
80
100
120
140Entries 3000Mean 49.96RMS 6.154
Entries 3000Mean 0.003199− RMS 0.9744
Variable25− 4− 3− 2− 1− 0 1 2 3 4 5
Freq
uenc
y
0
20
40
60
80
100
120
140Entries 3000Mean 0.003199− RMS 0.9744
Entries 3000Mean 0.5014RMS 0.4945
Variable31− 0.5− 0 0.5 1 1.5 2 2.5 3 3.5 4
Freq
uenc
y
0
20
40
60
80
100
120
Entries 3000Mean 0.5014RMS 0.4945
Entries 3000
Mean 0.008832− RMS 0.7032
Variable42− 1.5− 1− 0.5− 0 0.5 1 1.5 2
Freq
uenc
y
0
20
40
60
80
100Entries 3000
Mean 0.008832− RMS 0.7032
Entries 3000Mean 0.01131RMS 0.7166
Variable52− 1.5− 1− 0.5− 0 0.5 1 1.5 2
Freq
uenc
y
0
20
40
60
80
100 Entries 3000Mean 0.01131RMS 0.7166
Entries 3000Mean 0.4856RMS 0.4645
Variable60 0.5 1 1.5 2 2.5
Freq
uenc
y
0
20
40
60
80
100
120
140
160 Entries 3000Mean 0.4856RMS 0.4645
Entries 3000Mean 0.4816RMS 0.286
Variable70 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Freq
uenc
y
0
10
20
30
40
50
60
70 Entries 3000Mean 0.4816RMS 0.286
Entries 3000Mean 5.134RMS 2.687
Variable80 2 4 6 8 10 12 14
Freq
uenc
y
0
10
20
30
40
50
60
70
80Entries 3000Mean 5.134RMS 2.687
Entries 3000Mean 0.9989RMS 0.1014
Variable90 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Freq
uenc
y
0
50
100
150
200
250 Entries 3000Mean 0.9989RMS 0.1014
Entries 3000Mean 0.6294− RMS 1.243
Variable105− 4− 3− 2− 1− 0 1 2
Freq
uenc
y
0
10
20
30
40
50
60
70
80
90Entries 3000Mean 0.6294− RMS 1.243
Entries 3000Mean 1.158RMS 1.766
Variable115− 4− 3− 2− 1− 0 1 2 3 4 5
Freq
uenc
y
0
10
20
30
40
50
Entries 3000Mean 1.158RMS 1.766
Entries 3000Mean 5.983RMS 2.31
Variable120 5 10 15 20 25
Freq
uenc
y
0
100
200
300
400
500Entries 3000Mean 5.983RMS 2.31
Entries 3000Mean 9.989RMS 2.264
Variable130 5 10 15 20 25
Freq
uenc
y
0
100
200
300
400
500
Entries 3000Mean 9.989RMS 2.264
Entries 3000Mean 1.598RMS 1.116
Variable140.5− 0 0.5 1 1.5 2 2.5 3 3.5
Freq
uenc
y
660
680
700
720
740
760
780
800
820
840Entries 3000Mean 1.598RMS 1.116
Entries 3000Mean 0.9013RMS 1.078
Variable150.5− 0 0.5 1 1.5 2 2.5 3 3.5
Freq
uenc
y
200
400
600
800
1000
1200
1400
1600
Entries 3000Mean 0.9013RMS 1.078
D. Jason Koskinen - Advanced Methods in Applied Statistics
Exercise #2 Correlation Plots
!29
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Surprisingly from looking at the data, a BDT can get decent separation
Exercise #2 BDT Scores
!30
BDT Score1− 0.8− 0.6− 0.4− 0.2− 0 0.2 0.4 0.6 0.8 10
10
20
30
40
50
60
70
D. Jason Koskinen - Advanced Methods in Applied Statistics
• The following is one of the problems from the 2016 take-home exam
• For 2b the goal is to predict the cancer rate from an MVA trained on a blind data sample • 87%+ true positive rate got 100% credit
• 85-87% true positive rate got 80% credit
• 80-85% true positive rate got 60% credit
• <80% true positive rate got 40% credit (for a valid effort)
• I will post the ‘solution’ data set so you can test your own training acumen
Next
!31
D. Jason Koskinen - Advanced Methods in Applied Statistics
• A cancer study in 1991 conducted in Wisconsin collected data from ~700 patients. There were 9 variables associated with the digitized image of a fine needle aspirate biopsy sample of a tissue mass. Each variable has discrete values from 1-10. There was also the patient identifier (code number) and whether the sample mass was ultimately benign or malignant.
Problem 2
!32
# Attribute Domain/Range —————————————————————
1. Sample code number id number 2. Clump Thickness 1 - 10 3. Uniformity of Cell Size 1 - 10 4. Uniformity of Cell Shape 1 - 10 5. Marginal Adhesion 1 - 10 6. Single Epithelial Cell Size 1 - 10 7. Bare Nuclei 1 - 10 8. Bland Chromatin 1 - 10 9. Normal Nucleoli 1 - 10 10. Mitoses 1 - 10
11. Class: (2 for benign, 4 for malignant)
*From 2016 Exam
D. Jason Koskinen - Advanced Methods in Applied Statistics
• There are two files online: training data and blind data
• The following training data includes the aforementioned variables as well as whether the biopsy was benign or malignant (www.nbi.dk/~koskinen/Teaching/AdvancedMethodsInAppliedStatistics2016/data/breast-cancer-wisconsin_train-test.txt)
• The following data includes the same variables, but the information of whether the biopsy was benign or malignant has been removed, i.e. a blind sample, (www.nbi.dk/~koskinen/Teaching/AdvancedMethodsInAppliedStatistics2016/data/breast-cancer-wisconsin_mod_real.txt)
• Using some method (straight cuts, support vector machine, boosted decision tree, etc.) and the training data, come up with a classification algorithm which uses the 9 variables to identify malignant and benign tissue samples • Note: the separate variables can be assumed to have the same features
and shape between the training and blind sample
Problem 2a
!33
*From 2016 Exam
D. Jason Koskinen - Advanced Methods in Applied Statistics
• With a developed algorithm, run the classifier over the training sample and calculate the efficiency (true positive rate) of identifying a malignant mass • It is possible to get 100% true positive rate, provided the method is
overtrained
• But, you will have to use the same settings for classifying the blind sample in Problem 2b
• Calculate the overall classification true positive rate for the whole training sample • (classified_true_malignant + classified_true_benign)/
(total_trainingtest_sample)
Problem 2a cont.
!34
*From 2016 Exam
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Using the same setting(s) as developed in Problem 2a, run the classifier over all the entries in the blind sample (breast-cancer-wisconsin_mod_real.txt) • Produce a text file which contains only the ID of the samples which
your classifier classifies as malignant (last_name.malignant_ID.txt)
• Produce a text file which contains only the ID of the samples which your classifier classifies as benign (last_name.benign_ID.txt)
• Basic text files. No Microsoft Word documents, Adobe PDF, or any other extraneous text editor formats. Only a single ID number per line in the text file that can be easily read by numpy.loadtxt(). • Example online at http://www.nbi.dk/~koskinen/Teaching/
AdvancedMethodsInAppliedStatistics2016/data/koskinen.benign_ID.txt
• Any and all duplicates, i.e. two samples with the same ID, should be kept and included in the text files and analysis
Problem 2b
!35
*From 2016 Exam
D. Jason Koskinen - Advanced Methods in Applied Statistics
• Gradient Boosting • Original paper covering gradient boosting https://projecteuclid.org/
euclid.aos/1013203451
• XGBoost - “Extreme Gradient Boost” • Paper from XGBoost authors at https://doi.org/10.1145/2939672.2939785
• Weight quantile sketch algorithm in XGBoost at https://homes.cs.washington.edu/~tqchen/pdf/xgboost-supp.pdf
• Nice webpage covering explanation of gradient boosting concept https://explained.ai/gradient-boosting/
• Adaptive Boosting • AdaBoost is one of the most common boosting algorithms https://
doi.org/10.1006/jcss.1997.1504
Additional Information
!36