Bootstrapped Aggregationand Random Forests
Bonus: Regression Trees
2
o Suppose you want to PREDICT the salary of a MLB player based on two features: how many hits they average a year and how many years they’ve been in the league
Years
Hits
1
117.5
238
1 4.5 24
R1
R3
R2
|Years < 4.5
Hits < 117.55.11
6.00 6.74
.
• ×(
Ri-
i
'
Mgau( SAHR ) 1=5.11Rz 123
Bonus: Regression Trees
3
o Suppose you want to PREDICT the salary of a MLB player based on two features: how many hits they average a year and how many years they’ve been in the league
o Perform the usual binary splitting by feature
o Instead of classification, predict the response based on average response in leaf node
Bonus: Regression Trees
4
o Instead of classification, predict the response based on average response in leaf node
|
t1
t2
t3
t4
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
X1
X1X1
X2
X2
X2
X1 ≤ t1
X2 ≤ t2 X1 ≤ t3
X2 ≤ t4
|
t1
t2
t3
t4
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
X1
X1X1
X2
X2
X2
X1 ≤ t1
X2 ≤ t2 X1 ≤ t3
X2 ≤ t4
|
t1
t2
t3
t4
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
X1
X1X1
X2
X2
X2
X1 ≤ t1
X2 ≤ t2 X1 ≤ t3
X2 ≤ t4
L gift.
,:|q
Bonus: Regression Trees
5
o What measure do we use to decide which feature to split on?
o The goal is to find boxes that minimize the
o Consider splitting the data on a node into two boxes. Choose feature and split to minimize
where
RSS =JX
j=1
X
i2Rj
�yi � yRj
�2
<latexit sha1_base64="VB4eeT2jjl/WokMOFtW/uCEEKLg=">AAACRHicbVA9b9RAEF2Hr2A+ckBJM+KCFApO9jWBIlKkNIgqHxyJdHux1nvju03Wa2t3HMmy/OfSpKfjH9BQEESLWN9dAQlPGuntezOanZeWWjmKoq/B2p279+4/WH8YPnr85OlG79nzz66orMSRLHRhT1LhUCuDI1Kk8aS0KPJU43F6vtf5xxdonSrMJ6pLnORiZlSmpCAvJT2+GR4eHcEO8KlypRa1o1ojd1WeNGc7cXv6EZYPBVwZOEzOWuAaM9qCOlHwFvhcUFO3SdNZ3KrZnN6cDsPN0CPp9aNBtADcJvGK9NkK+0nvC58WssrRkNTCuXEclTRphCUlNbYhrxyWQp6LGY49NSJHN2kWKbTw2itTyArryxAs1L8nGpE7V+ep78wFzd1NrxP/540ryt5NGmXKitDI5aKs0kAFdJHCVFmUpGtPhLTK/xXkXFghyQffhRDfPPk2GQ0H7wfxwbC/u71KY529ZK/YFovZNttlH9g+GzHJLtk39oNdB1fB9+Bn8GvZuhasZl6wfxD8/gPTFa75</latexit><latexit sha1_base64="VB4eeT2jjl/WokMOFtW/uCEEKLg=">AAACRHicbVA9b9RAEF2Hr2A+ckBJM+KCFApO9jWBIlKkNIgqHxyJdHux1nvju03Wa2t3HMmy/OfSpKfjH9BQEESLWN9dAQlPGuntezOanZeWWjmKoq/B2p279+4/WH8YPnr85OlG79nzz66orMSRLHRhT1LhUCuDI1Kk8aS0KPJU43F6vtf5xxdonSrMJ6pLnORiZlSmpCAvJT2+GR4eHcEO8KlypRa1o1ojd1WeNGc7cXv6EZYPBVwZOEzOWuAaM9qCOlHwFvhcUFO3SdNZ3KrZnN6cDsPN0CPp9aNBtADcJvGK9NkK+0nvC58WssrRkNTCuXEclTRphCUlNbYhrxyWQp6LGY49NSJHN2kWKbTw2itTyArryxAs1L8nGpE7V+ep78wFzd1NrxP/540ryt5NGmXKitDI5aKs0kAFdJHCVFmUpGtPhLTK/xXkXFghyQffhRDfPPk2GQ0H7wfxwbC/u71KY529ZK/YFovZNttlH9g+GzHJLtk39oNdB1fB9+Bn8GvZuhasZl6wfxD8/gPTFa75</latexit><latexit sha1_base64="VB4eeT2jjl/WokMOFtW/uCEEKLg=">AAACRHicbVA9b9RAEF2Hr2A+ckBJM+KCFApO9jWBIlKkNIgqHxyJdHux1nvju03Wa2t3HMmy/OfSpKfjH9BQEESLWN9dAQlPGuntezOanZeWWjmKoq/B2p279+4/WH8YPnr85OlG79nzz66orMSRLHRhT1LhUCuDI1Kk8aS0KPJU43F6vtf5xxdonSrMJ6pLnORiZlSmpCAvJT2+GR4eHcEO8KlypRa1o1ojd1WeNGc7cXv6EZYPBVwZOEzOWuAaM9qCOlHwFvhcUFO3SdNZ3KrZnN6cDsPN0CPp9aNBtADcJvGK9NkK+0nvC58WssrRkNTCuXEclTRphCUlNbYhrxyWQp6LGY49NSJHN2kWKbTw2itTyArryxAs1L8nGpE7V+ep78wFzd1NrxP/540ryt5NGmXKitDI5aKs0kAFdJHCVFmUpGtPhLTK/xXkXFghyQffhRDfPPk2GQ0H7wfxwbC/u71KY529ZK/YFovZNttlH9g+GzHJLtk39oNdB1fB9+Bn8GvZuhasZl6wfxD8/gPTFa75</latexit><latexit sha1_base64="VB4eeT2jjl/WokMOFtW/uCEEKLg=">AAACRHicbVA9b9RAEF2Hr2A+ckBJM+KCFApO9jWBIlKkNIgqHxyJdHux1nvju03Wa2t3HMmy/OfSpKfjH9BQEESLWN9dAQlPGuntezOanZeWWjmKoq/B2p279+4/WH8YPnr85OlG79nzz66orMSRLHRhT1LhUCuDI1Kk8aS0KPJU43F6vtf5xxdonSrMJ6pLnORiZlSmpCAvJT2+GR4eHcEO8KlypRa1o1ojd1WeNGc7cXv6EZYPBVwZOEzOWuAaM9qCOlHwFvhcUFO3SdNZ3KrZnN6cDsPN0CPp9aNBtADcJvGK9NkK+0nvC58WssrRkNTCuXEclTRphCUlNbYhrxyWQp6LGY49NSJHN2kWKbTw2itTyArryxAs1L8nGpE7V+ep78wFzd1NrxP/540ryt5NGmXKitDI5aKs0kAFdJHCVFmUpGtPhLTK/xXkXFghyQffhRDfPPk2GQ0H7wfxwbC/u71KY529ZK/YFovZNttlH9g+GzHJLtk39oNdB1fB9+Bn8GvZuhasZl6wfxD8/gPTFa75</latexit>
X
i: xi2R1(j,s)
(yi � yR1)2 +
X
i: xi2R2(j,s)
(yi � yR2)2
<latexit sha1_base64="4N+0l5Bkl4OtHgO3IdysuJfHzHo=">AAACiXichVHRTtswFHXCGJCxUdgjL9YKUhGjSvJAgSckXniEaR1ITYkc96Y1OE5k3yCiKPwLv7Q3/ga3VBoDJI5k6eice4/te5NCCoO+/+i4C58WPy8tr3hfVr9+W2utb/wxeak59Hkuc32ZMANSKOijQAmXhQaWJRIukpuTqX9xC9qIXP3GqoBhxsZKpIIztFLcetjyopEwhWSVwUpCZMosrsXR/V0saCQU/RUHneufZqehkYQUO7Syxh6NJgzrqolr6zeRFuMJ7lyFdJd+lBZ+kBb+S/O2PIu41fa7/gz0LQnmpE3mOItbf6NRzssMFHLJjBkEfoHDmmkUXELjRaWBgvEbNoaBpYplYIb1bJQN3bbKiKa5tkchnakvO2qWGVNlia3MGE7Ma28qvucNSkwPhrVQRYmg+PNFaSkp5nS6FzoSGjjKyhLGtbBvpXzCNONotzcdQvD6y29JP+wedoPzsH3cm09jmWySH6RDAtIjx+SUnJE+4c6Ss+fsOz131Q3dA/foudR15j3fyX9wT54A34rBEg==</latexit><latexit sha1_base64="4N+0l5Bkl4OtHgO3IdysuJfHzHo=">AAACiXichVHRTtswFHXCGJCxUdgjL9YKUhGjSvJAgSckXniEaR1ITYkc96Y1OE5k3yCiKPwLv7Q3/ga3VBoDJI5k6eice4/te5NCCoO+/+i4C58WPy8tr3hfVr9+W2utb/wxeak59Hkuc32ZMANSKOijQAmXhQaWJRIukpuTqX9xC9qIXP3GqoBhxsZKpIIztFLcetjyopEwhWSVwUpCZMosrsXR/V0saCQU/RUHneufZqehkYQUO7Syxh6NJgzrqolr6zeRFuMJ7lyFdJd+lBZ+kBb+S/O2PIu41fa7/gz0LQnmpE3mOItbf6NRzssMFHLJjBkEfoHDmmkUXELjRaWBgvEbNoaBpYplYIb1bJQN3bbKiKa5tkchnakvO2qWGVNlia3MGE7Ma28qvucNSkwPhrVQRYmg+PNFaSkp5nS6FzoSGjjKyhLGtbBvpXzCNONotzcdQvD6y29JP+wedoPzsH3cm09jmWySH6RDAtIjx+SUnJE+4c6Ss+fsOz131Q3dA/foudR15j3fyX9wT54A34rBEg==</latexit><latexit sha1_base64="4N+0l5Bkl4OtHgO3IdysuJfHzHo=">AAACiXichVHRTtswFHXCGJCxUdgjL9YKUhGjSvJAgSckXniEaR1ITYkc96Y1OE5k3yCiKPwLv7Q3/ga3VBoDJI5k6eice4/te5NCCoO+/+i4C58WPy8tr3hfVr9+W2utb/wxeak59Hkuc32ZMANSKOijQAmXhQaWJRIukpuTqX9xC9qIXP3GqoBhxsZKpIIztFLcetjyopEwhWSVwUpCZMosrsXR/V0saCQU/RUHneufZqehkYQUO7Syxh6NJgzrqolr6zeRFuMJ7lyFdJd+lBZ+kBb+S/O2PIu41fa7/gz0LQnmpE3mOItbf6NRzssMFHLJjBkEfoHDmmkUXELjRaWBgvEbNoaBpYplYIb1bJQN3bbKiKa5tkchnakvO2qWGVNlia3MGE7Ma28qvucNSkwPhrVQRYmg+PNFaSkp5nS6FzoSGjjKyhLGtbBvpXzCNONotzcdQvD6y29JP+wedoPzsH3cm09jmWySH6RDAtIjx+SUnJE+4c6Ss+fsOz131Q3dA/foudR15j3fyX9wT54A34rBEg==</latexit><latexit sha1_base64="4N+0l5Bkl4OtHgO3IdysuJfHzHo=">AAACiXichVHRTtswFHXCGJCxUdgjL9YKUhGjSvJAgSckXniEaR1ITYkc96Y1OE5k3yCiKPwLv7Q3/ga3VBoDJI5k6eice4/te5NCCoO+/+i4C58WPy8tr3hfVr9+W2utb/wxeak59Hkuc32ZMANSKOijQAmXhQaWJRIukpuTqX9xC9qIXP3GqoBhxsZKpIIztFLcetjyopEwhWSVwUpCZMosrsXR/V0saCQU/RUHneufZqehkYQUO7Syxh6NJgzrqolr6zeRFuMJ7lyFdJd+lBZ+kBb+S/O2PIu41fa7/gz0LQnmpE3mOItbf6NRzssMFHLJjBkEfoHDmmkUXELjRaWBgvEbNoaBpYplYIb1bJQN3bbKiKa5tkchnakvO2qWGVNlia3MGE7Ma28qvucNSkwPhrVQRYmg+PNFaSkp5nS6FzoSGjjKyhLGtbBvpXzCNONotzcdQvD6y29JP+wedoPzsH3cm09jmWySH6RDAtIjx+SUnJE+4c6Ss+fsOz131Q3dA/foudR15j3fyX9wT54A34rBEg==</latexit>
R1(j, s) = {X | Xj < s} and R2(j, s) = {X | Xj � s}<latexit sha1_base64="Zs67o2YmObQBXLhrj9WKj1xrYoQ=">AAACPXicbVBNSxxBFOxRk5jJh6sec3m4BgyEZWYvJhBByMWjkYwu7CxDT8/btbW7Z9L9JrgM+8u85DfklqMXD0nINdf0rHswakFDUVWP16/ySklHUfQjWFpeefT4yerT8NnzFy/XOusbx66srcBElKq0g5w7VNJgQpIUDiqLXOcKT/Lzj61/8hWtk6X5TNMKR5pPjBxLwclLWSfZDo+yeOfsrXsDe5A2A0i1LGCQncEHcOkMUsILsroBbgqYwVHWfzCcTvBLmw+3Q4+s04160Rxwn8QL0mULHGad72lRilqjIaG4c8M4qmjUcEtSKJyFae2w4uKcT3DoqeEa3aiZnz+D114pYFxa/wzBXL090XDt3FTnPqk5nbq7Xis+5A1rGr8bNdJUNaERN4vGtQIqoe0SCmlRkJp6woWV/q8gTrnlgnzjbQnx3ZPvk6Tfe9+LP/W7+7uLNlbZK7bFdljMdtk+O2CHLGGCXbIr9pP9Cr4F18Hv4M9NdClYzGyy/xD8/QcN2aif</latexit><latexit sha1_base64="Zs67o2YmObQBXLhrj9WKj1xrYoQ=">AAACPXicbVBNSxxBFOxRk5jJh6sec3m4BgyEZWYvJhBByMWjkYwu7CxDT8/btbW7Z9L9JrgM+8u85DfklqMXD0nINdf0rHswakFDUVWP16/ySklHUfQjWFpeefT4yerT8NnzFy/XOusbx66srcBElKq0g5w7VNJgQpIUDiqLXOcKT/Lzj61/8hWtk6X5TNMKR5pPjBxLwclLWSfZDo+yeOfsrXsDe5A2A0i1LGCQncEHcOkMUsILsroBbgqYwVHWfzCcTvBLmw+3Q4+s04160Rxwn8QL0mULHGad72lRilqjIaG4c8M4qmjUcEtSKJyFae2w4uKcT3DoqeEa3aiZnz+D114pYFxa/wzBXL090XDt3FTnPqk5nbq7Xis+5A1rGr8bNdJUNaERN4vGtQIqoe0SCmlRkJp6woWV/q8gTrnlgnzjbQnx3ZPvk6Tfe9+LP/W7+7uLNlbZK7bFdljMdtk+O2CHLGGCXbIr9pP9Cr4F18Hv4M9NdClYzGyy/xD8/QcN2aif</latexit><latexit sha1_base64="Zs67o2YmObQBXLhrj9WKj1xrYoQ=">AAACPXicbVBNSxxBFOxRk5jJh6sec3m4BgyEZWYvJhBByMWjkYwu7CxDT8/btbW7Z9L9JrgM+8u85DfklqMXD0nINdf0rHswakFDUVWP16/ySklHUfQjWFpeefT4yerT8NnzFy/XOusbx66srcBElKq0g5w7VNJgQpIUDiqLXOcKT/Lzj61/8hWtk6X5TNMKR5pPjBxLwclLWSfZDo+yeOfsrXsDe5A2A0i1LGCQncEHcOkMUsILsroBbgqYwVHWfzCcTvBLmw+3Q4+s04160Rxwn8QL0mULHGad72lRilqjIaG4c8M4qmjUcEtSKJyFae2w4uKcT3DoqeEa3aiZnz+D114pYFxa/wzBXL090XDt3FTnPqk5nbq7Xis+5A1rGr8bNdJUNaERN4vGtQIqoe0SCmlRkJp6woWV/q8gTrnlgnzjbQnx3ZPvk6Tfe9+LP/W7+7uLNlbZK7bFdljMdtk+O2CHLGGCXbIr9pP9Cr4F18Hv4M9NdClYzGyy/xD8/QcN2aif</latexit><latexit sha1_base64="Zs67o2YmObQBXLhrj9WKj1xrYoQ=">AAACPXicbVBNSxxBFOxRk5jJh6sec3m4BgyEZWYvJhBByMWjkYwu7CxDT8/btbW7Z9L9JrgM+8u85DfklqMXD0nINdf0rHswakFDUVWP16/ySklHUfQjWFpeefT4yerT8NnzFy/XOusbx66srcBElKq0g5w7VNJgQpIUDiqLXOcKT/Lzj61/8hWtk6X5TNMKR5pPjBxLwclLWSfZDo+yeOfsrXsDe5A2A0i1LGCQncEHcOkMUsILsroBbgqYwVHWfzCcTvBLmw+3Q4+s04160Rxwn8QL0mULHGad72lRilqjIaG4c8M4qmjUcEtSKJyFae2w4uKcT3DoqeEa3aiZnz+D114pYFxa/wzBXL090XDt3FTnPqk5nbq7Xis+5A1rGr8bNdJUNaERN4vGtQIqoe0SCmlRkJp6woWV/q8gTrnlgnzjbQnx3ZPvk6Tfe9+LP/W7+7uLNlbZK7bFdljMdtk+O2CHLGGCXbIr9pP9Cr4F18Hv4M9NdClYzGyy/xD8/QcN2aif</latexit>
,# regions
Ey= =
-
=
= -
Previously on CSCI 4622
6
Last time we saw that full Decision Tree Classifiers tended to overfit
Previously on CSCI 4622
7
Last time we saw that full Decision Tree Classifiers tended to overfit
We could alleviate this a bit with pruning
Prepruning:o Don’t let the tree have too many levels
o Stopping conditions
Postpruning: o Build the complete tree
o Go back and remove vertices that don’t affect performance too much
Ensemble Methods
8
Today we’ll see how we can use many decision trees to do even better
Ensemble methods are very popular in Machine Learning of late
General Idea: o Train numerous slightly different classifiers
o Use each classifier to make a prediction on a query point
o Make the ensembled prediction by taking majority vote from individual classifiers
Bagging
9
This week we’ll look at three different types of Ensembled Decision Tree Classifiers
The simplest is called Bootstrapped Aggregation (or Bagging)
Idea: What would have happened if those trainingexamples that we clearly overfit to weren’t there?
00
Bagging
10
The simplest is called Bootstrapped Aggregation (or Bagging)
Bagging
11
The simplest is called Bootstrapped Aggregation (or Bagging)
o Sample n training examples with replacemento Fit full-depth Decision Tree Classifier
o Repeat many times
o Aggregate results by majority vote
n Training8×5
⇒-
Bagging
12
The simplest is called Bootstrapped Aggregation (or Bagging)
Bagging
13
The simplest is called Bootstrapped Aggregation (or Bagging)
Bagging
14
The simplest is called Bootstrapped Aggregation (or Bagging)
213 's IR
⇒ naV⇒B
• ••
B B R
Bootstrap Refresher
15
.tt#TF )
-
=
p§*YEw0¥÷ ;;¥¥⇒£⇒201L
30kEMPIRICAL
D 1St.
YouR , = 10/20,20 ,
20,20 ,zo
- Rz= 10110,70/30,10 ,2i
Bagging
16
In general, assume that we define and each individual DCT has
Then we can define the Bagged Classifier as
How do we Validate the Bagged model?
o Could use cross-validation or a held-out validation set, as usual
o But there is a better way.
y 2 {�1, 1}<latexit sha1_base64="+dcH34nr7UAWK2Kcg6/pOeN+7ic=">AAAB/XicbVBNS8NAEJ34WeNXVDx5WWwFD1qSXqq3ghePFawtNKFstpt26WYTdjdCCAX/ihcPKl79H978N24/Dtr6YODx3gwz88KUM6Vd99taWV1b39gsbdnbO7t7+87B4YNKMkloiyQ8kZ0QK8qZoC3NNKedVFIch5y2w9HNxG8/UqlYIu51ntIgxgPBIkawNlLPOa7Yuc8E8otL78Lzx3bFNug5ZbfqToGWiTcnZZij2XO+/H5CspgKTThWquu5qQ4KLDUjnI5tP1M0xWSEB7RrqMAxVUExPX+MzozSR1EiTQmNpurviQLHSuVxaDpjrIdq0ZuI/3ndTEdXQcFEmmkqyGxRlHGkEzTJAvWZpETz3BBMJDO3IjLEEhNtEpuE4C2+vExatep11burlRv1eRolOIFTOAcP6tCAW2hCCwgU8Ayv8GY9WS/Wu/Uxa12x5jNH8AfW5w9W0pH/</latexit><latexit sha1_base64="+dcH34nr7UAWK2Kcg6/pOeN+7ic=">AAAB/XicbVBNS8NAEJ34WeNXVDx5WWwFD1qSXqq3ghePFawtNKFstpt26WYTdjdCCAX/ihcPKl79H978N24/Dtr6YODx3gwz88KUM6Vd99taWV1b39gsbdnbO7t7+87B4YNKMkloiyQ8kZ0QK8qZoC3NNKedVFIch5y2w9HNxG8/UqlYIu51ntIgxgPBIkawNlLPOa7Yuc8E8otL78Lzx3bFNug5ZbfqToGWiTcnZZij2XO+/H5CspgKTThWquu5qQ4KLDUjnI5tP1M0xWSEB7RrqMAxVUExPX+MzozSR1EiTQmNpurviQLHSuVxaDpjrIdq0ZuI/3ndTEdXQcFEmmkqyGxRlHGkEzTJAvWZpETz3BBMJDO3IjLEEhNtEpuE4C2+vExatep11burlRv1eRolOIFTOAcP6tCAW2hCCwgU8Ayv8GY9WS/Wu/Uxa12x5jNH8AfW5w9W0pH/</latexit><latexit sha1_base64="+dcH34nr7UAWK2Kcg6/pOeN+7ic=">AAAB/XicbVBNS8NAEJ34WeNXVDx5WWwFD1qSXqq3ghePFawtNKFstpt26WYTdjdCCAX/ihcPKl79H978N24/Dtr6YODx3gwz88KUM6Vd99taWV1b39gsbdnbO7t7+87B4YNKMkloiyQ8kZ0QK8qZoC3NNKedVFIch5y2w9HNxG8/UqlYIu51ntIgxgPBIkawNlLPOa7Yuc8E8otL78Lzx3bFNug5ZbfqToGWiTcnZZij2XO+/H5CspgKTThWquu5qQ4KLDUjnI5tP1M0xWSEB7RrqMAxVUExPX+MzozSR1EiTQmNpurviQLHSuVxaDpjrIdq0ZuI/3ndTEdXQcFEmmkqyGxRlHGkEzTJAvWZpETz3BBMJDO3IjLEEhNtEpuE4C2+vExatep11burlRv1eRolOIFTOAcP6tCAW2hCCwgU8Ayv8GY9WS/Wu/Uxa12x5jNH8AfW5w9W0pH/</latexit><latexit sha1_base64="+dcH34nr7UAWK2Kcg6/pOeN+7ic=">AAAB/XicbVBNS8NAEJ34WeNXVDx5WWwFD1qSXqq3ghePFawtNKFstpt26WYTdjdCCAX/ihcPKl79H978N24/Dtr6YODx3gwz88KUM6Vd99taWV1b39gsbdnbO7t7+87B4YNKMkloiyQ8kZ0QK8qZoC3NNKedVFIch5y2w9HNxG8/UqlYIu51ntIgxgPBIkawNlLPOa7Yuc8E8otL78Lzx3bFNug5ZbfqToGWiTcnZZij2XO+/H5CspgKTThWquu5qQ4KLDUjnI5tP1M0xWSEB7RrqMAxVUExPX+MzozSR1EiTQmNpurviQLHSuVxaDpjrIdq0ZuI/3ndTEdXQcFEmmkqyGxRlHGkEzTJAvWZpETz3BBMJDO3IjLEEhNtEpuE4C2+vExatep11burlRv1eRolOIFTOAcP6tCAW2hCCwgU8Ayv8GY9WS/Wu/Uxa12x5jNH8AfW5w9W0pH/</latexit>
hk(x)<latexit sha1_base64="KOJGeqMAnba4cY+E9X2PCsU1FKg=">AAAB/HicbVC7TsNAEFzzDOZlHh3NiQQpNJGdJtBFoqEMEiaREis6X87JKeeH7s6IYEX8Cg0FIFo+hI6/4Zy4gISRVhrN7Gp3x084k8q2v42V1bX1jc3Slrm9s7u3bx0c3sk4FYS6JOax6PhYUs4i6iqmOO0kguLQ57Ttj69yv31PhWRxdKsmCfVCPIxYwAhWWupbxxVz1B9Xs54foIfpuVkxNfpW2a7ZM6Bl4hSkDAVafeurN4hJGtJIEY6l7Dp2orwMC8UIp1Ozl0qaYDLGQ9rVNMIhlV42u36KzrQyQEEsdEUKzdTfExkOpZyEvu4MsRrJRS8X//O6qQouvIxFSapoROaLgpQjFaM8CjRgghLFJ5pgIpi+FZERFpgoHVgegrP48jJx67XLmnNTLzcbRRolOIFTqIIDDWjCNbTABQKP8Ayv8GY8GS/Gu/Exb10xipkj+APj8wcunJH1</latexit><latexit sha1_base64="KOJGeqMAnba4cY+E9X2PCsU1FKg=">AAAB/HicbVC7TsNAEFzzDOZlHh3NiQQpNJGdJtBFoqEMEiaREis6X87JKeeH7s6IYEX8Cg0FIFo+hI6/4Zy4gISRVhrN7Gp3x084k8q2v42V1bX1jc3Slrm9s7u3bx0c3sk4FYS6JOax6PhYUs4i6iqmOO0kguLQ57Ttj69yv31PhWRxdKsmCfVCPIxYwAhWWupbxxVz1B9Xs54foIfpuVkxNfpW2a7ZM6Bl4hSkDAVafeurN4hJGtJIEY6l7Dp2orwMC8UIp1Ozl0qaYDLGQ9rVNMIhlV42u36KzrQyQEEsdEUKzdTfExkOpZyEvu4MsRrJRS8X//O6qQouvIxFSapoROaLgpQjFaM8CjRgghLFJ5pgIpi+FZERFpgoHVgegrP48jJx67XLmnNTLzcbRRolOIFTqIIDDWjCNbTABQKP8Ayv8GY8GS/Gu/Exb10xipkj+APj8wcunJH1</latexit><latexit sha1_base64="KOJGeqMAnba4cY+E9X2PCsU1FKg=">AAAB/HicbVC7TsNAEFzzDOZlHh3NiQQpNJGdJtBFoqEMEiaREis6X87JKeeH7s6IYEX8Cg0FIFo+hI6/4Zy4gISRVhrN7Gp3x084k8q2v42V1bX1jc3Slrm9s7u3bx0c3sk4FYS6JOax6PhYUs4i6iqmOO0kguLQ57Ttj69yv31PhWRxdKsmCfVCPIxYwAhWWupbxxVz1B9Xs54foIfpuVkxNfpW2a7ZM6Bl4hSkDAVafeurN4hJGtJIEY6l7Dp2orwMC8UIp1Ozl0qaYDLGQ9rVNMIhlV42u36KzrQyQEEsdEUKzdTfExkOpZyEvu4MsRrJRS8X//O6qQouvIxFSapoROaLgpQjFaM8CjRgghLFJ5pgIpi+FZERFpgoHVgegrP48jJx67XLmnNTLzcbRRolOIFTqIIDDWjCNbTABQKP8Ayv8GY8GS/Gu/Exb10xipkj+APj8wcunJH1</latexit><latexit sha1_base64="KOJGeqMAnba4cY+E9X2PCsU1FKg=">AAAB/HicbVC7TsNAEFzzDOZlHh3NiQQpNJGdJtBFoqEMEiaREis6X87JKeeH7s6IYEX8Cg0FIFo+hI6/4Zy4gISRVhrN7Gp3x084k8q2v42V1bX1jc3Slrm9s7u3bx0c3sk4FYS6JOax6PhYUs4i6iqmOO0kguLQ57Ttj69yv31PhWRxdKsmCfVCPIxYwAhWWupbxxVz1B9Xs54foIfpuVkxNfpW2a7ZM6Bl4hSkDAVafeurN4hJGtJIEY6l7Dp2orwMC8UIp1Ozl0qaYDLGQ9rVNMIhlV42u36KzrQyQEEsdEUKzdTfExkOpZyEvu4MsRrJRS8X//O6qQouvIxFSapoROaLgpQjFaM8CjRgghLFJ5pgIpi+FZERFpgoHVgegrP48jJx67XLmnNTLzcbRRolOIFTqIIDDWjCNbTABQKP8Ayv8GY8GS/Gu/Exb10xipkj+APj8wcunJH1</latexit>
H(x) = sign
"KX
k=1
hk(x)
#
<latexit sha1_base64="hedUj3SeYVZGuiD6Sfx1qXtpQsI=">AAACPXicbVA9a9xAFFw5jj+U2L4kZZrF54DdHNI1tgvDQRpDGhui2HCSxWrv6W653ZXYfTInhH5ZmvyGdCnTpEhM2rTeOx/BXwMLw8wb3r7JSiksBsEPb+XF6su19Y1N/9Xrre2dzpu3X2xRGQ4RL2RhLjNmQQoNEQqUcFkaYCqTcJFNP879i2swVhT6M9YlJIqNtcgFZ+iktBPt+af7TZzldNYe0BMaI8zQqMaKsW5jCTkO45GwpWS1xVpCbCuVNtOTsL36RCfp9H82NmI8wcTf8x3STjfoBQvQpyRcki5Z4iztfI9HBa8UaOSSWTsMgxKThhkUXELrx5WFkvEpG8PQUc0U2KRZnN/SD04Z0bww7mmkC/V+omHK2lplblIxnNjH3lx8zhtWmB8ljdBlhaD53aK8khQLOu+SjoQBjrJ2hHEj3F8pnzDDOLrG5yWEj09+SqJ+77gXnve7g8NlGxvkPdkl+yQkh2RATskZiQgnX8lP8pv88b55v7wb7+/d6Iq3zLwjD+D9uwUO6a1Y</latexit><latexit sha1_base64="hedUj3SeYVZGuiD6Sfx1qXtpQsI=">AAACPXicbVA9a9xAFFw5jj+U2L4kZZrF54DdHNI1tgvDQRpDGhui2HCSxWrv6W653ZXYfTInhH5ZmvyGdCnTpEhM2rTeOx/BXwMLw8wb3r7JSiksBsEPb+XF6su19Y1N/9Xrre2dzpu3X2xRGQ4RL2RhLjNmQQoNEQqUcFkaYCqTcJFNP879i2swVhT6M9YlJIqNtcgFZ+iktBPt+af7TZzldNYe0BMaI8zQqMaKsW5jCTkO45GwpWS1xVpCbCuVNtOTsL36RCfp9H82NmI8wcTf8x3STjfoBQvQpyRcki5Z4iztfI9HBa8UaOSSWTsMgxKThhkUXELrx5WFkvEpG8PQUc0U2KRZnN/SD04Z0bww7mmkC/V+omHK2lplblIxnNjH3lx8zhtWmB8ljdBlhaD53aK8khQLOu+SjoQBjrJ2hHEj3F8pnzDDOLrG5yWEj09+SqJ+77gXnve7g8NlGxvkPdkl+yQkh2RATskZiQgnX8lP8pv88b55v7wb7+/d6Iq3zLwjD+D9uwUO6a1Y</latexit><latexit sha1_base64="hedUj3SeYVZGuiD6Sfx1qXtpQsI=">AAACPXicbVA9a9xAFFw5jj+U2L4kZZrF54DdHNI1tgvDQRpDGhui2HCSxWrv6W653ZXYfTInhH5ZmvyGdCnTpEhM2rTeOx/BXwMLw8wb3r7JSiksBsEPb+XF6su19Y1N/9Xrre2dzpu3X2xRGQ4RL2RhLjNmQQoNEQqUcFkaYCqTcJFNP879i2swVhT6M9YlJIqNtcgFZ+iktBPt+af7TZzldNYe0BMaI8zQqMaKsW5jCTkO45GwpWS1xVpCbCuVNtOTsL36RCfp9H82NmI8wcTf8x3STjfoBQvQpyRcki5Z4iztfI9HBa8UaOSSWTsMgxKThhkUXELrx5WFkvEpG8PQUc0U2KRZnN/SD04Z0bww7mmkC/V+omHK2lplblIxnNjH3lx8zhtWmB8ljdBlhaD53aK8khQLOu+SjoQBjrJ2hHEj3F8pnzDDOLrG5yWEj09+SqJ+77gXnve7g8NlGxvkPdkl+yQkh2RATskZiQgnX8lP8pv88b55v7wb7+/d6Iq3zLwjD+D9uwUO6a1Y</latexit><latexit sha1_base64="hedUj3SeYVZGuiD6Sfx1qXtpQsI=">AAACPXicbVA9a9xAFFw5jj+U2L4kZZrF54DdHNI1tgvDQRpDGhui2HCSxWrv6W653ZXYfTInhH5ZmvyGdCnTpEhM2rTeOx/BXwMLw8wb3r7JSiksBsEPb+XF6su19Y1N/9Xrre2dzpu3X2xRGQ4RL2RhLjNmQQoNEQqUcFkaYCqTcJFNP879i2swVhT6M9YlJIqNtcgFZ+iktBPt+af7TZzldNYe0BMaI8zQqMaKsW5jCTkO45GwpWS1xVpCbCuVNtOTsL36RCfp9H82NmI8wcTf8x3STjfoBQvQpyRcki5Z4iztfI9HBa8UaOSSWTsMgxKThhkUXELrx5WFkvEpG8PQUc0U2KRZnN/SD04Z0bww7mmkC/V+omHK2lplblIxnNjH3lx8zhtWmB8ljdBlhaD53aK8khQLOu+SjoQBjrJ2hHEj3F8pnzDDOLrG5yWEj09+SqJ+77gXnve7g8NlGxvkPdkl+yQkh2RATskZiQgnX8lP8pv88b55v7wb7+/d6Iq3zLwjD+D9uwUO6a1Y</latexit>
* ,-1,
1,1a=
=
-2=2Sign (2) =/
Out-of-Bag Error Estimation
17
o Trees are repeatedly fit on bootstrapped re-samples of the training set
o Can be shown that around 2/3 of training examples present in each bootstrapped re-sample
o OOB Error Estimation makes use of the training examples NOT in the re-sample
Here’s How: 560
Xi ⇒ "gi I ,hiucx;)) ⇒ +1 or
- I
has ¥Rin
Bagging Pros and Cons
18
Pros:o Results in a much lower variance classifier than a single Decision Tree
Cons: o Results in a much less interpretable model
We essentially give up some interpretability in favor of better prediction accuracy
But we can still get insight into what our model is doing using Bagging
Bagging and Feature Importance
19
In a single Decision Tree Classifier we can get an idea of feature importance by
o Measuring Information Gain / reduction in Impurity is achieved by splitting on feature
o Compare these values at the end and rank features
With Bagging we can still do this.
We just compute these measures over all trees and average
Thal
Ca
ChestPain
Oldpeak
MaxHR
RestBP
Age
Chol
Slope
Sex
ExAng
RestECG
Fbs
0 20 40 60 80 100
Variable Importance
An Even Better Approach
20
It turns out, we can take the Bagging idea and make it even better
Suppose you have a particular feature that is a very strong predictor for the response
So strong that in almost all Bagged Decision Trees, it’s the first feature that gets split on
When this happens, the trees can all look very similar. They’re too correlated
Maybe always splitting on the same feature isn’t the best idea
Maybe, like with bootstrapping training examples, we split on a random subset of features too
Random Forests
21
o Still doing Bagging, in the sense that we fit bootstrapped resamples of training data
o BUT, every time we have to choose a feature to split on, don’t consider all p features
o Instead, consider only features to split on
Advantages:o Since at each split, considering less than half of features, this gives very different trees
o Very different trees in your ensemble decreases correlation, and gives more robust results
o Also, slightly cheaper because you don’t have to evaluate each feature to choose best split
m ⇡ pp
<latexit sha1_base64="6OmA/EtTp4raBkb6upyfWmzZprc=">AAAB/XicbVBNSwMxEM36WevXqnjyEmwFT2W3l+qt4MVjBdcWukvJptk2NNnEJCuWpeBf8eJBxav/w5v/xrTdg7Y+GHi8N8PMvFgyqo3nfTsrq2vrG5ulrfL2zu7evntweKdFpjAJsGBCdWKkCaMpCQw1jHSkIojHjLTj0dXUbz8QpalIb81YkoijQUoTipGxUs89rnIYIimVeIShvlcml5Mq7LkVr+bNAJeJX5AKKNDquV9hX+CMk9RghrTu+p40UY6UoZiRSTnMNJEIj9CAdC1NESc6ymfnT+CZVfowEcpWauBM/T2RI671mMe2kyMz1IveVPzP62YmuYhymsrMkBTPFyUZg0bAaRawTxXBho0tQVhReyvEQ6QQNjaxsg3BX3x5mQT12mXNv6lXmo0ijRI4AafgHPigAZrgGrRAADDIwTN4BW/Ok/PivDsf89YVp5g5An/gfP4AvouU3Q==</latexit><latexit sha1_base64="6OmA/EtTp4raBkb6upyfWmzZprc=">AAAB/XicbVBNSwMxEM36WevXqnjyEmwFT2W3l+qt4MVjBdcWukvJptk2NNnEJCuWpeBf8eJBxav/w5v/xrTdg7Y+GHi8N8PMvFgyqo3nfTsrq2vrG5ulrfL2zu7evntweKdFpjAJsGBCdWKkCaMpCQw1jHSkIojHjLTj0dXUbz8QpalIb81YkoijQUoTipGxUs89rnIYIimVeIShvlcml5Mq7LkVr+bNAJeJX5AKKNDquV9hX+CMk9RghrTu+p40UY6UoZiRSTnMNJEIj9CAdC1NESc6ymfnT+CZVfowEcpWauBM/T2RI671mMe2kyMz1IveVPzP62YmuYhymsrMkBTPFyUZg0bAaRawTxXBho0tQVhReyvEQ6QQNjaxsg3BX3x5mQT12mXNv6lXmo0ijRI4AafgHPigAZrgGrRAADDIwTN4BW/Ok/PivDsf89YVp5g5An/gfP4AvouU3Q==</latexit><latexit sha1_base64="6OmA/EtTp4raBkb6upyfWmzZprc=">AAAB/XicbVBNSwMxEM36WevXqnjyEmwFT2W3l+qt4MVjBdcWukvJptk2NNnEJCuWpeBf8eJBxav/w5v/xrTdg7Y+GHi8N8PMvFgyqo3nfTsrq2vrG5ulrfL2zu7evntweKdFpjAJsGBCdWKkCaMpCQw1jHSkIojHjLTj0dXUbz8QpalIb81YkoijQUoTipGxUs89rnIYIimVeIShvlcml5Mq7LkVr+bNAJeJX5AKKNDquV9hX+CMk9RghrTu+p40UY6UoZiRSTnMNJEIj9CAdC1NESc6ymfnT+CZVfowEcpWauBM/T2RI671mMe2kyMz1IveVPzP62YmuYhymsrMkBTPFyUZg0bAaRawTxXBho0tQVhReyvEQ6QQNjaxsg3BX3x5mQT12mXNv6lXmo0ijRI4AafgHPigAZrgGrRAADDIwTN4BW/Ok/PivDsf89YVp5g5An/gfP4AvouU3Q==</latexit><latexit sha1_base64="6OmA/EtTp4raBkb6upyfWmzZprc=">AAAB/XicbVBNSwMxEM36WevXqnjyEmwFT2W3l+qt4MVjBdcWukvJptk2NNnEJCuWpeBf8eJBxav/w5v/xrTdg7Y+GHi8N8PMvFgyqo3nfTsrq2vrG5ulrfL2zu7evntweKdFpjAJsGBCdWKkCaMpCQw1jHSkIojHjLTj0dXUbz8QpalIb81YkoijQUoTipGxUs89rnIYIimVeIShvlcml5Mq7LkVr+bNAJeJX5AKKNDquV9hX+CMk9RghrTu+p40UY6UoZiRSTnMNJEIj9CAdC1NESc6ymfnT+CZVfowEcpWauBM/T2RI671mMe2kyMz1IveVPzP62YmuYhymsrMkBTPFyUZg0bAaRawTxXBho0tQVhReyvEQ6QQNjaxsg3BX3x5mQT12mXNv6lXmo0ijRI4AafgHPigAZrgGrRAADDIwTN4BW/Ok/PivDsf89YVp5g5An/gfP4AvouU3Q==</latexit>
mm
Practical Results
22
o Look at expression of 500 different genes in tissue samples. Predict 15 cancer states
0 100 200 300 400 500
0.2
0.3
0.4
0.5
Number of Trees
Test
Cla
ssifi
catio
n E
rro
r
m=p
m=p/2m= p
[James,etal]
Where Do We Go From Here?
23
o Unfortunately, Decision Tree Classifiers are prone to overfitting
o On their own, tend to do worse than other classifiers we’ve seen
But WITH THEIR POWERS COMBINED!
Next up, Ensemble Methods.
o Take various weaker classifiers, and combine them to get strong classifiers
o Bagging and Random Forests
o Boosted Decision Trees and the AdaBoost Algorithm
24
25
26
Top Related