Ensemble Based Systems in Decision Making
description
Transcript of Ensemble Based Systems in Decision Making
Ensemble Based Systems in Decision Making
Yu-Mei, ChangSpeech Lab, CSIE
National Taiwan Normal University
Robi PolikarThird Quarter 2006IEEE circuits and system magazine
National Taiwan Normal University 2
Abstract • Ensemble based systems have shown to produce favorable results
compared to those of single-expert systems for a broad range of applications and under a variety of scenarios.
• This paper reviews conditions under which ensemble based systems may be more beneficial than their single classifier counterparts, algorithms for generating individual components of the ensemble systems, and various procedures through which the individual classifiers can be combined.
• They discuss popular ensemble based algorithms– bagging, boosting, AdaBoost, stacked generalization , and hierarchical
mixture of experts.
• As well as commonly used combination rules.– algebraic combination of outputs, voting based techniques, behavior
knowledge space, and decision templates.
2009/10/05
National Taiwan Normal University 3
Reasons for Using Ensemble Based Systems (1)
a) Statistical Reason– A set of classifiers with similar training performances may have different
generalization performances.– Combining the outputs of several classifiers by averaging may reduce the
risk of an unfortunate selection of a poorly performing classifier.– The averaging may or may not beat the performance of the best classifier in
the ensemble, but it certainly reduces the overall risk of making a particularly poor selection.
b) Large Volumes of Data– In certain applications, the amount of data to be analyzed can be too large to
be effectively handled by a single classifier.
c) Too little data– Ensemble systems can also be used to address the exact opposite problem
of having too little data.– resampling techniques
2009/10/05
National Taiwan Normal University 4
Reasons for Using Ensemble Based Systems (2)d) Divide and Conquer
– By dividing the data space into smaller and easier-to-learn partitions, where each classifier learns only one of the simpler partitions.
2009/10/05
National Taiwan Normal University 5
Reasons for Using Ensemble Based Systems (3)
e) Data Fusion– If we have several sets of data obtained from various sources, where the
nature of features are different (heterogeneous features), a single classifier cannot be used to learn the information contained in all of the data.
– Applications in which data from different sources are combined to make a more informed decision are referred to as data fusion applications.
– Ensemble based approaches have successfully been used for such applications.
• All ensemble systems must have two key components:1) an algorithm to generate the individual classifiers of the ensemble2) a method for combining the outputs of these classifiers
2009/10/05
National Taiwan Normal University 6
Brief History of Ensemble Systems• Perhaps one of the earliest work on ensemble systems is Dasarathy and
Sheela’s 1979 paper , which discusses partitioning the feature space using two or more classifiers.
• Schapire proved that a strong classifier in probably approximately correct (PAC) sense can be generated by combining weak classifiers through boosting ,the predecessor of the suite of AdaBoost algorithms.
• There are generally two types of combination:– classifier selection
• each classifier is trained to become an expert in some local area of the total feature space
• One or more local experts can be nominated to make the decision– classifier fusion
• all classifiers are trained over the entire feature space• the classifier combination process involves merging the individual
(weaker) classifier designs to obtain a single (stronger) expert of superior performance.
2009/10/05
National Taiwan Normal University 7
Diversity:Cornerstone of Ensemble Systems(1)
• The strategy in ensemble systems is therefore to create many classifiers, and combine their outputs such that the combination improves upon the performance of a single classifier.
• Individual classifiers make errors on different instances. – The intuition is that if each classifier makes different errors, then a strategic
combination of these classifiers can reduce the total error• They need classifiers whose decision boundaries are adequately different
from those of others. – Such a set of classifiers is said to be diverse.
• Classifier diversity can be achieved in several ways.– The most popular method is to use different training datasets to train
individual classifiers– Such datasets are often obtained through resampling techniques, such as
bootstrapping or bagging, where training data subsets are drawn randomly, usually with replacement, from the entire training data.
2009/10/05
National Taiwan Normal University 8
Diversity:Cornerstone of Ensemble Systems(2)
• Random and overlapping training data subsets are selected to train three classifiers, which then form three different decision boundaries.
• These boundaries are combined to obtain a more accurate classification.
2009/10/05
National Taiwan Normal University 9
Diversity:Cornerstone of Ensemble Systems(3)
• If the training data subsets are drawn without replacement, the procedure is also called jackknife or k-fold data split: – The entire dataset is split into k blocks, and each classifier is trained only on
k-1 of them.– A different subset of k blocks is selected for each classifier as shown in
Figure 4.
2009/10/05
National Taiwan Normal University 10
Measures of Diversity(1)• Several measures have been defined for quantitative assessment of
diversity.– The simplest ones are pair-wise measures, defined between two classifiers.– For T classifiers, they can calculate T(T-1)/2 pair-wise diversity measures.– An overall diversity of the ensemble can be obtained by averaging these
pair-wise measures.
• Given two hypotheses and– a is the fraction of instances that are correctly classified by both classifiers– b is the fraction of instances correctly classified by but incorrectly
classified by
2009/10/05
is correct is incorrect
is correct a b
is incorrect c d
ih jh
ih
ih
jhjh
ihjh
National Taiwan Normal University 11
Measures of Diversity(2)
a) Correlation – Diversity is measured as the correlation between two classifier outputs,
defined as
– Maximum diversity is obtained for , indicating that the classifiers are uncorrelated.
b) Q-Statistic– Defined as– Q assumes positive values if the same instances are correctly classified by
both classifiers; and negative values, otherwise.
c) Disagreement and Double Fault Measures– The diversity increases with both the disagreement and the double fault
value
2009/10/05
10
,
dbcadcbabcad
j,i
bcad/bcadQ j,i
cbD j,i
dDF j,i
0
National Taiwan Normal University 12
Bagging(1)• Breiman’s bagging, short for bootstrap aggregating, is one of the earliest
ensemble based algorithms.
• It is also one of the most intuitive and simplest to implement, with a surprisingly good performance.
• Different training data subsets are randomly drawn—with replacement—from the entire training data.
2009/10/05
National Taiwan Normal University 13
Bagging(2)
2009/10/05
tion.classifica final theas votetalhighest to thereceives that class the4.Choose
1 classeach by received voteotal3.Obtain t
.classifierby class given to vote thebe otherwise 0
class picks if 12.Let
. on ensemble the1.Evaluate instance unlabledGiven -
. ensemble, the to 3.Add . r)(classifie hypothesis thereceive and with 2.Call . ofpercent drawingrandomly by replica edbootstrapp a 1.Take
1 data trainingedbootstrapp create toF fraction) of (Percent
.iterations ofnumber specifying Integer , algorithm learning Weak
classes ngrepresenti labelscorrect with data Training
1
1
C,,j,υV
,
h, υ
xx
EhhSSS
,T,t
T
C,ω,ωΩωS
T
tt,jj
jtt,j
t
tt
t
Ci
Voting Majority Simple:TestEnd
WeakLearn
Do
WeakLearn
:inputBagging:Algorithm
National Taiwan Normal University 14
Variations of Bagginga) Random Forests
– so-called because it is constructed from decision trees.– A random forest can be created from individual decision trees, whose certain
training parameters vary randomly.– Such parameters can be bootstrapped replicas of the training data, as in
bagging– But they can also be different feature subsets as in random subspace
methods.b) Pasting Small Votes
– Unlike bagging, pasting small votes is designed to be used with large datasets
– A large dataset is partitioned into smaller subsets, called bites, each of which is used to train a different classifier.
– Two variations of pasting small votes have emerged:• Rvotes-creates the data subsets at random• Ivotes-creates consecutive datasets based on the importance of the
instances(better results)
2009/10/05
National Taiwan Normal University 15
Variations of Bagging-Pasting Small Votes
2009/10/05
data.in test grity votinsimplemajo Use
. obtain toensemble the to 6.Add 4. step return to , 21 If . of error obtain and dataset,n validatioaon 5.Evaluate
.hypothesis thereceive and with 4.Call instances.such has until 3-1 StepReapeat
11
, y probabilitin with place ,.Otherwise in place ied,misclassif is 3.If
. ensemblecurrent in the sclassifier bag-of-out of otemajority v using 2.Evaluate on.distributi uniform toaccording from instancean draw 1.Randomly
1. 1 step return to , 21 4.If
. of error obtain and dataset,n validatioaon 3.Evaluate . r)(classifie hypothesis thereceive and ,with 2.Call
from size ofsubset random a 1.Choose
created. be tosubsets trainingindividual of size theindicating Bitesize ;iterations ofnumber specifying Integer
algorithm learning Weak ; classes ngrepresenti labelscorrect with data Training
tt
1
0
000
00
0
-TestEnd
WeakLearn
Do
WeakLearn
eInititaliz
WeakLearn;
:Input(Ivotes) Votes Small Pasting :Algorithm
tt
tt
tt
t
t
tt
t
Ci
Eh/hh
hSMS
εε
ppxSxx
ExSx
T,,t/
hhhS
.SMS
,MT
,,S
National Taiwan Normal University 16
Boosting(1)• Schapire provides an elegant algorithm for boosting the performance of a
weak learner to the level of a strong one. – The algorithm is now considered as one of the most important developments
in the recent history of machine learning.• Similar to bagging, boosting also creates an ensemble of classifiers by
resampling the data, which are then combined by majority voting.– resampling is strategically geared to provide the most informative training
data for each consecutive classifier. • boosting creates three weak classifiers:
– The first classifier C1 is trained with a random subset of the available training data.
– The training data subset for the second classifier C2 is chosen as the most informative subset, given C1.
• C2 is trained on a training data only half of which is correctly classified by C1, and the other half is misclassified.
– The third classifier C3 is trained with instances on which C1and C2 disagree.
2009/10/05
National Taiwan Normal University 17
Boosting(2)
2009/10/05
tion.classifica ginal theas by predicted class thechoose disagree, they 2.If tion.classifica final theis class thisclass, on the agree they .If and by 1.Classify
instance test aGiven -. with classifier thirdTrain the disagree. and for which instances thoseselectingby 5.Create
. with classifier second e4.Train th . toadded becan patterns more no until coins flipping c.Continue
. toinstance thisAdd .classifiedcorrectly is onefirst theuntil toempresent th and , from samplesselect tail,b.If . toinstance thisAdd
. iedmisclassif is instancefirst theuntil toempresent th and , from samplesselect Head, If coin.fair a a.Filp :so do ied.Tomisclassif is halfother theand
, by classifiedcorrectly is of halfsuch that ,given , dataset einformativmost theasdataset 3.Create . classifier create toh train witand 2.Call
.subset data create to fromt replacemen without patterns 1.Select
. algorithm learning Weak ; labelscorrect with size theof data Training
3
21
3321
22
2
21
2
1
2212
11
11
21i
CCCx
xSCCC
SCS
SCSS
CS
CSCSCS
SSNN
,NS
Test
WeakLearn
TrainingWeakLearn
:Input
Boosting:Algprithm
National Taiwan Normal University 18
AdaBoost(1) • AdaBoost is a more general version of the original boosting algorithm. Among
its many variations.• Among its many variations, AdaBoost.M1 and AdaBoost.R are more
commonly used, as they are capable of handling multiclass and regression problems, respectively.
• AdaBoost generates a set of hypotheses(classifiers), and combines them through weighted majority voting of the classes predicted by the individual hypotheses.
• The hypotheses are generated by training a weak classifier, using instances drawn from an iteratively updated distribution of the training data.
• This distribution update ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier.
• Consecutive classifiers’ training data are geared towards increasingly hard-to-classify instances.
2009/10/05
National Taiwan Normal University 19
AdaBoost(2) • A weight distribution on training instances , from which
training data subsets are chosen for each consecutive classifier (hypothesis) .
• A normalized error is then obtained as , such that for , they have .
• Distribution update rule: – The distribution weights of those instances that are correctly classified by
the current hypothesis are reduced by a factor of , whereas the weights of the misclassified instances are unchanged.
– AdaBoost focuses on increasingly difficult instances. • AdaBoost raises the weights of instanced misclassified by , and lowers
the weights of correctly classified instances.• AdaBoost is ready for classifying unlabeled test instances. Unlike bagging or
boosting, AdaBoost uses a rather undemocratic voting scheme, called the weighted majority voting.
• is therefore a measure of performance, and can be used to weight the classifiers.
2009/10/05
iDt ix N,,i 1
th
210 /t 10 t
t
t
tS
th
t/ 1
National Taiwan Normal University 20
AdaBoost.M1(1)
2009/10/05
tion.classifica final theas votetalhighest to thereceives that class the2.Choose
1 1
classeach by received voteotal1.Obtain t . instance unlabeledan Given
function.on distributiproper a becomes thatsochosen constant ion normalizat a is where
otherwise, 1 if ,
ondistributi 5.Update 14.Set
. ,21 If
oferror thee3.Calculat . hypothesis receive, with 2.Train
. on distributi thefrom draw , subset data traininga1.Select :21
11
.iterations ofnumber specifying Integer ; algorithm learning Weak
;, labels with 1 examples N of Sequence
1
1
1
1
,C.,j,β
log V
x
DiDZ
yxhZ
iDiD: D
.ε/εβ/ε
iD: h
hSDS
T,, t
N,,i,N
iD
T
,y,N,,i,yxS
jt
iit
ωxt:h tj
ti
tt
iitt
t
ttt
ttt
t
yxh:ittt
tt
tt
Ciii
:Voting MajorityWeight -Test
abort
WeakLearn
for Do
Initialize
WeakLearn
:InputM1 AdaBoost.Algorithm
National Taiwan Normal University 21
AdaBoost.M1(2)
2009/10/05
National Taiwan Normal University 22
Stacked Generalization(1)• An ensemble of classifiers are first created, whose outputs are used as
inputs to a second level meta-classifier to learn the mapping between the ensemble outputs and the actual correct classes.
• are trained using training parameters through to output hypotheses through .
• The outputs of these classifiers and the corresponding true classes are then used as input/output training pairs for the second level classifier, .
2009/10/05
TCC 1 1 T
1h Th
1TC
National Taiwan Normal University 23
Mixture-of-Experts• A conceptually similar technique is the mixture-of-experts model, where a
set of classifiers constitute the ensemble, followed by a second level classifier used for assigning weights for the consecutive combiner.
• the combiner itself is usually not a classifier, but rather a simple combination rule, such as random selection (from a weight distribution), weighted majority, or weighted winner-takes-all.– the weight distribution used for the combiner is determined by a second level
classifier, usually a neural network, called the gating network.– The inputs to the gating network are the actual training data instances
themselves (unlike outputs of first level classifiers for stacked generalization)
• Mixture-of-experts can therefore be seen as a classifier selection algorithm.
• Individual classifiers are experts in some portion of the feature space, and the combination rule selects the most appropriate classifier, or classifiers weighted with respect to their expertise, for each instance x.
2009/10/05
TCC 1
1TC
National Taiwan Normal University 24
Stacked Generalization(2)• The pooling system may use the weights in several different ways.
– it may choose a single classifier with the highest weight, or calculate a weighted sum of the classifier outputs for each class, and pick the class that receives the highest weighted sum.
2009/10/05
National Taiwan Normal University 25
Combining Classifiers• The second key component of any ensemble system is the strategy
employed in combining classifiers. Combination rules are often grouped as – (i) trainable vs. non-trainable combination rules
• In trainable combination rules, the parameters of the combiner, usually called weights, are determined through a separate training algorithm.
– The combination parameters created by trainable rules are usually instance specific, and hence are also called dynamic combination rules.
• weighted majority voting is an example of such non-trainable rules, since the parameters become immediately available as the classifiers are generated.
– (ii) combination rules that apply to class labels vs. to class-specific continuous outputs.
• combination rules that apply to class labels need the classification decision only (that is, one of , ), whereas others need the continuous-valued outputs of individual classifiers.
2009/10/05
j C,,j 1
National Taiwan Normal University 26
Combining Class Labels• Assume that only the class labels are available from the classifier outputs. • Define the decision of the classifier as ,
and , where is the number of classifiers and is the number of classes.
• If classifier chooses class , then , and , otherwise.
a) Majority Voting :
b) Weighted Majority Voting –
c) Behavior Knowledge Space (BKS) – look up Tabled) Borda Count –
– each voter (classifier) rank orders the candidates (classes). If there are N candidates, the first-place candidate receives N − 1 votes, the second-place candidate receives N − 2, with the candidate in i th place receiving N − i votes. The votes are added up across all classifiers, and the class with the most votes is chosen as the ensemble decision.
2009/10/05
tht 10,d j,t T,,t 1 C,,j 1T C
tht j 1j,td 0
T
t
T
tj,t
C
jJ,t dmaxd1 11
T
t
T
tj,tt
C
jJ,tt dmaxd1 11
National Taiwan Normal University 27
Behavior Knowledge Space (BKS)
2009/10/05
National Taiwan Normal University 28
Combining Continuous Outputs(1)• Algebraic combiners
a) Mean Rule:
b) Weighted Average:
c) Minimum/Maximum/Median Rule:
d) Product Rule:
e) Generalized Mean: • Many of the above rules are in fact special cases of the generalized
mean
• : minimum rule; :maximum rule; :• : mean rule
2009/10/05
T
tj,tj xd
Tx
1
1
T
tj,tj,tj xd
Tx
1
1
xdmaxx j,tT,,tj 1 xdminx j,tT,,tj 1
xdmedianx j,tT,,tj 1
T
tj,tj xd
Tx
1
1
1
1
1
T
tj,tj xd
T,x
1
0 TT
tj,tj xd,x
1
1
National Taiwan Normal University 29
Combining Continuous Outputs(2)
2009/10/05
National Taiwan Normal University 30
Combining Continuous Outputs(3)• Decision Templates
– Kuncheva takes the idea of decision profiles one step further by defining decision templates as the average decision profile observed for each class throughout training.
– Given a test instance , its decision profile is compared to the decision templates of each class, and the class whose decision template is closest, in some similarity measure, is chosen as the ensemble decision.
– the decision template for , illustrated in Figure 16, is calculated as
– Given an unlabeled test instance , they first construct its from the ensemble outputs, and calculate the similarity S between and the decision template for each class as the degree of support given to class .
2009/10/05
801010207010206020205030140010850
...
...
...
......
xDP
x
j
jjPX
jj
j XDPN
DT
1
x xDP xDP
jjDT
j
C,,j,DT,xDPSx jj 1
National Taiwan Normal University 31
Combining Continuous Outputs(4)• Decision Templates
– The similarity measure S is usually a squared Euclidean distance, obtained as
2009/10/05
2
1 1
11
T
t
C
kk,tjj xdk,tDT
CTx
National Taiwan Normal University 32
Current & Emerging Areas(1)• The primary thrust of using ensemble systems reduce the risk of choosing a
single classifier with a poor performance, or to improve upon the performance of a single classifier by using an intelligently combined ensemble of classifiers.
• Many additional areas and applications have recently emerged, however, for which the ensemble systems are inherently appropriate.
a) Incremental Learning– Learn the novel information content in the new data, without forgetting the
previously acquired knowledge, and without requiring access to previously seen data.
– The algorithm Learn++ and its recent variations have been shown to achieve incremental learning on a broad spectrum of applications, even when new data introduce instances of previously unseen classes.
– Learn++ is inspired in part by AdaBoost;
• AdaBoost : updates its weight distribution based on the performance of hypothesis generated in the previous iteration.
• Learn++: composite hypothesis- the combined ensemble generated thus far at any given iteration- and updates its distribution based on the performance of the current ensemble through the use of this
• composite hypothesis. 2009/10/05
th
National Taiwan Normal University 33
Current & Emerging Areas(2)
b) Data Fusion– The goal of data fusion is to extract complementary pieces of information
from different data sources, allowing a more informed decision about the phenomenon generating the underlying data distributions.
– In data fusion applications, data obtained from different sources may use different feature sets that may be quite heterogeneous in their composition.
c) Feature selection– One way to improve diversity in the ensemble is to train the classifiers with
data that consist of different feature subsets.– Feature subsets should promote diversity, that is, identical subsets should
not be repeated.– Krause and Polikar, who used an ensemble of classifiers trained with an
iteratively updated feature distribution to address the missing feature problem.
• The idea is to generate a large enough ensemble of classifiers using random subsets of features.
2009/10/05
National Taiwan Normal University 34
Conclusions• In this paper, they discussed the fundamental aspects of these ensemble
systems, including the need to ensure—and ways to measure—diversity in the ensemble; ensemble generation techniques such as bagging, AdaBoost, mixture of experts; and classifier combination strategies, such as algebraic combiners, voting methods, and decision templates.
• They have also reviewed some of the more popular areas where ensemble systems become naturally useful, such as incremental learning, data fusion, feature selection and error correcting output codes.
• Whereas there is no single ensemble generation algorithm or combination rule that is universally better than others, all of the approaches discussed above have been shown to be effective on a wide range of real world and benchmark datasets, provided that the classifiers can be made as diverse as possible.
2009/10/05