Classifier Ensembles: Facts, Fiction, Faults and Future Ludmila I Kuncheva School of Computer...

Classifier Classifier Ensembles: Facts, Ensembles: Facts, Fiction, Faults and Fiction, Faults and

Future Future

Ludmila I KunchevaSchool of Computer ScienceBangor University, Wales, UK

1. Facts1. Facts

classifier

feature values(object description)

class label

Classifier ensembles

classifier


classifier classifier

class label

“combiner”



class label

a neural network

classifier

combiner

classifier


ensemble?

classifier


class label


class

ifie

rcl

ass

ifie

rcl

ass

ifie

rcl

ass

ifie

rcl

ass

ifie

rcl

ass

ifie

r

combinerensemble?

a fancy combiner

classifier


classifier classifier

class label

combinerclassifier?


a fancy feature

extractorclassifier

a. because we like to complicate entities beyond necessity(anti-Occam’s razor)

b. because we are lazy and stupid and can’t be bothered todesign and train one single sophisticated classifier

c. because democracy is so important to our society, it mustbe important to classification

Why classifier ensembles then?


Juan : “I just like to combine things…”


Juan : “I just like combining things…”

combination of multiple classifiers [Lam95,Woods97,Xu92,Kittler98]classifier fusion [Cho95,Gader96,Grabisch92,Keller94,Bloch96]mixture of experts [Jacobs91,Jacobs95,Jordan95,Nowlan91]committees of neural networks [Bishop95,Drucker94]consensus aggregation [Benediktsson92,Ng92,Benediktsson97]voting pool of classifiers [Battiti94]dynamic classifier selection [Woods97]composite classifier systems [Dasarathy78]classifier ensembles [Drucker94,Filippi94,Sharkey99]bagging, boosting, arcing, wagging [Sharkey99]modular systems [Sharkey99]collective recognition [Rastrigin81,Barabash83]stacked generalization [Wolpert92]divide-and-conquer classifiers [Chiang94]pandemonium system of reflective agents [Smieja96] change-glasses approach to classifier selection [KunchevaPRL93]etc.

fanciest

oldest


oldest

Moscow, Energoizdat, 1981

≈ 1 c

The method of collective recognition

classifier ensemble

classifier selection(regions of

competence)

weighted majority vote

Collective statistical decisions in [pattern] recognition

Moscow, Radio I svyaz’, 1983

weighted majority vote

This superb graph was borrowed from “Fuzzy models and digital signal processing (for pattern recognition): Is this a good marriage?”, Digital Signal Processing, 3, 1993, 253-270, by my good friend Jim Bezdek.Jim Bezdek.

Exp

ect

ati

on

1965 1970 1975 1980 1985 1993

Naive euphoria

Peak of hype

Overreaction to immature technology

Depth of Cynicism

True user benefit

Asymptote of reality

Exp

ect

ati

on

1965 1970 1975 1980 1985 1993

Naive euphoria

Peak of hype


Depth of Cynicism

True user benefit


So where are we?

Exp

ect

ati

on

1978 2008 2008 2008 2008 2008

Naive euphoria

Peak of hype


Depth of Cynicism

True user benefit


11 22 33 44 55

So where are we?

To make the matter worse...

Expert 1:Expert 1: J. Ghosh

Forum:Forum: 3rd International Workshop on Multiple Classifier Systems, 2002 (invited lecture)

Quote:Quote: “... our current understandingcurrent understanding of ensemble-type multiclassifier systems is now quite matureis now quite mature...”

Expert 2:Expert 2: T.K. Ho

Forum:Forum: Invited book chapter, 2002

Quote:Quote: “Many of the above questions are there because we do not yet have a scientific understandingwe do not yet have a scientific understanding of the classifier combination mechanisms”

half full

half empty

2000 2002 2004 2006 20080

50

100

150

200

250

3001. Classifier ensembles

2. AdaBoost – (1)

3. Random Forest – (1) – (2)

4. Decision Templates – (1) – (2) – (3)

Number of publications (13 Nov 2008)

inco

mp

lete

for

20

08

http://wok.mimas.ac.uk/

http://scientific.thomson.com/

2000 2002 2004 2006 20080

100

200

300

400

500 1. Classifier ensembles

2. AdaBoost – (1)

3. Random Forest – (1) – (2)

4. Decision Templates – (1) – (2) – (3)

Number of publications (13 Nov 2008)

inco

mp

lete

for

20

08

Literature

“One cannot embrace the unembraceable.”

Kozma Prutkov

http://wok.mimas.ac.uk/

http://scientific.thomson.com/

ICPR 2008984 papers~2000 words in the titlesfirst 2 principal components

imagesegmentfeature

videotrackobject

featurelocalselect

classifier ensembles

Exp

ect

ati

on

1978 2008 2008 2008 2008 2008

Naive euphoria

Peak of hype


Depth of Cynicism

True user benefit


11 22 33 44 55

So where are we?

still here… somewhere…

2. Fiction2. Fiction

Fiction?

Diversity. Diverse ensembles are better ensembles? Diversity = independence?

Adaboost.“The best off-the-shelf classifier”?

If these reports differ from one another, the computer identifies the two reports with the greatest overlap and produces a "majority report," taking this as the accurate prediction of the future.

But the existence of majority reports implies the existence of a "minority report."

- a science fiction short story by Philip K. Dick first published in 1956. It is about a future society where murders are prevented through the efforts of three mutants (“precogs”) who can see two weeks ahead in the future. The story was made into a popular film in 2002.

Each of the three “precogs” generates its own report or prediction. The three reports of are analysed by a computer.

Minority Report

Classifier E

nsemble

And, of course, the most interesting case is when the classifiers disagree – the minority report.

Diversity is goodWrong Correct3 classifiers

individual accuracy = 10/15 = 0.667

independent classifiersensemble accuracy (majority vote)= 11/15 = 0.733

identical classifiersensemble accuracy (majority vote)= 10/15 = 0.667

dependent classifiers 1ensemble accuracy (majority vote)= 7/15 = 0.467

dependent classifiers 2ensemble accuracy (majority vote)= 15/15 = 1.000

Myth: Independence is the best scenario.Myth: Diversity is always good.

identical 0.667independent 0.733dependent 1 0.467dependent 2 1.000

worse than individualbetter than independence

Example

The set-up • UCI data repository• “heart” data set• First 9 features; all 280 different partitions into [3, 3, 3] • Ensemble of 3 linear classifiers• Majority vote• 10-fold cross-validation

What we measured:

• Individual accuracies of the ensemble members• The ensemble accuracy• The ensemble diversity (just one of all these measures…)

0.4 0.5 0.6 0.7 0.8 0.90.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

ensemble is better

Ense

mble

acc

ura

cy

Individual accuracy

average individual accuracy

minimum individual accuracy

maximum individual accuracy

Example

280 ensembles

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.72

0.73

0.74

0.75

0.76

0.77

0.78

0.79

0.8

Example

diversity

Ense

mble

acc

ura

cy ?

more diverse

less

acc

ura

te

Example

0.660.68

0.70.72

0.74

0

0.2

0.4

0.6

0.80.72

0.74

0.76

0.78

0.8

0.82

IndividualDiversity

Ens

embl

e

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.66

0.67

0.68

0.69

0.7

0.71

0.72

0.73

Example

diversity

Indiv

idual acc

ura

cy

expected large ensemble accuracy

large

AdaBoost is everything

AdaBoost

AdaBoost

AdaBoost

AdaBoost

AdaBoostAdaBoost

AdaBoost

Bagging

Swiss Army Knife

Russian Army Knife

Surely, there is Surely, there is more to more to

combining combining classifiers than classifiers than

Bagging and Bagging and AdaBoostAdaBoost

“This altogether gives a very bad impression of ill-conceived experiments and confusing and unreliable conclusions. ... The current spotty conclusions are The current spotty conclusions are incomprehensible, and are of no generalization incomprehensible, and are of no generalization or reference value.or reference value.”

“This is a potentially great new method and any experimental analysis would be very useful for understanding its potential. Good study, with very Good study, with very useful information in the Conclusions.useful information in the Conclusions.”

Example – Rotation Forest

20 40 60 80 1000

20

40

60

80

100

Rotation Forest

Random Forest

Boosting

Bagging

Ensemble size

% of data sets (out of 32) where the respective ensemble method is best

So, no,AdaBoost is NOT everything

3. Faults3. Faults

OUR faults!

Complacent: We don’t care about terminology.

Vain: To get publications, we invent complex models for simple problems or, worse even, complex non-existent problems.

Untidy: There is little effort to systemise the area.

Ignorant and lazy: By virtue of ignorance we tackle problems well and truly solved by others. Krassi’s motto “I don’t have time to read papers because I am busy writing them”.

Haughty: Simple things that work do not impress us until they get proper theoretical proofs.

God, seeing what the people were doing, gave each person a different language

to confuse them and scattered the people throughout the earth…

image taken from http://en.wikipedia.org/wiki/Tower_of_Babel

•Pattern recognition land

•Data mining kingdom

•Machine learning ocean

•Statistics underworld and…

•Weka…

Terminology

instance

example

observation

data point

object

attribute

feature

variable

classifier

hypothesis

SVM

SMO

nearest neighbour

lazy learner

classifier ensemble

learner

naïve Bayes

AODE

decision tree

C4.5

J48

SVM SMO

nearest neighbour lazy learner

classifier hypothesislearner

naïve Bayes AODE

decision tree C4.5 J48

instance example observationdata pointobject

attributefeature variable

classifier ensemble meta learner

ML

Stats

Weka

combination of multiple classifiers [Lam95,Woods97,Xu92,Kittler98]classifier fusion [Cho95,Gader96,Grabisch92,Keller94,Bloch96]mixture of experts [Jacobs91,Jacobs95,Jordan95,Nowlan91]committees of neural networks [Bishop95,Drucker94]consensus aggregation [Benediktsson92,Ng92,Benediktsson97]voting pool of classifiers [Battiti94]dynamic classifier selection [Woods97]composite classifier systems [Dasarathy78]classifier ensembles [Drucker94,Filippi94,Sharkey99]bagging, boosting, arcing, wagging [Sharkey99]modular systems [Sharkey99]collective recognition [Rastrigin81,Barabash83]stacked generalization [Wolpert92]divide-and-conquer classifiers [Chiang94]pandemonium system of reflective agents [Smieja96] change-glasses approach to classifier selection [KunchevaPRL93]etc.

Out of fashion

Out of fashion

Subsumed Subsumed

Classifier ensembles - names

combination of multiple classifiers [Lam95,Woods97,Xu92,Kittler98]

classifier ensembles [Drucker94,Filippi94,Sharkey99]

United terminology! Yey!

MCS – Multiple Classifier Systems Workshops 2000-2009

Simple things that work…

We detest simple things that work well for an unknown reason!!!

flagship of THEORY…

Ideal scenario…

empirics and applications

hijacked by heuristics…

HEURISTICS

Real

theory

Simple things that work…

We detest simple things that work well for an unknown reason!!!

Lessons from the past:

Fuzzy sets •stability of the system?•reliability?•optimality?•why not probability?

Who cares?...•temperature for washing machine programmes•automatic focus in digital cameras•ignition angle of internal combustion in cars

Because it is•computationally simpler (faster)•easier to build, interpret and maintain

Learn to trust heuristics and empirics…

4. Future4. Future

Exp

ect

ati

on

1978 2008


Future Branch out ?

Multiple instance learning

Non i.i.d. examples

Skewed class distributions

Noisy class labels

Sparse data

Non-stationary data

classifier ensembles for changing environments classifier ensembles for change detection

D.J. HandClassifier Technology and the Illusion of ProgressStatistical Science 21(1), 2006, 1-14.

“… I am not suggesting that no major advances in classification methods will ever be made. Such a claim would be absurd in the face of developments such as the bootstrap and other resampling approaches, which have led to significant advances in classification and other statistical models. All I am saying is that much of the purported advance may well be illusory.”...

So have we truly made progress or are we just kidding ourselves?

Empty-y-y…

(not even half empty-y-y-y …)

Bo-o-ori-i-i-ing....Bo-o-ori-i-i-ing....

Classifier Ensembles: Facts, Fiction, Faults and Future Ludmila I Kuncheva School of Computer...

Documents

Transcript of Classifier Ensembles: Facts, Fiction, Faults and Future Ludmila I Kuncheva School of Computer...