Semi-random model tree ensembles: an effective and scalable regression method

Semi-random model tree ensembles: an effectiveand scalable regression method

Bernhard PfahringerDepartment of Computer Science

University of Waikato, New Zealand

September 22nd , 2011

Bernhard Pfahringer Department of Computer Science University of Waikato, New Zealand ()Semi-random model tree ensembles: an effective and scalable regression methodSeptember 22nd , 2011 1 / 28

Background

Outline

1 Background

2 Algorithm

3 Results

4 Summary


Background

Local regression

non-linear functions can be approximated by a set of locally linearestimatorsRegression and model trees are fast multi-variate versions of localregression


Background

Piece-wise linear approximation example


Background

Sample Regression Tree: constants in the leaves

A159 <= −0.62 :A149 <= 0.52 : Y = 1.6977A149 > 0.52 : Y = 1.2213

A159 > −0.62 :A149 <= 0.638 :

A57 <= −0.485 : Y = 0.8388A57 > −0.485 : Y = 1.0569

A149 > 0.638 : Y = 0.6062


Background

Sample Model Tree: linear models in the leaves

A159 <= −0.62 :A149 <= 0.52 : LM1A149 > 0.52 : LM2

A159 > −0.62 :A149 <= 0.638 : LM3A149 > 0.638 : LM4

LM1 Y = −0.597 ∗ A149− 0.211 ∗ A159 + 1.901LM2 Y = −0.471 ∗ A149− 0.211 ∗ A159 + 1.353LM3 Y = −0.365 ∗ A149− 0.232 ∗ A159 + 1.017LM4 Y = −0.555 ∗ A149− 0.232 ∗ A159 + 0.776


Algorithm

Outline

1 Background

2 Algorithm

3 Results

4 Summary


Algorithm

Ensembles of Semi-Random Model Trees

Ensembles usually improve resultsMost ensembles use randomization to generate diversity2 sources of randomness:

For each tree: divide data into a train and a validation setTo split: select best attribute from a random subset of all attributes


Algorithm

Single Semi-Random Model Tree

Only consider median as split value (=> balanced trees)Leaf model: linear ridge regression modelCap model predictions inside observed extremesOptimise tree depth and ridge value using the validation set


Algorithm

Build ensemble

BUILDENSEMBLE(data, numTrees, k)

1 for i = 1 to numTrees2 do randomly split data into two:3 train + validate4 BUILDTREE(train, validate, k)


Algorithm

BuildTree

BUILDTREE(train, validate, k)

1 min← MINTARGETVALUE(train)2 max ← MAXTARGETVALUE(train)3 localSSE ← LINREG(train, validate)4 �

5 if |train| > 10 & |validate| > 106 do split ← RANDOMSPLIT(train, k)7 �

8 smT ← SMALLER(train, split)9 smV ← SMALLER(validate, split)

10 smaller ← BUILDTREE(smT , smV , k)11 �

12 laT ← LARGER(train, split)13 laV ← LARGER(validate, split)14 larger ← BUILDTREE(laT , laV , k)15 �

16 subSSE ← SSE(smaller , larger , validate)17 �

18 if localSSE < subSSE19 do smaller ← null20 larger ← null21 else22 localModel ← null


Algorithm

BuildTree, continued

15 subSSE ← SSE(smaller , larger , validate)16 �

17 if localSSE < subSSE18 do smaller ← null19 larger ← null20 else21 localModel ← null


Algorithm

Ridge regression

LINREG(train, validate)

1 for ridge in 10−8, 10−4, 10−2, 10−1, 1, 102 do modelr ← RIDGEREGRESS(train, ridge)3 sser ← SSE(modelr , validate)4 if bestModel == model105 do build models for ridge = 102, 103, ...6 and so on while improving7 localModel ← bestModel8 return minimum-sse-on-validation-data


Algorithm

Random split selection

RANDOMSPLIT(train, k)

1 for i = 1 to k2 do splitAttr ← RANDOM CHOICE(allAttrs)3 stump ← STUMP(APPROX MEDIAN(splitAttr))4 compute SSE(stump, train)5 return minimum-sse stump


Algorithm

Parameter Settings

reported experiments:

average predictions of 50 randomized model treesto split select best of 50% randomly selected attributes

generally: should optimise separately for every application, e.g. usingcross-validation

number of trees: “the more the merrier”, but diminishing returnsnumber of randomly selected attributes: 50% is a good default, butmay be depend on the total number and on sparseness


Results

Outline

1 Background

2 Algorithm

3 Results

4 Summary


Results

Comparison

use more than 20 Torgo/UCI datasets, > 900 examplesrepeated 2

3 training, 13 testing splits

training split into equal build and validation halves (13 , 1

3 )preprocessed for missing or categorical valuescompare to:

LR: linear ridge regression, optimise ridge valueGP: gaussian process regression, optimise noise level and RBFgammaAG: additive groves, use ”fast” script

use RMAE: relative mean absolute error


Results

RMAE on Torgo/UCI

RMAE for Torgo/UCI data

0

10

20

30

40

50

60

70

80

90

100

colorh

istog

ram

layo

ut

cooc

textur

e

colorm

omen

ts

bank

8FM

stoc

kmv

ailero

ns

elnino

elev

ator

sfri

ed

delta

_aile

rons

2dplan

es

delta

_eleva

tors

cal_ho

using

cpu_

act

cpu_

small

bank

32nh

abalon

epo

l

hous

e_8L

puma8

NH

kin8

nm

hous

e_16

H

puma3

2H

quak

e

RMT

GP

LR

AG

Figure: RMAE for Torgo/UCI datasets, sorted by the linear regression result.


Results

Build times on Torgo/UCI

Training time in seconds for Torgo/UCI data

0.1

1

10

100

1000

10000

100000

stockquake

abalone

delta_ailerons

bank32nh

bank8FM

cpu_act

cpu_small

kin8nm

puma32H

puma8NH

delta_elevators

ailerons po

l

elevators

cal_housing

house_16H

house_8L

2dplanesfried mv

layout

colorhistogram

colormoments

cooctextureelnino

RMT

GP

LR

AG

Figure: Training time in seconds for Torgo/UCI datasets, sorted by thenumber of instances in each dataset; note the use of a logarithmic y-scale.


Results

UCI Census dataset

Table: Partial results, 2458285 examples in total, therefore about 800000 inthe training fold.

Method RMAE Time (secs)LR 15.96 1205RMT 9.78 19811GP ? ? (would need 5 Tb RAM)AG ? ? (estimated 2000000)


Results

Near infrared (NIR) Datasets

proprietary NIR data

7 datasetsfrom 255 upto 7500 spectrabetween 170 and 500odd featurespreprocessed for noise and base line shift


Results

Sample NIR spectrum

Prepocessed sample spectrum (nitrogen in soil)

-2

-1

0

1

2

3

4

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169


Results

RMAE on NIR data

RMAE for NIR datasets

10

20

30

40

50

60

70

80

90

n omd rmd tc phe ph p5 na g5

RMT

GP

LR

AG

Figure: RMAE for NIR datasets, sorted by the linear regression result.


Results

Build times on NIR data

Training time in seconds for NIR data

0.1

1

10

100

1000

10000

100000

omd rmd na n tc ph phe p5 g5

RMTGPLRAG

Figure: Training time in seconds for NIR datasets, sorted by the number ofinstances in each dataset; note the use of a logarithmic y-scale.


Results

Random Model Tree Build Times discussion

complexity is O(K ∗ N ∗ logN + K 2 ∗ N)

second term (linear model computation) seems to dominatetherefore observed complexity ∼ O(K 2 ∗ N)


Summary

Outline

1 Background

2 Algorithm

3 Results

4 Summary


Summary

Conclusions

Semi-Random Model Trees perform wellThey are fast: build time is practically linear in NCan model non-linear relationships


Summary

Future Work

Improve efficiency for large KStudy more and different regression problemsMore comparisons to alternative regression schemesStreaming/Moa variant


Semi-random model tree ensembles: an effective and scalable regression method

Business

Transcript of Semi-random model tree ensembles: an effective and scalable regression method