Geoff

77
Copyright © 2013 Geoffrey I Webb Fundamental and Advanced Machine Learning Methods for Big Data Applications Geoffrey I Webb, Ana Martinez, Nayyar Zaidi, Shenglei Chen Monash University http://www.csse.monash.edu.au/~webb

Transcript of Geoff

Page 1: Geoff

Copyright © 2013 Geoffrey I Webb

Fundamental and Advanced Machine Learning Methods for Big Data Applications

Geoffrey I Webb,

Ana Martinez, Nayyar Zaidi, Shenglei Chen

Monash Universityhttp://www.csse.monash.edu.au/~webb

Page 2: Geoff

Copyright © 2013 Geoffrey I Webb

Page 3: Geoff

Copyright © 2013 Geoffrey I Webb

Overview

• Big data

• Classification learning

• Sampling

• Dimensionality reduction

• Scaling-up existing algorithms

• Stream learning

• Bias and variance and big data

• Selective KDB

• Incremental Bayesian Network Classifiers

Page 4: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Big data

• Can mean many things

– Complex integration of many heterogeneous data sources

– Very large/streaming data

Name (SI

decimal prefixes)

Value Binary usage

kilobyte (kB)

megabyte (MB)

gigabyte (GB)

terabyte (TB)

petabyte (PB)

exabyte (EB)

zettabyte (ZB)

yottabyte (YB)

Page 5: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

What is ‘big’?

• Number of– instances– dimensions– classes

• Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.– Wikipedia

• Machine learning research usually treats more than 1 million examples as very large.

Page 6: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Examples

• Spelling correction

• Translation

• Farecast

• Recommender systems

• Electoral outcomes

Whitelaw, C, B Hutchinson, GY Chung, & G Ellis. "Using the web for language independent spellchecking and autocorrection." In Proceedings of the 2009 Conference

on Empirical Methods in Natural Language Processing: Volume 2, pp. 890-899. Association for Computational Linguistics, 2009.

Silver, Nate. The Signal and the Noise: Why So Many Predictions Fail-but Some Don't. Penguin Press, 2012.

Page 7: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Not a universal panacea

• Jeopardy but not chess

• Spelling correction and translation but not comprehension

http://www.engadget.com/2011/02/15/watson-soundly-beats-the-humans-in-first-round-of-jeopardy/

Page 8: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Classification learning

Page 9: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Evolving distributions

• Key issue

– Is the distribution from which the data are drawn static or dynamic?

– Concept drift

• class membership changes, eg rich

– Concept evolution

• new classes emerge

– Distribution drift

• probabilities change

Page 10: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Dimension of change

• Normally time but may be other such as location

• Classifier can only take dimension of change into account if data to be classified will fall within current scope or if it is possible to extrapolate

0

2

4

6

8

10

12

14

16

18

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

Training

Testing

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

Training

Testing

Page 11: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Loss functions

Page 12: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Imbalanced classes

• Many big datasets have a rare class of interest and a majority class from which we seek to distinguish it.

– Ad click-through

– Conversions

– Disease

– Fraud

– Homeland security

Page 13: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Loss functions for imbalanced classes

Page 14: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Loss functions for imbalanced classes

Predictions

Pos Neg

Act

ual Pos TP FN

Neg FP TN

Page 15: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Loss functions for imbalanced classes

• Area under the ROC curve

True Positive Rate (TPR)

Predictions

Pos Neg

Act

ual Pos TP FN

Neg FP TN

Prof. William H. Press, “Unit 17: Classifier Performance: ROC, Precision-Recall, and All That.”

http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf

False Positive Rate (FPR)

Predictions

Pos Neg

Act

ual Pos TP FN

Neg FP TN

Page 16: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Loss functions for imbalanced classes

• Area under the Precision Recall Curve

Prof. William H. Press, “Unit 17: Classifier Performance: ROC, Precision-Recall, and All That.”

http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf

Recall = True Positive

Rate (TPR)

Predictions

Pos Neg

Act

ual Pos TP FN

Neg FP TN

Precision

Predictions

Pos Neg

Act

ual Pos TP FN

Neg FP TN

Page 17: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Mutual information

Page 18: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Learning curves

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Ro

ot

Mea

n S

qu

are

d E

rro

r

Data quantity

KDB k=2

Page 19: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Sampling

• Select s instances from a dataset of size n

• Important that sample be selected randomly

• Make sure you use a robust random number generator

Page 20: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Ideal Sampling

• Select data quantity at which learning curve approaches asymptotic error and learn from sample

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Ro

ot

Mea

n S

qu

are

d E

rro

r

Data quantity

KDB k=2

Page 21: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Finding asymptotic error

• Progressive sampling

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Ro

ot

Mea

n S

qu

are

d E

rro

r

Data quantity

KDB k=2

Provost, F, D Jensen, T Oates. “Efficient progressive sampling.” In Proc 5th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 23-32. ACM, 1999.

Page 22: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Hoeffding's bound

Hulten, G, and P Domingos. "Mining complex models from arbitrarily large databases in constant time." In Proceedings 8th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 525-531. ACM, 2002.

Sample size

Error margin

Sample mean

Population mean

Page 23: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Maximum sample

• Take largest sample capacity can handle

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Ro

ot

Mea

n S

qu

are

d E

rro

r

Data quantity

KDB k=2

Page 24: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Maximum sample

• Take largest sample capacity can handle

• Saves overheads of repeated sampling and risk of terminating too soon

• Has risk that asymptotic error may not be reached

– but alternative techniques wouldn’t be able to handle a larger sample anyway!

Page 25: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Sampling with and without replacement

• Sampling involves deciding how many times Ki each element iof a collection should occur in the sample

• Sampling without replacement restricts Ki to 0 or 1

Page 26: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Uniform fixed-sized sampling with replacement for fixed n

selected ← 0

while selected < s

add a randomly selected instance to the sample

increment selected

Page 27: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Uniform sequential variable-sized sampling without replacement

i ← 1

while i < n

with fixed probability do

add the next instance to the sample

increment i

Page 28: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Uniform sequential fixed-sized sampling without replacement for known n

selected ← 0

i ← 1

while selected < s

with probability (s - selected )/(n-i+1) do

add the next instance to the sample

increment selected

increment i

Page 29: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Uniform sequential fixed-sized sampling without replacement for unknown n

count ← 0

while count < s and count < n

add the next instance to the sample

increment count

while more instances remain

increment count

with probability s/count do

add the next instance to the sample replacing an existing instance selected at random

else

discard the next instance

Tille, Yves. Sampling algorithms. Springer, 2006.

Page 30: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Dimensionality reduction

• Many learning algorithms are super-linear with respect to dimensionality

• Dimensionality can be reduced by

– feature selection

– feature projection

Page 31: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Feature selection

• Most powerful techniques are too computationally intensive for big data

– Eg wrapper techniques

– Best approach varies depending on base learner

• Techniques that consider only the relationship between an attribute and the class are efficient

– Eg top-k mutual information

– However, overlook complex interactions between attributes

• May be most effective to use powerful technique on a sample

Page 32: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Feature Projection

• Project feature space onto lower dimensional space

• Principal Components Analysis

• First principal component is the planar projection that maximises variance (= minimises RMSE with respect to original)

• Subsequent principal components are those that maximise variance (= minimise RMSE) while being uncorrelated with prior components

• First few principal components will capture most of the variation (= information) in the data

• Generalisations including principal curves and manifolds project onto manifolds instead of planes

Page 33: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Scaling-up existing algorithms

• Distributed cloud/cluster computing

• Hadoop

– Commodity clusters

– Map Reduce

• Map problem onto sub-problems and distribute these

• Assemble solution from solutions to sub-problems

White, Tom. Hadoop: The definitive guide. O'Reilly Media, Inc., 2012

Page 34: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Streaming algorithms

• Handle data that are too large to retain

– computer network/phone traffic, financial transactions, web searches, sensor data

• May be difficult to get labelled data

• Strong memory and running time constraints

– learning rate must be greater than the data rate

– only limited data can be retained

• Real time accuracy evaluation and formalisation, mainly to adjust the parameters accordingly.

Page 35: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Online and incremental learning

• Online learning

– Data arrives as input stream

– Classifier makes prediction

– Then correct classification is revealed and classifier updated

– Examples

• Ad placement, online conversions

• Incremental learning

– Classifier is updated as input arrives

– Classifier is identical to batch classifier

Auer, Peter. “Online Learning.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 736-743.

Page 36: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Streaming Strategies

• Retain samples of data and learn from these

– Continually assess current model against incoming data and when models lose accuracy take new samples and relearn

• Continually update a model using current data

– Refine using new data

– Prune elements that decline in accuracy

• Create ensemble of classifiers each learned from successive time periods

– Retire older classifiers as newer ones are created

Page 37: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Weighted majority algorithm

• Each classifier E has a weight wtE

• Classification by weighted majority vote

• All incorrect classifiers have their weights reduced wt+1E

=wtE , 0< <1

• Error is bounded to no more than twice the error of the best classifier

Littlestone, N, and MK Warmuth. "The weighted majority algorithm." In 30th Annual Symposium on Foundation of Computer Science, pp. 256-261. IEEE, 1989.

Page 38: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Winnow

Binary attributes

Non-negative real valued weights

Threshold

Prediction Correct xi = 0 xi = 1

1 0 unchanged

0 1 unchanged

Littlestone, N. "Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm." Machine Learning 2(4)(1988): 285-318.

Page 39: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Stochastic gradient descent

• Many classifiers have parameters that are learned by optimisation

e.g. logistic regression and SVM

– usually requires many passes through the data

• For linear classifiers stochastic gradient descent often converges before a single pass is completed.

– global gradient approximated by the gradient at each example

– performs sequential updates

– good step size is essential

• learn from an initial sample

– must take examples in random order

Zhang, Tong. "Solving large scale linear prediction problems using stochastic gradient descent algorithms." In Proceedings 21st International Conference on Machine learning, p. 116. ACM, 2004.

Page 40: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Ro

ot

Mea

n S

qu

are

d E

rro

r

Data quantity

KDB k=2 KDB k=5

Bias and variance

• Learning curves are not all equal

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Ro

ot

Mea

n S

qu

are

d E

rro

r

Data quantity

KDB k=2

Page 41: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Bias and variance

• A major factor in the difference between learning curves

• Decomposition of 0-1 loss

• Bias and variance relate to the performance of the learner given different training sets

“Bias Variance Decomposition.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 100-101.

Page 42: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Bias and Variance

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

Learner 1,0,1,1,0,1,0,0,?

1,1,0,1,0,1,1,1,?1

0

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

1

1

0

0

Page 43: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Bias and Variance

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

Learner 1,0,1,1,0,1,0,0,?

1,1,0,1,0,1,1,1,?1

0

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

1

1

0

0

Page 44: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Bias and Variance

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

Learner 1,0,1,1,0,1,0,0,?

1,1,0,1,0,1,1,1,?1

0

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

1

1

0

0

Page 45: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Bias and Variance

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

Learner 1,0,1,1,0,1,0,0,?

1,1,0,1,0,1,1,1,?1

0

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

1

1 X

0 X

0

Variance ≈ (lower limit on) error due to variability in response to sampling

Page 46: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Bias and Variance

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

Learner 1,0,1,1,0,1,0,0,?

1,1,0,1,0,1,1,1,?1 X

0 X

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

1,0,1,1,0,1,0,0,1

1,1,0,1,0,1,1,1,1

0,0,1,1,0,1,0,1,0

1,0,1,1,1,0,1,1,0

1 X

1

0

0 X

Variance ≈ (lower limit on) error due to variability in response to sampling

Bias ≈ error due to central tendency of the learner

Bias = error - variance

Page 47: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Bias and variance

High bias Low bias High bias Low bias

High variance High variance Low variance Low variance

Image from Bias Variance Decomposition, in Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 100-101.

Page 48: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Intrinsic error

• Many bias/variance analyses also include intrinsic error

• For our purposes this is included in bias

Page 49: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Bias/variance and big data

• As data quantity increases, variance should decrease

• Low variance important for small data

• Low bias important for big data

Page 50: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Low bias important for big data

• Low bias requires capacity to describe wide variety of multivariate distributions

• Big datasets contain fine detail needed to precisely delineate complex multivariate distributions

Page 51: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Bias/variance and big data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Ro

ot

Mea

n S

qu

are

d E

rro

r

Data quantity

Naïve Bayes

KDB k=2

KDB k=5

Page 52: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Most machine learning research has used small data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Ro

ot

Mea

n S

qu

are

d E

rro

r

Data quantity

Naïve Bayes

KDB k=2

KDB k=5

Page 53: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Computational tractability

• Error will be minimised by low bias algorithms

• Big data require efficient computation

– Linear wrt size

– Learn in a limited number of passes

• Most low-bias learners are compute intensive

– super-linear with respect to data quantity

– Kernel SVM and Random Forests

Page 54: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

k-dependence Bayesian classifier (KDB)

• Bayesian network classifier proposed by Sahami (1995).

• KDB

– the probability of each attribute value is conditioned by the class and at most k other attributes.

– Extends TAN to multiple parents.

C

A1 A2 A3 A4

Page 55: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

k-dependence Bayesian classifier (KDB)

• k=0 is Naïve Bayes

• k variance and bias

• High k with low bias should have low error for big data.

C

A1 A2 A3 A4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Ro

ot

Mea

n S

qu

ared

Err

or

Data quantity

Naïve Bayes

KDB k=2

KDB k=5

Page 56: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

KDB algorithm

1st pass:• Order attributes according to mutual

information (MI) with the class.

2nd pass:• Assign k parents to each attribute

according to MI conditioned on the class.

• Add the class as parent of all attributes

C

A1 A2 A3 A4

Page 57: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Two pass learning

No of instances No of attributes No of classesAv no of values/att

No of classes No of attributes Av no of values/att

Page 58: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Selective KDB - Motivation

• KDB is efficient and effective for large data.

• Irrelevant attributes can increase error.

• Cannot predetermine the best k for a given data quantity.

• Want an efficient way to select attributes and best k.

Page 59: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Selective KDB

MI(Ai;C)

MI(Ai;Aj,C) A1 A2 A3 A4

C

A1 A2 A3 A4

C

LF1 LF2 LF3 LF4

Leave-one-out cv (Pazzani’s trick)

Attributes ordered by MI

Each alternative model tested is a minor addition to the previous

LF1 LF2 LF3 LF4

A1 A2 A3

C

best

Page 60: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Selective KDB

• Loss function can be RMSE, 0-1 loss, Matthews Correlation Coefficient (for unbalanced datasets), etc.

• Still the value of k has to be tuned.

– Solution: Selective2 KDB: matrix of loss function results kxn.

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

a1 a2 a3 a4 a5 a6

p1 p1 p1 p1 p1

p2 p2 p2 p2

p3 p3 p3

Page 61: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Selective KDB

• Loss function can be RMSE, 0-1 loss, Matthews Correlation Coefficient (for unbalanced datasets), etc.

• Still the value of k has to be tuned.

– Solution: Selective2 KDB: matrix of loss function results kxn.

KDB Selective KDB Selective2 KDB

Training time

Test time

Page 62: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Selective KDB – Results (RMSE)

• Competitive with KDB in 16 very large datasets (165K-54.6M examples):

• Mean best k = 4.11

• Mean % attributes selected = 82.6626.72

KDB

selective KDB 8-8-0 5-11-0 5-11-0 6-10-0 6-9-1

k-selective KDB 5-11-0

Page 63: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Selective KDB – Results (RMSE)

• Comparison with Random Forest.

• Need to sample in 3/4 (out of 16) datasets to get RF 10/100 results.

RF (5EF) RF (Num)

Trees = 10 Trees = 100 Trees = 10 Trees = 100

k-selective KDB 6-1-6 4-1-7 5-0-8 4-0-8

Mnist

(250K/8.1M)

MITC

(600K/839K)

Satellite

(2M/8.7M)

Splice

(10M/54.6M)

RF (100) Sample 0.29580.0017 0.05180.0007 0.45680.0006 0.05300.0005

k-selective

KDB

Sample 0.23240.0029 0.04550.0019 0.45310.0011 0.05210.0006

All data 0.14490.0007 0.04460.0020 0.44480.0004 0.05230.0002

Page 64: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Selective KDB – Results (MCC)

• Unbalanced datasets: use MCC as loss function.

• Splice dataset: 0.32% of positive classes.

• Comparison with Random Forest.

KDB selective KDB

0.1768 0.1918

0.1855 0.1984

0.1932 0.2043

0.1986 0.2105

0.2061 0.2148

MITC

(600K/839K)

Splice

(10M/54.6M)

RF (100) Sample 0.9989 0.0950

k-selective

KDB

Sample 0.9954 0.1963

All data 0.9956 0.2148

Discrete attributes

Numeric attributes

Page 65: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Incremental Bayesian Network Classifiers

x1 x2 x3 xn

y

Page 66: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Incremental naïve Bayes

• Probability estimates are based on counts of the frequency of each attribute value co-occurring with the class

• These can be updated incrementally

• Can these desirable features be generalised to more sophisticated learners?

Page 67: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Adding edges reduces bias

• With additional edges it is possible to exactly represent all naïve Bayes distributions and more

– Lower bias

– Higher variance

– Should be more accurate for bigger data

– But which edges should we add?

x1 x2 x3 xn

y

Page 68: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Averaged n-Dependence Estimators

• Develop all of a family of classifiers that each add edges to naïve Bayes

• Select order of dependence, n

• Each model selects n attributes

– All other attributes are independent given these attributes and the class

– Each model has lower bias but higher variance than NB

– Ensembling reduces the variance

Webb, GI, JR Boughton, FZheng, KM Ting, HSalem. "Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification." Machine Learning 86(2) (2012): 233-272

Page 69: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Averaged n-Dependence Estimators

All subsets of

n attributes

Page 70: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Averaged n-Dependence Estimators

• Incremental learning in a single pass through the data

• Training time complexity O(man+1)

• Classification time complexity O(an+1k)

• Space complexity O(an+1vn+1k)

Number of

training examples

Number of

attributes

Number

of classes

Average number of

values per attribute

Page 71: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Averaged n-Dependence Estimators

• As n increases bias decreases

– Good for big data

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000

Ro

ot

Mea

n S

qu

are

d E

rro

r

Naïve Bayes

A1DE

A2DE

A3DE

Page 72: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Subsumption resolution

• If P(x1 | x2) = 1.0 then P(y | x1,x2) = P(y | x2)

– Eg P(oedema | female, pregnant) = P(oedema | pregnant)

• Subsumption resolution looks for subsuming attributes at classification time and ignores them

– Simple correction for extreme form of violation of attribute independence assumption

– Very effective in practice – reduce bias at small cost in variance –though not always applicable

– For AnDE with n≥1 uses statistics collected already – no learning overhead – often reduces classification time

Zheng, F, GI Webb, P Suraweera, L Zhu. "Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning." Machine Learning 87(1)(2012): 93-125.

Page 73: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Weighting

Jiang, Liangxiao, and Harry Zhang. "Weightily averaged one-dependence estimators." In PRICAI 2006, pp. 970-974. Springer Berlin, 2006.

Page 74: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Weighting

• Weighting also reduces bias at the cost of a small increase in variance

0

0.1

0.2

0.3

0.4

0.5

0.6

0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000

Ro

ot

Mea

n S

qu

are

d E

rro

r

Data quantity

A3DE A3DE W

Page 75: Geoff

Copyright © 2013 Geoffrey I Webb

Big data | Class learning | Sampling |Dimensionality red’n | Scaling-up | Streams | Bias/variance |Selective KDB | Incremental BNC

Weighting and subsumption resolution are complementary

• When SR is applicable, both in combination have lower bias but slightly higher variance than either alone

RMSE

Dataset Size A2DE A2DE-SR A2DE-W A2DE-WSR

small cleveland 303 0.359 0.360 0.361 0.361

balance-scale 625 0.430 0.430 0.430 0.430anneal 898 0.118 0.098 0.116 0.096

large

adult 48,842 0.313 0.306 0.308 0.303localization 164,860 0.499 0.499 0.498 0.498covtype 581,102 0.371 0.349 0.350 0.335poker-hand 1,025,010 0.496 0.496 0.420 0.420kddcup 5,209,460 0.044 0.040 0.043 0.039

Page 76: Geoff

Copyright © 2013 Geoffrey I Webb

Questions?

Page 77: Geoff

Copyright © 2013 Geoffrey I Webb

References

Silver, Nate. The Signal and the Noise: Why So Many Predictions Fail-but Some Don't. Penguin Press, 2012.

Whitelaw, Casey, Ben Hutchinson, Grace Y. Chung, and Gerard Ellis. "Using the web for language independent spellchecking and autocorrection." In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, pp. 890-899. Association for Computational Linguistics, 2009.

Prof. William H. Press, “Unit 17: Classifier Performance: ROC, Precision-Recall, and All That.” http://www.nr.com/CS395T/lectures2008/17-ROCPrecisionRecall.pdf

Provost, Foster, David Jensen, and Tim Oates. “Efficient progressive sampling.” In Proceedings 5th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 23-32. ACM, 1999.

Hulten, Geoff, and Pedro Domingos. "Mining complex models from arbitrarily large databases in constant time." In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 525-531. ACM, 2002.

Tille, Yves. Sampling algorithms. Springer, 2006.

White, Tom. Hadoop: The definitive guide. O'Reilly Media, Inc., 2012

Auer, Peter. “Online Learning.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 736-743.

Littlestone, Nick, and Manfred K. Warmuth. "The weighted majority algorithm." In Foundations of Computer Science, 1989., 30th Annual Symposium on, pp. 256-261. IEEE, 1989.

Littlestone, Nick. "Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm." Machine learning 2, no. 4 (1988): 285-318.

Zhang, Tong. "Solving large scale linear prediction problems using stochastic gradient descent algorithms." In Proceedings 21st

International Conference on Machine learning, p. 116. ACM, 2004.

“Bias Variance Decomposition.” In Encyclopedia of Machine Learning, C. Sammut and G.I. Webb, Editors. 2010, Springer: New York. p. 100-101.

Sahami, Mehran. "Learning limited dependence Bayesian classifiers." In KDD-96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 335-338. 1996.

Webb, Geoffrey I., Janice R. Boughton, Fei Zheng, Kai Ming Ting, and Houssam Salem. "Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification." Machine Learning 86, no. 2 (2012): 233-272.

Zheng, Fei, Geoffrey I. Webb, Pramuditha Suraweera, and Liguang Zhu. "Subsumption resolution: an efficient and effective technique for semi-naive Bayesian learning." Machine Learning 87, no. 1 (2012): 93-125.

Jiang, Liangxiao, and Harry Zhang. "Weightily averaged one-dependence estimators." In PRICAI 2006: trends in artificial intelligence, pp. 970-974. Springer Berlin Heidelberg, 2006.