Incremental Methods for Machine Learning Problems

59
Incremental Methods for Machine Learning Problems Aristidis Likas Department of Computer Science University of Ioannina e-mail: [email protected] http://www.cs.uoi.gr/~arly

description

Incremental Methods for Machine Learning Problems. Aristidis Likas Department of Computer Science University of Ioannina e-mail: [email protected] http://www.cs.uoi.gr/~arly. Outline. Machine Learning: Data Modeling + Optimization The incremental machine learning framework - PowerPoint PPT Presentation

Transcript of Incremental Methods for Machine Learning Problems

Page 1: Incremental Methods for Machine Learning Problems

Incremental Methods for Machine Learning Problems

Aristidis Likas

Department of Computer Science University of Ioanninae-mail: [email protected]

http://www.cs.uoi.gr/~arly

Page 2: Incremental Methods for Machine Learning Problems

Outline• Machine Learning: Data Modeling + Optimization• The incremental machine learning framework• Global k-means (PR, IEEE TNN)• Greedy EM (NPL 2002, Bioinformatics)• Incremental Bayesian GMM learning (IEEE TNN)• Dip-Means• Incremental Bayesian Supervised learning (IEEE TNN)• Current research problems

– Matlab code available for all methods

Page 3: Incremental Methods for Machine Learning Problems

Machine Learning Problems• Unsupervised Learning

– Clustering– Density estimation– Dimensionality Reduction

• Supervised Learning– Classification– Regression

Also considered as Data mining or Pattern Recognition problems

Page 4: Incremental Methods for Machine Learning Problems

Machine Learning as OptimizationTo solve a machine learning problem• dataset X of training examples• parametric Data Model that ‘explains’ the data

– f(x;Θ), Θ set of parameters to be estimated during training• objective function L(X;Θ)

Model training is achieved through the optimization of the objective function.

• Usually non-convex optimization, many local optima• We search for a ‘near-optimal’ solution

* s.t. constraints on arg ( ; ) opt L X

Page 5: Incremental Methods for Machine Learning Problems

Machine Learning as Optimization• Local search algorithms (gradient descent, BFGS, EM, k-means)• Performance depends on the initialization of parameters.• Typical solution: multiple (random) restarts

– multiple local search runs from (random) initializations– Keep the solution of the best run

• Weakenesses: – poor solutions for large models– How many runs?– How to initialize?– non-determinism: non-repeatability, difficulty in comparing different

methods. • An alternative approach (in some cases):

– incremental model training

Page 6: Incremental Methods for Machine Learning Problems

Building Blocks formulation• Many popular Data Models can be written as a combination

(or simply as a set) of “Building Blocks”

– Number of BBs = model order

• The combination function may also include parameters (w1,…, wM)

• Set of model parameters:• Examples

– k-means Clustering: Β=cluster centers, L=clustering error– Mixture Models: B=component densities, L=Likelihood– FF Neural Networks: B=sigmoidal or RBF hidden units, L=LS error– Kernel Models: B=basis functions (kernels) , L=loss functions

1 1( ) ( ( ),..., ( ))M M M Mf Comb B B

1 1{ ,..., , ,..., }MM M w w

Page 7: Incremental Methods for Machine Learning Problems

Building Blocks• In some models building blocks are fixed a priori.

– Only optimization w.r.t to the combination weights wi is required (convex problem in many cases, eg SVM).

• In the general case all the BB parameters θi should be learnt.

• Non-convex optimization problem– many local optima– local search methods– dependence on initialization of ΘM

• Resort to incremental training

Page 8: Incremental Methods for Machine Learning Problems

Incremental training• The incremental (greedy) approach can offer a simple and

effective solution to the random restarts problem in training ML models.

• Incremental methods are based on the following assumption:

– We can obtain a ‘near-optimal’ model with k BBs by exploiting a ‘near-optimal’ model with (k-1) BBs.

• Method: Starting with k=1 BB, incremental methods sequentially add one BB at each step until M BBs have been added.

Page 9: Incremental Methods for Machine Learning Problems

Incremental Training Approaches• 1. Fast approach: optimize only wrt θk of the k-ΒΒ

keeping θ1…θk-1 fixed to the solution of (k-1)-BB model.

– Exhaustive Enumeration (deterministic)– Multiple restarts, but the search space is much smaller

• 2. Fast approach followed by full model training (once)

*1k k

*1k

*k *

k

*1k k *

1k *k

*k

LS

Page 10: Incremental Methods for Machine Learning Problems

Incremental Training• 3. Full model training with multiple restarts:

– Initializations based on the (k-1)-BB model.

• Deterministic search is preferable (avoid randomness)

• Incremental methods also offer solutions for all indermediate models with k=1,…,M BBs

*1k

(2)k *

kLS

*1k

(1)k

(1)k

LS

*1k

( )lk

( )lkLS

(2)k

Page 11: Incremental Methods for Machine Learning Problems

Prototype-Based Clustering• Partition a dataset X of N vectors xi into M subsets

(clusters) Ck such that intra-cluster variance is minimized.

• Intra-cluster variance: avg. distance from the cluster prototype mk

• k-means: Prototype = cluster center• Finds local minima w.r.t. clustering error

– sum of intra-cluster variances

• Highly dependent on the initial positions of the centers mk

km (km.wmv)

Page 12: Incremental Methods for Machine Learning Problems

Global k-means• Incremental, deterministic clustering algorithm that runs

k-Means several times• Finds near-optimal solutions wrt clustering error

• Idea: a near-optimal solution for k clusters can be obtained by running k-means from an initial state

– the k-1 centers are initialized from a near-optimal solution of the (k-1)-clustering problem

– the k-th center is initialized at some data point xn (which?)

• Consider all possible initializations (one for each xn)

1 2 1( , ,..., , )k nm m m x

1 2 1( , ,..., )km m m

Page 13: Incremental Methods for Machine Learning Problems

Global k-means

• In order to solve the M-clustering problem:– Solve the 1-clustering problem (trivial)– Solve the k-clustering problem using the solution of

the (k-1)-clustering problem • Execute k-Means N times, initialized as

at the n-th run (n=1,…,N).• Keep the solution corresponding to the run with the lowest

clustering error as the solution with k clusters

– k:=k+1, Repeat step 2 until k=M.

1 2 1( , ,..., , )k nm m m x

Page 14: Incremental Methods for Machine Learning Problems

Best Initial m2

Best Initial m3

Best Initial m4

Best Initial m5

glkm (glkm.wmv)

Page 15: Incremental Methods for Machine Learning Problems

Fast Global k-Means

• How is the complexity reduced?– We select the initial state that provides the greatest reduction in clustering error

in the first iteration of k-means (reduction can be computed analytically)

– k-means is executed only once from this state

*1 2 1( , ,..., , )k nm m m x

Page 16: Incremental Methods for Machine Learning Problems

Kernel-Based Clustering(non-linear separation)

– Given a set of objects and the kernel matrix K=[Kij] containing the similarities between each pair of objects

– Goal: Partition the dataset into subsets (clusters) Ck such that intra-cluster similarity is maximized.

– Kernel trick: Data points are mapped from input space to a higher dimensional feature space through a transformation φ(x).

– The kernel function corresponds to the inner product in feature space

– Kernel k-Means ≡ k-Means in feature space2( ) ( ), || ( ) ( ) || 2T

ij i j i j ii jj ijK K K K x x x x

Page 17: Incremental Methods for Machine Learning Problems

Kernel k-Means• Kernel k-means = k-means in feature space

– Minimizes the clustering error in feature space

• Differences from k-means– Cluster centers mk in feature space cannot be computed– Each cluster Ck is explicitly described by its data objects

– Computation of distances from centers in feature space:

• Finds local minima - Strong dependence on the initial partition

Page 18: Incremental Methods for Machine Learning Problems

Global Kernel k-Means• In order to solve the M-clustering problem:

1. Solve the 1-clustering problem with Kernel k-Means (trivial solution)2. Solve the k-clustering problem using the solution of the (k-1)-clustering

problema) Let denote the solution to the (k-1)-clustering problemb) Execute Kernel k-Means N times, initialized during the n-th run as

c) Keep the run with the lowest clustering error as the solution with k clusters

d) k := k+1

3. Repeat step 2 until k=M.

• The fast Global kernel k-means can be applied

1 2 1( , ,..., )kC C C

1 1( ,..., : { },...., , { }) l l n k k n n lC C C C C C x x x

Page 19: Incremental Methods for Machine Learning Problems

Best Initial C3

Best Initial C4

Empty circles: optimal initialization of the cluster to be added

Best Initial C2

Page 20: Incremental Methods for Machine Learning Problems

Global Kernel k-means - Applications

• MRI image segmentation

• Key frame extraction - shot clustering

BackgroundMuscle/Skin

SkinWhite Grey CSF

Skull

Page 21: Incremental Methods for Machine Learning Problems

Mixture Models • Probability density estimation: estimate the density function

model f(x) that generated a given dataset X={x1,…, xN}

• Mixture Models

– M pdf components φj(x),

– mixing weights: π1, π2, …, πM (priors)

• Gaussian Mixture Model (GMM): φj = N(μj, Σj)

1

( ) ( ; )M

j j jj

f x x

1

0, 1M

j jj

Page 22: Incremental Methods for Machine Learning Problems

GMM (graphical model)

Hidden variable

πj

observation

Page 23: Incremental Methods for Machine Learning Problems

GMM examples

23

GMMs be used for density estimation (like histograms) or clustering

( ; )( | )

( )

nj j jn n

jn

xP j x z

f x

Cluster

memberhsip probability

Page 24: Incremental Methods for Machine Learning Problems

Mixture Model training • Given a dataset X={x1,…, xN} and a GMM f (x;Θ)

• Likelihood:

• GMM training: log-likelihood maximization

• Expectation-maximization (EM) algorithm– Applicable when posterior P(Z|X) can be computed

1 1( ; ) ( ,..., ; ) ( ; )

N

N iip X p x x f x

1

arg max ln ( ; )N

ii

p x

Page 25: Incremental Methods for Machine Learning Problems

EM for Mixture Models• E-step: compute expectation of hidden

variables given the observations:

1

( | )( | )

( | )

nj jn n

j Kn

j pp

xP j x z

x

• M-step: maximize expected complete likelihood

( 1)

( | )arg max (Θ) log ( , ;Θ)t

P Z XQ p X Z

1 1

( ) log log ( | )N K

n nj j j

n j

Q z x

Page 26: Incremental Methods for Machine Learning Problems

EM for GMM (M-step)

( 1) ( 1)( 1) 1

1

( )( )N n n t n t T

j j jt nj N n

jn

z x x

z

( 1) 1

1

N n njt n

j N njn

z x

z

Mean

Covariance

Mixing weights ( 1) 1

N njt n

j

z

N

Page 27: Incremental Methods for Machine Learning Problems

EM Local Maxima

Page 28: Incremental Methods for Machine Learning Problems

Greedy EM for GMM

1 11 1

( , ) log ( ) log (1 ) ( ) ( ; )N N

k k i k i ii i

L a f x a f x a x

( , )

1

( , ) arg max log (1 ) ( ) ( ; )n

k i ia

i

a a f x a x

1( ) (1 ) ( ) ( ; )k kf x a f x a x

• Start with k=1, f1(x)=N(μ1, Σ1), μ1=mean(X), Σ1=cov(X)

• Let fk the GMM solution with k components

• Let φ(x;μ,Σ) the k+1 component to be added

1(x) (1 ) (x) (x; ), (0,1)k kf a f a a θ

• Refine fk+1(x) using EM -> final GMM with k+1 components

Page 29: Incremental Methods for Machine Learning Problems

Greedy EM for GMM

)4(1

)2(

4

d

nd

( , )

1

( , ) arg max log (1 ) ( ) ( ; )N

k i ia

i

a a f x a x

1ˆarg max ( , )X kL

μμ μ

• Σ=σΙ,

• Given θ=(μ,σ), α* can be computed analytically

- Remark: the new component should be placed in a data region

2

1

1 21

1

( ; , )( ) ( ; , )1 1ˆ ( , ) log2 2 ( ; , )

NN iik i i

k Ni ii

fL

N

x μx x μμ

x μ

- Deterministic approach

* 1

2

1

( ; , )1 1( , )

2 2 ( ; , )

N

iiN

ii

a

x μμ

x μ

gem (gem.wmv)

Page 30: Incremental Methods for Machine Learning Problems

Greedy-EM applications

• Image modeling for content-based retrieval and relevance feedback

• Motif discovery in sequences (discrete data, mixture of multinomials)

• Times series clustering (mixture of regression models)

Page 31: Incremental Methods for Machine Learning Problems

Bayesian GMM

1

1

: ( , ), ( ) ( ), j j j jj

T Wishart v V p T p T T

1 1( ,..., ) : ( ,..., )MDirichlet a a

1

: ( , ), ( ) ( )j jj

N m S p p

1

( ) ( ; , )M

j j j jj

f x x

M

jj

1

1

Typical approach: Priors on all GMM parameters

Page 32: Incremental Methods for Machine Learning Problems

Bayesian GMM training• Parameters Θ become (hidden) RVs: H={Z, Θ}

• Objective: Compute Posteriors P(Z|X), P(Θ|X) (intractable)

• Approximations

• Sampling (RJMCMC)

• MAP approach

• Variational approach

• MAP approximation

• mode of the posterior P(Θ|Χ) (MAP-EM)

• compute P(Z|X,ΘMAP)

arg max {log ( | ) log ( )}MAP P X P

Page 33: Incremental Methods for Machine Learning Problems

Variational Inference (no parameters)

• Computes approximation q(H) of the true posterior P(H|X)• For any pdf q(H):• Variational Bound (F) maximization

• Mean field approximation

• System of equations

ln |p X F q KL q H P H X

,* arg max arg max lnq q

p X Hq F q q H dH

q H

k

k

q H q H

\

\

exp ln , ;;

exp ln , ;

k

k

q Hk

k

q H

p X Hq H

p X H dH

Page 34: Incremental Methods for Machine Learning Problems

Variational Inference (with parameters)• X data, H hidden RVs, Θ parameters• For any pdf q(H;Θ):

• Maximization of Variational Bound F

ln ; , ; | ;p X F q KL q H p H X

, ;

, ; ln ln ;;

p X HF q q H dH p X

q H

• Variational EM• VE-Step:

• VM-Step:

arg max , old

qq F q

arg max ,oldF q

Page 35: Incremental Methods for Machine Learning Problems

Bayesian GMM training

• Bayesian GMMs

• mean field variational approximation

• tackles the covariance singularity problem

• requires to specify the parameters of the priors

• Estimating the number of components:

• Start with a large number of components

• Let the training process prune redundant components (πj=0)

• Dirichlet prior on πj prevents component prunning

Page 36: Incremental Methods for Machine Learning Problems

Bayesian GMM without prior on π

• Mixing weights πj are parameters (remove Dirichlet prior)

• Training using Variational EM

Method (C-B)

• Start with a large number of components

• Perform variational maximization of the marginal likelihood

• Prunning of redundant components (πj=0)

• Only components that fit well to the data are finally retained

CBdemo (CBdemo.wmv)

Page 37: Incremental Methods for Machine Learning Problems

Bayesian GMM (C-B)

• C-B method: Results depend on

• the number of initial components

• initialization of components

• specification of the scale matrix V of the Wishart prior p(T)

Page 38: Incremental Methods for Machine Learning Problems

Incremental Bayesian GMM

• Modification of the Bayesian GMM is needed

• Divide the components as ‘fixed’ or ‘free’

• Prior on the weights of ‘fixed’ components (retained)

• No prior on the weights of ‘free’ components (may be eliminated)

• Prunning restricted among ‘free’ components

• Solution: incremental training using component splitting

• Local scale matrix V: based on the variance of the component to be splitted

Page 39: Incremental Methods for Machine Learning Problems

Incremental Bayesian GMM

Page 40: Incremental Methods for Machine Learning Problems

Incremental Bayesian GMM• Start with k=1 component.

•At each step:

• select a component j

• split component j in two subcomponents

• set the scale matrix V analogous to Σj

• apply Variational EM considering the two subcomponents as free and the rest components as fixed

• either the two components will be retained and adjusted

• or one of them will be eliminated and the other one will recover the original component (before split)

• until all components have been tested for split unsuccessfully

C-L

Page 41: Incremental Methods for Machine Learning Problems

Incremental Bayesian GMM Image segmentation

Number of segments determined automatically

Page 42: Incremental Methods for Machine Learning Problems

Incremental Bayesian GMMImage segmentation

Number of segments determined automatically

Page 43: Incremental Methods for Machine Learning Problems

Relevance Vector Machine• RVM model (Tipping 2001)

– φi(x)=K(x,xi) (same kernel function ‘centered’ on training example xi)

• Fixed pool of N basis functions

– Initially M=N basis functions:

– Bayesian inference with sparse prior on w prune redundant basis functions

– Οnly few basis functions are retained (relevance vectors)

1 1

( ) ( ), ( )N N

i i n i i n ni i

y w t w

x x x

1 2{ ( ), ( ),..., ( )}N x x x

, [ ] ( ) ( , )ij i j i jx K x x t w ε

Page 44: Incremental Methods for Machine Learning Problems

Relevance Vector Machine• Likelihood:

• Sparse prior of w:

- Separate precision αi for each

weight wi

• Weight prior p(w): Student's t (enforces sparsity)

1( | , ) N( | , )p t w t Φw I

N1

1i=1

( | ) ( | 0, ), ( ,..., )

( ) ( ; , )

Ti i N

i i

p N w a a a

p a Gamma a a b

w α α

Page 45: Incremental Methods for Machine Learning Problems

RVM Training

• Maximize Marginal Likelihood • Use Expectation Maximization (EM) Algorithm:

– E-step:

– M-step:

• Sparsity: Most

( ; , ) ( | ; ) ( ; )p p p d t α t w w α w

12 2

, , =1|| ||

N

ii i

i i i iii

Na a

t Φμ

0i ia w

1( | , , ) ( ; , ), , ( ( ))T Tp N diag w t α w μ Σ μ ΣΦ t Σ Φ Φ α

Page 46: Incremental Methods for Machine Learning Problems

RVM example

Page 47: Incremental Methods for Machine Learning Problems

RVM Incremental Training

• Incrementally add basis functions starting with empty model (Faul & Tipping 2003)

• Optimization w.r.t a single parameter αi

• Estimation of optimal αi analytical:

2

1 1

1( ) (log log( ) )

2

,

ii i i i

i i

T Ti i i i i i i

Ti j j j

j i

ql a a a s

a s

s q

φ C φ φ C t

C φ φ

22

2

2

, if

, if

ii i i

i i

i i i

sa q s

q s

a q s

Page 48: Incremental Methods for Machine Learning Problems

RVM Incremental Training• At each iteration of the training algorithm

– Compute optimal αi for all Ν basis functions– Select the best basis function φi(x) from the pool of N

candidates

– Perform one of the following:• Add this basis function to the current model• Update αi (if it is included in the model)• Remove this basis function (if it is included in the

model and αi =∞)

1 2{ ( ), ( ),..., ( )}N x x x

Page 49: Incremental Methods for Machine Learning Problems

RVM Limitations• How to specify kernel parameter?

(e.g. scale of RBF kernel)

– Typical solution: Cross-validation• Computationally expensive• Cannot be used when many

parameters must be adjusted

• How to model non-stationary functions?– RVM uses the same kernel for whole

input space

Page 50: Incremental Methods for Machine Learning Problems

Adaptive RVM with Kernel Learning (aRVM)

• Assume different parameters θi for each φ(x;θi) • RBF kernel: center mi and scale hi are parameters

• Generally mi different from training points xn

• Employ incremental RVM training

• Typical incremental RVM: select from a fixed set of N basis functions the best basis function to add

• aRVM: select the basis function φ(x;θi) to add by optimizing marginal likelihood sl(αi,θi) w.r.t (αi,θi)

2 2( ; , ) exp || ||i i i ih h x m x m

Page 51: Incremental Methods for Machine Learning Problems

Sparsity Controlling Prior• aRVM model is more flexible than typical RVM• Employ a “stronger” prior on weights to enforce sparsity• “Sparsity controlling prior” [Schmolck & Everson 2007]

– c= 0, typical RVM– c=log(N) (typical value in experiments)

• Prior can be written as:

• Likelihood is modified due to the new prior

( ) exp , ( ), degrees of freedom

, smoothing matrixT

p cDF DF trace

a S

S ΦΣΦ

1

( ) exp ( )M

i i iii

p a c M a

Page 52: Incremental Methods for Machine Learning Problems

Learning αi, θi

• Maximize

w.r.t to αi, θi

• Alternate maximization steps

• Optimal αi (for fixed θi) :

– for c=0 we obtain the incremental RVM update

22

2

2

, if (2 1)(2 1)

, if (2 1)

ii i i

i i

i i i

sa q c s

q c s

a q c s

2

1 1

21( , ) (log log( ) )

2

,

i ii i i i i

i i

T Ti i i i i i i

Ti j j j

j i

q casl a a a s

a s

s q

θ

φ C φ φ C t

C φ φ

Page 53: Incremental Methods for Machine Learning Problems

Learning αi, θi

• Maximize w.r.t to θi (αi fixed)

– Use quasi-Newton BFGS method (analytical derivatives)

– Perform multiple restarts from several initial values of θi and keep solution with best likelihood sl

2

2

1 1

1

( )

,

i i ii i

ik i i i ii i

T Ti ii i i i i

ik ik

q c a qslr

a s a sa s

r t

φ φ

φ C C

Page 54: Incremental Methods for Machine Learning Problems

aRVM Learning AlgorithmStart from an empty model. At each iteration: 1. Optimize the parameters (αi,θi) of a new basis function φ(x;

θi) and add it to the model

2. Train the current model:1. Update parameters (θi) of all current basis functions (BFGS

updates)2. Update parameters αi and β (noise precision)

3. Delete redundant basis functions (αi>1012)

3. Repeat steps 1-2 until convergence

The method can be used with any differential form of basis function φ(x; θi)

Page 55: Incremental Methods for Machine Learning Problems

aRVM Example (RBF kernel)

• Demos• Tables

Page 56: Incremental Methods for Machine Learning Problems

aRVM Example (RBF kernel)

• Demos• Tables

Page 57: Incremental Methods for Machine Learning Problems

Incremental Bayesian MLP Training• For Sigmoidal Basis functions:

we get the Multilayer Perceptron with one hidden layer.• It is straightforward to apply the incremental kernel

learning algorithm• Tackles the model selection problem (number of hidden

units) in MLP neural networks.

Page 58: Incremental Methods for Machine Learning Problems

Incremental Learning: Current Research• The dynamic nature of incremental training methods makes

them particularly suitable for machine learning using stream data.

• Sparse & high dimensional data: text clustering• Multiview-clustering

• Theoretical support for the successful empirical performance.• Submodular cost function (www.submodularity.org)

– The addition of a building block in a model M provides greater cost improvement than adding the same building block in a larger model M’ that includes M.

– For submodular functions the simple greedy heuristic performs ‘surprinsingly’ well (better than 0.65* maximum).

• Challenge: prove that machine learning objective functions are (approximately) submodular (proved for k-medoids, feature selection, dictionary learning).

Page 59: Incremental Methods for Machine Learning Problems

Thank you

Collaborators N. Vlassis (Global k-means, Greedy EM)

G. Tzortzis (Global kernel k-means)C. Constantinopoulos (Bayesian GMM)

A. Kalogeratos (Dip-means)D. Tzikas, N. Galatsanos (aRVM)

Matlab code available for all methods