Kernel Adaptive Filtering - Signal processing

Kernel Adaptive Filtering

Jose C. Principe and Weifeng Liu

Computational NeuroEngineering Laboratory (CNEL)

University of Florida

[email protected], [email protected]

Acknowledgments

Dr. Badong Chen

Tsinghua University and Post Doc CNEL

NSF ECS – 0300340 and 0601271

(Neuroengineering program)

Outline

1. Optimal adaptive signal processing fundamentals

Learning strategy

Linear adaptive filters

2. Least-mean-square in kernel space

Well-posedness analysis of KLMS

3. Affine projection algorithms in kernel space

4. Extended recursive least squares in kernel space

5. Active learning in kernel adaptive filtering

Wiley Book (2010)

Papers are available at

www.cnel.ufl.edu

Part 1: Optimal adaptive signal

processing fundamentals

Problem Setting

Optimal Signal Processing seeks to find optimal models for time

series.

The linear model is well understood and widely applied. Optimal

linear filtering is regression in functional spaces, where the user

controls the size of the space by choosing the model order.

Problems are fourfold:

In many important applications data arrives in real time, one sample

at a time, so on-line learning methods are necessary.

Optimal algorithms must obey physical constrains, FLOPS, memory,

response time, battery power.

Application conditions may be non stationary, i.e. the model must

be continuously adapting to track changes.

Unclear how to go beyond the linear model.

Although the optimal problem is the same as in machine learning,

constraints make the computational problem different.

Machine Learning

Assumption: Examples are drawn independently from an

unknown probability distribution P(u, y) that represents the

rules of Nature.

Expected Risk:

We would like f∗ that minimizes R(f) among all functions.

But we use a mapper class F and in general

The best we can have is that minimizes R(f).

P(u, y) is also unknown by definition.

Empirical Risk:

Instead we compute that minimizes Rn(f).

Vapnik-Chervonenkis theory tells us when this can work, but

the optimization is computationally costly.

Exact estimation of fN is done thru optimization.

),()),(()( yudPyufLfR

i

iiN yufLNfR )),((/1)(ˆ

Ff *

FfF *

FfN

Machine Learning Strategy

)()(

)()()()(

*

***

FN

FN

fRfR

fRfRfRfR

)~

()( NN fRfR

Machine Learning Strategy

The optimality conditions in learning and optimization theories

are mathematically driven:

Learning theory favors cost functions that ensure a fast estimation

rate when the number of examples increases (small estimation error

bound).

Optimization theory favors superlinear algorithms (small

approximation error bound)

What about the computational cost of these optimal solutions, in

particular when the data sets are huge? Algorithmic complexity

should be as close as possible to O(N).

Change the design strategy: Since these solutions are never

optimal (non-reachable set of functions, empirical risk), goal

should be to get quickly to the neighborhood of the optimal

solution to save computation.

Learning Strategy in Biology

In Biology optimality is stated in relative terms: the best possible

response within a fixed time and with the available (finite)

resources.

Biological learning shares both constraints of small and large

learning theory problems, because it is limited by the number of

samples and also by the computation time.

Design strategies for optimal signal processing are similar to the

biological framework than to the machine learning framework.

What matters is “how much the error decreases per sample for a

fixed memory/ flop cost”

It is therefore no surprise that the most successful algorithm in

adaptive signal processing is the least mean square algorithm

(LMS) which never reaches the optimal solution, but is O(L) and

tracks continuously the optimal solution!

Extensions to Nonlinear Systems

Many algorithms exist to solve the on-line linear regressionproblem:

LMS stochastic gradient descent

LMS-Newton handles eigenvalue spread, but is expensive

Recursive Least Squares (RLS) tracks the optimal solution with the available data.

Nonlinear solutions either append nonlinearities to linear filters (not optimal) or require the availability of all data (Volterra, neural networks) and are not practical.

Kernel based methods offers a very interesting alternative to neural networks.

Provided that the adaptation algorithm is written as an inner product, one can take advantage of the “kernel trick”.

Nonlinear filters in the input space are obtained.

The primary advantage of doing gradient descent learning in RKHS is that the performance surface is still quadratic, so there are no local minima, while the filter now is nonlinear in the input space.

Adaptive Filtering Fundamentals

Adaptive

System

Output

On-Line Learning for Linear Filters

The current estimate is computed in

terms of the previous estimate, , as:

ei is the model prediction error arising from the use of wi-1 and Gi is a Gain term

iw

1i i i iw w G e 1iw

Transversal filter

Adaptive weight-

control mechanism

iwiu ( )y i

( )d i

-

+

( )e i

Notation:

wi weight estimate at time i

(vector) (dim = l)

ui input at time i (vector)

e(i) estimation error at time i

(scalar)

d(i) desired response at time i

(scalar)

ei estimation error at iteration i

(vector)

di desired response at iteration

i (vector)

Gi capital letter matrix


sizestep

ieEJ

wwmEil

Jww

ii

iii

)]([

*][

2

1

W1

W2

-1 -0.5 0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3

3.5

4Contour

MEE

FP-MEE

11

1

iii JHww


Gradient descent learning for linear mappers has also great

properties

It accepts an unbiased sample by sample estimator that is easy to

compute (O(L)), leading to the famous LMS algorithm.

The LMS is a robust estimator ( ) algorithm.

For small stepsizes, the visited points during adaptation always

belong to the input data manifold (dimension L), since algorithm

always move in the opposite direction of the gradient.

)(1 ieuww iii

H

On-Line Learning for Non-Linear Filters?

Can we generalize to nonlinear models?

and create incrementally the nonlinear mapping?

Ty w u ( )y f u

Universal function

approximator

Adaptive weight-

control mechanism

ifiu ( )y i

( )d i

-

+

( )e i

1i i i iw w G e

iiii eGff 1

Part 2: Least-mean-squares in kernel

space

Non-Linear Methods - Traditional(Fixed topologies)

Hammerstein and Wiener models

An explicit nonlinearity followed (preceded) by a linear filter

Nonlinearity is problem dependent

Do not possess universal approximation property

Multi-layer perceptrons (MLPs) with back-propagation

Non-convex optimization

Local minima

Least-mean-square for radial basis function (RBF) networks

Non-convex optimization for adjustment of centers

Local minima

Volterra models, Recurrent Networks, etc

Non-linear Methods with kernels

Universal approximation property (kernel dependent)

Convex optimization (no local minima)

Still easy to compute (kernel trick)

But require regularization

Sequential (On-line) Learning with Kernels

(Platt 1991) Resource-allocating networks

Heuristic

No convergence and well-posedness analysis

(Frieb 1999) Kernel adaline

Formulated in a batch mode

well-posedness not guaranteed

(Kivinen 2004) Regularized kernel LMS

with explicit regularization

Solution is usually biased

(Engel 2004) Kernel Recursive Least-Squares

(Vaerenbergh 2006) Sliding-window kernel recursive least-squares

Neural Networks versus Kernel Filters

ANNs Kernel filters

Universal Approximators YES YES

Convex Optimization NO YES

Model Topology grows with data NO YES

Require Explicit Regularization NO YES/NO (KLMS)

Online Learning YES YES

Computational Complexity LOW MEDIUM

ANNs are semi-parametric, nonlinear approximators

Kernel filters are non-parametric, nonlinear approximators

Kernel Methods

)()()(1

0

ninxwnyL

i

i xwT

Kernel filters operate in a very special Hilbert space of

functions called a Reproducing Kernel Hilbert Space (RKHS).

A RKHS is an Hilbert space where all function evaluations are

finite

Operating with functions seems complicated and it is! But it

becomes much easier in RKHS if we restrict the computation

to inner products.

Most linear algorithms can be expressed as inner products.

Remember the FIR

Kernel methods

Moore-Aronszajn theorem

Every symmetric positive definite function of two real variables has

a unique Reproducing Kernel Hilbert Space (RKHS).

Mercer‟s theorem

Let K(x,y) be symmetric positive definite. The kernel can be

expanded in the series

Construct the transform as

Inner product

( ) ( ) ( )x y x y

1

( , ) ( ) ( )m

i i i

i

x y x y

1 1 2 2( ) [ ( ), ( ),..., ( )]T

m mx x x x

)exp(),(2

yxhyxk

Kernel methods

Mate L., Hilbert Space Methods in Science and Engineering, A. Hildger, 1989

Berlinet A., and Thomas-Agnan C., “Reproducing kernel Hilbert Spaces in probaability and Statistics, Kluwer 2004

Basic idea of on-line kernel filtering

Transform data into a high dimensional feature space

Construct a linear model in the feature space F

Adapt iteratively parameters with gradient information

Compute the output

Universal approximation theorem

For the Gaussian kernel and a sufficient large mi, fi(u) can

approximate any continuous input-output mapping arbitrarily close in

the Lp norm.

: ( )i iu

1

( ) , ( ) ( , )im

i i F j j

j

f u u a u c

, ( ) Fy u

iii J 1

Growing network structure

u

φ(u)

Ω

+y

1 ( ) ( )i i ie i u

1 ( ) ( , )i i if f e i u

+

a1

a2

ami-1

ami

c1

c2

cmi-1

cmi

yu

Kernel Least-Mean-Square (KLMS)

Least-mean-square

Transform data into a high dimensional feature space F

RBF Centers are the samples, and Weights are the errors!

011 )()()( wuwidieieuww iTiiii

: ( )i iu

0

1

1

0

( ) ( ) , ( )

( ) ( )

i i F

i i i

e i d i u

u e i

1

( ) ( )i

i j

j

e j u

1

( ) , ( ) ( ) ( , )i

i i F j

j

f u u e j u u

0

0 1

1 0 1 1 1

1 2

1 1 2

1 1 2

2 1 2

1 1 2 2

0

(1) (1) , ( ) (1)

( ) (1) ( )

(2) (2) , ( )

(2) ( ), ( )

(2) ( , )

( ) (2)

( ) ( )

...

F

F

F

e d u d

u e a u

e d u

d a u u

d a u u

u e

a u a u

Kernel Least-Mean-Square (KLMS)

),.)(()(

))(()()(

))(),(()())((

),.)(()(

1

1

1

1

1

1

1

1

iieff

ifidie

ijjeif

jjef

ii

i

i

j

i

i

j

i

u

u

uuu

u

Free Parameters in KLMS Step size

Traditional wisdom in LMS still applies here.

where is the Gram matrix, and N its dimensionality.

For translation invariant kernels, (u(j),u(j))=g0, is a

constant independent of the data.

The Misadjustment is therefore

G

][2

Gtr

NM

N

jjj

N

tr

N

1))(),((][ uuG

Free Parameters in KLMS Rule of Thumb for h

Although KLMS is not kernel density estimation,

these rules of thumb still provide a starting point.

Silverman‟s rule can be applied

where s is the input data s.d., R is the interquartile, N

is the number of samples and L is the dimension.

Alternatively: take a look at the dynamic range of the

data, assume it uniformly distributed and select h to

put 10 samples in 3 s

)5/(134.1/,min06.1 LNRh s

Free Parameters in KLMS

Kernel Design

The Kernel defines the inner product in RKHS

Any positive definite function (Gaussian,

polynomial, Laplacian, etc.), but we should choose

a kernel that yields a class of functions that allows

universal approximation.

A strictly positive definite function is preferred

because it will yield universal mappers (Gaussian,

Laplacian).

See Sriperumbudur et al, On the Relation Between Universality, Characteristic Kernels and RKHS Embedding of

Measures, AISTATS 2010

Free Parameters in KLMS Kernel Design

Estimate and minimize the generalization error, e.g.

cross validation

Establish and minimize a generalization error upper

bound, e.g. VC dimension

Estimate and maximize the posterior probability of

the model given the data using Bayesian inference

Free Parameters in KLMS Bayesian model selection

The posterior probability of a Model H (kernel and

parameters q) given the data is

where d is the desired output and U is the input vector.

This is hardly ever done for the kernel function, but it

can be applied to q and leads to Bayesian principles

to adapt the kernel parameters.

)|(

)(),|(),|(

Ud

UdUd

p

HpHpHp ii

i

Free Parameters in KLMS Maximal marginal likelihood

)]2log(2log21)(21[)( 212 ssq

NxamHJ nn

T

i IGdIGd

Sparsification

Filter size increases linearly with samples!

If RKHS is compact and the environment stationary,

we see that there is no need to keep increasing the

filter size.

Issue is that we would like to implement it on-line!

Two ways to cope with growth:

Novelty Criterion

Approximate Linear Dependency

First is very simple and intuitive to implement.

Sparsification

Novelty Criterion

Present dictionary is . When a new data

pair arrives (u(i+1),d(i+1)).

First compute the distance to the present dictionary

If smaller than threshold d1 do not create new center

Otherwise see if the prediction error is larger than d2

to augment the dictionary.

d1 ~ 0.1 kernel size and d2 ~ sqrt of MSE

im

jjciC1

)(

jCc

ciudisj

)1(min

Sparsification

Approximate Linear Dependency

Engel proposed to estimate the distance to the linear

span of the centers, i.e. compute

Which can be estimated by

Only increase dictionary if dis larger than threshold

Complexity is O(m2)

Easy to estimate in KRLS (dis~r(i+1))

Can simplify the sum to the nearest center, and it

defaults to NC

)())1((min jCc jb

cbiudisj

)1()()1())1(),1(( 12 iiiiidis ThGhuu

)())1((min,

jCcb

ciudisj

KLMS- Mackey-Glass Prediction

30)(1

)(2.0)(1.0)(

10

tx

txtxtx

LMS

=0.2

KLMS

a=1, =0.2

Regularization worsens performance

Performance Growth tradeoff

d1=0.1, d2=0.05

=0.1, a=1

KLMS- Nonlinear channel equalization

1

( ) , ( ) ( ) ( , )i

i i F j

j

f u u e j u u

( )

mi i

mi

c u

a e i

+

a1

a2

ami-1

ami

c1

c2

cmi-1

cmi

yu

10.5t t tz s s 20.9t t tr z z ns st rt

Nonlinear channel equalization

Algorithms Linear LMS (η=0.005)KLMS (η=0.1)

(NO REGULARIZATION)

RN

(REGULARIZED λ=1)

BER (σ = .1) 0.162±0.014 0.020±0.012 0.008±0.001

BER (σ = .4) 0.177±0.012 0.058±0.008 0.046±0.003

BER (σ = .8) 0.218±0.012 0.130±0.010 0.118±0.004

Algorithms Linear LMS KLMS RN

Computation (training) O(l) O(i) O(i3)

Memory (training) O(l) O(i) O(i2)

Computation (test) O(l) O(i) O(i)

Memory (test) O(l) O(i) O(i)

2( , ) exp( 0.1 || || )i j i ju u u u

Why don‟t we need to explicitly regularize the KLMS?

Self-regularization property of KLMS

Assume the data model then for any unknown vector the following inequality holds

As long as the matrix is positive definite. SoH∞ robustness

And is upper bounded

The solution norm of KLMS is always upper bounded i.e. the algorithm is well posed in the sense of Hadamard.

( ) ( ) ( )o

id i v i

2 2 2

1|| || (|| || 2 || || )o

N vs

2 1 2 2|| || || || 2 || ||oe v

2

1

11 2 2

1

| ( ) ( ) |1, 1, 2,...,

|| || | ( ) |

i

j

io

j

e j v jfor all i N

v j

σ1 is the largest

eigenvalue of Gφ

Liu W., Pokarel P., Principe J., “The Kernel LMS Algorithm”, IEEE Trans. Signal Processing, Vol 56, # 2, 543 – 554, 2008.

})()({ 1 TiiI

0

)(n

Regularization Techniques

Learning from finite data is ill-posed and a priori

information to enforce Smoothness is needed.

The key is to constrain the solution norm

In Least Squares constraining the norm yields

In Bayesian modeling, the norm is the prior. (Gaussian process)

In statistical learning theory, the norm is associated with the

model capacity and hence the confidence of uniform

convergence! (VC dimension and structural risk minimization)

2 2

1

1( ) ( ( ) ) || ||

NT

i

i

J d iN

Gaussian

distributed prior

2 2

1

1( ) ( ( ) ) , subject to || ||

NT

i

i

J d i CN

Norm

constraint

Tikhonov Regularization

1

2 2

1

( ,..., ,0,...,0) Tr

r

s sPdiag Q d

s s

1 2{ , ,..., }rS diag s s sSingular value

Notice that if λ = 0, when sr is very small, sr/(sr2+ λ) = 1/sr → ∞.

However if λ > 0, when sr is very small, sr/(sr2+ λ) = sr/ λ → 0.

In numerical analysis the method is to constrain the condition

number of the solution matrix (or its eigenvalues)

The singular value decomposition of F can be written

The pseudo inverse to estimate in is

which can be still ill-posed (very small sr). Tikhonov regularized the

least square solution to penalize the solution norm to yield

TQ

SPΦ

00

0

)()()( 0 iiid T

dQPT

rPI ssdiag ]0....0,,...,[ 11

1

2d TJ F)(

Tikhonov and KLMS

For finite data and using small stepsize theory:

Denote

Assume the correlation matrix is singular, and

From LMS it is known that

Define so

and

2 2 2min min

0[| ( ) | ] (1 ) (| ( ) | )2 2

i

i n

n n

J JE n n

1 1... ... 0k k m


T

xR P P ( ) m

i iu R

1

1 NT

i i

i

RN

)0()1()]([ n

i

nn iE

m

n nn Pii1

0 )()(

jj

M

j

i

jjj

M

j

i

jiE PP0

11

0 ])1(1[)0()1()]([

0)0(0)0( jj

20

1

202)()]([

M

j

jiE max/1

Tikhonov and KLMS

In the worst case, substitute the optimal weight by the pseudo inverse

Regularization function for finite N in KLMS

No regularization

Tikhonov

PCA

2 2 1[ /( )]n n ns s s

1 if th

0 if th

n n

n

s s

s

1

ns

2 1[1 (1 / ) ]N

n ns N s

0 0.5 1 1.5 2

0

0.2

0.4

0.6

0.8

1

singular value

reg-f

unction

KLMS

Tikhonov

Truncated SVD

The stepsize and N control the reg-function in

KLMS.

Liu W., Principe J. The Well-posedness Analysis of the Kernel Adaline, Proc WCCI, Hong-Kong, 2008

dQPT

r

i

r

i ssdiagiE ]0....0,))1(1(,...,))1(1[()]([ 11

11

The minimum norm initialization for KLMS

The initialization gives the minimum possible

norm solution.

00

1

m

i n nnc P

2 2 2

1 1|| || || || || ||

k m

i n nn n kc c

0 2 4-1

0

1

2

3

4

5

1

1

... 0

... 0

k

k m


KLMS and the Data Space

KLMS search is insensitive to the 0-eigenvalue directions

So if , and

The 0-eigenvalue directions do not affect the MSE

2 2min min

min 1 1( ) (| (0) | )(1 )

2 2

m m i

n n n nn n

J JJ i J

KLMS only finds solutions on the data subspace! It does

not care about the null space!


2( ) [| | ]T

iJ i E d

)0()1()]([ n

i

nn iE

2 2 2min min

0[| ( ) | ] (1 ) (| ( ) | )2 2

i

i n

n n

J JE n n

0n )0()]([ nn iE 22

)0(])([ nn iE

Energy Conservation Relation

Energy conservation in RKHS

Upper bound on step size for mean square convergence

Steady-state mean square performance

The fundamental energy conservation relation holds in RKHS!

Chen B., Zhao S., Zhu P., Principe J. Mean Square Convergence Analysis of the Kernel Least Mean Square Algorithm,

submitted to IEEE Trans. Signal Processing

222 2 ( )( )

( ) ( 1)( ), ( ) ( ), ( )

pae ie i

i ii i i i

u u u u

2*

2* 2

2

v

E

E

s

22lim ( )

2

va

iE e i

s

0.2 0.4 0.6 0.8 10

0.002

0.004

0.006

0.008

0.01

0.012

stepsize

EM

SE

simulation

theory

Effects of Kernel Size

0.5 1 1.5 20

1

2

3

4

5

6

7

8

x 10-3

kernel size s

EM

SE

simulation

theory

0 200 400 600 800 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

iteration

EM

SE

s = 0.2

s = 1.0

s = 20

Kernel size affects the convergence speed! (How to choose a suitable kernel size is still an open problem)

However, it does not affect the final misadjustment! (universal approximation with infinite samples)

Part 3: Affine projection algorithms in

kernel space

The big picture for gradient based learning

APANewton

APA

Leaky

APA

LMSNormalize

d LMS

Leaky

LMS

K=1

K=1

K=1

Adaline RLSK=

i

K=i

Extended

RLS

weighted

RLS

Frieb , 1999

Kivinen

2004

Engel, 2004We have kernelized

versions of all

The EXT RLS is a

model with states

Liu W., Principe J., “Kernel Affine Projection Algorithms”, European J. of Signal Processing, ID 784292, 2008.

Affine projection algorithms

Solve which yields

There are several ways to approximate this solution iteratively using

Gradient Descent Method

Newton‟s recursion

LMS uses a stochastic gradient that approximates

Affine projection algorithms (APA) utilize better approximations

Therefore APA is a family of online gradient based algorithms of intermediate complexity between the LMS and RLS.

2

)(min uwwT

wdEJ

du

-1

u rRw 0

)]1([)1()()0( iii wR-rwww udu

)]1([)()1()()0( 1 iii wR-rIRwww uduu

)()(ˆ)()(ˆ iidii T uruuR duu


APA are of the general form

Gradient

Newton

Notice that

So

T

LxK idKidiiKii )](),...,1([)()](),...,1([)( duuU

)()(1

ˆ)()(1ˆ ii

Kii

K

TdUrUUR duu

)]1()()()()1()()0( iiiiii TwU-[dUwww

)]1()()()[())()(()1()( 1 iiiiiiii TTwU-dUIUUww

11 ))()()(()())()(( IUUUUIUU iiiiii TT

)]1()()([])()()[()1()( 1 iiiiiiii TTwU-dIUUUww


If a regularized cost function is preferred

The gradient method becomes

Newton

Or

)]1()()()()1()1()()0( iiiiii TwU-[dUwww

)()())()(()1()1()( 1 iiiiii TdUIUUww

22

)(min wuww T

wdEJ

)(])()()[()1()1()( 1 iiiiii TdIUUUww

Kernel Affine Projection Algorithms

KAPA 1,2 use the least squares cost, while KAPA 3,4 are regularized

KAPA 1,3 use gradient descent and KAPA 2,4 use Newton update

Note that KAPA 4 does not require the calculation of the error by

rewriting the error with the matrix inversion lemma and using the

kernel trick

Note that one does not have access to the weights, so need recursion

as in KLMS.

Care must be taken to minimize computations.

wQ(i)

KAPA-1

1

( ) , ( ) ( , )i

i i F j j

j

f u u a u u

1 1

1 1

( )

( 1)

( 1)

mi i

mi i

mi mi i

mi K mi K i

c u

a e i

a a e i

a a e i K

+

a1

a2

ami-1

ami

c1

c2

cmi-1

cmi

yu

KAPA-1

)}(),1({)(

,...,1)1()(

1,....,11);()1()(

);()(

),.)(();(11

1

iiCiC

Kijii

iKjjieii

iiei

jjieff

jj

jj

i

i

Kj

ii

u

aa

aa

a

u

Error reusing to save computation

For KAPA-1, KAPA-2, and KAPA-3

To calculate K errors is expensive (kernel evaluations)

K times computations? No, save previous errors and use them

1( ) ( ) , ( 1 )T

i k ie k d k i K k i

1 1

1

1

( ) ( ) ( ) ( )

( ( ) )

( )

( ) ( ) .

T T

i k i k i i i

T T

k i k i i

T

i k i i

iT

i i k j

j i K

e k d k d k e

d k e

e k e

e k e j

F

F

F

Still needs which requires i kernel evals,

So O(i+K2)

( 1)ie i

KAPA-4: Smoothed Newton‟s method.

There is no need to compute the error

The topology can still be put in the same RBF framework.

Efficient ways to compute the inverse are necessary. The sliding

window computation yields a complexity of O(K2)

KAPA-4

1 1[ , ,..., ]

[ ( ), ( 1),..., ( 1)]

i i i i K

T

id d i d i d i K

F

)(])()()[()1()1()( 1 iiiiii TdIww

FFF

How to invert the K-by-K matrix and avoid O(K3)?

KAPA-4

( )T

i iI F F

)())(()(~

11)1()1()(

11)(~

)1()1()(

)(~

)(

1 iii

Kikii

ikKikdii

ikidi

kk

kk

k

dIGd

aa

aa

a

T

i i iGr F F

Sliding window Gram matrix inversion

1 1[ , ,..., ]i i i i K F

T

i

a bGr I

b D

1 /TD H ff e

1i T

D hGr I

h g

1( )T

i

e fGr I

f H

1 1 1 1

1

1 1

( )( ) ( )( )

( )

T

i T

D D h D h s D h sGr I

D h s s

1 1( )Ts g h D h

Assume

known1

2

3

Sliding window

Schur complement of D

Complexity is K2

Relations to other methods

Recursive Least-Squares

The RLS algorithm estimates a weight vector w(i-1) by minimizing the cost function

The solution becomes

And can be recursively computed as

Where . Start with zero weights and

21

1

)()(

i

j

T

wjjdnim wu

)1()1())1()1(()1( 1 iiiii TdUUUw

)1()()()(

)]()()()1([)()(/)()1()(

)()()1()()()1()(1)(

iiidie

iriiiiiriii

ieiiiiiiir

T

T

T

wu

kkPPuPk

kwwuPu

)]1()()([)()1()(1

)()1()1()(

iiid

iii

iiii T

Twu

uPu

uPww

1))()(()( Tiii UUP I1)0( P

Kernel Recursive Least-Squares

The KRLS algorithm estimates a weight function w(i) by minimizing

The solution in RKHS becomes

can be computed recursively as

From this we can also recursively compute Q(i)

And compose back a(i) recursively

with initial conditions

221

1

)()( ww

i

j

T

wjjdnim

)()()()()()()(1

iiiiiIii Tadw FFFF

)()())(),(()(

)()1()(

1)(

)()()()()1()()( 1

iiiiir

iii

i

iiiiriiri

TT

T

hzuu

hQz

z-

zzzQQ

)()()( iii dQa

)(i-1Q

)()1()()())()(

)()1()(

1

iiiiii

iii T

TT

F

hh

hQQ

-1

)1()()()()()(

)()()()()(

1

1

iiidieieir

ieiriii T

ahza

a

)1()1()1(,))(),(()1(1

dii TQauuQ

KRLS

)()()(

)()(

1

1

iieiraa

ieira

uc

jjmijmi

mi

imi

z

+

a1

a2

ami-1

ami

c1

c2

cmi-1

cmi

yu

)),(()()(1

uuau jifi

j

i

Engel Y., Mannor S., Meir R. “The kernel recursive least square algorithm”, IEEE Trans. Signal

Processing, 52 (8), 2275-2285, 2004.

KRLS

)}(),1({)(

1,...,1)()()()()(

)()()(

)()),(()()),(()(

1

1

1

1

1

1

iuiCiC

ijiieirii

ieiri

iejiiirff

jjj

i

i

j jii

zaa

a

uzu

Regularization

The well-posedness discussion for the KLMS hold for

any other gradient decent methods like KAPA-1 and

KAPA-3

If Newton method is used, additional regularization is

needed to invert the Hessian matrix like in KAPA-2

and normalized KLMS

Recursive least squares embed the regularization in

the initialization

Computation complexity

Prediction of Mackey-Glass

L=10

K=10

K=50 SW KRLS

Simulation 1: noise cancellation

( ) ( ) 0.2 ( 1) ( 1) ( 1) 0.1 ( 1) 0.4 ( 2)

( ( ), ( 1), ( 1), ( 2))

u i n i u i u i n i n i u i

H n i n i u i u i

n(i) ~ uniform [-0.5, 05]

Simulation 1: Noise Cancellation

2( ( ), ( )) exp( || ( ) ( ) || )u i u j u i u j

K=10

Simulation 1:Noise Cancellation

2500 2520 2540 2560 2580 2600-1

-0.50

0.5

2500 2520 2540 2560 2580 2600

-0.5

0

0.5

2500 2520 2540 2560 2580 2600

-0.5

0

0.5

Noisy Observation

NLMS

KLMS-1

2500 2520 2540 2560 2580 2600

-0.5

0

0.5

i

Am

plit

ute

KAPA-2

Simulation-2: nonlinear channel equalization

10.5t t tz s s 20.9t t tr z z ns st rt

K=10

s=0.1

Simulation-2: nonlinear channel equalization

Nonlinearity changed (inverted signs)

Gaussian Processes

A Gaussian process is a stochastic process (a family of random variables) where all the pairwise correlations are Gaussian distributed. The family however is not necessarily over time (as in time series).

For instance in regression, if we denote the output of a learning system by y(i) given the input u(i) for every i, the conditional probability

Where s is the observation Gaussian noise and G(i) is the Gram matrix

and is the covariance function (symmetric and positive definite). Just like the Gaussian kernel used in KLMS.

Gaussian processes can be used with advantage in Bayesian inference.

))(,0()(),...,1(|)(),...1(( 2 iGInuunyyp n s

))(),(())1(),((

))(),1(())1(),1((

)(

iii

i

iG

uuuu

uuuu

Gaussian Processes and Recursive

Least-Squares

The standard linear regression model with Gaussian noise is

where the noise is IID, zero mean and variance

The likelihood of the observations given the input and weight vector is

To compute the posterior over the weight vector we need to specify the prior, here a Gaussian and use Bayes rule

Since the denominator is a constant, the posterior is shaped by the numerator, and it is approximately given by

with mean and covariance

Therefore, RLS computes the posterior in a Gaussian process one sample at a time.

),0()( 2Iwp ws

)(,)( u wuu fdf T

))(()),(|)(()),(|)((1

Iijjdpiip Ti

j

2

nw,UwuwUd s

))(|)((

)()),(|)(())(),(|(

iip

piipiip

Ud

wwUddUw

2

nσ

))(()()(

1))((

2

1exp),|(

2iIiiidUwp w

T

n

TwwUUww

2ss

)()()()()(12 iiIiii wn

TdUUUw

2 ss

1

2 )()(1

Iii

σ w

T

n

2UU s

KRLS and Nonlinear Regression

It easy to demonstrate that KRLS does in fact estimate online nonlinear regression with a Gaussian noise model i.e.

where the noise is IID, zero mean and variance

By a similar derivation we can say that the mean and variance are

Although the weight function is not accessible we can create predictions at any point in the space by the KRLS as

with variance

)(,()( u wu)u fdf T

2

nσ

)()()()()(12 iiIiii wn

Tdw

2 FFF

ss1

2 )()(1

FF Iii

σ w

T

n

2s

)()()()()()]([ˆ 12 iIiiifE wn

TTduu

2 FFF ss

)()()()()()()()())((12222

uuuuu2 sssss T

wn

TT

w

T

w iIiiif FFFF

Part 4: Extended Recursive least

squares in kernel space

Extended Recursive Least-Squares

STATE model

Start with

Special cases

• Tracking model (F is a time varying scalar)

• Exponentially weighted RLS

• standard RLS

1

0| 1 0| 1,w P

1 , ( ) ( )T

i i i i ix x n d i u x v i

1 , ( ) ( )T

i i i ix x d i u x v i

Notations:

xi state vector at

time i

wi|i-1 state estimate

at time i using

data up to i-1

1 , ( ) ( )T

i i i ix x d i u x v i

ii

T

ii

iiii

vxUd

nxFx

1

The recursive update equations

Notice that

If we have transformed data, how to calculate for any k, i, j?

Recursive equations

1 1

0| 1 0| 10,w P

Conversion factor

gain factor

weight update

error

| 1( ) ( )T

k i i ju P u

| 1

, | 1

| 1

1| | 1 ,

2

1| | 1 | 1 | 1

( )

/ ( )

( ) ( )

( )

| | [ / ( )]

i T

e i i i i

p i i i i e

T

i i i

i i i i p i

T i

i i i i i i i i i i e

r i u P u

k P u r i

e i d i u w

w w k e i

P P P u u P r i q

1| | 1 | 1ˆ ˆ ( ) / ( )T T T

i i i i i i i eu w u w u P u e i r i

Theorem 1:

where is a scalar, and is a jxj matrix, for all j.

Proof:

New Extended Recursive Least-squares

| 1 1 1 1 1,T

j j j j j jP H Q H j

1j 1 0 1[ ,..., ]T

j jH u u 1jQ

1 1 1 1

0| 1 1 1, , 0P Q

| 1 | 12

1| | 1

2

1 1 1 1

1 1 1 1 1 1 1 1

1 1

1 1, 1, 1 1,2 2

1

1

| | [ ]( )

| | [

( ) ( )]

( )

( ) ( )(| | ) | |

T

i i i i i i i

i i i i

e

T

i i i i

T T T

ii i i i i i i i i i

e

T

i i i i i e i i i ei T

i i

i

P u u PP P q

r i

H Q H

H Q H u u H Q Hq

r i

Q f f r i f r iq H

1 2 1

1, 1( ) ( )T

i i e i e

if r i r i

H

By mathematical

induction!

Liu W., Principe J., “Extended Recursive Least Squares in RKHS”, in Proc. 1st Workshop on Cognitive Signal Processing, Santorini, Greece, 2008.

Theorem 2:

where and is a vector, for all j.

Proof:

New Extended Recursive Least-squares

| 1 1 | 1ˆ ,T

j j j j jw H a j

1 0 1[ ,..., ]T

j jH u u | 1j ja 1j

0| 1 0| 1ˆ 0, 0w a

1| | 1 ,

1 | 1 | 1

1 | 1 1 1 1 1

1 | 1 1 1 1,

1

| 1 1,

1

ˆ ˆ ( )

( ) / ( )

( ) ( ) / ( )

( ) / ( ) ( ) / ( )

( ) ( )

( )

i i i i p i

T

i i i i i i e

T T

i i i i i i i i e

T T

i i i i i e i i i e

T i i i i e

i

i

w w k e i

H a P u e i r i

H a H Q H u e i r i

H a u e i r i H f e i r i

a f e i r iH

e i r

1( )e i

By mathematical

induction again!

Extended RLS New Equations

1 1

0| 1 0| 10,w P

| 1

, | 1

| 1

1| | 1 ,

2

1| | 1

| 1 | 1

( )

/ ( )

( ) ( )

( )

| | [

/ ( )]

i T

e i i i i

p i i i i e

T

i i i

i i i i p i

i i i i

T i

i i i i i i e

r i u P u

k P u r i

e i d i u w

w w k e i

P P

P u u P r i q

1 1

0| 1 1 10, , 0a Q

1, 1

1, 1 1,

1 1, 1,

1, | 1

1

| 1 1,

1| 1

1

2

1

( )

( ) ( )

( ) ( )

( ) ( )

| |

T T

i i i i

i i i i i

i T T

e i i i i i i i

T

i i i i

i i i i e

i i

i e

i

i i

k u H

f Q k

r i u u k f

e i d i k a

a f r i e ia

r i e i

q

1 1

1 1, 1, 1 1,

1 2 1

1 1, 1

2( ) ( )

( ) ( )| |

T

i i i i i e i i i e

T

i i i e i e

i

Q f f r i f r i

f r i r iQ

An important theorem

Assume a general

nonlinear state-space

model

)())(),(()(

))(()1(

iiihid

isgis

su )())(())(()(

))(())1((

iixiid

ixix

T

su

sAs

),()()( uuuu T

Initialize

Extended Kernel Recursive Least-squares

1 1

0| 1 1 10, , 0a Q

Update on weights

Update on P matrix

1, 0 1

1, 1 1,

1 1, 1,

1, | 1

1

| 1 1,

1| 1

1

2

1

1 1, 1,2

[ ( , ),..., ( , )]

( ) ( , )

( ) ( )

( ) ( )

( ) ( )

| |

| |

T

i i i i i

i i i i i

i T

e i i i i i i i

T

i i i i

i i i i e

i i

i e

i

i i

i i i i i

i

k u u u u

f Q k

r i u u k f

e i d i k a

a f r i e ia

r i e i

q

Q f fQ

1 1

1 1,

1 2 1

1 1, 1

( ) ( )

( ) ( )

T

e i i i e

T

i i i e i e

r i f r i

f r i r i

Ex-KRLS

1

( ) , ( ) ( , )i

i i F j j

j

f u u a u u

+

a1

a2

ami-1

ami

c1

c2

cmi-1

cmi

yu

1

1

1

1 1 1,

1

1 1 1,

( ) ( )

( ) ( ) ( )

(1) ( ) ( )

mi i

mi i e

mi mi i i e

i i e

c u

a r i e i

a a f i r i e i

a a f r i e i

Simulation-3: Lorenz time series

prediction

Simulation-3: Lorenz time series

prediction (10 steps)

Simulation 4: Rayleigh channel tracking

5 tap Rayleigh multi-

path fading channeltanhst

rt+

Noise

fD=100Hz, t=8x10-5s s=0.005

1,000

symbols

Rayleigh channel tracking

AlgorithmsMSE (dB) (noise variance

0.001 and fD = 50 Hz )

MSE (dB) (noise

variance 0.01 and fD =

200 Hz )

ε-NLMS -13.51 -9.39

RLS -14.25 -9.55

Extended RLS -14.26 -10.01

Kernel RLS -20.36 -12.74

Kernel extended RLS -20.69 -13.85

2( , ) exp( 0.1 || || )i j i ju u u u

Computation complexity

AlgorithmsLinear

LMSKLMS KAPA ex-KRLS

Computation (training) O(l) O(i) O(i+K2) O(i2)

Memory (training) O(l) O(i) O(i+K) O(i2)

Computation (test) O(l) O(i) O(i) O(i)

Memory (test) O(l) O(i) O(i) O(i)

At time or iteration i

Part 5: Active learning in kernel

adaptive filtering

Active data selection

Why?Kernel trick may seem a “free lunch”!

The price we pay is memory and pointwise evaluations of the function.

Generalization (Occam‟s razor)

But remember we are working on an on-line scenario, so most of the methods out there need to be modified.

Active data selection

The goal is to build a constant length (fixed budget) filter in RKHS. There are two complementary methods of achieving this goal:

Discard unimportant centers (pruning)

Accept only some of the new centers (sparsification)

Apart from heuristics, in either case a methodology to evaluate the importance of the centers for the overall nonlinear function approximation is needed.

Another requirement is that this evaluation should be no more expensive computationally than the filter adaptation.

Previous Approaches – Sparsification

Novelty condition (Platt, 1991)• Compute the distance to the current dictionary

• If it is less than a threshold d1 discard

• If the prediction error

• Is larger than another threshold d2 include new center.

Approximate linear dependency (Engel, 2004)• If the new input is a linear combination of the previous

centers discard

which is the Schur Complement of Gram matrix and fits KAPA 2

and 4 very well. Problem is computational complexity

)()1()1()1( iiidie T

jiDc

ciudisj

)1(min)(

)(2 )()1((miniDc jj

j

cbiudis

Previous Approaches – Pruning

Sliding Window (Vaerenbergh, 2010)

Impose mi<B in

Create the Gram matrix of size B+1 recursively from size B

Downsize: reorder centers and include last (see KAPA2)

See also the Forgetron and the Projectron that provide

error bounds for the approximation.

),(

)()1(

11 BB

T cch

hiGiG

im

j

jji ciaf1

,.)()(

TBBB cccch ),(),...,,( 111

hzccrhiQziGIiQ T

BB

),()())(()( 11

1

rrz

rzrzziQiQ

T

T

/1/

//)()1(

B

j jji

T ciafidiQiaeffHiQ11 ,.)()1()1()1()1(/)1(

O. Dekel, S. Shalev-Shwartz, and Y. Singer, “The Forgetron: A kernel-based perceptron on a fixed budget,” in Advances

in Neural Information Processing Systems 18. Cambridge, MA: MIT Press, 2006, pp. 1342–1372.

F. Orabona, J. Keshet, and B. Caputo, “Bounded kernel-based online learning,” Journal of Machine Learning Research,

vol. 10, pp. 2643–2666, 2009.

Problem statement

The learning system

Already processed (your dictionary)

A new data pair

How much new information it contains?

Is this the right question?

Or

How much information it contains with respect to the

learning system ?

))(;( iTuy

ijjdjuiD 1)}(),({)(

)}1(),1({ idiu

))(;( iTuy

Information measure

Hartley and Shannon‟s definition of informationHow much information it contains?

Learning is unlike digital communications:

The machine never knows the joint distribution!

When the same message is presented to a learning system information (the degree of uncertainty) changes because the system learned with the first presentation!

Need to bring back MEANING into information theory!

))1(),1((ln)1( idiupiI

Surprise as an information measure

Learning is very much like an experiment that we do

in the laboratory.

Fedorov (1972) proposed to measure the importance

of an experiment as the Kulback Leibler distance

between the prior (the hypothesis we have) and the

posterior (the results after measurement).

Mackay (1992) formulated this concept under a

Bayesian approach and it has become one of the key

concepts in active learning.

Surprise as an information measure

))(|)1((ln)1())1(()( iTiupiCIiuS iT

))(log()( xqxIS

))(;( iTuy

Shannon versus Surprise

Shannon

(absolute

information)

Surprise

(conditional

information)

Objective Subjective

Receptor

independent

Receptor

dependent (on time

and agent)

Message is

meaningless

Message has

meaning for the

agent

Evaluation of conditional information

(surprise)

Gaussian process theory

where

))](|)1((ln[)1(2

))1(ˆ)1(()1(ln2ln

))](|)1(),1((ln[)1(

2

2

iTipi

ididi

iTidipiCI

u

u

ss

)1()]([)1())1(),1(()1(

)()]([)1()1(ˆ

1222

12

iiiiii

idiiid

n

T

n

n

T

hGIhuu

GIh

sss

s

Interpretation of conditional information

(surprise)

Prediction errorLarge error large conditional information

Prediction varianceSmall error, large variance large CI

Large error, small variance large CI (abnormal)

Input distributionRare occurrence large CI

)1(ˆ)1()1( ididie

)1(2 is

))(|)1(( iTip u

))](|)1((ln[)1(2

))1(ˆ)1(()1(ln2ln

))](|)1(),1((ln[)1(

2

2

iTipi

ididi

iTidipiCI

u

u

ss

Input distribution

Memoryless assumption

Memoryless uniform assumption

))(|)1(( iTip u

))1(())(|)1(( ipiTip uu

.))(|)1(( constiTiup

Unknown desired signal

Average CI over the posterior distribution of the

output

Memoryless uniform assumption

This is equivalent to approximate linear dependency!

))](|)1((ln[)1(ln)1( iTipiiIC us

)1(ln)1( iiIC s

Redundant, abnormal and learnable

Still need to find a systematic way to select these

thresholds which are hyperparameters.

2

21

1

)1(:Re

)1(:

)1(:

TiCIdundant

TiCITLearnable

TiCIAbnormal

Active online GP regression (AOGR)

Compute conditional information

If redundant

Throw away

If abnormal

Throw away (outlier examples)

Controlled gradient descent (non-stationary)

If learnable

Kernel recursive least squares (stationary)

Extended KRLS (non-stationary)

Simulation-5: nonlinear regression—learning

curve

Simulation-5: nonlinear regression—

redundancy removal

T1 is wrong, should be T2

Simulation-5: nonlinear regression–

most surprising data

Simulation-5: nonlinear regression

Simulation-5: nonlinear regression—

abnormality detection (15 outliers)

AOGR=KRLS

Simulation-6: Mackey-Glass time series

prediction

AOGR=KRLS

Simulation-7: CO2 concentration forecasting

Quantized Kernel Least Mean Square

A common drawback of sparsification methods: the redundant input data are purely discarded!

Actually the redundant data are very useful and can be, for example, utilized to update the coefficients of the current network, although they are not so important for structure update (adding a new center).

Quantization approach: the input space is quantized, if the current quantized input has already been assigned a center, we don‟t need to add a new, but update the coefficient of that center with the new information!

Intuitively, the coefficient update can enhance the utilization efficiency of that center, and hence may yield better accuracy and a more compact network.

Chen B., Zhao S., Zhu P., Principe J. Quantized Kernel Least Mean Square Algorithm, submitted to IEEE Trans. Neural

Networks


Quantization in Input Space

Quantization in RKHS

Using the quantization method to

compress the input (or feature) space

and hence to compact the RBF

structure of the kernel adaptive filter

(0)

( ) ( ) ( 1) ( )

( ) ( 1) ( ) ( )

Te i d i i i

i i e i i

0

Q

0

1

1

0

( ) ( ) ( ( ))

( ) ( ) ,

i

i i

f

e i d i f i

f f e i Q i

u

u .

Quantization operator


The key problem is the vector quantization (VQ):

Information Theory? Information Bottleneck? ……

Most of the existing VQ algorithms, however, are not

suitable for online implementation because the codebook

must be supplied in advance (which is usually trained on

an offline data set), and the computational burden is

rather heavy.

A simple online VQ method:1. Compute the distance between u(i) and C(i-1)

:

2. If keep the codebook unchanged, and quantize u(i) into

the closest code-vector by

3. Otherwise, update the codebook: , and quantize u(i) as itself

1 ( 1)

( ), ( 1) min ( ) ( 1)jj size i

dis i i i i

u uC

C C

( ), ( 1)dis i i u C

( ) ( 1), ( )i i i uC C

*

1 ( 1)

arg min ( ) ( 1)jj size i

j i i

uC

C * *( ) ( 1) ( )

j ji i e i a a


Quantized Energy Conservation Relation

A Sufficient Condition for Mean Square Convergence

Steady-state Mean Square Performance

222 2

2 2

( )( )( ) ( 1)

( ), ( ) ( ), ( )

paq

q q

e ie ii i

i i i i

u u u u

2 2

( ) ( 1) ( ) 0 ( 1)

, 2 ( ) ( 1) ( )0 ( 2)

( )

T

a q

T

a q

a v

E e i i i C

i E e i i iC

E e i

s

2 2

2max ,0 lim ( )2 2

v v

ai

E e i s s


Static Function Estimation

2 2( ( ) 1) ( ( ) 1)( ) 0.2 exp exp ( )

2 2

u i u id i v i

10-2

10-1

100

101

10-2

10-1

quantization factor

EM

SE

Lower bound

Upper bound

EMSE = 0.0171

2 4 6 8 100.10

10

20

30

40

quantization factor

final netw

ork

siz

e


Short Term Lorenz Time Series Prediction

0 1000 2000 3000 40000

50

100

150

200

250

300

350

400

450

500

iteration

netw

ork

siz

e

QKLMS

NC-KLMS

SC-KLMS

0 1000 2000 3000 400010

-3

10-2

10-1

100

101

iteration

testing M

SE

QKLMS

NC-KLMS

SC-KLMS


Short Term Lorenz Time Series Prediction

0 1000 2000 3000 40000

50

100

150

200

250

300

350

400

iteration

netw

ork

siz

e

QKLMS

NC-KLMS

SC-KLMS

0 1000 2000 3000 400010

-3

10-2

10-1

100

101

iteration

testing M

SE

QKLMS

NC-KLMS

SC-KLMS

KLMS

Redefinition of On-line Kernel Learning

Notice how problem constraints affected the form of the

learning algorithms.

On-line Learning: A process by which the free

parameters and the topology of a „learning system‟ are

adapted through a process of stimulation by the

environment in which the system is embedded.

Error-correction learning + memory-based learning

What an interesting (biological plausible?) combination.

Impacts on Machine Learning

KAPA algorithms can be very useful in large scale

learning problems.

Just sample randomly the data from the data base and

apply on-line learning algorithms

There is an extra optimization error associated with

these methods, but they can be easily fit to the machine

contraints (memory, FLOPS) or the processing time

constraints (best solution in x seconds).

Information Theoretic Learning (ITL)

This class of algorithms can

be extended to ITL cost

functions and also beyond

Regression (classification,

Clustering, ICA, etc). See

IEEE

SP MAGAZINE, Nov 2006

Or ITL resource

www.cnel.ufl.edu

Kernel Adaptive Filtering - Signal processing

Documents

Transcript of Kernel Adaptive Filtering - Signal processing