PhD Candidature Presentation - angms.science · PhD Candidature Presentation In other words : What...

Post on 27-Jun-2020

8 views 0 download

Transcript of PhD Candidature Presentation - angms.science · PhD Candidature Presentation In other words : What...

PhD Candidature PresentationIn other words : What I did last two years

Andersen Ang

Mathematique et recherche operationnelle, UMONS, BelgiumEmail: manshun.ang@umons.ac.be Homepage: angms.science

Feburary 26, 2019

The works

Journal paper

A.-Gillis, Accelerating Nonnegative Matrix Factorization Algorithms using ExtrapolationNeural Computation, vol. 31 (2), pp 417-439, Feb 2019, MIT Press

Conference paper

Leplat-A.-Gillis, Minimum-Volume Rank-Deficient Non-negative Matrix FactorizationsTo be presented in IEEE ICASSP 2019, Brighton, UK, 2019-May

A.-Gillis, Volume regularized Non-negative Matrix FactorizationsIEEE WHISPERS 2018, Amsterdam, NL, 2018-Sept-25

Work in progress

A.-Gillis, Algorithms and Comparisons of Non-negative Matrix Factorization with VolumeRegularization for Hyperspectral Unmixing, In preparation, to be submitted to IEEE JSTARS

Leplat-A.-Gillis, β-NMF for blind audio source separation with minimum volume regularization

Cohen-A., Accelerating Non-negative Canonical Polyadic Decomposition using extrapolationA soumettre a GRETSI2019 (en Francais!)

And numerous presentations (abstracts) in conferences, workshops, doctoral schools andseminars in BE, FR, NL, DE, IT, HK, e.g. SIAM-ALA18, ISMP2018, OR2018 . . .

2 / 42

Non-negative Matrix Factorization

Given X ∈ IRm×n or IRm×n+ , find two matrices W ∈ IRm×r

+ andH ∈ IRr×n

+ by solving :

minW,H

f(W,H) =1

2‖X−WH‖2F subject to W ≥ 0,H ≥ 0, (1)

where ≥ is taken element-wise.

Key points (see the report for references):

• Non-convex problem.

• No close form solution, use numerical optimization algorithm to solve.

• Non-negativity makes NMF NP-Hard (as opposed to PCA); there aremodel modifications that makes the problem solvable in polynomialtime.

• Many applications in machine learning, data mining, signal processing.

3 / 42

Non-negative Matrix Factorization

Given X ∈ IRm×n or IRm×n+ , find two matrices W ∈ IRm×r

+ andH ∈ IRr×n

+ by solving :

minW,H

f(W,H) =1

2‖X−WH‖2F subject to W ≥ 0,H ≥ 0, (1)

where ≥ is taken element-wise.

Key points (see the report for references):

• Non-convex problem.

• No close form solution, use numerical optimization algorithm to solve.

• Non-negativity makes NMF NP-Hard (as opposed to PCA); there aremodel modifications that makes the problem solvable in polynomialtime.

• Many applications in machine learning, data mining, signal processing.

4 / 42

Alternating minimization

The standard way to solve NMF :

· · · → update W→ update H→ update W→ update H→ . . .

with the goal – descent condition :

f(Wk,Hk) ≤ f(Wk,Hk−1) ≤ f(Wk−1,Hk−1), k ∈ IN, (2)

where k is iteration counter.

To achieve (2), use projected gradient descent (PGD):

Update W Wk+1 = [Wk − αWk ∇Wf(Wk;Hk)]+ (3)

Update H Hk+1 = [Hk − αHk ∇Hf(Wk+1,Hk)]+, (4)

where αk is step size and [ · ]+ = max{·, 0}.

Fact : the series {Wk,Hk}k∈IN produced by the PGD scheme (3)-(4)converges to a first-order stationary point of f .

5 / 42

Alternating minimization

The standard way to solve NMF :

· · · → update W→ update H→ update W→ update H→ . . .

with the goal – descent condition :

f(Wk,Hk) ≤ f(Wk,Hk−1) ≤ f(Wk−1,Hk−1), k ∈ IN, (2)

where k is iteration counter.

To achieve (2), use projected gradient descent (PGD):

Update W Wk+1 = [Wk − αWk ∇Wf(Wk;Hk)]+ (3)

Update H Hk+1 = [Hk − αHk ∇Hf(Wk+1,Hk)]+, (4)

where αk is step size and [ · ]+ = max{·, 0}.

Fact : the series {Wk,Hk}k∈IN produced by the PGD scheme (3)-(4)converges to a first-order stationary point of f .

6 / 42

Embedding extrapolation into the update

Update W Wk+1 = [Yk − αYk ∇Yf(Yk;Hk)]+

Extrapolate W Yk+1 = Wk+1 + βk(Wk+1 −Wk)

Update H Hk+1 = [Gk − αHk ∇Hf(Wk+1,Gk)]+

Extrapolate H Gk+1 = Hk+1 + βHk (Hk+1 −Hk),

where Y and G are the pairing variable of W and H, respectively1.

1For initialization, Y0 = W0 and G0 = H0.

7 / 42

The extrapolation parameter βk

Update W Wk+1 = [Yk − αYk ∇Yf(Yk;Hk)]+ (5)

Extrapolate W Yk+1 = Wk+1 + βk(Wk+1 −Wk) (6)

Update H Hk+1 = [Gk − αHk ∇Hf(Wk+1,Gk)]+ (7)

Extrapolate H Gk+1 = Hk+1 + βHk (Hk+1 −Hk), (8)

The parameter βk is the critical part of the scheme :

• βk is dynamically updated at each iteration k

• βk ∈ [0, 1].

• If βk = 0, the scheme (5)-(8) reduces to the plain projected gradientscheme (3)-(4).

• Nesterov’s acceleration, being optimal in terms of convergence rate,gives the explicit close form formula for βk.

• NMF is non-convex, it is not known how to determine β.

8 / 42

The extrapolation parameter βk

Update W Wk+1 = [Yk − αYk ∇Yf(Yk;Hk)]+ (5)

Extrapolate W Yk+1 = Wk+1 + βk(Wk+1 −Wk) (6)

Update H Hk+1 = [Gk − αHk ∇Hf(Wk+1,Gk)]+ (7)

Extrapolate H Gk+1 = Hk+1 + βHk (Hk+1 −Hk), (8)

The parameter βk is the critical part of the scheme :

• βk is dynamically updated at each iteration k

• βk ∈ [0, 1].

• If βk = 0, the scheme (5)-(8) reduces to the plain projected gradientscheme (3)-(4).

• Nesterov’s acceleration, being optimal in terms of convergence rate,gives the explicit close form formula for βk.

• NMF is non-convex, it is not known how to determine β.

9 / 42

The extrapolation parameter βk

Update W Wk+1 = [Yk − αYk ∇Yf(Yk;Hk)]+ (5)

Extrapolate W Yk+1 = Wk+1 + βk(Wk+1 −Wk) (6)

Update H Hk+1 = [Gk − αHk ∇Hf(Wk+1,Gk)]+ (7)

Extrapolate H Gk+1 = Hk+1 + βHk (Hk+1 −Hk), (8)

The parameter βk is the critical part of the scheme :

• βk is dynamically updated at each iteration k

• βk ∈ [0, 1].

• If βk = 0, the scheme (5)-(8) reduces to the plain projected gradientscheme (3)-(4).

• Nesterov’s acceleration, being optimal in terms of convergence rate,gives the explicit close form formula for βk.

• NMF is non-convex, it is not known how to determine β.

10 / 42

The extrapolation parameter βk

Update W Wk+1 = [Yk − αYk ∇Yf(Yk;Hk)]+ (5)

Extrapolate W Yk+1 = Wk+1 + βk(Wk+1 −Wk) (6)

Update H Hk+1 = [Gk − αHk ∇Hf(Wk+1,Gk)]+ (7)

Extrapolate H Gk+1 = Hk+1 + βHk (Hk+1 −Hk), (8)

The parameter βk is the critical part of the scheme :

• βk is dynamically updated at each iteration k

• βk ∈ [0, 1].

• If βk = 0, the scheme (5)-(8) reduces to the plain projected gradientscheme (3)-(4).

• Nesterov’s acceleration, being optimal in terms of convergence rate,gives the explicit close form formula for βk.

• NMF is non-convex, it is not known how to determine β.

11 / 42

The extrapolation parameter βk

Update W Wk+1 = [Yk − αYk ∇Yf(Yk;Hk)]+ (5)

Extrapolate W Yk+1 = Wk+1 + βk(Wk+1 −Wk) (6)

Update H Hk+1 = [Gk − αHk ∇Hf(Wk+1,Gk)]+ (7)

Extrapolate H Gk+1 = Hk+1 + βHk (Hk+1 −Hk), (8)

The parameter βk is the critical part of the scheme :

• βk is dynamically updated at each iteration k

• βk ∈ [0, 1].

• If βk = 0, the scheme (5)-(8) reduces to the plain projected gradientscheme (3)-(4).

• Nesterov’s acceleration, being optimal in terms of convergence rate,gives the explicit close form formula for βk.

• NMF is non-convex, it is not known how to determine β.

12 / 42

A numerical scheme to tune β for NMF

The idea is to update βk based on the increase or decrease of the objectivefunction. Let ek = f(Wk,Hk), then

βk+1 =

min{γβk, β} if ek ≤ ek−1βkη

if ek > ek−1(9)

where γ > 1, and η > 1 are constants and β0 = 1 with the update

βk+1 =

{min{γβk, 1} if ek ≤ ek−1 and βk < 1

βk if ek > ek−1. (10)

13 / 42

A numerical scheme to tune β for NMF

The idea is to update βk based on the increase or decrease of the objectivefunction. Let ek = f(Wk,Hk), then

βk+1 =

min{γβk, β} if ek ≤ ek−1βkη

if ek > ek−1(9)

where γ > 1, and η > 1 are constants and β0 = 1 with the update

βk+1 =

{min{γβk, 1} if ek ≤ ek−1 and βk < 1

βk if ek > ek−1. (10)

14 / 42

The logic flow of updating βk

Case 1. The error decreases : ek ≤ ek−1

• It means the current β value is “good”

• We can be more ambitious on the extrapolation

� i.e., we increase the value of β� How : multiplying it with a growth factor γ > 1

βk+1 = βkγ

• Note that the growth of β cannot be indefinite

� i.e., we put a ceiling parameter β to upper bound the growth� How : use min

βk+1 = min{βkγ, βk}

� β itself is also updated dynamically with a growth factor γ with the upperbound 1.

15 / 42

The logic flow of updating βk

Case 1. The error decreases : ek ≤ ek−1

• It means the current β value is “good”

• We can be more ambitious on the extrapolation

� i.e., we increase the value of β� How : multiplying it with a growth factor γ > 1

βk+1 = βkγ

• Note that the growth of β cannot be indefinite

� i.e., we put a ceiling parameter β to upper bound the growth� How : use min

βk+1 = min{βkγ, βk}

� β itself is also updated dynamically with a growth factor γ with the upperbound 1.

16 / 42

The logic flow of updating βk

Case 1. The error decreases : ek ≤ ek−1

• It means the current β value is “good”

• We can be more ambitious on the extrapolation

� i.e., we increase the value of β� How : multiplying it with a growth factor γ > 1

βk+1 = βkγ

• Note that the growth of β cannot be indefinite

� i.e., we put a ceiling parameter β to upper bound the growth� How : use min

βk+1 = min{βkγ, βk}

� β itself is also updated dynamically with a growth factor γ with the upperbound 1.

17 / 42

The logic flow of updating βk

Case 2. The error increases : ek > ek−1

• It means the current β value is “bad” (too large)

• We become less ambitious on the extrapolation

� i.e., we decrease the value of β� How : dividing it with the decay factor η > 1

βk+1 =βkη

• As f is often a continuous and smooth, for βk being too large, suchvalue of β will certainly be also too large at iteration k + 1

� i.e., we have to avoid βk+1 to grow back to βk (the “bad” value) too soon� How : we set the ceiling parameter

βk+1 = βk

18 / 42

The logic flow of updating βk

Case 2. The error increases : ek > ek−1

• It means the current β value is “bad” (too large)

• We become less ambitious on the extrapolation

� i.e., we decrease the value of β� How : dividing it with the decay factor η > 1

βk+1 =βkη

• As f is often a continuous and smooth, for βk being too large, suchvalue of β will certainly be also too large at iteration k + 1

� i.e., we have to avoid βk+1 to grow back to βk (the “bad” value) too soon� How : we set the ceiling parameter

βk+1 = βk

19 / 42

The logic flow of updating βk

Case 2. The error increases : ek > ek−1

• It means the current β value is “bad” (too large)

• We become less ambitious on the extrapolation

� i.e., we decrease the value of β� How : dividing it with the decay factor η > 1

βk+1 =βkη

• As f is often a continuous and smooth, for βk being too large, suchvalue of β will certainly be also too large at iteration k + 1

� i.e., we have to avoid βk+1 to grow back to βk (the “bad” value) too soon� How : we set the ceiling parameter

βk+1 = βk

20 / 42

A toy example

An example showing the extrapolated scheme (E-PGD) has much fasterconvergence than that of the standard scheme (PGD).

Such numerical scheme is found to be effective in accelerating NMF algorithms.See the paper for more examples and the comparisons with other accelerationscheme. 21 / 42

Chain structure

There are variations on the chain structure of the update, for examples

• Update W → extrapolate W → update H → extrapolate H

• Update W → extrapolate W → update H → extrapolate H → project H

• Update W → update H → extrapolate W → extrapolate H

The comparisons of these three schemes : see the paper.

22 / 42

23 / 42

24 / 42

Future work : to analyze why certain structure has a better performancethan others

25 / 42

Tensor extension

Recent attempt : try to extend the idea of extrapolation to the tensorcases; more precisely to the Non-negative Canonical PolyadicDecomposition (NNCPD).

minU,V,W

f(U,V,W) = ‖Y −U ∗V ∗W‖ s.t. U ≥ 0,V ≥ 0,W ≥ 0

= ‖Y −r∑i

ui ∗ vi ∗wi‖

Preliminary experiments showed that the approach is very promising and isable to significantly accelerate the NNCPD algorithms.

Unsolved problem : NNCPD has even higher variability on the chainstructure. 26 / 42

Understanding the relationship between the data structure (rank size, sizeof each mode) and the chain structure will be crucial.

27 / 42

Separable NMF relaxes NP-hardness of NMF

Geometrically, NMF describes a non-negative cone, that is data points areencapsulated inside a non-negative cone generated by the basis W.

Under the assumption of separability, when the basis W is also presentedwithin the data cloud, the NMF problem, which is then called SeparableNMF (SNMF), becomes solvable in polynomial time.

28 / 42

Volume Regularized Non-negative Matrix Factorizations

Separable NMF is a well studied problem.

Separability assumption is quite strong.

So we relax it by minimum volume NMF, or Volume-regularized NMF(VRNMF) :

argminW,H

1

2‖X−WH‖2F + λV(W) s.t. W ≥ 0,H ≥ 0,H>1r ≤ 1n.

Geometrically the goal is the find the underlying generating vertex of thedata by fitting a non-negative convex hull with minimum volume.

29 / 42

Volume Regularized Non-negative Matrix Factorizations

Separable NMF is a well studied problem.Separability assumption is quite strong.

So we relax it by minimum volume NMF, or Volume-regularized NMF(VRNMF) :

argminW,H

1

2‖X−WH‖2F + λV(W) s.t. W ≥ 0,H ≥ 0,H>1r ≤ 1n.

Geometrically the goal is the find the underlying generating vertex of thedata by fitting a non-negative convex hull with minimum volume.

30 / 42

Volume Regularized Non-negative Matrix Factorizations

Separable NMF is a well studied problem.Separability assumption is quite strong.

So we relax it by minimum volume NMF, or Volume-regularized NMF(VRNMF) :

argminW,H

1

2‖X−WH‖2F + λV(W) s.t. W ≥ 0,H ≥ 0,H>1r ≤ 1n.

Geometrically the goal is the find the underlying generating vertex of thedata by fitting a non-negative convex hull with minimum volume.

31 / 42

VR-NMF

Four different volume functions V are studied :

• log-determinant log det(W>W + δIr)

• Determinant det(W>W)

• Frobenius norm ‖W‖2F• Nuclear norm ‖W‖∗These functions are all non-decreasing function of the singular values ofW, which minimize them indirectly minimize the ”volume” of the convexhull spanned by W.

Note

• The “true” volume function of the convex hull of W exists, butcomputationally very expensive.

• Computing the exact volume of a convex polytope from the vertices in highdimension is a long-time hard problem.

32 / 42

VR-NMF

What has been done

• Algorithms for VR-NMF with the four volume functions

• Model comparisons bewteen different volume functions : logdet seemsto be better

• Proposed algorithms perform better than state-of-the-art algorithms

33 / 42

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-1

-0.5

0

0.5

1

1.5

1

2

3

4

Ground truth and reconstructions (2D projected)Data point xiVertex WtrueB

Hull

SPA(Gillis14)Hull

Vertex Det(this work)Hull

VertexHull

RVolMin(Fu2016)

(m,n,r) = (100, 3000, 4)p = [0.9, 0.8, 0.7, 0.6]

Vertex

34 / 42

Future work on VR-NMF

• Theoretical limit of VR-NMF : if data points are concentrated in thecenter of the convex polygon, it is impossible to recover the vertices byVRNMF.Future study will be to analyze the theoretical limit of VRNMF : tocome up with a phase transition boundary of such vertex recovery theproblem2.

• Rank deficient case : recently it was found that when the inputfactorization rank r is larger than the true underlying dimension of thedata points, VRNMF with log-determinant volume regularizer is stillable to find the ground truth vertices.Another research direction will be to understand why it is so.

2There exists such a characterization of the transition boundary, when all the datapoints are equidistant from the vertices. However, a more general characterization of thetransition boundary when the data points are having different distances from the verticesis still an open problem.

35 / 42

Application domains : hyper spectral imaging

Examples of application of NMF for image segmentation of hyper-spectralimaging.

36 / 42

Application domains : audio source separation

Sheet music of Bach Prelude in C Major.13 kinds of notes : B3, C4, D4, E4, F#

4 , G4, A4, C5, D5, E5, F5, G5, A5.

37 / 42

38 / 42

W,H obtained from β-NMF with logdet regularizer with r = 16 ≥ 13. It can beobserved for an overestimated factorization rank r, minimum volumeregularization will automatically set some components to zero (with * symbol).

39 / 42

40 / 42

41 / 42

Last page – the works : past, present, future

Journal paper• A.-Gillis, Accelerating Nonnegative Matrix Factorization Algorithms using Extrapolation

Neural Computation, vol. 31 (2), pp 417-439, Feb 2019, MIT Press

Conference paper• Leplat-A.-Gillis, Minimum-Volume Rank-Deficient Non-negative Matrix Factorizations

To be presented in IEEE ICASSP 2019, Brighton, UK, 2019-May

• A.-Gillis, Volume regularized Non-negative Matrix FactorizationsIEEE WHISPERS 2018, Amsterdam, NL, 2018-Sept-25

Work in progress• A.-Gillis, Non-negative Matrix Factorization with Volume Regularizations for Asymmetric Non-separable Data, In

preparation, to be submitted to IEEE JSTARS

• Leplat-A.-Gillis, β-NMF for blind audio source separation with minimum volume regularization

• Cohen-A., Accelerating Non-negative Canonical Polyadic Decomposition using extrapolationA soumettre a GRETSI2019 (en Francais!)

Future working directions• Volume related

• Phase transition boundary of asymmetric non-separability• Rank deficient case

• Acceleration related

• Chain structure of the acceleration scheme• Convergence

• Application related

• Other interested applications (e.g. the “translator step” in the Brain Computer Interface)

End of Presentation

42 / 42