An introduction to dictionary...

Adaptive representations Dimension reduction Inverse problems Conclusion

An introduction to dictionary learning

Pierre CHAINAIS

Sept. 17th 2014

P. Chainais - Centrale Lille/LAGIS -INRIA SequeL- REPAR Douai 2014

From high to low dimension

I Data live in a high dimensional space

I Significant information lives in a low dimensional space

How to capture the ”substantifique moelle” ?

I dimension reduction ? (K < p)

I adaptive representation ? (K < p, K = p, or K > p)

Motivations

Learning a ’good’ dictionary is useful for :

I compression,

I classification or segmentation,

I denoising,

I inpainting,

I blind source separation,

I recommendation systems...

2 main approaches :

I Parametric dictionaries : combining existing functions...

I Non parametric dictionaries : matrix factorization

Adaptive representation

1 Mathematical construction :

DCT, Fourier : stationary signals...

Time-frequency atoms : non stationary signals

Wavelets : stationary + transients + multiscale

Curvelets : wavelets + contours...

2 Statistical learning :

representative examples : vector quantization (K-means),

otrhonormal basis : PCA / SVD (Karhunen-Loève)

a family of functions for linear decomposition =Dictionary

Dimension reduction

Feature selection

Select the most representative features (forward, backward...)

+Advantage : interpretability

Limitation : use predefined features

Other approach : build/learn a mapping to a newrepresentation

Dimension reduction

Vector quantization : clustering, K-means...

Objective : find representative examples for groups of data points.

11 vowels ) K = 11 classes described by D = 10 features(cf. time frequency analysis)

118 4. Linear Methods for Classification

o

o

oo

o

o o

o

o

o

oo

o

o o

o

o

o

o

o

oo

o

oo

o

ooo

oo

o

oo

o

o

o o

o

o

o

o

o

o

o

o

oo

o

o

oo

o

o

o

o

o

o

o

o

oo

oo

o

o

o

o

o o

o

ooo

o

o

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

oo oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo o

o

oo

oo

o

o

o

oo

o

o

o o

o

o

o

oo

oo

o

o

o o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o o

oo

o

o o

o

oo

o

o

o

o

o

o o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o o o

o

o

oo

o

o

oo

oo

oo

o

o

o

o

o

o

o

o

o oo

o

o

o

o

oo

o

o

oo

o

oo

oo

o

o

o

oo

o

o

o

o

o

o

o

oo

o

o o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o o

o

oo

o

o

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

oo

o

o

oo

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o oo

o

o

oo

o

o

o

o

oo

o

oo

o

oo

o

o

o

o

oo

o

o

o

o

o

o

oo

o

oo o

o

o

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o o

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

oo

o

o

o

oo

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

Canonical Coordinate 1

Can

onic

al C

oord

inat

e 2

Classification in Reduced Subspace

••••

•••• •••• •• ••

•• ••••

FIGURE 4.11. Decision boundaries for the vowel training data, in the two-di-mensional subspace spanned by the first two canonical variates. Note that inany higher-dimensional subspace, the decision boundaries are higher-dimensionala�ne planes, and could not be represented as lines.

Classification based on M=2 Fisher discriminant components

Vectorial quantization of color images

Compression

1 pixel = xn

= (xR

, xV

, xB

) =) clustering 3dusing K colors only

K = 2 K = 3 K = 10 Original image

Principal Component Analysis (PCA)

How spread are data points ?

2 4 640

50

60

70

80

90

100


How spread are data points ?

−2 0 2

−2

0

2


Example : face recognition

ID = number of pixels per image, for instance 19x19 = 361,

I xn

2 RD is an image of a face,I x

ni

= intensity of the i-th pixel of image n,

X

T

D⇥N ' UD⇥M ⇥ ZM⇥N

Eigen-faces [Turk and Pentland, 1991]

• d = number of pixels• Each xi 2 Rd is a face image• xji = intensity of the j-th pixel in image i

Xd�n � Ud�k Zk�n

( . . . ) � ( ) ( z1 . . . zn )Idea: zi more “meaningful” representation of i-th face than xiCan use zi for nearest-neighbor classification

Much faster: O(dk + nk) time instead of O(dn) when n, d � kWhy no time savings for linear classifier?

Principal component analysis (PCA) / Case studies 18

· · ·







'







⇥

0

@| |z1 · · · z

N

| |

1

A

Interest :

I extraction of generical characteristics,

I use the zj

for classification (K-NN,...),

I reduction of dimension ) speed

[Turk & Pentland 1991]


Inverse problems in image processing

Denoising



Inverse problems in image processing

Deconvolution


Inverse problems & dictionary learning

Main problem :ill posed inverse problems : unknown complexity| {z }

diversity

/ structure| {z }content

Main purpose :

I to propose a suitable model (structure)

I to discover the number of degrees of freedom (complexity)

Main tools to promote sparsity :

I a wide range of penalized optimization formulations (L1...)

I model selection deals with discovering complexity

Main interest :

I e�cient tools and algorithms in optimization,

I di↵erent approaches to promote (structured) sparsity

I theorems to control convergence properties...

Image processing and dictionary learning

Redundant dictionaries

IEEE SIGNAL PROCESSING MAGAZINE [28] MARCH 2011

dimension of the sampled data to the effective dimension of the underlying process without sensible penalty in the subsequent data analysis procedure.

An intuitive way to approach this dimensionality reduction problem is first to look at what generates the dimensionality gap between the physical processes and the observations. The most common reason for this gap is the difference between the representation of data defined by the sensor and the representa-tion in the physical space. In some cases, this discrepancy is, for example, a simple linear transform of the representation space, which can be determined by the well-known principal compo-nent analysis (PCA) [1] method. It may however happen that the sensors observe simultaneously two or more processes with causes lying within different subspaces. Other methods such as independent component analysis (ICA) [2] are required to understand the different processes behind the observed data. ICA is able to separate the different causes or sources by analyz-ing the statistical characteristics of the data set and minimizing the mutual information between the observed samples. However, ICA techniques respect some orthogonality conditions such that the maximal number of causes is often limited to the signal dimension. In Figure 1(a), we show some examples of noisy images whose underlying causes are linear combinations of two English letters chosen from a dictionary in Figure 1(b). These images are 4 3 4 pixels, hence their dimensionality in the pixel space is 16, while the number of causes is 20 (total number of letters). When applied to 5,000 randomly chosen noisy samples of these letters, PCA finds a linear transform of the pixels space into another 16-dimensional space represented by vectors in Figure 1(c). This is done by finding the directions in the original space with the largest variance. However, this representation does not identify the processes that generate the data, i.e., it does not find our 20 letters. ICA [2] differs from PCA because it is able to separate sources not only with respect to the second order correlations in a data set, but also with respect to higher order statistics. However, since the maximal number of causes is equivalent to the signal dimension in the standard ICA, the subspace vectors found by ICA in the example of Figure 1(d) do not explain the underlying letters.

The obvious question is: Why should we constrain our sen-sors to observe only a limited number of processes? Why do we need to respect orthogonality constraints in the data represen-tation subspace? There is no reason to believe that the number of all observable processes in nature is smaller than the maxi-mal dimension in existing sensors. If we look for an example in a 128 3 128 dimensional space of face images for all the people in the world, we can imagine that all the images of a single per-son belong to the same subspace within our 16,384-dimensional space, but we cannot reasonably accept that the total number of people in the world is smaller than our space dimension. We conclude that the representation of data could be overcomplete, i.e., that the number of causes or the number of subspaces used for data description can be greater than the signal dimension.

Where does the dimensionality reduction occur in this case? The answer to this question lies in one of the most important principles in sensory coding—efficiency, as first outlined by Barlow [3]. Although the number of possible pro-cesses in the world is huge, the number of causes that our sen-sors observe at a single moment is much smaller: the observed processes are sparse in the set of all possible causes. In other words, although the number of representation subspaces is large, only few ones will contain data samples from sensor measurements. By identifying these few subspaces, we find the representation in the reduced space.

An important question arises here: given the observed data, how to determine the subspaces where the data lie? The choice of these subspaces is crucial for efficient dimensionality reduc-tion, but it is not trivial. This question has triggered the emer-gence of a new and promising research field called dictionary learning. It focuses on the development of novel algorithms for building dictionaries of atoms or subspaces that provide efficient representations of classes of signals. Sparsity constraints are keys to most of the algorithms that solve the dictionary learning problems; they enforce the identification of the most important causes of the observed data and favor the accurate representation of the relevant information. Figure 1(e) shows that one of the first dictionary learning methods called sparse coding [4] suc-ceeds in learning all 20 letters that generate 5,000 observations

(a) (b) (c) (d) (e) (f)

[FIG1] Learning underlying causes from a set of noisy observations of English letters. A subset of 20 noisy 4 3 4 images is shown in (a). These samples have been generated as linear combinations of two letters randomly chosen from the alphabet in (b), and they have been corrupted by additive Gaussian noise. When run of 5,000 such samples, PCA and ICA find the same number of components as the dimension of the signal. Therefore, they cannot find the underlying 20 letters. Sparse coding [4] learns an overcomplete dictionary of 20 components, thus it can separate these causes and find all 20 letters from the original alphabet. K-SVD [5] performs similarly, i.e., it finds almost all of the letters. However, since the implementation of K-SVD [5] uses MP for the sparse approximation step, it converges to a local minimum resulting in some repeated letters in the learned dictionary. (a) Noisy samples; (b) original causes; (c) PCA; (d) ICA; (e) sparse coding; and (f) KSVD.

(a) samples (b) original dict. (c) PCA (d) ICA (e) sparse code (f) K-SVD


Seminal papers


Seminal papers

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 11, NOVEMBER 2006 4311

-SVD: An Algorithm for Designing OvercompleteDictionaries for Sparse Representation

Michal Aharon, Michael Elad, and Alfred Bruckstein

Abstract—In recent years there has been a growing interest inthe study of sparse representation of signals. Using an overcom-plete dictionary that contains prototype signal-atoms, signals aredescribed by sparse linear combinations of these atoms. Applica-tions that use sparse representation are many and include compres-sion, regularization in inverse problems, feature extraction, andmore. Recent activity in this field has concentrated mainly on thestudy of pursuit algorithms that decompose signals with respectto a given dictionary. Designing dictionaries to better fit the abovemodel can be done by either selecting one from a prespecified set oflinear transforms or adapting the dictionary to a set of training sig-nals. Both of these techniques have been considered, but this topicis largely still open. In this paper we propose a novel algorithm foradapting dictionaries in order to achieve sparse signal representa-tions. Given a set of training signals, we seek the dictionary thatleads to the best representation for each member in this set, understrict sparsity constraints. We present a new method—the -SVDalgorithm—generalizing the -means clustering process. -SVDis an iterative method that alternates between sparse coding of theexamples based on the current dictionary and a process of updatingthe dictionary atoms to better fit the data. The update of the dictio-nary columns is combined with an update of the sparse represen-tations, thereby accelerating convergence. The -SVD algorithmis flexible and can work with any pursuit method (e.g., basis pur-suit, FOCUSS, or matching pursuit). We analyze this algorithmand demonstrate its results both on synthetic tests and in applica-tions on real image data.

Index Terms—Atom decomposition, basis pursuit, codebook,dictionary, FOCUSS, gain-shape VQ, -means, -SVD, matchingpursuit, sparse representation, training, vector quantization.

I. INTRODUCTION

A. Sparse Representation of Signals

RECENT years have witnessed a growing interest in thesearch for sparse representations of signals. Using an over-complete dictionary matrix that contains proto-type signal-atoms for columns, , a signal canbe represented as a sparse linear combination of these atoms.The representation of may either be exact or ap-proximate, , satisfying . The vector

contains the representation coefficients of the signal. In approximation methods, typical norms used for measuring

Manuscript received December 26, 2004; revised January 21, 2006. Thiswork was supported in part by The Technion under V.P.R. funds and by the Is-rael Science Foundation under Grant 796/05. The associate editor coordinatingthe review of this manuscript and approving it for publication was Dr. StevenL. Grant.

The authors are with the Department of Computer Science, TheTechnion—Israel Institute of Technology, Haifa 32000, Israel (e-mail:[email protected]; [email protected]; [email protected]).

Digital Object Identifier 10.1109/TSP.2006.881199

the deviation are the -norms for and . In this paper,we shall concentrate on the case of .

If and is a full-rank matrix, an infinite numberof solutions are available for the representation problem, henceconstraints on the solution must be set. The solution with thefewest number of nonzero coefficients is certainly an appealingrepresentation. This sparsest representation is the solution ofeither

subject to (1)

or

subject to (2)

where is the norm, counting the nonzero entries of avector.

Applications that can benefit from the sparsity and overcom-pleteness concepts (together or separately) include compres-sion, regularization in inverse problems, feature extraction, andmore. Indeed, the success of the JPEG2000 coding standard canbe attributed to the sparsity of the wavelet coefficients of naturalimages [1]. In denoising, wavelet methods and shift-invariantvariations that exploit overcomplete representation are amongthe most effective known algorithms for this task [2]–[5]. Spar-sity and overcompleteness have been successfully used for dy-namic range compression in images [6], separation of textureand cartoon content in images [7], [8], inpainting [9], and more.

Extraction of the sparsest representation is a hard problemthat has been extensively investigated in the past few years. Wereview some of the most popular methods in Section II. In allthose methods, there is a preliminary assumption that the dic-tionary is known and fixed. In this paper, we address the issueof designing the proper dictionary in order to better fit the spar-sity model imposed.

B. The Choice of the Dictionary

An overcomplete dictionary that leads to sparse represen-tations can either be chosen as a prespecified set of functions ordesigned by adapting its content to fit a given set of signal ex-amples.

Choosing a prespecified transform matrix is appealing be-cause it is simpler. Also, in many cases it leads to simple and fastalgorithms for the evaluation of the sparse representation. Thisis indeed the case for overcomplete wavelets, curvelets, con-tourlets, steerable wavelet filters, short-time Fourier transforms,and more. Preference is typically given to tight frames that caneasily be pseudoinverted. The success of such dictionaries in ap-plications depends on how suitable they are to sparsely describethe signals in question. Multiscale analysis with oriented basis

1053-587X/$20.00 © 2006 IEEE

Inverse problems and dictionary learning

Restore initial image ) good representation = linear regressionI H : damaging operator (blur, mask...)I D : dictionary (cosines, wavelets, learnt atoms...)I ↵ : coe�cients, X

i

=P

j

↵ij

Dj

(Y = HX + n

X = D↵

and X is sparse on D to be discovered... Prior : ↵ is sparse

!"# $%&'()*) +,- ./0&1)# 2-3#'4 5#6*%*7*-%89:1):*78 5*;7*-%&1( #(-%3>(4 2*;"' ?'&3

13

!"# /0&1)*7( >&)#3 /(%7"#)*) 2-3#'

@# &)):A# 7"# #B*)7#%;# -6 & )(%7"#)*)3*;7*-%&1( ! CD3 % E"-)# ;-':A%) &1# 7"#&7-A )*=%&')F/*=%&') &1# A-3#'#3 &) )0&1)# '*%#&1;-AG*%&7*-%) -6 7"# 3*;7*-%&1( &7-A)4

@# )##H & )0&1)*7( -6 8 A#&%*%= 7"&7*7 *) &)):A#3 7- ;-%7&*% A-)7'( I#1-)F!"*) A-3#' *) 7(0*;&''( 1#6#11#3 7- &) 7"#)(%7"#)*) )0&1)# &%3 1#3:%3&%71#01#)#%7&7*-% A-3#' 6-1 )*=%&')F!"*) A-3#' G#;&A# J#1( 0-0:'&1 &%3 J#1():;;#))6:' *% 7"# 0&)7 3#;&3#F

!

!B !

!KB


Restore initial image ) good representation = linear regressionI H : damaging operator (blur, mask...)

I D : dictionary (cosines, wavelets, learnt atoms...)

I ↵ : coe�cients, Xi

=P

j

↵ij

Dj

(Y = HX + n

X = D↵


Optimization typical approach :

Y = (HX + Gaussian noise) + regularization

(D,↵) = argminD,↵ kY � H D↵|{z}X̂

k2 + �k↵kL0| {z }

Kcomponents

+

LASSO, Forward-Backward, proximal methods...


Restore initial image ) good representation = linear regressionI H : damaging operator (blur, mask...)

I D : dictionary (cosines, wavelets, learnt atoms...)

I ↵ : coe�cients, Xi

=P

j

↵ij

Dj

(Y = HX + n

X = D↵


Optimization typical approach :



k2 + �k↵kL1| {z }

Laplace

+

LASSO, Forward-Backward, proximal methods...

Dictionary learning

I Searching for an adaptive representation

... for dimension reduction (correlated atoms, K < p)

... for an orthonormal basis (PCA, K = p),

... for a redundant dict. / sparse representation (K > p)

I Optimization problem :



k2 + �k↵kL0| {z }

+

Alternate optimization (proximal methods...)

Dictionary learning


... for dimension reduction (correlated atoms, K < p)

... for an orthonormal basis (PCA, K = p),

... for a redundant dict. / sparse representation (K > p)

I Optimization problem :



k2 + �k↵kL1| {z }

+

Alternate optimization (proximal methods...)

Dictionary learning


to solve inverse problems,for compression,for classification,...

I Many open questions :optimize the size of the dictionary ?optimize sparsity ?...

I Many generalizations :distributed setting,task driven dictionaries,multiscale dictionary,fast transforms,multispectral/hyperspectral,multimodal dictionaries...

IEEE SIGNAL PROCESSING MAGAZINE [33] MARCH 2011

considerably affect the design of learning strategies as well as the approximation performance.

APPLICATIONS OF DICTIONARY LEARNINGDictionary learning for sparse signal approximation has found successful applications in several domains. For example, it has been applied to medical imaging and representation of audio and visual data. We overview here some of the main applica-tions in these directions.

MEDICAL IMAGINGDictionary learning has the interesting potential to reveal a prio-ri unknown statistics of certain types of signals captured by dif-ferent measurement devices. An important example are medical signals, such as electroencephalogram (EEG), electrocardiogra-phy (ECG), magnetic resonance imaging (MRI), functional MRI (fMRI), and ultrasound tomography (UST) where different physical causes produce the observed signals. It is crucial, how-ever, that representation, denoising, and analysis of these signals are performed in the right signal subspace, such that the under-lying physical causes of the observed signals can be identified. Learning of components in ECG signals facilitates ventricular cancellation and atrial modeling in the ECG of patients suffering from atrial fibrillation [27]. Overcomplete dictionaries learned from MRI scans of breast tissues have been shown to provide an excellent representation space for reconstructing images of breast tissue obtained by the UST scanner [28], which drastically reduces the imaging cost compared to MRI. Moreover, standard breast screening techniques, such as the X-ray projection mam-mography and computed tomography can potentially exploit highly sparse representations in learned dictionaries [29]. Analysis of other signals, such as neural signals obtained by EEG, multielectrode arrays, or two-photon microscopy could also largely benefit from adapted representations obtained by dictionary learning methods.

REPRESENTATION OF AUDIO AND VISUAL DATADictionary learning has introduced significant progress in denoising of speech [30] and images [5], and in audio coding and source separation [16], [31], where it is very important to capture the underlying causes or the most important constitu-tive components of the target signals. The probabilistic diction-ary learning framework has been also proposed for modeling natural videos. These methods explicitly model the separation of the invariant signal part given by the image content and the varying part represented by the motion. Learning under these separation constraints can be achieved using the bilinear model [32], [33], or the phase coding model [34]. In addition to learning the dictionary elements for the visual content, these methods also learn the sparse components of the invariant part (e.g., translational motion).

There exist many examples in nature where a physical process is observed or measured under different conditions. This results in sets of correlated signals whose common part

corresponds to the underlying physical cause. However, dif-ferent observation conditions introduce variability in the measured signals, such that the common cause is usually dif-ficult to extract. Dictionary learning methods based on ML and MAP can be extended by modifying the objective function such that the learning procedures identify the proper sub-space for the joint analysis of multiple signals. This permits to learn the underlying causes under different observation conditions. Such modified learning procedures have been applied to audio-visual signals [35] and to multiview imaging [36]. The synchrony between audio and visual signals is exploited in [35] to extract and learn the components of their generating cause that is human speech. A multimodal dic-tionary is learned with elements that have an audio part and a video part corresponding to the movement of the lips that generate the audio signal. An example of the learned atom for the word “one” is shown in Figure 4. One important contri-bution of this work certainly lies in its benefits towards understanding and modeling the integration of audio and visual sensory information in the cortex.

In stereo vision, the same three-dimensional (3-D) scene is observed from different viewpoints, which produce correlated multiview images. Due to the projective properties of light rays, the correlation between multiview images has to comply with epipolar geometry constraints. Dictionaries can be learned such that they efficiently describe the content of natural stereo imag-es and si multaneously permit to capture the geometric correla-tion between multiview images [36]. The correlation between images is modeled by the local atom transforms, which is made feasible by the use of geometric dictionaries built on scaling, rotation and shifts of a generating function. Learning is based on an ML objective that includes the probability that left image yL and right image yR are well represented by a dictionary F, and the probability that corresponding image components in different views satisfy the epipolar constraint

F*5 arg maxF

3log P 1yL, yR, D5 0|F 2 4, (6)where D5 0 denotes the event when the epipolar geometry is satisfied. This ML objective leads to an energy minimization learning method, where the energy function has three terms: image approximation error term (for both stereo images), the sparsity term, and the multiview geometry term. Dictionary learning is performed in two steps: sparse approximation step

Audio

Video

Word “One”

Time

[FIG4] Learned audio-visual atom representing the word ”one." Figure used with permission from [35].

Travaux en cours / REPAR / Centrale Lille / Signal-Image

I Thèse de Hong Phuong Dang :

Méthodes bayésiennes non paramétriques pourl’apprentissage de dictionnaire (2013-2016)

I Post-doc de Sylvain Rousseau (co-encad. Christelle Garnier) :

Représentation parcimonieuse et suivi multi-objet dansdes séquences vidéo (2014-2015)

L’objectif du post-doc est d’explorer le potentiel des méthodesde représentation parcimonieuse pour repousser les limites dufiltrage particulaire en grande dimension (tâche 4 du projet).L’application visée est le suivi multi-objets dans des séquencesd’images.

I ANR Bayesian Non Parametrics for Signal & Imageprocessing (BNPSI, 2014-2018)

Adaptive representationsDimension reductionFeature selectionClusteringPrincipal Component Analysis

Inverse problemsInverse problems in image processing

Conclusion

An introduction to dictionary...

Documents

Transcript of An introduction to dictionary...