Low-rank double dictionary learning from corrupted data ...

Pattern Recognition 72 (2017) 419–432

Contents lists available at ScienceDirect

Pattern Recognition

journal homepage: www.elsevier.com/locate/patcog

Low-rank double dictionary learning from corrupted data for robust

image classification

Yi Rong

a , b , Shengwu Xiong

a , ∗∗, Yongsheng Gao

b , ∗

a School of Computer Science and Technology, Wuhan University of Technology, China b School of Engineering, Griffith University, Australia

a r t i c l e i n f o

Article history:

Received 1 November 2016

Revised 1 March 2017

Accepted 30 June 2017

Available online 5 July 2017

Keywords:

Low-rank dictionary learning

Class-specific dictionary

Class-shared dictionary

Image classification

Corrupted training samples

Robustness

a b s t r a c t

In this paper, we propose a novel low-rank double dictionary learning (LRD

2 L) method for robust image

classification tasks, in which the training and testing samples are both corrupted. Unlike traditional dic-

tionary learning methods, LRD

2 L simultaneously learns three components from corrupted training data:

1) a low-rank class-specific sub-dictionary for each class to capture the most discriminative class-specific

features of each class, 2) a low-rank class-shared dictionary which models the common patterns shared

in the data of different classes, and 3) a sparse error term to model the noise in data. Through low-rank

class-shared dictionary and noise term, the proposed method can effectively separate the corruptions and

noise in training samples from creating low-rank class-specific sub-dictionaries, which are employed for

correctly reconstructing and classifying testing images. Comparative experiments are conducted on three

public available databases. Experimental results are encouraging, demonstrating the effectiveness of the

proposed method and its superiority in performance over the state-of-the-art dictionary learning meth-

ods.

© 2017 Elsevier Ltd. All rights reserved.

1

s

i

a

[

p

i

t

a

e

a

a

t

p

r

f

a

y

s

e

a

o

m

i

c

c

s

b

t

s

c

t

t

t

a

t

d

h

0

. Introduction

Image classification has been attracting much attention from re-

earchers in the areas of pattern recognition and machine learn-

ng due to their importance in many real-world applications, such

s computational forensics, face recognition and medical diagnosis

1–3] . By exploiting the label information of training images, su-

ervised dictionary learning approaches [4–10,54] have achieved

mpressive performances in learning the discriminative informa-

ion from training samples for accurately classifying testing im-

ges, and are widely used in various applications [11–15] . How-

ver, in real-world applications, it is inevitable that the training

nd/or testing samples are corrupted (i.e., containing interference

nd noise components not relevant to the classes). These corrup-

ions can appear in various forms (such as occlusions, facial ex-

ressions, lighting variations, pose variations, and noise in face

ecognition tasks) and make a dictionary learning algorithm fail to

unction due to significant performance drop.

Recent effort s have been made towards developing robust im-

ge classification algorithms to handle the corruptions in image

∗ Corresponding author. ∗∗ Co-corresponding author.

E-mail addresses: [email protected] (S. Xiong),

[email protected] (Y. Gao).

f

c

d

g

h

ttp://dx.doi.org/10.1016/j.patcog.2017.06.038

031-3203/© 2017 Elsevier Ltd. All rights reserved.

amples [16–19] . They developed new methods built on the strat-

gy of simultaneously learning the reconstruction coefficients and

weight matrix for each testing sample, which is calculated based

n the reconstruction residual of an image sample. Such weight

atrix can be employed to detect the error pixels in each test-

ng sample to handle the corruptions in testing samples. Highly

ompetitive performances are achieved when testing samples are

orrupted. However, they cannot perform well when the training

amples are corrupted as the corruptions in the training data will

e carried into dictionaries, resulting in a poor reconstruction of

esting samples, and thus lead to a failure of classification.

The robust image classification when both training and testing

amples are corrupted remains an unsolved practical problem, be-

ause the training data obtained in real-world applications are of-

en not clean. The challenge of such a problem lies in the fact that

he intra-class variations in the corrupted image samples are larger

han their inter-class differences. Inspired by [20] , we find that

n image in robust image classification tasks often contains three

ypes of information: (1) Class-specific information, which are the

iscriminative features owned by one class. (2) Class-shared in-

ormation, which are the common patterns shared by different

lasses. For example, in a face recognition task, face images from

ifferent classes often share common illumination, pose, and dis-

uise (such as wearing glasses) variants. These variants are not

elpful to distinguish different classes, but without them, images

http://dx.doi.org/10.1016/j.patcog.2017.06.038

http://www.ScienceDirect.com

http://www.elsevier.com/locate/patcog

http://crossmark.crossref.org/dialog/?doi=10.1016/j.patcog.2017.06.038&domain=pdf

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.patcog.2017.06.038

420 Y. Rong et al. / Pattern Recognition 72 (2017) 419–432

m

n

a

c

e

r

s

f

d

p

c

a

t

c

c

c

t

g

m

t

t

Z

n

r

c

o

n

h

a

t

t

t

t

b

i

w

s

e

e

i

i

e

t

W

n

h

i

t

d

a

i

p

t

p

s

t

[

N

n

c

t

with similar variants cannot be well represented. (3) Sparse noise,

which is useless for image classification and often enlarges the

intra-class differences. If we can separate the class-specific infor-

mation from the others and solely construct a dictionary to capture

such information, this dictionary will be immune to the corrup-

tions and noise in the training samples for correctly reconstructing

and classifying testing samples.

Based on above observation, we propose a novel low-rank dou-

ble dictionary learning (LRD

2 L) approach to separate these three

types of information. More specifically, given a training sample set

with labels, we propose to learn a low-rank class-specific sub-

dictionary for each class, which captures the most discrimina-

tive features of the class. Simultaneously, a low-rank class-shared

dictionary is constructed for representing the common patterns

shared in images of different classes. An error term is intro-

duced to approximate the sparse noise contained in image sam-

ples. Through the class-shared dictionary and noise term, the pro-

posed method can effectively separate the corruptions and noise

in training samples from affecting the construction of class-specific

sub-dictionaries, which will be utilized to correctly reconstruct and

classify new testing samples. To the best of our knowledge, the

proposed low-rank double dictionary (LRD

2 L) method is the first

attempt to integrate the low-rank matrix recovery technique with

the class-specific and class-shared dictionary learning. By sepa-

rating the corruptions and sparse noise in the training samples

through the class-shared dictionary and sparse error term, the pro-

posed method is invariant, in theory, to corruptions and noise in

both training and testing data. A new alternative optimization al-

gorithm is designed to effectively solve the optimization problem

for the proposed method.

The rest of the paper is organized as follows: In Section 2 ,

we briefly review the background and related work. The pro-

posed low-rank double dictionary learning method and the de-

sign of its objective function are discussed in detail in Section 3 .

Section 4 presents our algorithms to practically solve the optimiza-

tion problem for the proposed method. Experimental results on

three public databases are reported in Section 5 . Finally, the paper

concludes in Section 6 .

2. Background

Correctly learning a dictionary from training samples is of vi-

tal importance to effectively reconstructing and classifying test-

ing samples for image classification tasks. Zhang and Li [15] pro-

posed a discriminative K-SVD (DK-SVD) method for face recogni-

tion by integrating a classification error term into objective func-

tion. The classification accuracy of a linear classifier and the recon-

structive power of dictionary are simultaneously considered, such

that the learned dictionary is both reconstructive and discrimina-

tive. Jiang et al. [21] proposed a label consistent K-SVD (LC-KSVD)

algorithm that associates the label information with the dictionary

atoms. They introduced a binary code matrix to force the sam-

ples from the same classes to have similar representations, which

enhances the discriminability of their coding coefficients. Ramirez

et al. [8] proposed a novel classification and clustering framework

built on dictionary learning by introducing a structured incoher-

ence regularization term to make sub-dictionaries as independent

as possible. By imposing the Fisher criterion on the sparse coding

coefficients, Yang et al. [22,23] proposed a Fisher discrimination

dictionary learning (FDDL) method to make the coding coefficients

have small within-class scatters and large between-class scatters.

Both the reconstruction residual and the distances between the

representation coefficients of a testing sample and the mean repre-

sentation coefficients of each class are taken into account to clas-

sify the testing samples, which achieved a higher level of accuracy.

Gu et al. [9] proposed a projective dictionary pair learning (DPL)

ethod for image classification, which learns an analysis dictio-

ary and a synthesis dictionary for each class to make the training

nd testing processes more efficient.

To overcome the limitations of above algorithms that do not

onsider the common features shared by different classes, Zhou

t al. [14] developed a novel joint dictionary learning (JDL) algo-

ithm that learns one common dictionary and multiple category-

pecific dictionaries. By explicitly separating the common features

rom the category-specific patterns, the learned category-specific

ictionaries can be more discriminative. Wang and Kong [20] pro-

osed to learn a particular dictionary (called particularity) for each

lass, and a common pattern pool (called commonality) shared by

ll the classes. The commonality only complements the reconstruc-

ion of input images over the particular dictionary and does not

ontain any discriminative information. Therefore, the particularity

aptures the most discriminative features in each class and thus

an be more compact and discriminative. Sun et al. [24] learned

he common and class-specific dictionaries based on a weighted

roup sparse coding framework, and achieved impressive perfor-

ance on several public datasets. The group sparsity helps to cap-

ure the structural information of data samples, which improves

he discriminability of the learned dictionary. Gao et al. [25] and

hang et al. [53] applied the common and class-specific dictio-

ary learning strategy to solve a new fine-grained image catego-

ization problem. By imposing both the self-incoherence and the

ross-incoherence constrains on each dictionary, the performance

f fine-grained classification is significantly improved.

The above algorithms work well on clean image data, but can-

ot generalize well when the image samples are corrupted. To

andle the image corruption problem, Yang et al. [26] proposed

robust sparse coding (RSC) scheme to estimate the distribu-

ion of corruptions and learn a distribution induced weight ma-

rix for each testing sample. The weight matrix is used to detect

he error pixels and control the contribution of these pixels in

he reconstruction process. He et al. [27] developed a correntropy-

ased sparse model to maximize the correntropy between a test-

ng sample and its reconstructed image. The corrupted pixels,

hich have small contributions to the correntropy, will be assigned

maller weights, such that these corrupted pixels have less influ-

nce on the reconstruction residual of the testing sample. Qian

t al. [16] and Yang et al. [28] adopted a nuclear norm regular-

zation to describe the structural information of the errors in test-

ng samples, which increases the accuracy of detecting error pix-

ls. They obtained promising results in dealing with the struc-

ural corruptions in testing images. Following the scheme of RSC,

ei and Wang [17,19] proposed to learn a robust auxiliary dictio-

ary for undersamped face recognition, which can significantly en-

ance robustness to the occlusions in testing samples. The max-

mum correntropy criterion [27] is also integrated into conven-

ional dictionary learning to construct the robust dictionary for

ealing with outliers and noise in testing samples [18,29] . These

lgorithms achieved very promising performances when the test-

ng samples are corrupted. However, they still cannot solve the

roblem of training sample corruption, because the corruptions in

raining data will be carried into their dictionaries, resulting in

oor reconstructions and subsequently misclassifications of testing

amples.

Low-rank matrix recovery theory [55] , in principle, has the po-

ential to handle data corruptions in dictionary learning. Let Y = Y 1 , Y 2 , . . . , Y C ] ∈ �

d×N be a training sample set, which consists of

training samples from C different classes. And let Y i ∈ �

d×N i de-

ote a data set containing all the training samples from the i -th

lass, where d is the dimension of images and N i is the number of

raining samples in the i -th class, which satisfies N =

C ∑

i =1

N i . Sup-

Y. Rong et al. / Pattern Recognition 72 (2017) 419–432 421

p

t

c

e

o

f

m

w

t

‖

g

t

c

r

t

t

g

i

l

c

c

3

i

W

s

t

b

t

3

t

f

(

n

f

Y

w

s

r

a

o

s

c

f

f

c

r

c

f

c

p

t

f

w

i

t

w

c

b

3

n

B

p

a

t

Y

w

n

d

t

h

A

l

t

A

s

w

s

s

a

t

i

d

r

f

[

s

S

n

e

R

j

t

o

A

s

w

i

t

p

t

f

i

t

d

p

l

ose the training samples Y are contaminated by a corruption ma-

rix E ∈ �

d×N , and let the matrix Y o ∈ �

d×N denote the underlying

lean sample matrix, i.e., Y = Y o + E. The low-rank matrix recov-

ry model aims to recover the clean sample matrix Y o from the

bserved sample set Y . This goal can be achieved by solving the

ollowing nuclear norm regularized problem:

in

Y o , E ‖

Y o ‖ ∗ + α‖

E ‖ 1 s . t . Y = Y o + E, (1)

here α > 0 is a trade-off parameter to balance the influence of

he nuclear norm minimization term ‖ Y o ‖ ∗ and the error term

E‖ 1 . ‖ · ‖ ∗ is the nuclear norm of a matrix, i.e., the sum of the sin-

ular values of this matrix. The idea of solving the rank minimiza-

ion problem in Eq. (1) was initially used in the robust principal

omponent analysis (RPCA) [30,31] to recover the underlying low-

ank structure from corrupted data, in a single subspace. To reveal

he membership of samples more accurately, a low-rank represen-

ation (LRR) [32] method is developed to generalize RPCA from sin-

le subspace to multi-subspaces setting. By integrating rank min-

mization into traditional sparse representation based dictionary

earning framework, Ma et al. [33] and Li et al. [34] proposed dis-

riminative low-rank dictionary learning methods for robust image

lassification and achieved impressive classification performance.

. Low-rank double dictionary learning (LRD

2 L)

This section presents a novel low-rank double dictionary learn-

ng (LRD

2 L) to address the corruption problem in training samples.

e first analyze the properties of class-specific information, class-

hared information, and sparse noise in image samples. Based on

he analysis, the objective function of LRD

2 L is designed that will

e used in the optimization algorithm presented in the next sec-

ion.

.1. The properties of the three components

The information contained in an image can be classified into

hree types, i.e., the class-specific information, the class-shared in-

ormation and the sparse noise. Given a dataset Y = [ Y 1 , Y 2 , . . . , Y C ]

see Section 2 ), we propose to decompose Y into three compo-

ents, such that each component only represents one type of in-

ormation:

i = Y i +

˜ Y i + E i , for i = 1 , 2 , . . . , C, (2)

here Y i and

˜ Y i are the class-specific component and the class-

hared component of the sample images from the i -th class, E i

epresents the sparse noise in the i -th class. To appropriately sep-

rate these components from an image, we analyze the properties

f them as follows.

If images in the dataset are clean (i.e., only containing class-

pecific information of its class), the images belonging to the same

lass tend to be drawn from the same subspace, while the images

rom different classes are drawn from different subspaces. There-

ore, the column vectors in a class-specific component Y i , which

orrespond to images from the same class, are highly linearly cor-

elated, whereas the column vectors in two different class-specific

omponents Y i and Y j where i � = j, which correspond to images

rom different classes, are independent. Thus, the class-specific

omponent matrix Y i for every class i , where i = 1 , 2 , . . . , C, is ex-

ected to be low rank.

Since the class-shared components represent the common pat-

erns across different classes, the class-shared components of dif-

erent classes ˜ Y i and

˜ Y j are highly linearly correlated. Therefore, if

e stack the class-shared components of all class into one matrix,

.e., ˜ Y = [ Y 1 , Y 2 , . . . , Y C ] , this matrix ˜ Y should be low rank. Note that

he matrix ˜ Y for a single class, however, may not be low rank.
i
Because the noise contaminates a relatively small portion of the

hole image and the noises in different images are statistically un-

orrelated [35] , the sparse noise component of each class E i should

e a sparse matrix.

.2. Building objective function for LRD

2 L

In this study, we propose to learn a class-specific dictio-

ary A = [ A 1 , A 2 , . . . , A C ] ∈ �

d×m A and a class-shared dictionary

∈ �

d×m B to represent the class-specific and the class-shared com-

onents of the data, respectively. A i ∈ �

d×m A i is a sub-dictionary

ssociated with the i -th class. According to the dictionary learning

heory, Eq. (2) can be rewritten as:

i = A X i + B Z i + E i , for i = 1 , 2 , . . . , C, (3)

here X i is the representation coefficient matrix of Y i over dictio-

ary A , and Z i is the representation coefficient matrix of ˜ Y i over

ictionary B .

Because both the class-specific component of each class Y i and

he class-shared component across all classes ˜ Y = [ Y 1 , Y 2 , . . . , Y C ]

ave the low-rankness property, the class-specific sub-dictionary

i of each class and the class-shared dictionary B should also be

ow-rank matrices. Hence we propose the following objective func-

ion for our low-rank double dictionary learning:

min

i , X i , B , Z i , E i ‖

A i ‖ ∗ + ‖

B ‖ ∗ + α( ‖

X i ‖ 1 + ‖

Z i ‖ 1 ) + β‖

E i ‖ 1

.t. Y i = A X i + B Z i + E i , for i = 1 , 2 , . . . , C, (4)

here α and β are positive-valued parameters that balance the

parsity of the coefficient matrices and the weight of noise, re-

pectively. Since the noise component E i of each class should be

sparse matrix, the l 1 -norm regularization ‖ E i ‖ 1 is used to model

he sparse noise in each class.

We also adopt two terms to further enhance the discrim-

native power of the dictionaries. Firstly, since A i is the sub-

ictionary associated with the i -th class, it is supposed to well

epresent the class-specific component of the i -th class. There-

ore, by vertically partitioning the coefficient matrix X i as X i = X i 1 ; X i 2 ; . . . ; X iC ] , where X i j denotes the coefficients of Y i corre-

ponding to A j , we can have the constraint Y i = A i X ii + B Z i + E i .

econdly, the class-specific components of different classes should

ot be correlated, therefore for Y j , the coefficients X ji ( i � = j) are

xpected to be a zero matrices. To this end, an incoherence term

( A i ) =

C ∑

j =1 , j � = i ‖ A i X ji ‖ 2 F

( i = 1 , 2 , . . . , C) is introduced into the ob-

ective function. The minimization of such term makes the correla-

ion between Y j and A i ( i � = j) as small as possible.

Considering both the above two factors, the objective function

f Eq. (4) is further improved for LRD

2 L as follow:

min

i , X i , B , Z i , E i ‖

A i ‖ ∗ + ‖

B ‖ ∗ + α( ‖

X i ‖ 1 + ‖

Z i ‖ 1 ) + β‖

E i ‖ 1 + λR ( A i )

.t. Y i = A X i + B Z i + E i , Y i = A i X ii + B Z i + E i , (5)

here λ > 0 is a parameter that controls the contribution of the

ncoherence term. X ii denotes the coefficients of Y i corresponding

o A i . We will present the algorithm of solving this optimization

roblem in Section 4 .

Different from [20] that decomposes the image data into only

wo components, our method captures three types of information

rom the images. By considering the sparse noise in both train-

ng and testing phases, the proposed LRD

2 L method is more robust

o noise. In addition, the multi-subspace structural information in

ata is ignored in [20] , because the atoms in their dictionaries are

rocessed independently. On the contrary, our method imposes the

ow-rank constrains on dictionaries to ensure that such structural


Algorithm 1. Solving problem (8) via ALM.

Input: Data matrix Y i , initial dictionary matrix D = [ A , B ] , parameters α and β1

Initialization: E i = T 1 = 0 , P i = H = T 2 = 0 , μ = 10 −6 , ρ = 1 . 1 , ε = 10 −8 ,

ma x μ = 10 30 .

while not converged and the maximal iteration number is not reached do

1: Fix the others and update H by:

H = argmin H

α‖ H‖ 1 + 〈 T 2 , P i − H 〉 +

μ2 ‖ P i − H ‖ 2 F .

2: Fix the others and update P i by:

P i = ( I + D T D ) −1 ( H + D T Y i − D T E i + ( D T T 1 − T 2 ) /μ) .

3: Fix the others and update E i by:

E i = argmin E i

β1 ‖ E i ‖ 1 + 〈 T 1 , Y i − D P i − E i 〉 +

μ2 ‖ Y i − D P i − E i ‖ 2 F .

4: Update the multipliers T 1 and T 2 by:

T 1 = T 1 + μ( Y i − D P i − E i ) , T 2 = T 2 + μ( P i − H ) .

5: Update the parameter μ by:

μ = min ( ρμ, ma x μ) .

6: Check the convergence conditions:

‖ Y i − D P i − E i ‖ ∞ < ε, ‖ P i − H ‖ ∞ < ε

end

Output: The coefficient matrix P i = [ X i ; Z i ] and the error matrix E i .

i

t

A

s

L

A

++

w

i

c

s

s

S

i

t

4

l

fi

B

s

s

n

f

o

u

information in the training images are correctly retained. There-

fore, the dictionaries learned by LRD

2 L can decompose each com-

ponent more accurately, and consequently LRD

2 L achieves higher

levels of accuracy and robustness.

4. Optimization of LRD

2 L

In this section, we propose an effective optimization algorithm

to solve the problem (7) , in which the entire optimization prob-

lem is divided into three sub-problems to be solved iteratively.

1) Updating the coefficients X i and Z i for class i ( i = 1 , 2 , . . . , C)

by fixing the dictionaries A , B and other coefficients X j and Z j

( i � = j). 2) Updating the sub-dictionary A i class-by-class by fixing

other variables. The corresponding coefficients X ii should also be

updated to satisfy the constraint Y i = A i X ii + B Z i + E i . 3) Updating

the class-shared dictionary B and the corresponding coefficients Z

iteratively to satisfy the constraint Y i = A X i + B Z i + E i . Similar to

[33] , the sparse error E i is updated in each sub-problem. Because

the weights for E i should be adjusted differently for the constraints

Y i = A i X ii + B Z i + E i and Y i = A X i + B Z i + E i respectively, the pa-

rameter β in the second sub-problem is set differently from that

in the other two sub-problems.

4.1. Updating coefficients X i and Z i

Given the dictionaries A and B , the coefficients X i and Z i are

updated class by class by fixing all other coefficients X j and Z j ( i � =j). Thus, by ignoring the terms that are unrelated with X i and Z i ,

the problem (5) is reduced to a sparse coding problem as follows:

min

X i , Z i , E i α( ‖

X i ‖ 1 + ‖

Z i ‖ 1 ) + β1 ‖

E i ‖ 1

s.t. Y i = A X i + B Z i + E i . (6)

Note that ‖ [ X i ; Z i ] ‖ 1 = ‖ X i ‖ 1 + ‖ Z i ‖ 1 and the above constraint

can be rewritten as Y i = [ A , B ][ X i ; Z i ] + E i . Therefore, by defining

P i = [ X i ; Z i ] , D = [ A , B ] and introducing an auxiliary variable H,

the problem (6) is converted to the equivalent problem:

min

P i , E i , H α‖

H ‖ 1 + β1 ‖

E i ‖ 1

s.t. Y i = D P i + E i , P i = H. (7)

The above problem can be solved efficiently by the Augmented

Lagrange Multipliers (ALM) [36] method, which minimizes the

augmented Lagrange function of problem (7) as

min

P i , E i , H α‖

H ‖ 1 + β1 ‖

E i ‖ 1 + 〈 T 1 , Y i − D P i − E i 〉 + 〈 T 2 , P i − H 〉 +

μ

2

(‖

Y i − D P i − E i ‖

2 F + ‖

P i − H ‖

2 F

), (8)

where T 1 and T 2 are Lagrange multipliers and μ > 0 is a posi-

tive penalty parameter. 〈 A , B 〉 = T r( A

T B ) is the sum of the diag-

onal elements of the matrix A

T B , and A

T is the transpose of the

matrix A . The optimization procedure of problem (8) is shown in

Algorithm 1 , where I is an identity matrix. The optimization prob-

lem in steps 1 and 3 can be solved by the soft-thresholding or

shrinkage operator.

4.2. Updating class-specific sub-dictionary A i

With the learned coefficients X i , the sub-dictionary A i is up-

dated class by class, and the corresponding coefficients X ii is also

updated to meet the constraint. Then, the second sub-problem is

reduced to the following optimization problem:

min

A i , X ii , E i ‖

A i ‖ ∗ + α‖

X ii ‖ 1 + β2 ‖

E i ‖ 1 + λR ( A i )

s.t. Y i = A i X ii + B Z i + E i . (9)

(

For mathematical brevity, we define Y A = Y i − B Z i . Two auxil-

ary variables J and S are introduced and the problem (9) becomes

he following equivalent optimization problem:

min

i , J, X ii , S, E i ‖

J ‖ ∗ + α‖

S ‖ 1 + β2 ‖

E i ‖ 1 + λR ( A i )

.t. Y A = A i X ii + E i , A i = J, X ii = S. (10)

Problem (10) can be solved by solving the following Augmented

agrange Multiplier problem through the ALM method:

min

i , J, X ii , S, E i ‖

J ‖ ∗ + α‖

S ‖ 1 + β2 ‖

E i ‖ 1 + λR ( A i )

〈 T 1 , Y A − A i X ii − E i 〉 + 〈 T 2 , A i − J 〉 + 〈 T 3 , X ii − S 〉

μ2

(‖

Y A − A i X ii − E i ‖

2 F + ‖

A i − J ‖

2 F + ‖

X ii − S ‖

2 F

),

(11)

here T 1 , T 2 and T 3 are Lagrange multipliers and μ > 0

s a positive penalty parameter. The derivative of the in-

oherence term R ( A i ) =

C ∑

j =1 , j � = i ‖ A i X ji ‖ 2 F

with respect to A i is

d R ( A i ) d A i

= 2 C ∑

j =1 , j � = i A i X ji X

T ji

. Hence by defining the matrix V =

2 λμ

C ∑

j =1 , j � = i X ji X

T ji

, the optimization process of problem (11) is pre-

ented in Algorithm 2 . The optimization problem in step 1 can be

olved by using the singular value thresholding (SVT) operator [37] .

ame as in [23,33] , normalization operations are used after updat-

ng J and A i respectively to make the columns of the learned dic-

ionary be unit vectors.

.3. Updating class-shared dictionary B

When all the class-specific sub-dictionaries are updated, we

earn the class-shared dictionary B with all the other variables

xed. Hence, the objective function in (5) is reduced to

min

, Z i , E i ‖

B ‖ ∗ + α‖

Z i ‖ 1 + β1 ‖

E i ‖ 1

.t. Y i = A X i + B Z i + E i . (12)

Different from the class-specific sub-dictionary A i that is as-

ociated only with an individual class, the class-shared dictio-

ary B is related to image samples across all the classes. There-

ore, when updating B , we need to consider all the relationships

f image samples across different classes altogether. By summing

p the objective function in (12) of all the classes, the problem

12) can be converted to the following optimization problem be-


Algorithm 2. Solving problem (11) via ALM.

Input: Data matrix Y A = Y i − B Z i , coefficient matrix X , parameters α, β2 , λ.

Initialization: E i = T 1 = 0 , A i = J = T 2 = 0 , S = T 3 = 0 , μ = 10 −6 , ρ = 1 . 1 ,

ε = 10 −8 , ma x μ = 10 30 , incoherent matrix V =

2 λμ

C ∑

j=1 , j � = i X ji X

T ji

.

while not converged and the maximal iteration number is not reached do

1: Fix the others and update J by

J = argmin J

α‖ J‖ ∗ + 〈 T 2 , A i − J 〉 +

μ2 ‖ A i − J ‖ 2 F ,

normalize the columns in J to unit vectors.

2: Fix the others and update A i by:

A i = ( J + Y A X ii T − E i X ii

T + ( T 1 X ii T − T 2 ) /μ) ( I + X ii X ii

T + V ) −1 ,

normalize the columns in A i to unit vectors.

3: Fix the others and update S by:

S = argmin S

α‖ S‖ 1 + 〈 T 3 , X ii − S 〉 +

μ2 ‖ X ii − S ‖ 2 F .

4: Fix the others and update X ii by:

X ii = ( I + A i T A i )

−1 ( S + A i T Y A − A i

T E i + ( A i T T 1 − T 3 ) /μ) .

5: Fix the others and update E i by:

E i = argmin E i

β2 ‖ E i ‖ 1 + 〈 T 1 , Y A − A i X ii − E i 〉 +

μ2 ‖ Y A − A i X ii − E i ‖ 2 F .

6: Update the multipliers T 1 , T 2 and T 3 by:

T 1 = T 1 + μ( Y A − A i X ii − E i ) , T 2 = T 2 + μ( A i − J ) ,

T 3 = T 3 + μ( X ii − S ) .

7: Update the parameter μ by:

μ = min ( ρμ, ma x μ) .

8: Check the convergence conditions:

‖ Y A − A i X ii − E i ‖ ∞ < ε, ‖ A i − J ‖ ∞ < ε, ‖ X ii − S ‖ ∞ < ε

end

Output: Sub-dictionary matrix A i , coefficient matrix X ii and error matrix E i .

c

B

s

[

mB

s

i

(

B

s

g

B

++

t

h

t

4

t

f

o

Algorithm 3. The complete algorithm of LRD 2 L.

Input: Data matrix Y , initial dictionary matrix D = [ A , B ] ,

parameters α, β1 , β2 , λ.

while the maximal iteration number is not reached do

1: Fix the others and update X i and Z i class by class, by using the

Algorithm 1 ;

2: Fix the others and update A i class by class, by using the Algorithm 2 ;

3: Fix the others and update B by solving the problem (16) ;

end

Output: Class-specific dictionary A , class-shared dictionary B , coefficient

matrix X , coefficient matrix Z and error matrix E.

L

L

i

l

A

t

c

l

c

A

A

w

o

c

s

m

t

F

a

t

i

i

v

t

t

t

i

v

t

t

b

4

a

t

i

m

c

c

1

p

i

m

l

t

t

a

ause C ∑

i =1

‖ Z i ‖ 1 = ‖ [ Z 1 , Z 2 , . . . , Z C ] ‖ 1 :

min

, [ Z 1 , Z 2 , ... , Z C ] , [ E 1 , E 2 , ... , E C ] ‖

B ‖ ∗ + α‖ [ Z 1 , Z 2 , . . . , Z C ] ‖ 1

+ β1 ‖

[ E 1 , E 2 , . . . , E C ] ‖ 1

.t. [ Y 1 , Y 2 , . . . , Y C ] = A [ X 1 , X 2 , . . . , X C ] + B [ Z 1 , Z 2 , . . . , Z C ]

+ [ E 1 , E 2 , . . . , E C ] .

(13)

Since X = [ X 1 , X 2 , . . . , X C ] , Z = [ Z 1 , Z 2 , . . . , Z C ] and E = E 1 , E 2 , . . . , E C ] , the Eq. (13) can be reformulated as

in

, Z, E ‖

B ‖ ∗ + α‖

Z ‖ 1 + β1 ‖

E ‖ 1

.t. Y = AX + BZ + E. (14)

Similar to the second sub-problem, we define Y B = Y − AX and

ntroduce two auxiliary variables Q and L, such that the problem

14) is converted to the following equivalent optimization problem:

min

, Q , Z, L, E ‖

Q ‖ ∗ + α‖

L ‖ 1 + β1 ‖

E ‖ 1

.t. Y B = BZ + E, B = Q , Z = L. (15)

The above problem can be solved by solving its Augmented La-

range Multiplier problem through the ALM method:

min

, Q , Z, L, E ‖

Q ‖ ∗ + α‖

L ‖ 1 + β1 ‖

E ‖ 1

〈 T 1 , Y B − BZ − E 〉 + 〈 T 2 , B − Q 〉 + 〈 T 3 , Z − L 〉

μ2

(‖

Y B − BZ − E ‖

2 F + ‖

B − Q ‖

2 F + ‖

Z − L ‖

2 F

).

(16)

We can observe that the objective function of problem (16) has

he same form as that of problem (11) , except without the inco-

erence term R ( A i ) . Therefore, by setting the parameter λ to zero,

he problem (16) can also be solved by Algorithm 2 .

.4. Complete algorithm of LRD

2 L

With the algorithms for the above three sub-problems, the op-

imization procedures of (8) , (11) and (16) need to be iterated a

ew times to reach the solution of LRD

2 L. The complete algorithm

f LRD

2 L is summarized in Algorithm 3 .

Similar to [39–41,52] , the proposed objective function (5) of

RD

2 L is non-convex to the unknown variables. Since the proposed

RD

2 L algorithm can only converge to a local minimum, the initial-

zation of each dictionary is important for achieving a desirable so-

ution. As discussed in Section 3 , we construct the sub-dictionary

i to represent the class-specific information of each class, while

he class-shared information and the sparse noise should not be

aptured by it. Therefore, similar to [39] , by applying the singu-

ar value decomposition (SVD) on each training samples set Y i , we

an initialize the sub-dictionary A i ∈ �

d×m A i as follows:

i = U i ( 1 : d, 1 ) S i ( 1 , 1 ) V i

(1 : m A i , 1

)T

i ( : , i ) = A i ( : , i ) / norm ( A i ( : , i ) ) for i = 1 , 2 , . . . , m A i ,

(17)

here U i , S i and V i are the singular value decomposition matrices

f Y i , i.e. , Y i = U i S i V i T , and the normalization operation makes the

olumns of A i be unit vectors. Such an initialization scheme makes

ub-dictionary A i capture the most significant class-specific infor-

ation with the lowest rank (only rank 1), and maximally removes

he effect of the class-shared information and the noise from A i .

or the class-shared dictionary B , same as [23] , we initialize the

toms of B with the eigenvectors of Y , which are associated with

he largest m B eigenvalues. Our experiments show that this initial-

zation strategy can lead to a desirable result. In addition, in each

teration, Algorithm 1 and Algorithm 2 reinitialize the associated

ariables and the outputs are the local minima optimized from

he zero matrices. Therefore, similar to the results in [49,51,52] ,

he value of the objective function shown in Fig. 1 slightly fluc-

uates with the increase of iteration number. However, after a few

terations, the objective function value tends to be stable and only

aries in a very small range (see Fig. 1 ). It can be seen that only 10

o 15 iterations can yield stable results. Therefore, for the rest of

he experiments in this work, we set the maximal iteration num-

er to 15.

.5. Computational complexity analysis

In this section, we analyze the computational complexity of our

pproach. For updating the coefficients X i and Z i , the computa-

ions involve the matrix inversion and the matrix multiplications

n Algorithm 1 . Since the dictionary D = [ A , B ] is a d × ( m A + m B )

atrix, the computational complexity of the matrix inversion cal-

ulation in the step 2 is O ( ( m A + m B ) ) 3 , and the matrix multipli-

ations cost O( d N i ( m A + m B ) ) .

For updating the sub-dictionary A i using Algorithm 2 , the step

requires computing the SVD of a d × m A i matrix, whose com-

lexity is O( dm

2 A i

) . Since A i ∈ �

d×m A i and X ii ∈ �

m A i ×N i , the matrix

nversions in the steps 2 and 4 both cost O( m

3 A i

) , and the matrix

ultiplication operations have the complexity of O( d N i m A i ) . Simi-

arly, for updating the class-shared dictionary B using Algorithm 2 ,

he SVT operator in the step 1 has the complexity of O( dm

2 B ) . The

ime cost of the matrix inversions and the matrix multiplications

re O( m

3 ) and O( dN m B ) , respectively.
B


Fig. 1. The curves of the objective function (5) versus the iteration number obtained on (a) AR dataset and (b) Extended YaleB dataset.

C

t

c

y

s

t

fi

c

5

c

a

p

t

i

a

r

p

c

p

n

e

α

p

Considering the number of classes and the number of it-

erations, the overall complexity of the proposed algorithm is

t 1 O( d N i ( m A + m B ) + ( m A + m B ) 3 ) + C t 2 O( d N i m A i

+ dm

2 A i

+ m

3 A i

) +

3 O( d N m B + d m

2 B

+ m

3 B ) , where t 1 , t 2 and t 3 are the numbers of

iterations needed for solving each sub-problem, respectively. C

is the number of classes. The entire optimization algorithm can

be fast provided that the dictionary sizes m A and m B are not too

large.

4.6. Classification based on LRD

2 L

In the previous sections, we have addressed the corruption and

noise problem in the training samples at the dictionary learning

stage. To handle the corruptions and the noise in testing sam-

ples at the testing stage, after learning the class-specific dictionary

A = [ A 1 , A 2 , . . . , A C ] and the class-shared dictionary B , we encode

a testing image y ∈ �

d over the learned dictionaries by solving the

following problem:

min

x , z, e ‖

x ‖ 1 + γ ‖

z ‖ 1 + ω ‖

e ‖ 1

s.t. y = Ax + Bz + e , (18)

where x and z are the coding coefficient vectors of y in terms of A

and B , respectively. e is the error vector to approximate the sparse

noise in y. This problem has the similar form with the problem

(6) and can be solved by ALM algorithm as well. Then, for each

lass, the testing image y can be approximated as

′ i = A δi ( x ) + Bz + e . (19)

Denote x = [ x 1 ; x 2 ; . . . ; x C ] , where x i is the coding coefficient

ub-vector associated with the sub-dictionary A i . δi (x ) selects

he coefficients corresponding to the i -th class, which is de-

ned as δi (x ) = [ 0 ; 0 ; . . . ; x i ; . . . ; 0 ] . Finally, y is classified to the

lass with the smallest reconstruction residual, i.e., identity (y) =argmin

i

‖ y ′ i − y ‖ 2

2 .

. Experimental results

To evaluate the effectiveness of the proposed LRD

2 L method, we

onducted experiments on three public available databases for face

nd object recognitions. The performance of our approach is com-

ared with six state-of-the-art methods, including sparse represen-

ation classifier (SRC) [35] , fisher discrimination dictionary learn-

ng (FDDL) [23] , label consistent KSVD [21] version 1 (LC-KSVD1)

nd version 2 (LC-KSVD2), DL-COPAR [20] and discriminative low-

ank dictionary for sparse representation (DLRD_SR) [33] . We im-

lemented SRC and DLRD_SR algorithms. The codes of the other

ompeting methods were downloaded from the authors’ home-

ages. Same as [40–42] , the image samples used in this paper are

ormalized to unit length (i.e., the 2 -norm of each image vector

quals to one). As the accuracy of our method is not sensitive to

and λ, we fix α = 0 . 1 , λ = 1 for all the experiments in this pa-

er. β , β and the parameters in the benchmark algorithms are
1 2


Fig. 2. Example images of the AR dataset.

Table 1

Classification results of different methods on AR dataset with ξ = 0

( β1 = 0 . 1 , β2 = 0 . 015 ).

Methods (dictionary size) Classification accuracy

SRC ( 7 × 100 = 700 ) 92 .14%

FDDL ( 7 × 100 = 700 ) 92 .29%

LC-KSVD1 ( 7 × 100 = 700 ) 92 .00%

LC-KSVD2 ( 7 × 100 = 700 ) 92 .86%

DL-COPAR ( 7 × 100 + 7 = 707) 92 .86%

DLRD_SR ( 7 × 100 = 700 ) 96 .14%

Proposed LRD 2 L ( 5 × 100 + 200 = 700 ) 97 .29%

t

p

5

i

r

f

s

w

3

l

p

w

t

t

5

t

5

s

t

s

c

w

w

a

s

m

w

p

h

c

a

m

a

1

s

t

t

D

b

c

c

s

a

t

c

a

n

t

f

a

p

l

d

5

t

w

e

p

uned manually for each experimental setting to the best possible

erformance of each competing method.

.1. Experiments on AR dataset

The AR dataset [43] contains more than 40 0 0 frontal view face

mages of 126 subjects. The face images in AR dataset are cor-

upted by different types of corruptions, such as illuminations,

acial expressions and disguises (sunglasses and scarf). For each

ubject, there are 26 images captured in two different sessions

ith a two-week interval. Each session consists of 13 images, with

images under different illumination conditions (right light on,

eft light on and both lights on), 3 images under different ex-

ression changes (smiling, angry and screaming) and 6 images

ith disguises (sunglasses and scarf in combination of illumina-

ion changes). Fig. 2 shows example images of one subject from

he two sessions.

.1.1. Tests on data without occlusion

In this experiment, we use a subset of the dataset, which con-

ains 2600 images from the first 50 male individuals and the first

0 female individuals. For each individual, 7 images from the ses-

ion 1 with different expressions (including neutral) and illumina-

ions are used as training samples, and the 7 images from the ses-

ion 2 under the same conditions are used as testing samples.

In both of the above training and testing images, a varying per-

entage (from 10% to 40%) of randomly selected pixels are replaced

ith noise uniformly distributed over [ 0 , V max ] as used in [33,38] ,

here V max is the maximal pixel value in the image. All the im-

ges are cropped and normalized to the size of 60 × 43 pixels. Fig. 3

hows the average classification accuracies of 5 repeated experi-

ents. Table 1 lists the classification results of different methods

ithout added noise.

It can be seen that, without added noise (i.e., ξ = 0 ), our ap-

roach achieves a classification accuracy of 97.29%, which is 1.15%

igher than the DLRD_SR method and over 4.43% than other five

ompeting methods. When the noise level becomes ξ = 20 , the

ccuracy improvements of the proposed LRD

2 L over the DLRD_SR

ethod and the other competing methods increase up to 4.60%

nd more than 21.02% respectively. The improvements can go up to

7.29% and more than 39.15% when ξ = 40 . These results demon-

trate the effectiveness and superiority of the proposed LRD

2 L over

he state-of-the-art methods in recognizing faces from corrupted

raining and testing data.

Fig. 4 visualizes image decomposition results obtained by

LRD_SR (shown in red box) and the proposed LRD

2 L (shown in

lue box) on AR dataset, under 20% of random pixel corruption. It

an be observed by comparing Fig. 4 (b) and (d) that the images re-

overed by LRD

2 L have much less noise, and preserve more class-

hared information, such as facial expression and illumination vari-

tions. This results in a much stronger representative capability of

he dictionary learned by LRD

2 L than DLRD_SR. Furthermore, be-

ause the proposed LRD

2 L separates the class-specific information

nd the class-shared information through learning different dictio-

aries, the class-specific components (see Fig. 4 (f)) only contain

he most discriminative features of each class, i.e. the identity in-

ormation, and the common patterns (such as the illuminations

nd the expressions) are represented by the class-shared com-

onents (see Fig. 4 (g)). Hence, the class-specific sub-dictionaries

earned by LRD

2 L can be more discriminative and robust than the

ictionaries learned by DLRD_SR.

.1.2. Tests on data with occlusion

In addition to the random noise corruption, we further evaluate

he effectiveness of the proposed approach in dealing with real-

orld disguises in both training and testing samples. Following the

xperimental setting as in [38,44,45] , we conduct comparative ex-

eriments under three different scenarios:

Sunglasses: In this scenario, all seven undisguised images and

one randomly selected image with sunglasses from the ses-

sion 1 of each person are used to construct the training set.

The seven undisguised images from the session 2 and all re-

maining images with sunglasses (the two remaining images

from the session 1 and all the three images from the session

2) of each person are used as testing samples. Thus, there

are 8 training samples and 12 testing samples for each indi-

vidual. The sunglasses cover about 20% of the face image.

Scarf: In this scenario, all seven undisguised images and one

randomly selected image with scarf from the session 1 of

each person are chosen to construct the training set. The

seven undisguised images from the session 2 and all the re-

maining images with scarf are used as testing samples. Thus,

there are also 8 training samples and 12 testing samples for

each individual. The scarf covers about 40% of the face im-

age.

Mixed: In this scenario, both the sunglasses and scarf occlusions

are both considered. For each subject, all seven undisguised

images, one randomly selected image with sunglasses, and

one randomly selected image with scarf from the session 1

are chosen as training samples. The remaining images are


Fig. 3. Classification results of different methods on AR dataset under different percentage of random pixel corruption ( ξ ).

Fig. 4. Images decomposition examples of DLRD_SR [33] and LRD 2 L on AR dataset, under 20% of random pixel noises. (For interpretation of the references to color in this

figure legend, the reader is referred to the web version of this article.)


Fig. 5. Example images of the Extended YaleB dataset.

Table 2

Classification results of different methods on AR dataset with different occlusions

( β1 = 0 . 2 , β2 = 0 . 01 ).

Methods (dictionary size) Scenarios

Sunglass Scarf Mixed

SRC ( = number of training samples ) 89 .83% 89 .17% 88 .59%

FDDL ( 8 × 100 = 800 ) 90 .08% 89 .58% 89 .47%

LC-KSVD1 ( 8 × 100 = 800 ) 87 .75% 86 .53% 86 .24%

LC-KSVD2 ( 8 × 100 = 800 ) 89 .36% 88 .81% 87 .76%

DL-COPAR ( 8 × 100 + 8 = 808 ) 90 .08% 89 .17% 88 .24%

DLRD_SR ( 8 × 100 = 800 ) 94 .08% 92 .58% 91 .53%

Proposed LRD 2 L ( 5 × 100 + 200 = 700 ) 95 .28% 94 .23% 93 .59%

Table 3

Classification results of different methods on Extended YaleB dataset with ξ = 0

( β1 = 0 . 08 , β2 = 0 . 055 ).


SRC ( 30 × 38 = 1140 ) 97 .80%

FDDL ( 20 × 38 = 760 ) 97 .93%

LC-KSVD1 ( 20 × 38 = 760 ) 96 .22%

LC-KSVD2 ( 20 × 38 = 760 ) 96 .83%

DL-COPAR ( 20 × 38 + 20 = 780 ) 97 .25%

DLRD_SR ( 20 × 38 = 760 ) 98 .12%

Proposed LRD 2 L ( 15 × 38 + 200 = 770 ) 99 .43%

a

f

a

L

1

C

l

r

l

5

f

a

l

a

3

a

a

r

p

t

Table 4

Classification results of different methods on COIL-20 dataset with ξ = 0 ( β1 =

0 . 07 , β2 = 0 . 04 ).


SRC ( 10 × 20 = 200 ) 87 .50%

FDDL ( 10 × 20 = 200 ) 87 .98%

LC-KSVD1 ( 10 × 20 = 200 ) 90 .68%

LC-KSVD2 ( 10 × 20 = 200 ) 90 .87%

DL-COPAR ( 10 × 20 + 10 = 210 ) 90 .42%

DLRD_SR ( 10 × 20 = 200 ) 89 .63%

Proposed LRD 2 L ( 6 × 20 + 80 = 200 ) 92 .05%

v

s

D

t

u

d

o

p

b

c

a

S

t

p

c

s

n

t

2

l

(

n

p

5

a

i

a

s

p

l

t

t

e

i

used as testing samples. Thus, there are 9 training samples

and 17 testing samples for each individual.

All the experiments are repeated 5 times. Table 2 reports the

verage classification accuracies of different approaches under dif-

erent scenarios. Consistently, the proposed LRD

2 L obtains the best

ccuracies across all scenarios. More specifically, the proposed

RD

2 L outperforms DLRD_SR by 1.20% for the sunglasses scenario,

.65% for the scarf scenario and 2.06% for the mixed scenario.

ompared with the other benchmark methods, LRD

2 L achieves at

east 5.20%, 4.65% and 4.12% improvements for the three scenarios,

espectively. This demonstrates the effectiveness of the proposed

ow-rank double dictionary learning method.

.2. Experiments on extended YaleB dataset

The Extended YaleB dataset [46] consists of 2414 near frontal

ace images from 38 individuals, with each individual having

round 59–64 images. The images are captured under various

aboratory-controlled illumination conditions (some example im-

ges are shown in Fig. 5 ). For each subject, we randomly select

0 face images to build the training set, and the remaining images

re used as testing samples. All the images are manually cropped

nd normalized to the size of 32 × 32 pixels. The experiments are

epeated 5 times and the average classification accuracies are re-

orted in Table 3 .

From the experimental results shown in Table 3 , it is observed

hat, with sufficient training samples representing the illumination

ariants in the testing data, SRC achieves better classification re-

ult than LC-KSVD. It also can be seen that FDDL, DL-COPAR and

LRD_SR obtain comparable results with SRC on smaller size dic-

ionaries. This demonstrates that dictionary learning methods can

sually learn a more compact and discriminative dictionary than

irectly using training samples as dictionary. The proposed LRD

2 L

btains the highest classification accuracy of 99.43%, which im-

roves by 1.31% over DLRD_SR and at least 1.50% over the other

enchmark methods.

To further evaluate the robustness to different levels of pixel

orruption, 10%–40% of randomly selected pixels in all training

nd testing images are corrupted by the random noise as in

ection 5.1.1 . For each class, 30 images are randomly selected as

raining samples and the rest are used as testing samples. Each ex-

eriment is repeated 5 times and the average classification accura-

ies under different levels of pixel corruptions are shown in Fig. 6 .

From the Fig. 6 , it can be seen that the proposed LRD

2 L con-

istently obtains the highest classification accuracies under varying

oise corruption. When ξ = 20 , the proposed LRD

2 L outperforms

he DLRD_SR and the other five methods by 11.17% and at least

3.25%, respectively. Such improvement increases to 17.49% and at

east 30.92% respectively, when the noise level increases to 40%

i.e., ξ = 40 ). This demonstrates the robustness and the effective-

ess of the proposed LRD

2 L method comparing to the other com-

eting methods.

.3. Experiments on COIL-20 dataset

The COIL-20 object dataset [47] consists of 1440 grayscale im-

ges from 20 objects with various poses. Each object contains 72

mages, which are captured in equally spaced views, i.e., one im-

ge is taken for every 5 degree interval. Fig. 7 shows an example

et of images from one object with different viewpoints. In this ex-

eriment, same as in [4 8,4 9] , 10 images per class are randomly se-

ected as training samples, while the remaining images are used as

esting samples. All the images are manually cropped and resized

o 32 × 32 pixels. All experiments are repeated 5 times and the av-

rage classification accuracies of different approaches are reported

n Table 4 .


Fig. 6. Classification results of different methods on Extended YaleB dataset under different percentage of random pixel corruption ( ξ ).

Fig. 7. Example images of the COIL-20 dataset.

a

a

c

A

i

o

a

6

t

i

m

t

5

As shown in Table 4 , DLRD_SR does not work as well as in the

previous two experiments and only achieves a classification rate

of 89.63%, which becomes lower than those of DL-COPAR and LC-

KSVD. This may be because that, without enough training data, the

correlations between some training samples from the same class

can be non-linear due to the pose variations existing in the dataset.

This makes the low-rankness property of each class less obvious.

And due to the low-rank regularization on each sub-dictionary,

a portion of class-shared information, i.e. pose variants informa-

tion, are removed with sparse noise. Consequently, DLRD_SR can-

not well represent the testing samples with similar pose variants,

which results in a degraded classification accuracy. Different from

DLRD_SR, by learning a class-shared dictionary to capture the pose

variants, the proposed LRD

2 L achieves the highset classification ac-

curacy of 92.05%, still outperforming the competing methods with

margins between 1.18% and 4.55%.

To evaluate the robustness of LRD

2 L on COIL-20 object dataset,

we replace 10% to 40% of randomly selected pixels in all training
p
nd testing images with pixel value of 255 using the same strategy

s in [49,50] . All experiments are repeated 5 times and the average

lassification accuracies of different methods are shown in Fig. 8 .

gain, the proposed LRD

2 L consistently outperforms the compet-

ng methods under different levels of pixel corruptions. With 10%

r more corruptions, our approach performs better than DLRD_SR

nd DL-COPAR by over 3.82% improvements, and obtains at least

% improvement over the other competing methods.

It is very encouraging to see that the proposed LRD

2 L consis-

ently outperforms the state-of-the-art approaches in all the exper-

ments, showing that the dictionaries learned by our approach are

ore discriminative and more robust to the corruptions in both

raining and testing samples.

.4. Effect of β1 and β2

To analyze the influence of the parameters β1 and β2 on the

erformance of the proposed LRD

2 L, experiments are conducted


Fig. 8. Classification results of different methods on COIL-20 dataset under different percentage of random pixel corruption ( ξ ).

Table 5

Classification results of LRD 2 L with and without R( A i ) on the

three datasets with ξ = 0 .

Datasets Classification accuracy

LRD 2 L with R ( A i ) LRD 2 L without R ( A i )

AR 97 .29% 96 .71%

Extended YaleB 99 .43% 99 .13%

COIL-20 92 .05% 91 .46%

b

c

e

p

w

o

b

s

r

r

r

t

t

5

m

a

a

R

o

i

Table 6

Average classification time of different methods on the three

dataset with ξ = 0 .

Methods Classification time (second)

AR Extended YaleB COIL-20

SRC 4058 .47 4907 .05 695 .26

FDDL 4045 .83 3214 .39 683 .12

LC-KSVD1 9 .88 2 .21 0 .083

LC-KSVD2 9 .86 2 .20 0 .075

DL-COPAR 26 .35 39 .60 0 .37

DLRD_SR 122 .74 104 .36 39 .59

Proposed LRD 2 L 178 .13 136 .55 56 .70

d

t

e

5

t

b

t

a

C

v

t

s

a

b

y varying their values from 0.001 to 1. The classification accura-

ies on all the three datasets are shown in Fig. 9 . Note that, for

ach of the three databases, the classification accuracy of the pro-

osed LRD

2 L consistently reaches its peak value and remains flat

hen β1 and β2 are in the range of [0.01, 0.1], which indicates that

ur method is relatively insensitive to β1 and β2 . When β1 and β2

ecome too small (less than 0.01), the accuracy decreases because

ome of the class-specific and class-shared information is incor-

ectly balanced to the sparse noise component reducing the rep-

esentation ability of each dictionary. On the other hand, the accu-

acy also drops when β1 and β2 become greater than 0.1, because

he error matrix E is too sparse to correctly capture the noise in

he images.

.5. Effect of the incoherence term R ( A i )

To evaluate the effect of the incoherence term R ( A i ) , experi-

ents are conducted to compare the performances of LRD

2 L with

nd without R ( A i ) on the three datasets. Their classification results

re reported in Table 5 . It can be seen that the incoherence term

( A i ) consistently improves the classification accuracy of LRD

2 L

n all datasets. With R ( A i ) , the classification accuracy of LRD

2 L

ncreases by 0.58%, 0.30% and 0.49% respectively on the three

atasets. This demonstrates that reducing the coherence between

he class-specific sub-dictionaries of different classes by R ( A i ) can

ffectively increase the discriminability of LRD

2 L.

.6. Evaluation of classification time

We conducted experiments on the three datasets to compare

he computational time of the proposed LRD

2 L algorithm with the

enchmark methods. Table 6 summarizes the average classification

ime of all the methods on the three datasets. All the experiments

re conducted on a 64-bit computer with Inter i7-4710MQ 2.5 GHz

PU and 16GB RAM under the MATLAB R2014a programming en-

ironment. As shown in Table 6 , LC-KSVD costs much less time

han the other methods, due to the efficiency of its classification

cheme. The classification process of LRD

2 L is much faster than SRC

nd FDDL. And the time cost of LRD

2 L is close to that of DLRD_SR,

ecause they use similar ALM schemes for classification.


Fig. 9. The classification accuracy versus β1 and β2 obtained on (a) AR dataset, (b) Extended YaleB dataset and (c) COIL-20 dataset.

f

n

p

m

A

D

R

6. Conclusion

In this paper, we presented a novel low-rank double dictionary

learning (LRD

2 L) approach that can effectively handle corruptions

in both training and testing data, to achieve robust image clas-

sification. The proposed LRD

2 L method incorporates the low-rank

matrix recovery technique with the class-specific and class-shared

dictionary learning. More specifically, our method simultaneously

learns a low-rank class-specific dictionary for each class to cap-

ture the class-specific information owned by each class, a low-rank

class-shared dictionary to represent the common features shared

by all classes, and a sparse error term to approximate the sparse

noise in image data. Based on the analysis on the properties of the

three types of information, by imposing the low-rank regulariza-

tions on each class-specific sub-dictionary and class-shared dictio-

nary together with the sparse noise term, the proposed method

can better separate the corruptions and noise in training samples

from affecting the construction of class-specific sub-dictionaries.

In this way, the low-rank class-specific dictionary only captures

the discriminative features of each class for correctly reconstruct-

ing and classifying new testing samples. A new alternative opti-

mization algorithm is developed to solve the optimization problem

or the proposed method. Experiments on face and object recog-

ition tasks demonstrate the effectiveness and superiority of the

roposed LRD

2 L approach compared to the state-of-the-art bench-

ark methods, especially when the data are corrupted.

cknowledgements

This work was supported in part by National Key Research &

evelopment Program of China ( 2016YFD0101903 ).

eferences

[1] M. Yang , L. Van Gool , L. Zhang , Sparse variation dictionary learning for facerecognition with a single training sample per person, in: IEEE International

Conference on Computer Vision (ICCV), 2013, pp. 689–696 . [2] M. Yang , L. Zhang , J. Yang , D. Zhang , Metaface learning for sparse representa-

tion based face recognition, in: IEEE International Conference on Image Pro-cessing (ICIP), 2010, pp. 1601–1604 .

[3] S. Li , H. Yin , L. Fang , Group-sparse representation with dictionary learning formedical image denoising and fusion, IEEE Trans. Biomed. Eng. 59 (12) (2012)

3450–3459 .

[4] R. Rubinstein , A.M. Bruckstein , M. Elad , Dictionaries for sparse representationmodeling, Proc. IEEE 98 (6) (2010) 1045–1057 .

[5] Z. Jiang , G. Zhang , L.S. Davis , Submodular dictionary learning for sparse coding,in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012,

pp. 3418–3425 .

http://refhub.elsevier.com/S0031-3203(17)30261-3/sbref0001






















[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[6] S. Bengio , F. Pereira , Y. Singer , D. Strelow , Group sparse coding, in: Advancesin Neural Information Processing Systems (NIPS), 2009, pp. 82–89 .

[7] D. Zhang , P. Liu , K. Zhang , H. Zhang , Q. Wang , X. Jing , Class relatedness orient-ed-discriminative dictionary learning for multiclass image classification, Pat-

tern Recognit. 59 (2016) 168–175 . [8] I. Ramirez , P. Sprechmann , G. Sapiro , Classification and clustering via dic-

tionary learning with structured incoherence and shared features, in: IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2010,

pp. 3501–3508 .

[9] S. Gu , L. Zhang , W. Zuo , X. Feng , Projective dictionary pair learning for patternclassification, in: Advances in Neural Information Processing Systems (NIPS),

2014, pp. 793–801 . [10] W. Liu , Z. Yu , Y. Wen , R. Lin , M. Yang , Jointly learning non-negative projection

and dictionary with discriminative graph constraints for classification, BritishMachine Vision Conference (BMVC), 2016 .

[11] J. Wright , Y. Ma , J. Mairal , G. Sapiro , T.S. Huang , S. Yan , Sparse representa-

tion for computer vision and pattern recognition, Proc. IEEE 98 (6) (2010)1031–1044 .

[12] J. Zheng , Z. Jiang , Learning view-invariant sparse representations forcross-view action recognition, in: IEEE International Conference on Computer

Vision (ICCV), 2013, pp. 3176–3183 . [13] J. Zheng , Z. Jiang , R. Chellappa , Cross-view action recognition via transferable

dictionary learning, IEEE Trans . Image Process . 25 (6) (2016) 2542–2556 .

[14] N. Zhou , Y. Shen , J. Peng , J. Fan , Learning inter-related visual dictionary for ob-ject recognition, in: IEEE Conference on Computer Vision and Pattern Recogni-

tion (CVPR), 2012, pp. 3490–3497 . [15] Q. Zhang , B. Li , Discriminative K-SVD for dictionary learning in face recogni-

tion, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),2010, pp. 2691–2698 .

[16] J. Qian , L. Luo , J. Yang , F. Zhang , Z. Lin , Robust nuclear norm regularized re-

gression for face recognition with occlusion, Pattern Recognit. 48 (10) (2015)3145–3159 .

[17] C.-P. Wei , Y.-C.F. Wang , Undersampled face recognition via robust auxiliary dic-tionary learning, IEEE Trans. Image Process. 24 (6) (2015) 1722–1734 .

[18] P. Zhu , M. Yang , L. Zhang , I.-Y. Lee , Local generic representation for face recog-nition with single sample per person, in: Asian Conference on Computer Vision

(ACCV), 2014, pp. 34–50 .

[19] C.-P. Wei , Y.-C.F. Wang , Undersampled face recognition with one-pass dictio-nary learning, IEEE International Conference on Multimedia and Expo (ICME),

2015 . 20] D. Wang , S. Kong , A classification-oriented dictionary learning model: ex-

plicitly learning the particularity and commonality across categories, PatternRecognit. 47 (2) (2014) 885–898 .

[21] Z. Jiang , Z. Lin , L.S. Davis , Label consistent K-SVD: learning a discriminative

dictionary for recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (11) (2013)2651–2664 .

22] M. Yang , L. Zhang , X. Feng , D. Zhang , Fisher discrimination dictionary learn-ing for sparse representation, in: IEEE International Conference on Computer

Vision (ICCV), 2011, pp. 543–550 . 23] M. Yang , L. Zhang , X. Feng , D. Zhang , Sparse representation based fisher dis-

crimination dictionary learning for image classification, Int . J . Comput . Vis . 109(3) (2014) 209–232 .

[24] Y. Sun , Q. Liu , J. Tang , D. Tao , Learning Discriminative Dictionary for Group

Sparse Representation, IEEE Trans. Image Process. 23 (9) (2014) 3816–3828 .

25] S. Gao , I.W. Tsang , Y. Ma , Learning category-specific dictionary and shared dic-tionary for fine-grained image categorization, IEEE Trans. Image Process. 23 (2)

(2014) 623–634 . 26] M. Yang , L. Zhang , J. Yang , D. Zhang , Robust sparse coding for face recognition,

in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011,

pp. 625–632 . [27] R. He , W.-S. Zheng , B.-G. Hu , Maximum correntropy criterion for robust

face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 33 (8) (2011) 1561–1576 .

28] J. Yang , L. Luo , J. Qian , Y. Tai , F. Zhang , Y. Xu , Nuclear norm based matrix re-gression with applications to face recognition with occlusion and illumination

changes, IEEE Trans. Pattern Anal. Mach. Intell. 39 (1) (2016) 156–171 .

29] P. Hao , S. Kamata , Maximum correntropy criterion for discriminative dictionarylearning, in: IEEE International Conference on Image Processin (ICIP), 2013,

pp. 4325–4329 .

30] J. Wright , A. Ganesh , S. Rao , Y. Peng , Y. Ma , Robust principal component analy-sis : exact recovery of corrupted low-rank matrices by convex optimization, in:

Advances in Neural Information Processing (NIPS), 2009, pp. 2080–2088 . [31] E.J. Candès , X. Li , Y. Ma , J. Wright , Robust principal component analysis ? J.

ACM 58 (3) (2011) . 32] G. Liu , Z. Lin , S. Yan , J. Sun , Y. Yu , Y. Ma , Robust recovery of subspace struc-

tures by low-rank representation, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1)(2013) 171–184 .

[33] L. Ma , C. Wang , B. Xiao , W. Zhou , Sparse representation for face recognition

based on discriminative low-rank dictionary learning, in: IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2012, pp. 2586–2593 .

34] L. Li , S. Li , Y. Fu , Learning low-rank and discriminative dictionary for imageclassification, Image Vis. Comput. 32 (10) (2014) 814–823 .

[35] J. Wright , A.Y. Yang , A. Ganesh , S.S. Sastry , Y. Ma , Robust face recognition viasparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009)

210–227 .

36] Z. Lin , M. Chen , Y. Ma , The Augmented Lagrange Multiplier Method for Ex-act Recovery of Corrupted Low-Rank Matrices, 2010 UIUC Technical Report

UILU-ENG-09-2215, arXiv preprint arXiv:1009.5055 . [37] J.-F. Cai , E.J. Candès , Z. Shen , A singular value thresholding algorithm for matrix

completion, SIAM J. Optim. 20 (4) (2010) 1956–1982 . 38] Y. Zhang , Z. Jiang , L.S. Davis , Learning structured low-rank representations

for image classification, in: IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), 2013, pp. 676–683 . 39] X. Jiang , J. Lai , Sparse and dense hybrid representation via dictionary decom-

position for face recognition, in: IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 37, 2015, pp. 1067–1079 .

40] L. Zhuang , S. Gao , J. Tang , J. Wang , Z. Lin , Y. Ma , N. Yu , Constructing a nonnega-tive low-rank and sparse graph with data-adaptive features, IEEE Trans. Image

Process. 24 (11) (2015) 3717–3728 .

[41] S. Li , Y. Fu , Low-rank coding with b-matching constraint for semi-supervisedclassification, in: International Joint conference on Artificial Intelligence (IJCAI),

2013, pp. 1472–1478 . 42] K. Tang , R. Liu , Z. Su , J. Zhang , Structure-constrained low-rank representation,

IEEE Trans. Neural Netw. Learn. Syst. 25 (12) (2014) 2167–2179 . 43] A.M. Martinez , The AR Face Database, 1998 CVC Technical Report .

44] C.-F. Chen , C.-P. Wei , Y.-C.F. Wang , Low-rank matrix recovery with structural

incoherence for robust face recognition, in: IEEE Conference on Computer Vi-sion and Pattern Recognition (CVPR), 2012, pp. 2618–2625 .

45] W. Deng , J. Hu , J. Guo , In defense of sparsity based face recognition, in:IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013,

pp. 399–406 . 46] A.S. Georghiades , P.N. Belhumeur , D.J. Kriegman , From Few to Many: Illumina-

tion Cone Models for Face Recognition under Variable Lighting and Pose, in:

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, 2001,pp. 643–660 .

[47] S.A. Nene , S.K. Nayar , H. Murase , Columbia Object Image Library (COIL-20),Department of Computer Science, Columbia University, New York, USA, 1996

Technical Report No. CUCS-006-96 . 48] S. Wang , Y. Fu , Locality-constrained discriminative learning and coding, in:

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Work-shops, 2015, pp. 17–24 .

49] S. Li , Y. Fu , Learning robust and discriminative subspace with low-rank con-

straints, IEEE Trans. Neural Netw. Learn. Syst. 27 (11) (2016) 2160–2173 . 50] F. Wu , X.-Y. Jing , X. You , D. Yue , R. Hu , J.-Y. Yang , Multi-view low-rank dictio-

nary learning for image classification, Pattern Recognit. 50 (2016) 143–154 . [51] P. Zhou , C. Zhang , Z. Lin , Bilevel model based discriminative dictionary learning

for recognition, IEEE Trans. Image Process. 26 (3) (2017) 1173–1187 . 52] Z. Feng , M. Yang , L. Zhang , Y. Liu , D. Zhang , Joint discriminative dimensionality

reduction and dictionary learning for face recognition, Pattern Recognit. 46 (8)

(2013) 2134–2143 . 53] C. Zhang, C. Liang, L. Li, J. Liu, Q. Huang, Q. Tian, Fine-grained image classifi-

cation via low-rank sparse coding with general and class-specific codebooks,IEEE Trans. Neural Netw. Learn. Syst. (2016), doi: 10.1109/TNNLS.2016.2545112 .

54] Z. Li , Z. Lai , Y. Xu , J. Yang , D Zhang , A locality-constrained and label embeddingdictionary learning algorithm for image classification, IEEE Trans. Neural Netw.

Learn. Syst. 28 (2) (2017) 278–293 .

55] Q. Liu , Z. Lai , Z. Zhou , F. Kuang , Z. Jin , A truncated nuclear norm regularizationmethod based on weighted residual error for matrix completion, IEEE Trans.

Image Process. 25 (1) (2016) 316–330 .

























































































































































































































http://dx.doi.org/10.1109/TNNLS.2016.2545112














ong University of Science and Technology, China, in 2012 and M.Sc. degree in software ent at the School of Computer Science and Technology, Wuhan University of Technology,

th University, Australia. His research interests include digital image processing, pattern

an University, China, in 1987, and the M.Sc. and Ph.D. degrees in computer software and

is a Professor with the School of Computer Science and Technology, Wuhan University of rning and pattern recognition.

m Zhejiang University, China, in 1985 and 1988, respectively, and the Ph.D. degree in

rrently a Professor with the School of Engineering, Griffith University, Australia. He had T Australia (ARC Centre of Excellence), a consultant of Panasonic Singapore Laboratories,

logical University, Singapore. His research interests include face recognition, biometrics, ognition, and medical imaging.

Yi Rong received the B.Sc. degree in computer science and technology from Huazhengineering from Wuhan University of Technology, China, in 2014. He is a Ph.D. stud

China. Currently, he is a visiting Ph.D. student with the School of Engineering, Griffirecognition, machine learning and computer vision.

Shengwu Xiong received the B.Sc. degree in computational mathematics from Wuh

theory from Wuhan University, China, in 1997 and 2003, respectively. Currently, he Technology, China. His research interests include intelligent computing, machine lea

Yongsheng Gao received the B.Sc. and M.Sc. degrees in electronic engineering fro

computer engineering from Nanyang Technological University, Singapore. He is cubeen the Leader of Biosecurity Group, Queensland Research Laboratory, National IC

and an Assistant Professor in the School of Computer Engineering, Nanyang Technobiosecurity, environmental informatics, image retrieval, computer vision, pattern rec

Low-rank double dictionary learning from corrupted data ...

Documents

Transcript of Low-rank double dictionary learning from corrupted data ...