IEEE TRANSACTIONS ON NEURAL NETWORKS AND ...incremental learning with deep models [33]–[41]. The...

10
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. X, JANUARY 2020 1 Memory Efficient Class-Incremental Learning for Image Classification Hanbin Zhao, Hui Wang, Yongjian Fu, Fei Wu, Xi Li* Abstract—With the memory-resource-limited constraints, class-incremental learning (CIL) usually suffers from the “catas- trophic forgetting” problem when updating the joint classification model on the arrival of newly added classes. To cope with the forgetting problem, many CIL methods transfer the knowledge of old classes by preserving some exemplar samples into the size-constrained memory buffer. To utilize the memory buffer more efficiently, we propose to keep more auxiliary low-fidelity exemplar samples rather than the original real high-fidelity exemplar samples. Such memory-efficient exemplar preserving scheme make the old-class knowledge transfer more effective. However, the low-fidelity exemplar samples are often distributed in a different domain away from that of the original exemplar samples, that is, a domain shift. To alleviate this problem, we propose a duplet learning scheme that seeks to construct domain- compatible feature extractors and classifiers, which greatly nar- rows down the above domain gap. As a result, these low-fidelity auxiliary exemplar samples have the ability to moderately replace the original exemplar samples with a lower memory cost. In addition, we present a robust classifier adaptation scheme, which further refines the biased classifier (learned with the samples containing distillation label knowledge about old classes) with the help of the samples of pure true class labels. Experimental results demonstrate the effectiveness of this work against the state-of-the-art approaches. We will release the code, baselines, and training statistics for all models to facilitate future research. Index Terms—Class-incremental Learning, Memory Efficient, Exemplar, Catastrophic Forgetting, Classification I. I NTRODUCTION R Ecent years have witnessed a great development of incremental learning [1]–[19], which has a wide range of real-world applications with the capability of continual model learning. To handle a sequential data stream with time-varying new classes, class-incremental learning [20] has emerged as a technique for the resource-constrained classification problem, which dynamically updates the model with the new-class sam- ples as well as a tiny portion of old-class information (stored in a limited memory buffer). In general, class-incremental learning aims to set up a joint classification model simul- taneously covering the information from both new and old H. Zhao, H. Wang, Y. Fu are with College of Computer Science, Zhe- jiang University, Hangzhou 310027, China. (e-mail: zhaohanbin, wanghui 17, [email protected]) F. Wu is with the College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China. (e-mail: [email protected]) X. Li* (corresponding author) is with the College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China, and also with the Alibaba-Zhejiang University Joint Insititute of Frontier Technologies, Hangzhou 310027, China (e-mail: [email protected]) The first two authors contribute equally. 1 Loss training horse cow bird vulture 2 dog Loss Loss 3 training training Fig. 1. Illustration of resource-constrained class-incremental learning. Firstly a training is done on the first available data. After that, part of those data is stored in a limited memory. When new data arrives, the samples in the memory are extracted and used with the new data to train the network so that it can correctly identify all the classes it has seen. classes, and is usually facing the forgetting problem [21]– [30] with the domination of new-class samples. To address the forgetting problem, many class-incremental learning ap- proaches typically concentrate on the following two aspects: 1) how to efficiently utilize the limited memory buffer (e.g., select representative exemplar samples from old classes); and 2) how to effectively attach the new-class samples with old- class information (e.g., feature transferring from old classes to new classes or distillation label [31] on each sample with old- class teacher model [32]). Therefore, we focus on effective class knowledge transfer and robust classifier updating for class-incremental learning within a limited memory buffer. As for class knowledge transfer, a typical way is to preserve some exemplar samples into a memory buffer which has a constrained size in practice. For maintaining a low memory cost of classification, existing approaches [20], [31] usually resort to reducing the number of exemplar samples from old classes, resulting in the learning performance drop. Motivated by this observation, we attempt to enhance the learning perfor- mance with a fixed memory buffer by increasing the number of exemplar samples while moderately reducing the fidelity of exemplar samples. Our goal is to build a memory efficient class-incremental learning manner with low-fidelity exem- plars. However, the normal exemplar-based class-incremental learning schemes [20], [31] can not work well with low- fidelity exemplars, because there exists a domain gap between the original exemplar samples and their corresponding low- fidelity ones (with smaller memory sizes). Thus, a specific a arXiv:2008.01411v1 [cs.LG] 4 Aug 2020

Transcript of IEEE TRANSACTIONS ON NEURAL NETWORKS AND ...incremental learning with deep models [33]–[41]. The...

Page 1: IEEE TRANSACTIONS ON NEURAL NETWORKS AND ...incremental learning with deep models [33]–[41]. The works can be roughly divided into three fuzzy categories of the common incremental

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. X, JANUARY 2020 1

Memory Efficient Class-Incremental Learning forImage Classification

Hanbin Zhao, Hui Wang, Yongjian Fu, Fei Wu, Xi Li*

Abstract—With the memory-resource-limited constraints,class-incremental learning (CIL) usually suffers from the “catas-trophic forgetting” problem when updating the joint classificationmodel on the arrival of newly added classes. To cope with theforgetting problem, many CIL methods transfer the knowledgeof old classes by preserving some exemplar samples into thesize-constrained memory buffer. To utilize the memory buffermore efficiently, we propose to keep more auxiliary low-fidelityexemplar samples rather than the original real high-fidelityexemplar samples. Such memory-efficient exemplar preservingscheme make the old-class knowledge transfer more effective.However, the low-fidelity exemplar samples are often distributedin a different domain away from that of the original exemplarsamples, that is, a domain shift. To alleviate this problem, wepropose a duplet learning scheme that seeks to construct domain-compatible feature extractors and classifiers, which greatly nar-rows down the above domain gap. As a result, these low-fidelityauxiliary exemplar samples have the ability to moderately replacethe original exemplar samples with a lower memory cost. Inaddition, we present a robust classifier adaptation scheme, whichfurther refines the biased classifier (learned with the samplescontaining distillation label knowledge about old classes) withthe help of the samples of pure true class labels. Experimentalresults demonstrate the effectiveness of this work against thestate-of-the-art approaches. We will release the code, baselines,and training statistics for all models to facilitate future research.

Index Terms—Class-incremental Learning, Memory Efficient,Exemplar, Catastrophic Forgetting, Classification

I. INTRODUCTION

REcent years have witnessed a great development ofincremental learning [1]–[19], which has a wide range of

real-world applications with the capability of continual modellearning. To handle a sequential data stream with time-varyingnew classes, class-incremental learning [20] has emerged as atechnique for the resource-constrained classification problem,which dynamically updates the model with the new-class sam-ples as well as a tiny portion of old-class information (storedin a limited memory buffer). In general, class-incrementallearning aims to set up a joint classification model simul-taneously covering the information from both new and old

H. Zhao, H. Wang, Y. Fu are with College of Computer Science, Zhe-jiang University, Hangzhou 310027, China. (e-mail: zhaohanbin, wanghui 17,[email protected])

F. Wu is with the College of Computer Science and Technology, ZhejiangUniversity, Hangzhou 310027, China. (e-mail: [email protected])

X. Li* (corresponding author) is with the College of Computer Scienceand Technology, Zhejiang University, Hangzhou 310027, China, and alsowith the Alibaba-Zhejiang University Joint Insititute of Frontier Technologies,Hangzhou 310027, China (e-mail: [email protected])

The first two authors contribute equally.

𝐷𝑎𝑡𝑎 1

Loss

training

horse

cow

bird

vulture

𝐷𝑎𝑡𝑎 2

dog

𝑀𝑒𝑚𝑜𝑟𝑦 𝑀𝑒𝑚𝑜𝑟𝑦

Loss Loss

𝐷𝑎𝑡𝑎 3

training training

Fig. 1. Illustration of resource-constrained class-incremental learning. Firstlya training is done on the first available data. After that, part of those datais stored in a limited memory. When new data arrives, the samples in thememory are extracted and used with the new data to train the network so thatit can correctly identify all the classes it has seen.

classes, and is usually facing the forgetting problem [21]–[30] with the domination of new-class samples. To addressthe forgetting problem, many class-incremental learning ap-proaches typically concentrate on the following two aspects:1) how to efficiently utilize the limited memory buffer (e.g.,select representative exemplar samples from old classes); and2) how to effectively attach the new-class samples with old-class information (e.g., feature transferring from old classes tonew classes or distillation label [31] on each sample with old-class teacher model [32]). Therefore, we focus on effectiveclass knowledge transfer and robust classifier updating forclass-incremental learning within a limited memory buffer.

As for class knowledge transfer, a typical way is to preservesome exemplar samples into a memory buffer which has aconstrained size in practice. For maintaining a low memorycost of classification, existing approaches [20], [31] usuallyresort to reducing the number of exemplar samples from oldclasses, resulting in the learning performance drop. Motivatedby this observation, we attempt to enhance the learning perfor-mance with a fixed memory buffer by increasing the numberof exemplar samples while moderately reducing the fidelity ofexemplar samples. Our goal is to build a memory efficientclass-incremental learning manner with low-fidelity exem-plars. However, the normal exemplar-based class-incrementallearning schemes [20], [31] can not work well with low-fidelity exemplars, because there exists a domain gap betweenthe original exemplar samples and their corresponding low-fidelity ones (with smaller memory sizes). Thus, a specific a

arX

iv:2

008.

0141

1v1

[cs

.LG

] 4

Aug

202

0

Page 2: IEEE TRANSACTIONS ON NEURAL NETWORKS AND ...incremental learning with deep models [33]–[41]. The works can be roughly divided into three fuzzy categories of the common incremental

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. X, JANUARY 2020

learning scheme must be proposed to update the model whilereducing the influence of domain shift. In our duplet learningscheme, when facing the samples of new classes the low-fidelity exemplar samples are treated as the auxiliary samples,resulting in a set of duplet sample pairs in the form of originalsamples and their corresponding auxiliary samples. Based onsuch duplet sample pairs, we construct a duplet-driven deeplearner that aims to build domain-compatible feature extractorsand classifiers to alleviate the domain shift problem. Withsuch a domain-compatible learning scheme, the low-fidelityauxiliary samples have the capability of moderately replacingthe original high-fidelity samples, leading to more exemplarsamples in the fixed memory buffer with a better learningperformance.

After that, the duplet-driven deep learner is carried out overthe new-class samples to generate their corresponding distil-lation label information of old classes, which makes the new-class samples inherit the knowledge of old classes. In this way,the label information on each new-class sample is composedof both distillation labels of old classes and true new-classlabels. Hence, the overall classifier is incrementally updatedwith these two kinds of label information. Since the distillationlabel information is noisy, the classifer still has a small bias.Therefore, we propose a classifier adaptation scheme to correctthe classifier. Specifically, we fix the feature extractor learnedwith knowledge distillation, and then adapt the classifier oversamples with true class labels only (without any distillationlabel information). Finally, the corrected classifier is obtainedas a more robust classifier.

In summary, the main contributions of this work are three-fold. First, we propose a novel memory-efficient duplet-drivenscheme for resource-constrained class-incremental learning,which innovatively utilizes low-fidelity auxiliary samples forold-class knowledge transfer instead of the original real sam-ples. With more exemplar samples in the limited memorybuffer, the proposed learning scheme is capable of learningdomain-compatible feature extractors and classifiers, whichgreatly reduces the influence of the domain gap betweenthe auxiliary data domain and the original data domain.Second, we present a classifier adaptation scheme, whichrefines the overall biased classifier (after distilling the old-class knowledge into the model) by using pure true classlabels for the samples while keeping the feature extractorsfixed. Third, extensive experiments over benchmark datasetsdemonstrate the effectiveness of this work against the state-of-the-art approaches.

The rest of the paper is organized as follows. We firstdescribe the related work in Section II, and then explain thedetails of our proposed strategy in Section III. In Section IV,we report the experiments that we conducted and discuss theirresults. Finally, we draw a conclusion and describe future workin Section V.

II. RELATED WORK

Recently, there have been a lot of research works onincremental learning with deep models [33]–[41]. The workscan be roughly divided into three fuzzy categories of thecommon incremental learning strategies.

A. Rehearsal strategies

Rehearsal strategies [20], [31], [42]–[46] replay the pastknowledge to the model periodically with a limited mem-ory buffer, to strengthen connections for previously learnedmemories. Selecting and preserving some exemplar samplesof past classes into the size-constrained memory is a strategyto keep the old-class knowledge. A more challenging approachis pseudo-rehearsal with generative models. Some generativereplay strategies [47]–[51] attempt to keep the domain knowl-edge of old data with a generative model and using onlygenerated samples does not give competitive results.

B. Regularization strategies

Regularization strategies extend the loss function with lossterms enabling the updated weights to retain past memories.The work in [32] preserves the model accuracy on old classesby encouraging the updated model to reproduce the scoresof old classes for each image through knowledge distillationloss. The strategy in [52] is to apply the knowledge distillationloss to incremental learning of object detectors. Other strate-gies [3], [53]–[55] use a weighted quadratic regularization lossto penalize moving important weights used for old tasks.

C. Architectural strategies

Architectural strategies [56]–[63] mitigate forgetting of themodel by fixing part of the model’s architectures (e.g. layers,activation functions, parameters). PNN [57] combines theparameter freezing and network expansion, and CWR [58] isproposed with a fixed number of shared parameters based onPNN.

Our work belongs to the first and the second categories. Wefocus on effective past class knowledge transfer and robustclassifier updating within a limited memory buffer. For effec-tive class knowledge transfer with more exemplar samples, weinnovatively design a memory-efficient exemplar preservingscheme and a duplet learning scheme that utilizes the low-fidelity exemplar samples for knowledge transfer, instead ofdirectly utilizing the original real samples. Moreover, the dis-tillation label information of old classes on new-class sampleswith knowledge distillation is usually noisy. Motivated bythis observation, we further refine the biased classifier in aclassifier adaptation scheme.

TABLE IMAIN NOTATIONS AND SYMBOLS USED THROUGHOUT THE PAPER.

Notation DefinitionXh The sample set of class hXh Auxiliary form of the sample set of class hAt The added data of new classes at the t-th learning sessionFt The deep image classification model at the t-th learning sessionθFt The parameters of FtBt The feature extractor of FtCt The classifier of FtP jt The auxiliary exemplar samples of old class j stored at the t-th learning sessionRt The auxiliary exemplar samples of old classes preserved at previous t learning sessionsE The mapping function of the encoderD The mapping function of the decoder

Page 3: IEEE TRANSACTIONS ON NEURAL NETWORKS AND ...incremental learning with deep models [33]–[41]. The works can be roughly divided into three fuzzy categories of the common incremental

ZHAO et al.: MEMORY EFFICIENT CLASS-INCREMENTAL LEARNING FOR IMAGE CLASSIFICATION 3

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(a) w/o duplet learning

Real SamplesAuxiliary Samples

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(b) with duplet learning

Real SamplesAuxiliary Samples

Fig. 2. Visualization of the low-fidelity auxiliary samples and the corresponding real samples with two different feature extractors by t-SNE. (a): The featureextractor is updated incrementally on the auxiliary data without our duplet learning scheme, and we notice that there is a large gap between the auxiliary dataand real data; (b): The feature extractor is updated incrementally on the auxiliary data with our duplet learning scheme, and we observe that the domain gapis reduced.

III. METHOD

A. Problem Definition

Before presenting our method, we first provide an illustra-tion of the main notations and symbols used hereinafter (asshown in Table I) for a better understanding.

Class-incremental learning assumes that samples from onenew class or a batch of new classes come at a time. Forsimplicity, we suppose that the sample sets in a data streamarrive in order (i.e. X1, X2, . . . ), and the sample set Xh

contains the samples of class h (h ∈ {1, 2, . . . }). We considerthe time interval from the arrival of the current batch of classesto the arrival of the next batch of classes as a class-incrementallearning session [47], [56]. A batch of new class data added atthe t-th (t ∈ {1, 2, . . . }) learning session are represented as:

At =

{X1 ∪ · · · ∪Xk ∪ · · · ∪Xn1 t = 1

Xnt−1+1 ∪ · · · ∪Xk ∪ · · · ∪Xnt t ≥ 2(1)

In an incremental learning environment with a limitedmemory buffer, previous samples of old classes can not bestored entirely and only a small number of exemplars fromthe samples are selected and preserved into memory for old-class knowledge transfer [20], [31]. The memory buffer isdynamically updated at each learning session.

At the t-th session, after obtaining the new-class samplesAt we access memory Mt−1 to extract the exemplar samplesof old-class information. Let P j

t−1 denote the set of exemplarsamples extracted from the memory at the t-th session for theold class j:

P jt−1 =

{(xj1, y

j), . . . , (xjet−1, yj)

}(2)

where et−1 is the number of exemplar samples and yj isthe corresponding ground truth label. P j

t−1 is the first et−1samples selected from the sorted list of samples of class j byherding [64]. And then Mt−1 can be rewritten as:

Mt−1 = P 1t−1 ∪ · · · ∪ P

nt−1

t−1 (3)

The objective is to train a new model Ft which hascompetitive classification performance on the test set of allthe seen classes. Ft represents the deep image classificationmodel at the t-th learning session and the parameters of themodel are denoted as θFt

. The output of Ft is defined as:

Ft(x) = [F 1t (x), . . . , F

nt−1

t (x), Fnt−1+1t (x), . . . , Fnt

t (x)](4)

Ft is usually composed of a feature extractor Bt and aclassifier Ct. After obtaining the model Ft, the memory bufferis updated and Mt is constructed with the exemplars in At anda subset of Mt−1.

B. Memory Efficient Class-incremental LearningIt is helpful for knowledge transfer to store more number of

exemplars in the memory [31]. For effective knowledge trans-fer, we propose a memory efficient class-incremental learningmanner, which means utilizing more low-fidelity auxiliaryexemplar samples to approximately replace the original realexemplar samples.

We present an encoder-decoder structure to transform theoriginal high-fidelity real sample x to the corresponding low-fidelity sample x (with smaller memory size):

x = D ◦ E(x) (5)

where E is the mapping function of the encoder and D is themapping function of the decoder. Due to the loss of fideltiy,we keep the auxiliary sample code with smaller memory costcompared with storing the corresponding real sample. Weuse r to represent the memory cost ratio of keeping a low-fidelity sample and keeping the corresponding high-fidelityreal sample (r = size(E(x))

size(x) ).

Let P jt−1 denote the set of exemplar auxiliary samples

extracted from the memory at the t-th session for the old classj:

P jt−1 =

{(xj1, y

j), . . . , (xjet−1, yj)

}(6)

Page 4: IEEE TRANSACTIONS ON NEURAL NETWORKS AND ...incremental learning with deep models [33]–[41]. The works can be roughly divided into three fuzzy categories of the common incremental

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. X, JANUARY 2020

𝐸 𝐴1

𝐴1

(𝐴1, 𝐴1 )

𝐴1 𝐴2 𝐴2

𝐸 𝑅1

𝐸 𝐴2

𝐸 𝑅1 𝐸 𝑅2

D E D E D E

𝐸 𝐴𝑡

𝐴𝑡 𝐴𝑡

... ... 𝐸 𝑅𝑡−1 𝐸 𝑅𝑡

𝐵1

𝐶1

𝑙𝑐𝑙𝑠1

(𝐴𝑡, 𝐴𝑡 ) ∪ 𝑅𝑡−1

𝑅𝑡

𝐴1 𝐴2 𝐴𝑡 ……

Memory

Data Stream

𝑅2

𝐵1

𝐶1

𝑙𝑑𝑖𝑠2

𝐵2

𝐶2

𝑙𝑐𝑙𝑠2

𝐵2

𝐶2

𝐿2(𝜃𝐶2)

(𝐴2, 𝐴2 ) ∪ 𝑅1

𝐵𝑡−1

𝐶𝑡−1

𝑙𝑑𝑖𝑠𝑡 𝑙𝑐𝑙𝑠

𝑡

𝐵𝑡

𝐶𝑡

𝐵𝑡

𝐶𝑡

Fixed

2nd learning session 1st learning session 𝒕th learning session

D D D D Write Read Write Read Write

𝐿𝑡(𝜃𝐶𝑡)

Fig. 3. Illustration of the process of class-incremental learing with our duplet learning scheme and classifier adaptation scheme. We use Bt and Ct (t ∈ N)to represent the feature extractor and the classifier respectively at the t-th learning session. For initialization, F1 is trained from scratch with the set of dupletsample pairs (A1, A1), R1 is then constructed and stored in the memory in the form of E(R1). At the t-th learning session, firstly we train a new featureextractor Bt and a biased classifier on all the seen classes with our duplet learning scheme, and the exemplar auxiliary samples Rt for all the seen classesare then constructed. Finally, we update the classifier with the classifier adaptation scheme on Rt.

where et−1 is the number of exemplar auxiliary samples. Withthe fixed size memory buffer, et−1 is larger than et−1. We useRt−1 to represent all the exemplar auxiliary samples extractedfrom Mt−1:

Rt−1 = P 1t−1 ∪ · · · ∪ P

nt−1

t−1 (7)

And then Mt−1 in our memory efficient class-incrementallearning can be represented as:

Mt−1 = E(Rt−1) (8)

where Rt−1 represents the corresponding real samples ofRt−1.

At the t-th session, we use the exemplars Rt−1 and newlyadded data At to train the model Ft. More number ofexemplars from Mt−1 are helpful for learning the model.However, the domain gap between the auxiliary samples andtheir original versions (as shown in Figure 2(a)) often gives abad performance. In order to fix this issue, we propose a dupletclass-incremental learning scheme in the following subsection.

C. Duplet Class-incremental Learning Scheme

To reduce the influence of domain shift, we propose aduplet learning scheme, which trains the model using a setof duplet sample pairs in the form of original samples andtheir corresponding auxiliary samples.

A duplet sample pair (x, x, y) is constructed from anauxiliary sample (x, y) and the corresponding real sample(x, y). At the t-th learning session (as shown in Figure 3),we construct a set of duplet sample pairs with At and At,denoted as:

(At, At) = {(xk, xk, yk)}|At|k=1

s.t. (xk, yk) ∈ At, (xk, yk) ∈ At

(9)

We train the model Ft with the duplet sample pairs of (At, At)

and the exemplar auxiliary samples of old classes Rt−1 byoptimizing the objective function Lt:

Lt(θFt) = Lt1(θFt) + Lt

2−dup(θFt) (10)

where Lt1 is the loss term for Rt−1 and Lt

2−dup is the lossterm for (At, At).Lt

1: For preserving the model performance on old-class aux-iliary samples, Lt

1is defined as:

Lt1(θFt

) = 1∣∣∣Rt−1

∣∣∣∑

(x,y)∈Rt−1

lt((x, y); θFt) (11)

where lt is composed of a classification loss term ltcls and aknowledge distillation loss term ltdis for one sample, which isdefined as:

lt((x, y); θFt) = ltcls((x, y); θFt

) + ltdis((x, y); θFt) (12)

Page 5: IEEE TRANSACTIONS ON NEURAL NETWORKS AND ...incremental learning with deep models [33]–[41]. The works can be roughly divided into three fuzzy categories of the common incremental

ZHAO et al.: MEMORY EFFICIENT CLASS-INCREMENTAL LEARNING FOR IMAGE CLASSIFICATION 5

The classification loss ltcls for one training sample on newlyadded classes is formulated as:

ltcls((x, y); θFt) =

nt∑k=nt−1+1

Entropy(F kt (x; θFt

), δy=k)

(13)where δy=k is a indicator function and denoted as:

δy=k =

{1 y = k

0 y 6= k(14)

Entropy(·, ·) is a cross entropy function and represented as:

Entropy(y, y) = −[y log(y) + (1− y) log(1− y)] (15)

The knowledge distillation loss ltdis is defined as:

ltdis((x, y); θFt) =nt−1∑k=1

Entropy(F kt (x; θFt), F

kt−1(x; θFt−1))

(16)its aim is to make the output of the model Ft close to thedistillation class label F k

t−1(x; θFt−1) (k ∈ {1, 2, . . . , nt−1})of the previously learned model Ft−1 on old classes.Lt

2−dup : For the new-class duplet sample pairs, Lt2−dup

encourages the output of the model on the real sample similarto that on the corresponding auxiliary sample and is definedas:

Lt2−dup(θFt

) = 1

|At|+|At|∑

(x,x,y)∈(At,At)

[lt((x, y); θFt)

+lt((x, y); θFt)]

(17)In general, we can obtain a domain-compatible feature ex-

tractor and a classifier on all the seen classes by optimizing theloss function in Equation (10). The domain gap in Figure 2(a)is decreased greatly with our duplet learning scheme as shownin Figure 2(b).

D. Classifier Adaptation

Our duplet-driven deep learner is carried out over the newclasses samples to generate their corresponding distillationlabel information of old classes through the distillation loss (asdefined in Equation (16)), which makes the new-class samplesinherit the knowledge of old classes. Since the distillation labelknowledge is noisy, we propose a classifier adaptation schemeto refine the classifier over samples with true class labels only(without any distillation label information).

Taking the t-th learning session for example, we can obtaina domain-compatible but biased classifier Ct by optimizingthe objective function defined in Equation (10). Here, we fixthe feature extractor Bt learned and continue to optimize theparameters of the classifier further only using the true classlabel knowledge. Then, the optimization is done by usingexclusively the auxiliary samples that are going to be stored inmemory Mt. The objective function is formulated as below:

Lt(θCt) =1

|Rt|∑

(x,y)∈Rt

nt∑k=1

Entropy(Ckt ◦Bt(x; θCt)), δy=k)

(18)By minimizing the objective function, the classifier Ct is

refined and has better performance for all the seen classes.Figure 3 illustrates the process of class-incremental learning

Algorithm 1: Training the model at the t-th sessionInput: The added data of new classes At

Require: The exemplar auxiliary samples Rt−1 and theparameters θFt−1

={θCt−1

, θBt−1

}1 Obtain the auxiliary samples At for new classes from At

using Equation (5);2 Initialize θBt , θCt with θBt−1 and θCt−1 respectively;/* The duplet learning scheme */

3 Obtain the optimal parameters θBtof the feature extractor

and a biased classifier by minimizing Equation (10);4 Obtain the exemplar auxiliary samples Rt of all the seen

classes from Rt−1 and At (described in Section III-B);/* The classifier adaptation scheme */

5 Fix the parameters θBtand adapt the biased classifier by

minimizing Equation (18) to obtain the optimalparameters θCt

;Output: The auxiliary exemplar samples Rt and the

parameters θFt= {θCt

, θBt}

with our duplet learning scheme and classifier adaptation indetail. For initialization, F1 is trained from scratch with theset of duplet sample pairs (A1, A1), then R1 is constructedand stored in the memory in the form of E(R1). At the t-thlearning session, we extract the auxiliary samples Rt−1 forprevious classes and At for new classes. Firstly we train adomain-compatible feature extractor Bt and a classifier on allthe seen classes with our duplet learning scheme (described inSection III-C). Then the exemplar auxiliary samples Rt for allthe seen classes are constructed from Rt−1 and At (describedin Section III-B). Finally we update the classifier further withour classifier adaptation scheme (introduced in Section III-D).Algorithm 1 lists the steps for training the model in detail.

IV. EXPERIMENTS

A. Datasets

CIFAR-100 [29] is a labeled subset of the 80 million tinyimages dataset for object recognition. This dataset contains60000 32× 32 RGB images in 100 classes, with 500 imagesper class for training and 100 images per class for testing.ILSVRC [65] is a dataset for ImageNet Large Scale VisualRecognition Challenge 2012. It contains 1.28 million trainingimages and 50k validation images in 1000 classes.

B. Evaluation Protocol

We evaluate our method on the iCIFAR-100 benchmark andthe iILSVRC benchmark proposed in [20]. On iCIFAR-100, inorder to simulate a class-incremental learning process, we trainall 100 classes in batches of 5, 10, 20, or 50 classes at a time,which means 5, 10, 20, or 50 classes of new data are added ateach learning session. After each batch of classes are added,the obtained accuracy is computed on a subset of test data setscontaining only those classes that have been added. The resultswe report are the average accuracy without considering theaccuracy of the first learning session as it does not represent

Page 6: IEEE TRANSACTIONS ON NEURAL NETWORKS AND ...incremental learning with deep models [33]–[41]. The works can be roughly divided into three fuzzy categories of the common incremental

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. X, JANUARY 2020

High fidelity Low fidelity Original

E D

E D

High fidelity Low fidelity Original

High fidelity Low fidelity Original

𝑫(𝒓𝒊) 𝑬(𝒓𝒊)

𝑫(𝒓𝒊) 𝑬(𝒓𝒊)

𝒓𝟏 𝒓𝟐 𝒓𝟑 𝒓𝟒

High fidelity Low fidelity Original

Fig. 4. Illustration of original real samples from CIFAR-100 and theirauxiliary sample equivalents with different fidelities.

the incremental learning described in [31]. On ILSVRC, weuse a subset of 100 classes which are trained in batches of 10(iILSVRC-small) [20]. For a fair comparison, we take the sameexperimental setup as that of [20], which randomly selects 50samples for each class of ILSVRC as the test set and the restas the training set. In addition, the evaluation method of theresult is the same as that on the iCIFAR-100.

C. Implementation Details

a) Data Preprocessing: on CIFAR-100, the only prepro-cessing we do is the same as iCaRL [20], including randomcropping, data shuffling and per-pixel mean subtraction. OnILSVRC, the augmentation strategy we use is the 224 × 224random cropping and the horizontal flip.

b) Generating auxiliary samples of different fidelities:we generate auxiliary samples of different fidelities (as shownin Figure 4) with various kinds of encoder-decoder structures(e.g. PCA [66], Downsampling and Upsampling). The fidelityfactors r we use (as defined in Section 3.1) are as follow: foriCIFAR-100, we use 1

3 , 16 and 1

12 when using a PCA basedreduction and 1

3 , 16 when using downsampling. For iILSVRC,

we conduct the experiments with r values of 14 based on

downsampling. In our experiments, we use a memory bufferof 2000 full samples. Since the sample fidelity and numberof exemplars is negatively correlated, our approach can store2000r samples.

c) Training details: for iCIFAR-100 at each learningsession, we train a 32-layers ResNet [67] using SGD with amini-batch size of 256 (128 duplet sample pairs are composedof 128 original samples and 128 corresponding auxiliarysamples) by the duplet learning scheme. The initial learningrate is set to 2.0 and is divided by 5 after 49 and 63 epochs.We train the network using a weight decay of 0.00001 anda momentum of 0.9. For the classifier adaptation, we use theauxiliary exemplar samples for normal training and the otherparameters of the experiments remain the same. We implementour framework with the theano package and use an NVIDIATITAN 1080 Ti GPU to train the network. For iILSVRC, wetrain an 18-layers ResNet [67] with the initial learning rate of2, divided by 5 after 20, 30, 40 and 50 epochs. The rest ofthe settings are the same as those on the iCIFAR-100.

TABLE IIEVALUATION OF DIFFERENT METHODS WHICH EITHER USING THE

AUXILIARY EXEMPLARS OR REAL EXEMPLARS. ADDED CLASSES AT EACHSESSION IS 10.

Method Exemplar Average Accuracy

iCaRL Real 60.79%Auxiliary 50.86%

iCaRL-Hybrid1 Real 55.10%Auxiliary 44.86%

Ours.FC Real 61.67%Auxiliary 67.04%

Ours.NCM Real 61.97%Auxiliary 66.95%

TABLE IIIVALIDATION OF OUR DUPLET LEARNING SCHEME ON ICIFAR-100 WITHTHE AUXILIARY SAMPLES OF DIFFERENT FIDELITIES. WITH OUR DUPLET

LEARNING SCHEME (DUP), THE AVERAGE ACCURACY OF THECLASS-INCREMENTAL MODEL IS ENHANCED BY MORE THAN 10%

COMPARED TO THAT OF THE NORMAL LEARNING SCHEME IN [20] FORALL CASES.

Fidelity Factor r Method Average Accuracy

PCA 13

iCaRL-Hybrid1 44.86%iCaRL-Hybrid1+DUP +14.67%

PCA 16

iCaRL-Hybrid1 42.74%iCaRL-Hybrid1+DUP +17.39%

PCA 112

iCaRL-Hybrid1 40.90%iCaRL-Hybrid1+DUP +16.88%

Downsampling 13

iCaRL-Hybrid1 41.88%iCaRL-Hybrid1+DUP +16.41%

Downsampling 16

iCaRL-Hybrid1 40.29%iCaRL-Hybrid1+DUP +16.96%

D. Ablation Experiments

In this section, we first evaluate different methods whicheither using the auxiliary exemplars or real exemplars. Thenwe carry out two ablation experiments to validate our dupletlearning and classifier adaptation scheme on iCIFAR-100, andwe also conduct another ablation experiment to show the effectof the auxiliary sample’s fidelity and the auxiliary exemplardata size for each class when updating the model. Finally, weevaluate methods with memory buffer of different sizes.

1) Baseline: for the class-incremental learning problem, weconsider three kinds of baselines, which are: a) LWF.MC [32],utilizes knowledge distillation in the incremental learningproblem, b) iCaRL [20], utilizes exemplars firstly for old-class knowledge transfer and a nearest-mean-of-exemplarsclassfication strategy and c) iCaRL-Hybrid1 [20], also usesthe exemplars but with a neural network classifier (i.e. a fullyconnected layer).

2) Using auxiliary exemplars or real exemplars: we eval-uate different methods which either using the auxiliary ex-emplars or real exemplars, shown in Table II. For a fair

Page 7: IEEE TRANSACTIONS ON NEURAL NETWORKS AND ...incremental learning with deep models [33]–[41]. The works can be roughly divided into three fuzzy categories of the common incremental

ZHAO et al.: MEMORY EFFICIENT CLASS-INCREMENTAL LEARNING FOR IMAGE CLASSIFICATION 7

TABLE IVVALIDATION OF OUR CLASSIFIER ADAPTATION SCHEME ON ICIFAR-100

WITH THE AUXILIARY SAMPLES OF DIFFERENT FIDELITIES. THE MODEL’SACCURACY IS IMPROVED AFTER UPDATING THE CLASSIFIER FURTHER

WITH THE CLASSIFIER ADAPTATION SCHEME.

Fidelity Factor r Method Average Accuracy

PCA 13

iCaRL-Hybrid1+DUP 59.53%iCaRL-Hybrid1+DUP+CA +7.51%

PCA 16

iCaRL-Hybrid1+DUP 60.13%iCaRL-Hybrid1+DUP+CA +3.93%

PCA 112

iCaRL-Hybrid1+DUP 57.77%iCaRL-Hybrid1+DUP+CA +0.63%

Downsampling 13

iCaRL-Hybrid1+DUP 58.29%iCaRL-Hybrid1+DUP+CA +2.63%

Downsampling 16

iCaRL-Hybrid1+DUP 57.25%iCaRL-Hybrid1+DUP+CA +0.68%

0 0.2 0.4 0.6 0.8 1

Cost Ratio r based on PCA method

0.2

0.3

0.4

0.5

0.6

0.7

Acc

ura

cy

Ours.FCiCaRLiCaRL-Hybrid1LWF.MC

Fig. 5. Performance of model when varying the auxiliary sample’s fidelityand the exemplar auxiliary data size. When r = 1, it indicates that keepingthe real exemplar samples in memory directly [20].

comparison, the memory cost for the auxiliary or the realis fixed. Our method is denoted by “Ours.FC” if we utilizethe fully connected layer as the classifier, or “Ours.NCM” ifutilizing the nearest-mean-of-exemplars classification strategy.Using auxiliary exemplars directly leads to a performancedrop for both iCaRL and iCaRL-Hybrid1 because of the largedomain gap between the auxiliary data and real data (shownin Figure 2(a)). For our domain-invariant learning method, theaverage accuracy of using the auxiliary exemplars is about5% higher than that of using the real exemplars. It seemsto be proved that with the same memory buffer limitation,using auxiliary exemplars can further improve the performancecompared with using the real exemplars, as long as the domaindrift between them is reduced.

3) Validation of the duplet learning scheme: we evaluateour duplet learning scheme with different auxiliary samples fi-delity for the iCIFAR100 benchmark. 10 new classes are addedat each learning session. We utilize the “iCaRL-Hybrid1”method with the auxilary samples by the normal learning

1000 2000 3000 4000 5000

Memory Szie

0.4

0.5

0.6

0.7

0.8

Acc

ura

cy

Ours.FCiCaRLiCaRL-Hybrid1LWF.MC

Fig. 6. Average incremental accuracy on iCIFAR-100 with 10 classes perbatch for the memory of different size (expressed in the number of realexemplar samples). The average accuracy of the model with our scheme ishigher than that of iCaRL, LWF.MC and iCaRL-Hybrid1 in all the cases.

scheme in [20] or our duplet learning scheme (DUP). Asshown in Table III, we can observe that our scheme is ableto significantly improve the performance of the final modelgreatly for various kinds of low-fidelity auxiliary samplescompared with directly training the model. Moreover, the t-SNE [68] analysis in Figure 2(a) shows that normally thereis a large gap between the auxiliary data and the real datawithout our duplet learning scheme. Figure 2(b) illustrates thatour duplet learning scheme can actually reduce the domaindrift and guarantee the effectiveness of the auxiliary data forpreserving the model’s performance on old classes.

4) Validation of the classifier adaptation scheme: we eval-uate the performance of the final classifier after using ourclassifier adaptation scheme (CA). As shown in Table IV, theclassifier adaptation scheme can further improve the classifier’saccuracy. For the auxiliary samples of different fidelities basedon the PCA or Downsampling, we can observe that theimprovement of the performance decreases along with theauxiliary samples’ fidelity.

5) Balance of the auxiliary exemplar samples’ size and fi-delity: we examine the effect of varying the auxiliary sample’sfidelity and the auxiliary exemplar data size for each classwhile the size of the limited memory buffer remains the same.Specifically, the fidelity will decrease if the size of auxiliarysamples increase, where the number of the auxiliary samplesis set to 2000

r . As shown in Figure 5, the average accuracy ofthe model increases first and then decreases with the decreaseof the auxiliary samples’ fidelity (also the increase of theauxiliary samples’ size), which means moderately reducing thesamples’ fidelity can improve the final model’s performancewith a limited memory buffer. When the fidelity of an auxiliarysample is decreased a lot, the model’s performance drops dueto too much loss of class knowledge.

Page 8: IEEE TRANSACTIONS ON NEURAL NETWORKS AND ...incremental learning with deep models [33]–[41]. The works can be roughly divided into three fuzzy categories of the common incremental

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. X, JANUARY 2020

20 40 60 80 100

Number of classes

0

10

20

30

40

50

60

70

80

90

100

Acc

ura

cy

Incrementally learning 5 classes at a session

Ours(65.96%)BiC(62.96%)ETE(63.40%)iCaRL(57.44%)iCaRL-Hybrid1(47.55%)LWF.MC(29.53%)

20 40 60 80 100

Number of classes

0

10

20

30

40

50

60

70

80

90

100

Acc

ura

cy

Incrementally learning 10 classes at a session

Ours(67.04%)BiC(63.86%)ETE(63.60%)iCaRL(60.79%)iCaRL-Hybrid1(55.10%)LWF.MC(39.10%)

20 40 60 80 100

Number of classes

0

10

20

30

40

50

60

70

80

90

100

Acc

ura

cy

Incrementally learning 20 classes at a session

Ours(67.25%)BiC(65.14%)ETE(63.70%)iCaRL(63.33%)iCaRL-Hybrid1(61.18%)LWF.MC(46.11%)

20 40 60 80 100

Number of classes

0

10

20

30

40

50

60

70

80

90

100

Acc

ura

cy

Incrementally learning 50 classes at a session

Ours(64.74%)BiC(63.92%)ETE(60.80%)iCaRL(62.44%)iCaRL-Hybrid1(60.96%)LWF.MC(51.79%)

Fig. 7. The performance of different methods with the incremental learning session of 5, 10, 20 and 50 classes on iCIFAR-100. The average accuracy ofthe incremental learning sessions is shown in parentheses for each method and computed without considering the accuracy of the first learning session. Ourclass-incremental learning scheme with auxiliary samples obtains the best results in all the cases.

6) Fixed memory buffer size: we conduct the experimentswith memory buffer of different sizes on iCIFAR-100 wherethe number of added classes at each learning session is 10.The size of memory buffer is expressed in the number of realexemplar samples. As shown in Figure 6, all of the exemplar-based methods (“Ours.FC”, “iCaRL” and “iCaRL-Hybrid1”)benefit from a larger memory which indicates that moresamples of old classes are useful for keeping the performanceof the model. The average accuracy of the model with ourscheme is higher than that of iCaRL, LWF.MC and iCaRL-Hybrid1 in all the cases.

E. State-of-the-Art Performance Comparison

In this section, we evaluate the performance of our proposedscheme on the iCIFAR-100 and iILSVRC benchmark, againstthe state-of-the-art methods, including LWF.MC [32], iCaRL,iCaRL-Hybrid1 [20], ETE [31], BiC [46].

For iCIFAR-100, we evaluate the incremental learningsession of 5, 10, 20 and 50 classes, Figure 7 summarisesthe results of the experiments. The memory size for all theevaluated methods is the same. We observe that our class-incremental learning scheme with auxiliary samples obtainsthe best results in all the cases. Compared with iCaRL andiCaRL-Hybrid1, we achieve a higher accuracy at each learningsession. When the new-class data arrives, the accuracy of ourscheme decreases slowly compared to ETE and BiC.

For iILSVRC-small, we evaluate the performance of ourmethod with the incremental learning session of 10 classes andthe results are shown in Figure 8. It can be also observed thatour scheme obtains the highest accuracy at each incrementallearning session among others.

V. CONCLUSION

In this paper, we have presented a novel memory-efficientexemplar preserving scheme and a duplet learning scheme

Page 9: IEEE TRANSACTIONS ON NEURAL NETWORKS AND ...incremental learning with deep models [33]–[41]. The works can be roughly divided into three fuzzy categories of the common incremental

ZHAO et al.: MEMORY EFFICIENT CLASS-INCREMENTAL LEARNING FOR IMAGE CLASSIFICATION 9

20 40 60 80 100

Number of classes

50

60

70

80

90

100

Acc

ura

cy

Ours(90.82%)BiC(90.12%)ETE(89.03%)iCaRL(77.56%)iCaRL-Hybrid1(73.21%)LWF.MC(68.59%)

Fig. 8. The performance of different methods with the incremental learningsession of 10 classes on iILSVRC-small. The average accuracy over all theincremenal learning sessions is shown in parentheses for each method.

for resource-constrained class-incremental learning, whichtransfers the old-class knowledge with low-fidelity auxiliarysamples rather than the original real samples. We have alsoproposed a classifier adaptation scheme for the classifier’supdating. The proposed scheme refines the biased classifierwith samples of pure true class labels. Our scheme hasobtained better results than the state-of-the-art methods onseveral datasets. As part of our future work, we plan toexplore the low-fidelity auxiliary sample selection scheme forinheriting more class information in a limited memory buffer.

ACKNOWLEDGMENT

The authors would like to thank Xin Qin for their valuablecomments and suggestions.

REFERENCES

[1] J. Xu and Z. Zhu, “Reinforced continual learning,” in Proceedings of theAdvances in Neural Information Processing Systems (NeurIPS), 2018,pp. 907–916.

[2] E. Belouadah and A. Popescu, “Deesil: Deep-shallow incremental learn-ing,” in Proceedings of the European Conference on Computer Vision(ECCV). Springer, 2018, pp. 151–157.

[3] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr, “Riemannianwalk for incremental learning: Understanding forgetting and intransi-gence,” in Proceedings of the European Conference on Computer Vision(ECCV), 2018, pp. 532–547.

[4] A. Rannen, R. Aljundi, M. B. Blaschko, and T. Tuytelaars, “Encoderbased lifelong learning,” in Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), 2017, pp. 1320–1328.

[5] R. Aljundi, P. Chakravarty, and T. Tuytelaars, “Expert gate: Lifelonglearning with a network of experts,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2017,pp. 3366–3375.

[6] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continuallifelong learning with neural networks: A review,” Neural Networks,2019.

[7] M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis,G. Slabaugh, and T. Tuytelaars, “Continual learning: A comparativestudy on how to defy forgetting in classification tasks,” arXiv preprintarXiv:1909.08383, 2019.

[8] J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. Heck, H. Zhang, and C.-C. J. Kuo, “Class-incremental learning via deep model consolidation,”arXiv preprint arXiv:1903.07864, 2019.

[9] Y. Hao, Y. Fu, Y.-G. Jiang, and Q. Tian, “An end-to-end architecturefor class-incremental object detection with knowledge distillation,” inProceedings of the IEEE International Conference on Multimedia andExpo (ICME). IEEE, 2019, pp. 1–6.

[10] Y. Hao, Y. Fu, and Y.-G. Jiang, “Take goods from shelves: A dataset forclass-incremental object detection,” in Proceedings of the InternationalConference on Multimedia Retrieval (ICMR). ACM, 2019, pp. 271–278.

[11] K. Lee, K. Lee, J. Shin, and H. Lee, “Overcoming catastrophic forgettingwith unlabeled data in the wild,” in Proceedings of the IEEE Interna-tional Conference on Computer Vision (ICCV), 2019, pp. 312–321.

[12] Q. Dong, S. Gong, and X. Zhu, “Imbalanced deep learning by minorityclass incremental rectification,” IEEE transactions on pattern analysisand machine intelligence (T-PAMI), vol. 41, no. 6, pp. 1367–1381, 2018.

[13] H. Zhang, X. Xiao, and O. Hasegawa, “A load-balancing self-organizingincremental neural network,” IEEE Transactions on Neural Networksand Learning Systems (TNNLS), vol. 25, no. 6, pp. 1096–1105, 2013.

[14] B. Gu, V. S. Sheng, K. Y. Tay, W. Romano, and S. Li, “Incrementalsupport vector learning for ordinal regression,” IEEE Transactions onNeural networks and learning systems (TNNLS), vol. 26, no. 7, pp.1403–1416, 2014.

[15] A. Penalver and F. Escolano, “Entropy-based incremental variationalbayes learning of gaussian mixtures,” IEEE transactions on neuralnetworks and learning systems (TNNLS), vol. 23, no. 3, pp. 534–540,2012.

[16] Z. Wang, H.-X. Li, and C. Chen, “Incremental reinforcement learningin continuous spaces via policy relaxation and importance weighting,”IEEE Transactions on Neural Networks and Learning Systems (TNNLS),2019.

[17] Z. Yang, S. Al-Dahidi, P. Baraldi, E. Zio, and L. Montelatici, “A novelconcept drift detection method for incremental learning in nonstationaryenvironments,” IEEE transactions on neural networks and learningsystems (TNNLS), 2019.

[18] Y. Nakamura and O. Hasegawa, “Nonparametric density estimationbased on self-organizing incremental neural network for large noisydata,” IEEE transactions on neural networks and learning systems(TNNLS), vol. 28, no. 1, pp. 8–17, 2016.

[19] C. P. Chen, Z. Liu, and S. Feng, “Universal approximation capability ofbroad learning system and its structural variations,” IEEE transactionson neural networks and learning systems (TNNLS), vol. 30, no. 4, pp.1191–1204, 2018.

[20] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl:Incremental classifier and representation learning,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (CVPR),2017, pp. 2001–2010.

[21] R. M. French, “Catastrophic forgetting in connectionist networks,”Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999.

[22] I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio,“An empirical investigation of catastrophic forgetting in gradient-basedneural networks,” in Proceedings of the International Conference onLearning Representations (ICLR), 2014.

[23] M. McCloskey and N. J. Cohen, “Catastrophic interference in connec-tionist networks: The sequential learning problem,” in Psychology oflearning and motivation. Elsevier, 1989, vol. 24, pp. 109–165.

[24] B. Pfulb and A. Gepperth, “A comprehensive, application-oriented studyof catastrophic forgetting in dnns,” Proceedings of the InternationalConference on Learning Representations (ICLR), 2019.

[25] H. Ritter, A. Botev, and D. Barber, “Online structured laplace approxi-mations for overcoming catastrophic forgetting,” in Proceedings of theAdvances in neural information processing systems (NeurIPS), 2018, pp.3738–3748.

[26] X. Liu, M. Masana, L. Herranz, J. Van de Weijer, A. M. Lopez, and A. D.Bagdanov, “Rotate your networks: Better weight consolidation and lesscatastrophic forgetting,” in Proceedings of the International Conferenceon Pattern Recognition (ICPR). IEEE, 2018, pp. 2262–2268.

[27] J. Serra, D. Surıs, M. Miron, and A. Karatzoglou, “Overcomingcatastrophic forgetting with hard attention to the task,” arXiv preprintarXiv:1801.01423, 2018.

[28] S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, and B.-T. Zhang, “Over-coming catastrophic forgetting by incremental moment matching,” inProceedings of the Advances in neural information processing systems(NeurIPS), 2017, pp. 4652–4662.

[29] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” Citeseer, Tech. Rep., 2009.

Page 10: IEEE TRANSACTIONS ON NEURAL NETWORKS AND ...incremental learning with deep models [33]–[41]. The works can be roughly divided into three fuzzy categories of the common incremental

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. XX, NO. X, JANUARY 2020

[30] R. Coop, A. Mishtal, and I. Arel, “Ensemble learning in fixed expansionlayer networks for mitigating catastrophic forgetting,” IEEE transactionson neural networks and learning systems (TNNLS), vol. 24, no. 10, pp.1623–1634, 2013.

[31] F. M. Castro, M. J. Marın-Jimenez, N. Guil, C. Schmid, and K. Alahari,“End-to-end incremental learning,” in Proceedings of the EuropeanConference on Computer Vision (ECCV), 2018, pp. 233–248.

[32] Z. Li and D. Hoiem, “Learning without forgetting,” IEEE transactionson pattern analysis and machine intelligence (T-PAMI), vol. 40, no. 12,pp. 2935–2947, 2017.

[33] C. P. Chen and Z. Liu, “Broad learning system: An effective and efficientincremental learning system without the need for deep architecture,”IEEE transactions on neural networks and learning systems (TNNLS),vol. 29, no. 1, pp. 10–24, 2017.

[34] Y. Xing, F. Shen, and J. Zhao, “Perception evolution network based oncognition deepening modeladapting to the emergence of new sensoryreceptor,” IEEE transactions on neural networks and learning systems(TNNLS), vol. 27, no. 3, pp. 607–620, 2015.

[35] C. Hou and Z.-H. Zhou, “One-pass learning with incremental and decre-mental features,” IEEE transactions on pattern analysis and machineintelligence (T-PAMI), vol. 40, no. 11, pp. 2776–2792, 2017.

[36] J.-Y. Park and J.-H. Kim, “Incremental class learning for hierarchicalclassification,” IEEE transactions on cybernetics, vol. 50, no. 1, pp.178–189, 2018.

[37] Y. Sun, K. Tang, Z. Zhu, and X. Yao, “Concept drift adaptation byexploiting historical knowledge,” IEEE transactions on neural networksand learning systems (TNNLS), vol. 29, no. 10, pp. 4822–4832, 2018.

[38] G. Ding, Y. Guo, K. Chen, C. Chu, J. Han, and Q. Dai, “Decode: Deepconfidence network for robust image classification,” IEEE Transactionson Image Processing (TIP), vol. 28, no. 8, pp. 3752–3765, 2019.

[39] Y. Liu, F. Nie, Q. Gao, X. Gao, J. Han, and L. Shao, “Flexible unsu-pervised feature extraction for image classification,” Neural Networks,vol. 115, pp. 65–71, 2019.

[40] Z. Lin, G. Ding, J. Han, and L. Shao, “End-to-end feature-awarelabel space encoding for multilabel classification with many classes,”IEEE Transactions on Neural Networks and Learning Systems (TNNLS),vol. 29, no. 6, pp. 2472–2487, 2017.

[41] Y. Guo, G. Ding, J. Han, and Y. Gao, “Zero-shot learning withtransferred samples,” IEEE Transactions on Image Processing (TIP),vol. 26, no. 7, pp. 3277–3290, 2017.

[42] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny, “Efficientlifelong learning with a-gem,” Proceedings of the International Confer-ence on Learning Representations (ICLR), 2019.

[43] E. Belouadah and A. Popescu, “Il2m: Class incremental learning withdual memory,” in Proceedings of the IEEE International Conference onComputer Vision (ICCV), 2019, pp. 583–592.

[44] Y. Xiang, Y. Fu, P. Ji, and H. Huang, “Incremental learning using con-ditional adversarial networks,” in Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), 2019, pp. 6619–6628.

[45] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin, “Learning a unifiedclassifier incrementally via rebalancing,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR), 2019,pp. 831–839.

[46] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu, “Largescale incremental learning,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2019, pp. 374–382.

[47] C. He, R. Wang, S. Shan, and X. Chen, “Exemplar-supported generativereproduction for class incremental learning,” in Proceedings of theBritish Machine Vision Conference (BMVC), 2018, pp. 3–6.

[48] H. Shin, J. K. Lee, J. Kim, and J. Kim, “Continual learning with deepgenerative replay,” in Proceedings of the Advances in Neural InformationProcessing Systems (NeurIPS), 2017, pp. 2990–2999.

[49] F. Lavda, J. Ramapuram, M. Gregorova, and A. Kalousis, “Contin-ual classification learning using generative models,” arXiv preprintarXiv:1810.10612, 2018.

[50] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, Z. Zhang, and Y. Fu,“Incremental classifier learning with generative adversarial networks,”arXiv preprint arXiv:1802.00853, 2018.

[51] G. M. van de Ven and A. S. Tolias, “Generative replay with feedbackconnections as a general strategy for continual learning,” arXiv preprintarXiv:1809.10635, 2018.

[52] K. Shmelkov, C. Schmid, and K. Alahari, “Incremental learning of objectdetectors without catastrophic forgetting,” in Proceedings of the IEEEInternational Conference on Computer Vision (ICCV), 2017, pp. 3400–3409.

[53] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins,A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinskaet al., “Overcoming catastrophic forgetting in neural networks,” Pro-ceedings of the national academy of sciences (PNAS), vol. 114, no. 13,pp. 3521–3526, 2017.

[54] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synapticintelligence,” in Proceedings of the 34th International Conference onMachine Learning-Volume 70. JMLR. org, 2017, pp. 3987–3995.

[55] J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W.Teh, R. Pascanu, and R. Hadsell, “Progress & compress: A scalableframework for continual learning,” arXiv preprint arXiv:1805.06370,2018.

[56] R. Kemker and C. Kanan, “Fearnet: Brain-inspired model for incremen-tal learning,” Proceedings of the International Conference on LearningRepresentations (ICLR), 2018.

[57] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick,K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neuralnetworks,” arXiv preprint arXiv:1606.04671, 2016.

[58] V. Lomonaco and D. Maltoni, “Core50: a new dataset and benchmark forcontinuous object recognition,” arXiv preprint arXiv:1705.03550, 2017.

[59] D. Maltoni and V. Lomonaco, “Continuous learning in single-incremental-task scenarios,” arXiv preprint arXiv:1806.08568, 2018.

[60] A. Mallya, D. Davis, and S. Lazebnik, “Piggyback: Adapting a singlenetwork to multiple tasks by learning to mask weights,” in Proceedingsof the European Conference on Computer Vision (ECCV), 2018, pp.67–82.

[61] J. Rajasegaran, M. Hayat, S. Khan, F. S. Khan, and L. Shao, “Randompath selection for incremental learning,” 2019.

[62] X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong, “Learn to grow:A continual structure learning framework for overcoming catastrophicforgetting,” International Conference on Machine Learning (ICML),2019.

[63] S. Huang, V. Francois-Lavet, and G. Rabusseau, “Neural architecturesearch for class-incremental learning,” arXiv preprint arXiv:1909.06686,2019.

[64] M. Welling, “Herding dynamical weights to learn,” in Proceedings ofthe International Conference on Machine Learning (ICML). ACM,2009, pp. 1121–1128.

[65] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Proceedings of the Ad-vances in neural information processing systems (NeurIPS), 2012, pp.1097–1105.

[66] H. Hotelling, “Analysis of a complex of statistical variables into principalcomponents.” Journal of educational psychology, vol. 24, no. 6, p. 417,1933.

[67] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition (CVPR), 2016, pp. 770–778.

[68] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journalof machine learning research (JMLR), vol. 9, no. Nov, pp. 2579–2605,2008.