Probabilistic Knowledge Transfer for Lightweight Deep ...

10
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Probabilistic Knowledge Transfer for Lightweight Deep Representation Learning Nikolaos Passalis , Maria Tzelepi , and Anastasios Tefas, Member, IEEE Abstract— Knowledge-transfer (KT) methods allow for trans- ferring the knowledge contained in a large deep learning model into a more lightweight and faster model. However, the vast majority of existing KT approaches are designed to handle mainly classification and detection tasks. This limits their performance on other tasks, such as representation/metric learning. To over- come this limitation, a novel probabilistic KT (PKT) method is proposed in this article. PKT is capable of transferring the knowledge into a smaller student model by keeping as much information as possible, as expressed through the teacher model. The ability of the proposed method to use different kernels for estimating the probability distribution of the teacher and student models, along with the different divergence metrics that can be used for transferring the knowledge, allows for easily adapting the proposed method to different applications. PKT outperforms several existing state-of-the-art KT techniques, while it is capable of providing new insights into KT by enabling several novel applications, as it is demonstrated through extensive experiments on several challenging data sets. Index Terms— Knowledge transfer (KT), lightweight deep learning (DL), metric learning, neural network distillation, rep- resentation learning. I. I NTRODUCTION D EEP learning (DL) allowed for tackling many difficult problems with great success [1], ranging from challeng- ing computer vision problems [2] to complex reinforcement learning problems [3]. However, DL suffers from significant limitations, since neural networks are becoming increasingly complex, often composed of hundreds of layers [4], and requiring powerful and dedicated hardware for effectively deploying them. This limits their applications on embedded and mobile devices with limited computational resources, e.g., memory, processing power, and so on. These limitations shifted the research interest into developing more efficient models, which can effectively lower the computational and Manuscript received August 20, 2019; revised March 1, 2020; accepted May 15, 2020. This work was supported by Greece and the European Union [European Social Fund (ESF)] through the Operational Programme “Human Resources Development, Education and Lifelong Learning 2014-2020” in the context of the project “Lighweight Deep Learning Models for Signal and Information Analysis” under Grant MIS 5047925. (Corresponding author: Nikolaos Passalis.) The authors are with the Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece (e-mail: [email protected]; [email protected]; [email protected]). This article has supplementary downloadable material available at http://ieeexplore.ieee.org, provided by the authors. Color versions of one or more of the figures in this article are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2020.2995884 energy requirements of DL. To this end, several methods have been recently proposed, including the model-compression and quantization methods [5], which can reduce the number of bits needed to store the weights of a network and accelerate the required computations, as well as the lightweight and more efficient neural network architectures [6]. Another promising line of research for further improving the performance of the lightweight DL models is the knowledge- transfer (KT) methods [7], [8], which are also known as the knowledge distillation (KD) or neural network distillation methods. KT works by modeling the knowledge, as encoded by a large DL model, and then transferring it into a smaller and faster model. The most straightforward way to perform KT is to train the smaller model, which is called student model, to mimic the response (output) of a larger and more complex neural network, typically called teacher model. In this way, KT allows for training more accurate student models that are able to generalize better than the models trained without KT, since the teacher implicitly captures more information for each data sample and the training classes. Note that this information is usually not available when training with traditional methods, since only hard binary labels are provided, instead of a smooth probability distribution over the classes. This allows KT to regularize better the training process, significantly improving the performance of the student network [9]. It is worth noting that KT methods are orthogonal to other approaches used for developing lightweight models, e.g., they can be used with both quantization methods and lightweight architectures, such as [6]. This allows KT to be combined with any other method, further improving the performance of the lightweight DL models. Several KT methods have been proposed and successfully used for a wide variety of tasks, mainly related to classi- fication [7], [8], [10] and object detection [11]. However, these approaches suffer from a series of limitations that prohibit them to be efficiently used for other DL-related tasks, such as representation/metric learning [12]. First, most of the existing methods are not capable of efficiently transferring the knowledge when a different number of neurons are used, i.e., the layers have different dimensionalities. The main reason for this is that most KT methods are currently tailored toward the classification tasks, where they are usually used to transfer the knowledge between the final classification layers of two networks, assuming that the dimensionality of these layers is the same for both networks. However, this renders most of the existing KT methods not suitable for representation 2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 14,2020 at 11:56:27 UTC from IEEE Xplore. Restrictions apply.

Transcript of Probabilistic Knowledge Transfer for Lightweight Deep ...

Page 1: Probabilistic Knowledge Transfer for Lightweight Deep ...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Probabilistic Knowledge Transfer for LightweightDeep Representation Learning

Nikolaos Passalis , Maria Tzelepi , and Anastasios Tefas, Member, IEEE

Abstract— Knowledge-transfer (KT) methods allow for trans-ferring the knowledge contained in a large deep learning modelinto a more lightweight and faster model. However, the vastmajority of existing KT approaches are designed to handle mainlyclassification and detection tasks. This limits their performanceon other tasks, such as representation/metric learning. To over-come this limitation, a novel probabilistic KT (PKT) methodis proposed in this article. PKT is capable of transferring theknowledge into a smaller student model by keeping as muchinformation as possible, as expressed through the teacher model.The ability of the proposed method to use different kernels forestimating the probability distribution of the teacher and studentmodels, along with the different divergence metrics that can beused for transferring the knowledge, allows for easily adaptingthe proposed method to different applications. PKT outperformsseveral existing state-of-the-art KT techniques, while it is capableof providing new insights into KT by enabling several novelapplications, as it is demonstrated through extensive experimentson several challenging data sets.

Index Terms— Knowledge transfer (KT), lightweight deeplearning (DL), metric learning, neural network distillation, rep-resentation learning.

I. INTRODUCTION

DEEP learning (DL) allowed for tackling many difficultproblems with great success [1], ranging from challeng-

ing computer vision problems [2] to complex reinforcementlearning problems [3]. However, DL suffers from significantlimitations, since neural networks are becoming increasinglycomplex, often composed of hundreds of layers [4], andrequiring powerful and dedicated hardware for effectivelydeploying them. This limits their applications on embeddedand mobile devices with limited computational resources,e.g., memory, processing power, and so on. These limitationsshifted the research interest into developing more efficientmodels, which can effectively lower the computational and

Manuscript received August 20, 2019; revised March 1, 2020; acceptedMay 15, 2020. This work was supported by Greece and the European Union[European Social Fund (ESF)] through the Operational Programme “HumanResources Development, Education and Lifelong Learning 2014-2020” in thecontext of the project “Lighweight Deep Learning Models for Signal andInformation Analysis” under Grant MIS 5047925. (Corresponding author:Nikolaos Passalis.)

The authors are with the Department of Informatics, Aristotle Universityof Thessaloniki, 54124 Thessaloniki, Greece (e-mail: [email protected];[email protected]; [email protected]).

This article has supplementary downloadable material available athttp://ieeexplore.ieee.org, provided by the authors.

Color versions of one or more of the figures in this article are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TNNLS.2020.2995884

energy requirements of DL. To this end, several methods havebeen recently proposed, including the model-compression andquantization methods [5], which can reduce the number of bitsneeded to store the weights of a network and accelerate therequired computations, as well as the lightweight and moreefficient neural network architectures [6].

Another promising line of research for further improving theperformance of the lightweight DL models is the knowledge-transfer (KT) methods [7], [8], which are also known asthe knowledge distillation (KD) or neural network distillationmethods. KT works by modeling the knowledge, as encodedby a large DL model, and then transferring it into a smallerand faster model. The most straightforward way to perform KTis to train the smaller model, which is called student model,to mimic the response (output) of a larger and more complexneural network, typically called teacher model. In this way, KTallows for training more accurate student models that are ableto generalize better than the models trained without KT, sincethe teacher implicitly captures more information for each datasample and the training classes. Note that this information isusually not available when training with traditional methods,since only hard binary labels are provided, instead of a smoothprobability distribution over the classes. This allows KT toregularize better the training process, significantly improvingthe performance of the student network [9]. It is worth notingthat KT methods are orthogonal to other approaches usedfor developing lightweight models, e.g., they can be usedwith both quantization methods and lightweight architectures,such as [6]. This allows KT to be combined with any othermethod, further improving the performance of the lightweightDL models.

Several KT methods have been proposed and successfullyused for a wide variety of tasks, mainly related to classi-fication [7], [8], [10] and object detection [11]. However,these approaches suffer from a series of limitations thatprohibit them to be efficiently used for other DL-related tasks,such as representation/metric learning [12]. First, most of theexisting methods are not capable of efficiently transferring theknowledge when a different number of neurons are used, i.e.,the layers have different dimensionalities. The main reasonfor this is that most KT methods are currently tailored towardthe classification tasks, where they are usually used to transferthe knowledge between the final classification layers of twonetworks, assuming that the dimensionality of these layersis the same for both networks. However, this renders mostof the existing KT methods not suitable for representation

2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 14,2020 at 11:56:27 UTC from IEEE Xplore. Restrictions apply.

Page 2: Probabilistic Knowledge Transfer for Lightweight Deep ...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

learning tasks, such as text, image, and multimedia informationretrieval [13]–[15]. At the same time, there are many otherapplications requiring accurate, yet lightweight feature extrac-tors, e.g., reducing the communication overhead between themobile devices and the cloud [16], protecting the privacy ofthe users by processing most of the data on the edge [17],[18], and so on.

Furthermore, most of the existing KT methods usuallycompletely ignore the actual geometry of the teacher’s featurespace, e.g., manifolds that are possibly formed, the localstructure of the space (as expressed by the similarities betweenthe neighboring samples), and so on, since they directly regressthe representation extracted from the teacher model. However,it has been demonstrating that using this information allowsfor improving the quality of the learned model/representationfor many different domains and applications [19], [20].

These observations lead us to a number of interestingquestions.

1) Can we use the existing KT methods for representationlearning tasks and how do they perform on these tasks?

2) Is it possible to design a method that can directly recre-ate the geometry of the teacher’s feature space usingthe student model? This would allow for effectivelymodeling the manifolds formed in the feature spaceof the teacher model and then recreating them intothe student’s lower dimensional feature space, possiblyallowing for further improving the performance of thestudent.

3) Can handcrafted features, e.g., HoG [21], be used totransfer the knowledge encoded in their representationinto a neural network. This process can allow foreffectively exploiting the enormous number of availableunlabeled training data for training the DL models.

4) Finally, is it possible to model the gradual transformationof the feature space, through the various layers of thestudent model, in order to improve further the perfor-mance of the teacher model? If so, how we shouldmatch the layers between the student and the teacher,e.g., should we directly transfer the knowledge betweenall the layers of the student and the teacher or shouldwe use a more adaptive and sophisticated approach toavoid overregularizing the network?

In this article, a probabilistic KT (PKT) method is proposed,allowing for overcoming the limitations of existing methodsand providing an efficient approach for developing lightweightDL models for various representation/metric learning tasks.To this end, the proposed method matches the probabilitydistribution of the data in the feature spaces formed by theteacher and student models, instead of merely regressing theiractual representation, as shown in Fig. 1. PKT is motivated bythe observation that matching the probability distributions inthe feature space of the teacher and student models allowsfor maintaining the teacher’s quadratic mutual information(QMI) [22] in the smaller student model. Even though a setof labels is employed for modeling the MI of the model, nolabels are actually required for the KT process, rendering theproposed approach fully unsupervised. This is possible, since,

Fig. 1. PKT. First, the teacher’s knowledge and student’s knowledge aremodeled using two probability distributions. The knowledge is then transferredto the student by minimizing the divergence between these two probabilitydistributions.

as it is thoroughly described in Section III, the aforementionedprocess leads to the unsupervised modeling of the interactionsbetween the data samples in the feature space as a probabilitydistribution that expresses the affinity between the data sam-ples.

As we extensively demonstrate through several experiments,this process provides significant advantages over existing KTtechniques. First, the proposed PKT method enables us totransfer directly the knowledge even when the output dimen-sionality of the networks does not match without using anyadditional dimensionality-reduction layers. Furthermore, PKTallows for using the most appropriate kernel to model thefeature space of the teacher and student models, providinga straightforward way to fine-tune the proposed method fordifferent applications, e.g., information retrieval using angularmetrics, such as the cosine similarity. Finally, it is demon-strated that the probability distributions can also be estimatedor enhanced using any other information source, such ashandcrafted feature extractors, intermediate neural layers, andsupervised information or even qualitative information pro-vided by users and/or domain experts. Thus, PKT constitutesa powerful and flexible KT tool that can be used to improveKT for existing scenarios as well as support a number ofnovel KT scenarios. The proposed method can effectively:1) perform cross-modal KT; 2) transfer the knowledgeencoded in the handcrafted feature extractors into the neuralnetworks; 3) transfer the knowledge into multiple layersusing a hierarchical ladder scheme; 4) learn representationsrobust to distribution shifts; and 5) improve the perfor-mance for a wide variety of tasks (retrieval, classification,and clustering). The proposed method is extensively eval-uated and compared with other KT techniques using fourchallenging computer vision data sets and several evalu-ation setups [KT from deep neural networks, KT from

Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 14,2020 at 11:56:27 UTC from IEEE Xplore. Restrictions apply.

Page 3: Probabilistic Knowledge Transfer for Lightweight Deep ...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PASSALIS et al.: PKT FOR LIGHTWEIGHT DEEP REPRESENTATION LEARNING 3

handcrafted feature extractors, KT from different modalities(cross modal transfer), and hierarchical KT from multiplelayers].

This article is an extended version of our previous workpresented in [23]. We further extended our previous articleby: 1) providing a multikernel formulation that improves theperformance of the proposed method over different setups;2) employing a ladder-based KT scheme that allows foreffectively transferring the knowledge between multiple levelsof two networks, despite the difficulties that often arise fromthis process; 3) providing an extensive stability and ablationstudy; 4) providing additional comparison with a recentlyproposed metric learning approach [24]; and 5) evaluating theproposed method in additional setups (classification, distribu-tion shift, and unsupervised knowledge discovery) as well asadditional data sets (STL-10 [25] and ILSVRC [26]). An open-source implementation of the proposed method, containing allthe improvements proposed, in this article will be publiclyavailable at https://github.com/passalis/deep_pkt.

The rest of this article is structured as follows. First, inSection II, the related works are discussed, while the proposedmethod is analytically presented in detail in Section III. Then,the proposed method is evaluated in Section IV. Finally,Section V concludes this article.

II. RELATED WORK

Research on the KT methods, which allow for effectivelytraining smaller and faster models, is mainly driven by theincreasing size and complexity of the DL models as well asthe need to deploy them into various devices with limitedcomputing capabilities, such as embedded and mobile devices.A large portion of the KT methods proposed in the literatureemploy the teacher model to generate soft-labels and then usethese soft-labels for training the smaller student network [8],[9], [27]–[29]. Indeed, one of the earliest approaches to KTdirectly used the label distribution predicted by the teacher totrain the student [27], while the well-known neural networkdistillation method [8] extended this approach by appropriatelytuning the temperature of the softmax activation. Neural net-work distillation can indeed effectively regularize the smallernetwork, leading to better performance than directly trainingthe network using hard binary labels [6], [8], [15]. Severalextensions to this approach have been proposed: soft-labelscan be used for pretraining a large network [30], for domain-adaptation tasks [28], compressing the posterior density inthe Bayesian methods [31], or transferring the knowledgefrom recurrent neural networks into simpler models [29]. Theregularization nature of neural network distillation is alsohighlighted in [9], where the knowledge was transferred from asmaller teacher model to a larger student network, allowing fortraining the student with fewer annotated data. These methodsare mainly designed to handle the classification tasks andcannot be effectively applied for representation learning, asit is also demonstrated in Section IV. It is also worth notingthat KT is closely related to the optimization methods [32]–[35], since it performs a relaxation of the original optimizationproblem, as well as to various loss functions proposed formetric learning, such as [36].

A quite different approach was proposed in [7], where thestudent network was trained by employing hints from theintermediate layers of the teacher model. To this end, pro-jection matrices were employed to match the dimensionalitybetween the teacher’s and student’s layers, since usually thestudent model is smaller. Even though this approach allowsfor transferring the knowledge between the arbitrary layers,it can lead to a significant loss of information, due to usinglow-dimensional projections. A similar approach was alsoemployed in [38], where instead of using hints, the flow ofsolution procedure (FSP) matrix was used to transfer theknowledge between the intermediate layers of various residualnetworks. It is worth noting that FSP, in contrast with hints,cannot be applied when the intermediate representations havedifferent sizes and, as a result, cannot be used for representa-tion learning. Finally, in [20], an embedding-based approachwas proposed for transferring the knowledge, while in [24],a multidimensional-scaling-based method was used to learna student that maintains the same distances as the teacherbetween the pairs of samples. However, the first method istailored toward the classification tasks, while the latter leads tosignificantly worse performance compared with the proposedone (demonstrated in Section IV), since directly matchingthe distances in high-dimensional spaces is less effective thanthe proposed probabilistic formulation that can alleviate theselimitations (especially when coupled with a carefully selectedkernel function and divergence metric).

To the best of our knowledge, the proposed method is thefirst probabilistic KT approach that can be effectively used fora wide range of different representation learning tasks. Indeed,the proposed method can be employed in many differentand novel scenarios, e.g., transferring the knowledge fromthe handcrafted feature extractors, as it is demonstrated inSection IV. The proposed method does not require extensivehyperparameter turning, such as the tedious tuning of thesoftmax temperature [8], while it is easy to implement anduse. At the same time, its ability to estimate the probabilitydistribution using different kernels, as well as to combinethem, allows for learning more robust lightweight studentmodels. Furthermore, the proposed method can directly handlethe layers of different dimensionalities, without requiring usingadditional dimensionality-reduction layers [7] or less-efficientdistance-based matching [24]. Finally, also note that a KTapproach that employs an intermediate network was alsoproposed in [37]. However, compared with this approach, theproposed one is designed to facilitate multilayer KT transfer.

III. PKT

First, the used notation and required background are brieflyintroduced in this section. Then, the proposed method isanalytically described. Several design choices are discussedthrough this section, along with the extensions that allow forhandling several different KT scenarios.

A. Notation and Background

Let T = {t1, t2, . . . , tN } be a transfer set of N objects. Thetransfer set is used to transfer the knowledge encoded in the

Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 14,2020 at 11:56:27 UTC from IEEE Xplore. Restrictions apply.

Page 4: Probabilistic Knowledge Transfer for Lightweight Deep ...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

teacher model into the student model. The notations x = f (t)and y = g(t, W) are used to refer to the teacher’s and student’soutput representations (respectively). Note that W refers to thetrainable parameters of the student model. The student modelg(·) is then trained in order to “mimic” the behavior of f (·).Note that there is no constraint on what the functions f (·)and g(·) are as long as the output of f (·) is known for everyelement of T and g(·) is a differentiable function. Actually,the first constraint can be relaxed even more, since, as it willbe demonstrated later, it is enough to know just the pairwisesimilarities between the training samples for the teacher model.The continuous random variables X and Y are used to modelthe distribution of the representation extracted from the teacherand student models. In addition, an additional discrete randomvariable C is introduced to describe some higher level semanticfeatures of the samples, e.g., class labels. Therefore, eachvector x drawn from X is also associated with an attributelabel c.

MI is a measure of uncertainty, regarding the class label cafter observing the corresponding vector x [22]. Let P(c) bethe probability of observing the class label c. The teacher’sMI can be formally defined as

I (X, C) =∑

c

∫x

p(x, c) logp(x, c)

p(x)P(c)dx (1)

where p(x, c) is the joint probability density function of xand c. Then, QMI can be derived by replacing the Kullback–Leibler (KL) divergence between p(x, c) and the product ofmarginal probabilities that appear in (1) by the quadraticdivergence measure [22]

IT (X, C) =∑

c

∫x(p(x, c) − p(x)P(c))2dX. (2)

By expanding (2), QMI can be more compactly expressed interms of three quantities, called information potentials, as:IT (X, C) = VIN + VALL − 2VBTW, where the correspondingpotentials are defined as: VIN = ∑

c

∫x p(x, c)2dx, VALL =∑

c

∫x(p(x)P(c))2dx, and VBTW = ∑

c

∫x p(x, c)p(x)P(c)dx.

The potential VIN expresses the in-class interactions, the poten-tial VALL the interactions between all the samples, while thepotential VBTW the interaction of each class against all theother samples, as further shown in (5)–(7).

The class prior probability for each cp class can be estimatedas P(cp) = (Jp/N), where Jp refers to the number ofsamples for the pth class and N is the size of the transferset. Then, kernel density estimation (KDE) can be employedfor estimating the joint density probability as

p(x, cp

) = p(x|cp

)P

(cp

) = 1

N

Jp∑j=1

K(x, xpj , σ

2) (3)

where K (a, b, σ 2) is a symmetric kernel with width σ andthe notation xpj is used to refer to the j th sample of the pthclass. The density of X is similarly estimated as

p(x) =Jp∑

p=1

p(x, cp

) = 1

N

N∑j=1

K(x, x j , σ

2). (4)

B. PKT

The knowledge can be transferred from the teacher modelto the student model by maintaining the same amount of MIbetween the two random variables X and Y and the classlabels, i.e., I (X, C) = I (Y, C). In the case of QMI, thisimplies that the information potentials of the student modelshould be equal to the corresponding information potentialsof the teacher. Note that there might be other configurationswith equal MI as well; however, it is enough to use one ofthem to transfer the knowledge.

The information potentials for the teacher model can be eas-ily calculated using the probabilities estimated in Section III-Aas

V (t)IN = 1

N2

Nc∑p=1

Jp∑k=1

Jp∑l=1

K(xpk, xpl, 2σ 2

t

), (5)

V (t)ALL = 1

N2

⎛⎝ Nc∑

p=1

(Jp

N

)2⎞⎠ N∑

k=1

N∑l=1

K(xk, xl, 2σ 2

t

)(6)

and

V (t)BTW = 1

N2

Nc∑p=1

Jp

N

Jp∑j=1

N∑k=1

K (xpj, xk, 2σ 2t ) (7)

where NC is the total number of classes. The interactionbetween the two samples ti and t j is measured usingthe kernel function K (xi , x j , σ

2) that expresses thesimilarity between them (as measured through therepresentation extracted from the teacher model). Similarly,the information potentials for the student network are defined:V (s)

IN = (1/N2)∑Nc

p=1

∑Jp

k=1

∑Jp

l=1 K (ypk, ypl, 2σ 2s ), V (s)

ALL =(1/N2)(

∑Ncp=1((Jp/N))2)

∑Nk=1

∑Nl=1 K (yk, yl, 2σ 2

s ), and

V (s)BTW = (1/N2)

∑Ncp=1(Jp/N)

∑Jp

j=1

∑Nk=1 K (ypj, yk, 2σ 2

s ).Different (and appropriately tuned) bandwidths σt and σs

must be used for the teacher and student models, since thekernels are used to model the distribution in two differentfeature spaces. The bandwidth of the student model was setto 1, while the bandwidth of the teacher was determinedaccording to the mean distance between the samples in theteacher’s feature space. This strategy was used for all theconducted experiments, follows the experimental findings ofother related approaches [39], and ensures that a meaningfulprobability estimation, i.e., that the kernel values will notcollapse to either 0 (too small bandwidth) or 1 (too largebandwidth), will be obtained for both models.

It is easy to see that the most straightforward way to ensurethat the information potentials will be equal among the twomodels is to require the similarity between each pair of points,as expressed through the employed kernel, to be equal

K(xi − x j , 2σ 2

t

) = K(yi − y j , 2σ 2

s

) ∀i, j. (8)

Therefore, the problem of transferring the knowledge by main-taining the same MI between the models and a set of classesis reduced into matching the kernel values between differentpairs of data. PKT, instead of matching the unnormalizedkernel, proposes to minimize the divergence between the

Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 14,2020 at 11:56:27 UTC from IEEE Xplore. Restrictions apply.

Page 5: Probabilistic Knowledge Transfer for Lightweight Deep ...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PASSALIS et al.: PKT FOR LIGHTWEIGHT DEEP REPRESENTATION LEARNING 5

TABLE I

DIFFERENT KERNELS THAT CAN BE USED FOR TRANSFERRING THEKNOWLEDGE BETWEEN THE TEACHER AND STUDENT MODELS

teacher’s and student’s conditional probability distributions

p(t)i| j = K

(xi , x j , 2σ 2

t

)∑N

i=1,i �= j K(xi , x j , 2σ 2

t

) ∈ [0, 1] (9)

and

p(s)i| j = K

(yi , y j , 2σ 2

t

)∑N

i=1,i �= j K(yi , y j , 2σ 2

s

) ∈ [0, 1]. (10)

These probabilities express how probable is for each sampleto select each of its neighbors [39], modeling in this way thegeometry of the feature space.

Several different kernel functions can be used to estimatethe corresponding probabilities, while some of the possiblechoices are listed in Table I, where the kernels are applied ontwo vectors a and b and the notation ||·||2 is used to referto the l2 norm of a vector. Among the most popular ones isthe Gaussian kernel [40]. However, Gaussian kernels requiretuning carefully their width, which is not a straightforwardtask. It is worth noting that several heuristics have beenproposed to address specifically this limitation [41]. In theinitial version of PKT [23], this issue was avoided by derivingan angular kernel that requires no domain-dependent tuning(cosine kernel). This kernel also fits various retrieval tasksespecially well, since the cosine similarity is usually usedfor retrieval. However, as we experimentally found out, thiskernel is not optimal for every task. Instead, l2-based kernels,such as the Gaussian and T-student kernels, seem to performbetter on such tasks, e.g., clustering and classification. Thisbehavior is experimentally demonstrated in the results reportedin Section IV.

Therefore, in this article, we propose using a hybrid objec-tive that requires minimizing the divergence calculated usingboth the cosine kernel, which ensures the good performance ofthe learned representation for retrieval tasks, and the T-studentkernel, which ensures the good performance of the method forclassification tasks

L = D(P (t)

c ,P (s)c

) + D(P (t)

T ,P (s)T

)(11)

where D(·) is a divergence metric and the notations P (t)c

and P (t)T are used to denote the conditional probabilities of

the teacher calculated using the cosine and T-student kernels,respectively. The student probability distribution is denotedsimilarly by P (s)

c and P (s)T . It is worth noting that the proposed

method only requires one additional feedforward pass for theteacher model in order to extract the teacher’s representationand calculate the loss L. The complexity of calculating the lossis O(N2

B Nd ), where NB is the batch size and Nd is the size

of the representation extracted from the teacher/student (or themaximum dimensionality among them, if they are different).

There are also several different choices for defining thedivergence metric. In this article, a symmetric version of theKL divergence, the Jeffreys divergence [42], is used to thisend

DJ(P (t)||P (s)

) =∫ +∞

−∞

(P (t)(t) − P (s)(t)

)(logP (t)(t) − logP (s)(t)

)dt. (12)

The final divergence function that can be readily used fortraining the model is defined as

DJ(P (t)||P (s)

)

=N∑

i=1

N∑j=1,i �= j

(p(t)

j |i − p(s)j |i

)·(

log p(t)j |i − log p(s)

j |i)

(13)

since the two distributions are sampled using a finite number ofpoints. Note that other metrics, such as the KL divergence, canbe employed to meet the requirements of each application. Forexample, KL is an asymmetric metric that can be used whenhigher weight should be given to minimize the divergence forthe neighboring pairs of points instead of distant ones.

To learn the parameters W of the student model g(t, W),gradient descent is used: �W = −η(∂L/∂W), where W isthe matrix with the parameters of the student model and η isthe employed learning rate. The derivative of the loss functionwith respect to the parameters of the model can be easilyderived by observing that

∂L∂W

=N∑

i=1

N∑j=1,i �= j

∂L∂p(s)

j |i

N∑l=1

∂p(s)j |i

∂yl

yl

∂W(14)

where (yl/∂W) is just the derivative of the student’s outputwith respect to its parameters. Furthermore, the conditionalprobabilities can be estimated using relatively small batchesof the data at each iteration, e.g., batches of 128 samples,instead of using the whole data set. In this way, it is possibleto accelerate the convergence of the method, while it wasexperimentally established that this batch-based approach doesnot negatively affect the learned representation. Note that toensure that different sample pairs are used for each iteration,the transfer samples are shuffled after each epoch. Finally,note that the teacher model can be any model for whichwe can estimate the conditional probability p(t)

j |i . Therefore,the source of knowledge for the teacher can range fromrepresentations extracted from neural layers and handcraftedfeatures (where kernels are used to estimate the probabilities)to domain knowledge and label attributes. In the latter case, itis worth noting that only the pairwise similarities between thesamples are required (allowing for transferring the knowledgeeven when there is no representation extracted from the teacherfor each data sample).

C. Ladder PKT

The proposed method can be directly used to transfer theknowledge from multiple layers of two neural networks. Thisis expected to guide better the knowledge-transfer process,

Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 14,2020 at 11:56:27 UTC from IEEE Xplore. Restrictions apply.

Page 6: Probabilistic Knowledge Transfer for Lightweight Deep ...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 2. Ladder PKT: constructing an intermediate ladder network allows fortransferring more effectively the knowledge to all the layers of the studentmodel. First, the knowledge is transferred to the intermediate ladder network.Then, PKT is employed in a two-step process: 1) the knowledge is transferredbetween all the intermediate layers and 2) the final representation (extractedfrom the last layer of the networks) is further fine-tuned.

similar to the way that hints from multiple layers guide theneural network distillation process [7]. However, if the layersfrom and to which the knowledge will be transferred are notcarefully selected, the student network can be overregularized,leading to worse performance than not using intermediatelayers for the KT, as we also experimentally demonstrated inthe ablation study provided in Section IV. Note that selectingthe appropriate layers can be a quite difficult and tediousprocess that involves several experiments to find the bestcombination of layers, especially when the architectures ofthe student and teacher differ a lot.

To overcome this limitation, we propose a simple, yeteffective ladder-based approach for transferring the knowledgebetween all the layers of the student and an appropriatelyconstructed model, as shown in Fig. 2. First, an intermediatenetwork, called ladder network, with the same number oflayers as the student model is employed. This allows forhaving a one-to-one matching between each layer of thestudent and ladder models, overcoming the need to carefullyselect and match the layers between the teacher and laddermodels. At the same time, the ladder network should containmore parameters, i.e., the number of neurons/convolutionalfilters per layer must be increased, to ensure that enoughknowledge will be always available to the ladder network(when compared with the student model).

First, the knowledge is transferred from the final layer ofthe teacher model to the final layer of the ladder network, byperforming regular PKT. Then, the knowledge can be readilytransferred from the ladder to the student using all the availablelayers, since the ladder and student networks have similararchitecture, and, as a result, all the layers can be used withoutthe risk of mismatching between the layers. Finally, insteadof directly transferring the knowledge from all the layersat once, we propose using a cyclical schedule, i.e., select arandom layer pair (of the same depth) at each training epoch

TABLE II

COMPUTATIONAL COMPLEXITY OF THE MODELS USED FOR THECONDUCTED EXPERIMENTS

and transfer the knowledge between the layers of this pair.This process provides greater flexibility and allows for betterexploring the solution space. Then, the KT process continuesby employing regular PKT to transfer the knowledge to thefinal representation layer. The complexity of calculating theloss in the case of multilayer KT is updated as O(K N2

B ND),where K is the number of layers used for the KT and ND refersto the maximum dimensionality of any intermediate layer.

IV. EXPERIMENTAL EVALUATION

The proposed method is extensively evaluated and comparedwith other baseline and state-of-the-art KT methods in thissection. First, the proposed method is evaluated under arepresentation learning setup using the CIFAR-10 data set [43]as well as a challenging subset of the large-scale ILSVRCdata set [26]. Then, the proposed method is evaluated undera distribution shift setup, using the STL-10 data set [25],as well as by transferring the knowledge encoded in thehandcrafted feature extractors and different modalities usingthe SUN Attribute data set [44]. Finally, a sensitivity analysisis provided, where the effect of the employed kernel onthe performance on the learned representations is evaluated.Please note that details regarding the experimental setup,such as the employed network architectures and optimizationhyperparameters, are provided in the Supplementary Material.The computational complexity (time and space complexities)of the networks employed for the conducted experiments issummarized in Table II. Note that a wide variety of differentnetworks and transfer setups is employed. No informationregarding the teacher model is provided for the SUN Attributedata set, since the handcrafted features are used for the KTfor the experiments conducted using this data set.

A. Representation Learning Using CIFAR-10

The experimental results using the CIFAR-10 data set arereported in Table III. A content-based image-retrieval setupwas employed for the evaluation. The database contains thetraining image, while the test set images are used for evaluat-ing the performance of each model. The (interpolated) meanaverage precision (mAP) at the standard 11-recall points andthe top-k precision (abbreviated as “t-k”) were used as theevaluation metrics [13]. Both the Euclidean distance (denotedby “e”) and the cosine similarity (denoted by “c”) were usedfor the retrieval process. In addition, note that the cosinesimilarity is actually equivalent to the Euclidean distance,when used to rank the similarity between a query and the

Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 14,2020 at 11:56:27 UTC from IEEE Xplore. Restrictions apply.

Page 7: Probabilistic Knowledge Transfer for Lightweight Deep ...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PASSALIS et al.: PKT FOR LIGHTWEIGHT DEEP REPRESENTATION LEARNING 7

TABLE III

CIFAR-10: RETRIEVAL EVALUATION

documents in the database, when l2 normalization is performedon the extracted feature vectors. In this way, these two metricsprovide a way to evaluate the performance of the learnedrepresentation with and without using normalization.

The penultimate layers were used to extract a representationfrom each image for both the teacher (512-D features) andstudent (128-D features) models. The proposed method wascompared with two variants of the hint-based KT method [7],abbreviated as “Hint” (in the first one, a random projec-tion was used, while in the second one, the projection wasoptimized along with the student model). Furthermore, themultidimensional scaling-based method proposed in [24] wasalso evaluated (abbreviated as “MDS-T”), along with the plaindistillation approach (where the knowledge was transferredfrom the final classification layer). For the rest of the evaluatedmethods, the knowledge was transferred from the penultimatelayer of the teacher network (512 dimensions) to the corre-sponding layer of the student network (128 dimensions).

Several conclusions can be drawn from the results reportedin Table III. First, the proposed method leads to signifi-cant improvements compared with both the baseline studentnetwork trained for classification (the mAP increases from47.36% to 66.83%) as well as to all the other evaluatedmethods regardless the metric used for the retrieval process. Inaddition, training from scratch seems to lead to slightly bettersolutions compared with continuing the training process usingthe pretrained student network for most methods (including theproposed one). Quite interestingly, using random projectionsfor the hint-based method leads to consistently better resultscompared with optimizing the projection. This is perhapsbecause the projection is becoming a vital part of the network,and as a result, the learned representation losses part of itsdiscrimination ability when the projection is removed.

The quality of the learned representations was also evaluatedusing a knowledge discovery setup, where the representationswere clustered into ten clusters using the k-means algorithms.Various clustering metrics were measured [45]: the adjustedRand score (Rand), the adjusted MI, the V-score measure(V-score), the Fowlkes–Mallows (FM) score, and the CalinskiHarabasz (CH) score. The experiments were conducted usingthe representation learned using the pretrained student (lastblock of Table III). The experimental results are reported inTable IV. It is worth noting that even though the MDS-T

TABLE IV

CIFAR-10: CLUSTERING EVALUATION

TABLE V

CIFAR-10: LADDER-TRANSFER EVALUATION

method did not perform well on the retrieval evaluation, itperformed significantly better on the clustering evaluation.Nonetheless, the proposed method vastly outperformed allthe other evaluated methods for all the evaluated clusteringcriteria.

Finally, the proposed ladder-based transfer is evaluated inTable V. To evaluate the quality of the ladder transfer, weemployed a student trained with the PKT method as the laddernetwork, while a new student network was created by remov-ing half of the convolutional filters/neurons for each layerof the ladder network. Note that the proposed ladder-transfermethod can also be applied for the Hint and MDS methodsby transferring the knowledge between all the layers of theladder and teacher networks. The proposed ladder approachindeed improves the performance of all the methods that canbe used to support the ladder-based KT (hints with randomprojections did not converge, so the corresponding resultswere not included). Note that significant improvements wereobtained when the proposed ladder-transfer method was com-bined with the MDS approach, for which the mAP increasesfrom 31.99% to 36.59%. This highlights the importance ofusing the appropriate teacher models for training the students,regardless the employed KT method. Again, the proposedmethod outperforms all the other evaluated methods, leadingto higher mAP and top-50 precision.

B. Representation Learning Using ILSVRC

Next, the proposed method was evaluated under a morechallenging setup (Table VI), where the knowledge wastransferred for a specific subset of the Imagenet contain-ing five classes related to recognizing household appliances(“microwave,” “dish washer,” “refrigerator,” “washer,” and

Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 14,2020 at 11:56:27 UTC from IEEE Xplore. Restrictions apply.

Page 8: Probabilistic Knowledge Transfer for Lightweight Deep ...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE VI

IMAGENET SUBSET: RETRIEVAL EVALUATION

TABLE VII

CIFAR-10: RETRIEVAL EVALUATION USING THE STL-10 DATA SET

“vacuum”). The teacher model was trained from scratch,instead of using a larger network pretrained on the wholeImagenet data set, since we found out that all the methodsperformed significantly better when the knowledge was trans-ferred from a network specialized on the specific subset ofclasses. Despite the significantly different setup, e.g., differentreceptive fields for the convolutional layers, different numberof classes, and so on, the results are similar to those reportedfor the CIFAR-10 data set. The proposed method outperformsall the other evaluated methods under all the evaluated metrics.

C. Distribution Shift Evaluation Using CIFAR-10 and STL-10

The models trained using the CIFAR-10 data set (KT usingthe pretrained model) were also evaluated using a more chal-lenging setup, where the STL-10 data set, which contains thesame classes as the CIFAR-10 (except for one), was employedto evaluate the KT methods in a distribution shift scenario.The STL-10 images were resized to 32 × 32 pixels, in orderto be compatible with the networks trained on CIFAR-10.The results are reported in Table VII. Two different scenarioswere used: in the first one, the transfer set was the CIFAR-10data set, while in the second one, the transfer set was theSTL-10 data set. In both cases, no labels were used for theKT process, while the teacher ResNet-18 model was trainedusing the CIFAR-10 data set as before. The proposed methodagain outperforms all the other evaluated methods for bothscenarios. As expected, using data from the same distributionfor the KT process, i.e., using the training set of the STL-10data set as the transfer set, leads to better retrieval precisionfor all the evaluated methods. Note again that no labels wereneeded for this process, allowing for using the whole unlabeledtraining set of the STL-10 data set (100 000 images).

TABLE VIII

SUN ATTRIBUTE: CROSS-MODAL KT

D. Cross-Modal KT

The proposed method was also evaluated on a cross-modalKT setup using the SUN Attribute data set [44] that con-tains more than 700 categories of scenes and 14 000 images,where each image is described by 102 discriminative textualattributes. The evaluation results are reported in Table VIII.First, the knowledge was transferred from 2 × 2 HoG fea-tures [44] into the student network. The proposed PKT methodleads to 31.21% mAP, outperforming all the other methods andachieving almost the same performance as the original HoGfeatures. Furthermore, the proposed method was also evaluatedunder a cross-modal KT setup, where the knowledge wastransferred from the textual modality (expressed in the formof a list of textual attributes) into the student neural networkthat operates within the visual modality (“Attribute” features).Transferring the knowledge from the textual modality indeedimproves the precision of the student network for all theevaluated methods, while the proposed PKT approach againoutperforms the rest of the evaluated methods, demonstratingthe flexibility of the proposed approach and its ability tosupport novel KT scenarios.

E. Sensitivity Analysis

Finally, we evaluated the effect of the kernel used forestimating the probability distributions to the effectiveness ofthe KT process. The results are reported in Table IX, using thesame setup as in Table III (30 training epochs were used). The“Combined” method refers to the proposed combined loss thatemploys both the cosine and T-student kernels, as describedin Section III. The Gaussian kernel is almost always theworst performing kernel. The T-student kernel performs wellon the classification tasks (1-nearest neighbor classificationaccuracy - “1-NN”) as well as on the retrieval tasks usingthe Euclidean distance as the affinity metric. On the otherhand, the cosine kernel leads to significantly better resultsfor retrieval using the cosine similarity, but to significantlyworse when used for retrieval using the Euclidean distance.These results motivated our choice for combining both thecosine and T-student losses. Indeed, the proposed combinedapproach significantly improved the worst case performance,almost matching the best results for each individual kernel,while outperforming both of them on the 1-NN classificationaccuracy. It is worth noting that even when training withthe Gaussian/T-student kernel, the best retrieval precision isacquired when the cosine similarity is used instead of the

Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 14,2020 at 11:56:27 UTC from IEEE Xplore. Restrictions apply.

Page 9: Probabilistic Knowledge Transfer for Lightweight Deep ...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

PASSALIS et al.: PKT FOR LIGHTWEIGHT DEEP REPRESENTATION LEARNING 9

TABLE IX

EFFECT OF THE KERNEL FUNCTION ON PKT

TABLE X

LADDER ABLATION STUDY

Euclidean distance. This somewhat unexpected finding demon-strates that the l2 normalization, which is involved in the cosinemetric, can lead to improved retrieval precision. This finding isfurther confirmed when the same normalization is applied onthe training, i.e., when the cosine or combined kernel is used,since, in this case, the obtained results are further improved.

Furthermore, an ablation study was conducted to evaluatethe effectiveness of the proposed multilayer ladder-based KT.The results are reported in Table X. First, the effectivenessof the multilayer KT is evaluated using one (plain PKT), two,three, or four intermediate ladder layers. The precision steadilyincreases as more layers are employed for the KT process,raising mAP by approximately 0.5% (average relative mAPincrease) for each additional layer employed, leading to a totalrelative accumulative increase of about 1.5%. Furthermore,the effectiveness of the proposed cyclical training process wasalso evaluated, by simultaneously transferring the knowledgefrom all the four layers, instead of using the proposed cycli-cal training process. Indeed, the employed cyclical trainingprocess leads to slight, yet consistent, improvements overthe baseline. Finally, to highlight further the effectivenessof employing a ladder architecture, multilayer PKT resultsare also provided using the ResNet teacher model, insteadof the ladder network, where the output of each residualblock was matched with each of the layers of the studentnetwork. Directly using the larger ResNet teacher, instead ofthe proposed ladder architecture, leads to a significant dropin the effectiveness of KT, since mAP drops by more than4% (relative decrease), demonstrating the effectiveness of theproposed ladder-based KT.

V. CONCLUSION

In this article, a novel probabilistic KT method that allowsfor transferring the knowledge contained in a large andcomplex neural network into a smaller and faster one waspresented. PKT was capable of transferring the knowledge intoa smaller student model by keeping as much information aspossible, as expressed through the representations extractedfrom the teacher model. The proposed method is able to

employ different kernels to estimate the probability distribu-tion of the teacher and student models, as well as differentdivergence metrics, allowing for easily adapting to a widerange of different applications. The flexibility of the proposedmethod was demonstrated using extensive experiments on fourdifferent data sets using a wide variety of experimental setups.The robustness and powerful probabilistic formulation of theproposed method led to improved performance in all the eval-uated scenarios, overcoming the limitations and significantlyoutperforming all the evaluated baseline and state-of-the-artKT methods. The experimental results also demonstrated theimportance of using the appropriate intermediate layers for KT,since, in the case of using a ResNet-18 teacher, the efficiencyof the multilayer PKT is actually lower than the baselinePKT. To this end, reinforcement learning [46] and/or attentionmechanisms [47] can be employed, allowing to better selectand/or pair the layers between the student and teacher models,further improving the efficiency of KT.

REFERENCES

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,pp. 436–444, May 2015.

[2] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with deeplearning: A review,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30,no. 11, pp. 3212–3232, Nov. 2019.

[3] M. Mahmud, M. S. Kaiser, A. Hussain, and S. Vassanelli, “Applica-tions of deep learning and reinforcement learning to biological data,”IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 2063–2079,Jun. 2018.

[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.(CVPR), Jun. 2016, pp. 770–778.

[5] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,“Binarized neural networks,” in Proc. Adv. Neural Inf. Process. Syst.,2016, pp. 4107–4115.

[6] A. G. Howard et al., “MobileNets: Efficient convolutional neuralnetworks for mobile vision applications,” 2017, arXiv:1704.04861.[Online]. Available: http://arxiv.org/abs/1704.04861

[7] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, andY. Bengio, “Fitnets: Hints for thin deep nets,” in Proc. Int. Conf. Learn.Represent., 2015, pp. 1–12.

[8] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neuralnetwork,” in Proc. Neural Inf. Process. Syst. Deep Learn. Workshop,2015, pp. 1–9.

[9] Z. Tang, D. Wang, and Z. Zhang, “Recurrent neural network trainingwith dark knowledge transfer,” in Proc. IEEE Int. Conf. Acoust., SpeechSignal Process. (ICASSP), Mar. 2016, pp. 5900–5904.

[10] Y. Chebotar and A. Waters, “Distilling knowledge from ensembles ofneural networks for speech recognition,” in Proc. Interspeech, Sep. 2016,pp. 3439–3443.

[11] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, “Learningefficient object detection models with knowledge distillation,” in Proc.Adv. Neural Inf. Process. Syst., 2017, pp. 742–751.

[12] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,”in Proc. Int. Workshop Similarity-Based Pattern Recognit., vol. 2015,pp. 84–92.

[13] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Informa-tion Retrieval. Cambridge, U.K.: Cambridge Univ. Press, 2008.

[14] M. Tzelepi and A. Tefas, “Deep convolutional learning for content basedimage retrieval,” Neurocomputing, vol. 275, pp. 2467–2478, Jan. 2018.

[15] P. Chitrakar, C. Zhang, G. Warner, and X. Liao, “Social media imageretrieval using distilled convolutional neural network for suspiciouse-Crime and terrorist account detection,” in Proc. IEEE Int. Symp.Multimedia (ISM), Dec. 2016, pp. 493–498.

[16] Y. Cao, Y. Chen, and D. Khosla, “Spiking deep convolutional neuralnetworks for energy-efficient object recognition,” Int. J. Comput. Vis.,vol. 113, no. 1, pp. 54–66, May 2015.

[17] R. Shokri and V. Shmatikov, “Privacy-preserving deep learning,” inProc. 53rd Annu. Allerton Conf. Commun., Control, Comput. (Allerton),Sep. 2015, pp. 1310–1321.

Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 14,2020 at 11:56:27 UTC from IEEE Xplore. Restrictions apply.

Page 10: Probabilistic Knowledge Transfer for Lightweight Deep ...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[18] R. Fierimonte, S. Scardapane, A. Uncini, and M. Panella, “Fullydecentralized semi-supervised learning via privacy-preserving matrixcompletion,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 11,pp. 2699–2711, Nov. 2017.

[19] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization:A geometric framework for learning from labeled and unlabeled exam-ples,” J. Mach. Learn. Res., vol. 7, pp. 2399–2434, Nov. 2006.

[20] N. Passalis and A. Tefas, “Unsupervised knowledge transfer usingsimilarity embeddings,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30,no. 3, pp. 946–950, Mar. 2019.

[21] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. PatternRecognit. (CVPR), vol. 1, Jun. 2005, pp. 886–893.

[22] K. Torkkola, “Feature extraction by non-parametric mutual informationmaximization,” J. Mach. Learn. Res., vol. 3, pp. 1415–1438, Mar. 2003.

[23] N. Passalis and A. Tefas, “Learning deep representations with prob-abilistic knowledge transfer,” in Proc. Eur. Conf. Comput. Vis., 2018,pp. 268–284.

[24] L. Yu, V. O. Yazici, X. Liu, J. van de Weijer, Y. Cheng, andA. Ramisa, “Learning metrics from teachers: Compact networks forimage embedding,” in Proc. IEEE/CVF Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2019, pp. 2907–2916.

[25] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networksin unsupervised feature learning,” in Proc. 14th Int. Conf. Artif. Intell.Statist., 2011, pp. 215–223.

[26] O. Russakovsky et al., “ImageNet large scale visual recognition chal-lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015.

[27] C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,”in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining,2006, pp. 535–541.

[28] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deeptransfer across domains and tasks,” in Proc. IEEE Int. Conf. Comput.Vis. (ICCV), Dec. 2015, pp. 4068–4076.

[29] W. Chan, N. Rosemary Ke, and I. Lane, “Transferring knowledgefrom a RNN to a DNN,” 2015, arXiv:1504.01483. [Online]. Available:http://arxiv.org/abs/1504.01483

[30] Z. Tang, D. Wang, Y. Pan, and Z. Zhang, “Knowledge transferpre-training,” 2015, arXiv:1506.02256. [Online]. Available:http://arxiv.org/abs/1506.02256

[31] A. K. Balan, V. Rathod, K. P. Murphy, and M. Welling, “Bayesiandark knowledge,” in Proc. Adv. Neural Inf. Process. Syst., 2015,pp. 3438–3446.

[32] X. Dong, J. Shen, L. Shao, and M.-H. Yang, “Interactive cosegmenta-tion using global and local energy optimization,” IEEE Trans. ImageProcess., vol. 24, no. 11, pp. 3966–3977, Nov. 2015.

[33] J. Shen, Z. Liang, J. Liu, H. Sun, L. Shao, and D. Tao, “Multiobjecttracking by submodular optimization,” IEEE Trans. Cybern., vol. 49,no. 6, pp. 1990–2001, Jun. 2019.

[34] J. Shen, X. Dong, J. Peng, X. Jin, L. Shao, and F. Porikli, “Submodularfunction optimization for motion clustering and image segmentation,”IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 9, pp. 2637–2649,Sep. 2019.

[35] J. Shen, J. Peng, X. Dong, L. Shao, and F. Porikli, “Higher order energiesfor image segmentation,” IEEE Trans. Image Process., vol. 26, no. 10,pp. 4911–4922, Oct. 2017.

[36] X. Dong and J. Shen, “Triplet loss in siamese network for objecttracking,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 459–474.

[37] S.-I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Mat-sukawa, and H. Ghasemzadeh, “Improved knowledge distillationvia teacher assistant,” 2019, arXiv:1902.03393. [Online]. Available:http://arxiv.org/abs/1902.03393

[38] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation:Fast optimization, network minimization and transfer learning,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,pp. 7130–7138.

[39] L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,”J. Mach. Learn. Res., vol. 9, pp. 2579–2605, Nov. 2008.

[40] D. W. Scott, Multivariate Density Estimation: Theory, Practice, andVisualization. Hoboken, NJ, USA: Wiley, 2015.

[41] B. A. Turlach et al., Bandwidth Selection in Kernel Density Estimation:A Review. Neuve, Belgium: Univ. catholique de Louvain Louvain-la-Neuve, 1993.

[42] H. Jeffreys, “An invariant form for the prior probability in estimationproblems,” Proc. Roy. Soc. London. A. Math. Phys. Sci., vol. 186,no. 1007, pp. 453–461, 1946.

[43] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” Univ. Toronto, Toronto, ON, Canada, Tech. Rep., 2009.

[44] G. Patterson and J. Hays, “SUN attribute database: Discovering, anno-tating, and recognizing scene attributes,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., Jun. 2012, pp. 2751–2758.

[45] J. Wu, H. Xiong, and J. Chen, “Adapting the right measures for K-meansclustering,” in Proc. 15th ACM SIGKDD Int. Conf. Knowl. DiscoveryData Mining (KDD), 2009, pp. 877–886.

[46] X. Dong, J. Shen, W. Wang, Y. Liu, L. Shao, and F. Porikli, “Hyper-parameter optimization for tracking with continuous deep Q-Learning,”in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,pp. 518–527.

[47] W. Wang and J. Shen, “Deep visual attention prediction,” IEEE Trans.Image Process., vol. 27, no. 5, pp. 2368–2378, May 2018.

Nikolaos Passalis received the B.Sc. degree in informatics, the M.Sc. degreein information systems, and the Ph.D. degree in informatics from the AristotleUniversity of Thessaloniki, Thessaloniki, Greece, in 2013, 2015, and 2018,respectively.

Since 2019, he has been a Post-Doctoral Researcher with the AristotleUniversity of Thessaloniki. He has (co)authored more than 55 journal articlesand conference papers and contributed one chapter to one edited book.His research interests include deep learning, lightweight machine learning,information retrieval, and computational intelligence.

Maria Tzelepi received the B.Sc. degree in informatics and the M.Sc. degreein digital media-computational intelligence from the Aristotle University ofThessaloniki, Thessaloniki, Greece, in 2013 and 2016, respectively, whereshe is currently pursuing the Ph.D. degree with the Artificial Intelligence andInformation Analysis Laboratory, Department of Informatics.

Her research interests include deep learning, pattern recognition, computervision, and image analysis and retrieval.

Anastasios Tefas (Member, IEEE) received the B.Sc. and Ph.D. degreesin informatics from the Aristotle University of Thessaloniki, Thessaloniki,Greece, in 1997 and 2002, respectively.

Since 2017, he has been an Associate Professor with the Department ofInformatics, Aristotle University of Thessaloniki, where he was a Lecturerand an Assistant Professor from 2008 to 2017. From 2006 to 2008, he wasan Assistant Professor with the Department of Information Management,Technological Institute of Kavala, Kavala, Greece. From 2003 to 2004, hewas a temporary Lecturer with the Department of Informatics, Universityof Thessaloniki. From 1997 to 2002, he was a Researcher and TeachingAssistant with the Department of Informatics, University of Thessaloniki.He participated in 18 research projects financed by national and Europeanfunds. He has coauthored 100 journal articles and 225 articles in internationalconferences and contributed eight chapters to edited books in his area ofexpertise. Over 5000 citations have been recorded to his publications andhis H-index is 36 according to Google scholar. His current research interestsinclude computational intelligence, deep learning, pattern recognition, statis-tical machine learning, digital signal and image analysis and retrieval, andcomputer vision.

Dr. Tefas is an Area Editor of the Signal Processing: Image Communica-tions journal.

Authorized licensed use limited to: Aristotle University of Thessaloniki. Downloaded on July 14,2020 at 11:56:27 UTC from IEEE Xplore. Restrictions apply.