A Theoretical Analysis of Metric Hypothesis Transfer Learning · A Theoretical Analysis of Metric...

A Theoretical Analysis of Metric Hypothesis Transfer Learning

Michael Perrot [email protected] Habrard [email protected]

Universite de Lyon, Universite Jean Monnet de Saint-Etienne,Laboratoire Hubert Curien, CNRS, UMR5516, F-42000, Saint-Etienne, France.

AbstractWe consider the problem of transferring somea priori knowledge in the context of supervisedmetric learning approaches. While this settinghas been successfully applied in some empiricalcontexts, no theoretical evidence exists to justifythis approach. In this paper, we provide a theo-retical justification based on the notion of algo-rithmic stability adapted to the regularized met-ric learning setting. We propose an on-average-replace-two-stability model allowing us to provefast generalization rates when an auxiliary sourcemetric is used to bias the regularizer. Moreover,we prove a consistency result from which weshow the interest of considering biased weightedregularized formulations and we provide a solu-tion to estimate the associated weight. We alsopresent some experiments illustrating the interestof the approach in standard metric learning tasksand in a transfer learning problem where few la-belled data are available.

1. IntroductionA lot of machine learning problems, such as clustering,classification or ranking, require to accurately compare ex-amples by means of distances or similarities. Designinga good metric for a task at hand is thus of crucial impor-tance. Manually tuning a metric is in general difficult andtedious, a recent trend consists to learn the metrics directlyfrom data. This has led to the emergence of supervisedmetric learning, see (Bellet et al., 2013; Kulis, 2013) forup-to-date surveys. The underlying idea is to infer auto-matically the parameters of a metric in order to capture theidiosyncrasies of the data. In a supervised classificationperspective, this is generally done by trying to satisfy pair-based constraints aiming at assigning a small (resp. large)

Proceedings of the 32nd International Conference on MachineLearning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-right 2015 by the author(s).

score to pairs of examples of the same class (resp. dif-ferent class). Most of the existing work has notably fo-cused on learning Mahalanobis-like distances of the formdM(x,x′) =

√(x− x′)TM(x− x′) where M is a posi-

tive semi-definite (PSD) matrix1, the learned matrix beingtypically plugged in a k-Nearest Neighbor classifier allow-ing one to achieve a better accuracy than the standard Eu-clidean distance.

Recently, there is a growing interest for methodsable to take into account some background knowledge(Parameswaran & Weinberger, 2010; Cao et al., 2013;Bohne et al., 2014) for learning M. This is in particular thecase for supervised regularized metric learning approacheswhere the regularizer is biased with respect to an auxiliarymetric given under the form of a matrix. The main ob-jective here is to make use of this a priori knowledge in asetting where only few labelled data are available to helplearning. For example, in the context of learning a PSDmatrix M plugged into a Mahalanobis-like distance as dis-cussed above, let I be the identity matrix used as an aux-iliary knowledge, ‖M − I‖ is a biased regularizer oftenconsidered. This regularization can be interpreted as fol-lows: learn M while trying to stay close to the Euclideandistance, or from another standpoint try to learn a matrix Mwhich performs better than I. Other standard matrices canbe used such as Σ−1 the inverse of the variance-covariancematrix, note that if we take the 0 matrix, we retrieve theclassical unbiased regularization term.

Another useful setting comes when I is replaced by anyauxiliary matrix MS learned from another task. This cor-responds to a transfer learning approach where the biasedregularization can be interpreted as transferring the knowl-edge brought by MS for learning M. This setting is appro-priate when the distributions over training and testing do-mains are different but related. Domain adaptation strate-

1Note that this distance is a generalization of some well-known distances: when M = I, I being the identity matrix, weretrieve the Euclidean distance, when M = Σ−1 where Σ is thevariance-covariance matrix of the data at hand, it actually corre-sponds to the original definition of a Mahalanobis distance.


gies (Ben-David et al., 2010) propose to make use of the re-lationship between the training examples, called the sourcedomain, and the testing instances, called the target domainto infer a model. However, it is sometimes not possible tohave access to all the training examples, for example whensome new domains are acquired incrementally. In this con-text, transferring the information directly from the modellearned from the source domain without any other access tothe source domain is of crucial importance. In the contextof this paper, we call this setting Metric Hypothesis Trans-fer Learning in reference to the Hypothesis Transfer Learn-ing model introduced in (Kuzborskij & Orabona, 2013) inthe context of classical supervised learning.

Metric learning generally suffers from a lack of theoret-ical justifications, in particular metric hypothesis transferlearning has never been investigated from a theoreticalstandpoint. In this paper, we propose to bridge this gapby providing a theoretical analysis showing that supervisedregularized metric learning approaches using a biased reg-ularization are well-founded. Our theoretical analysis isbased on algorithmic stability arguments allowing one toderive generalization guarantees when a learning algorithmdoes not suffer too much from a little change in the train-ing sample. As a first contribution, we introduce a newnotion of stability called on-average-replace-two-stabilitythat is well-suited to regularized metric learning formula-tions. This notion allows us to prove a high probabilitygeneralization bound for metric hypothesis transfer learn-ing achieving a fast converge rate in O(1/n) in the con-text of admissible, lipschitz and convex losses. In a secondstep, we provide a consistency result from which we justifythe interest of weighted biased regularization of the form‖M − βMS‖ where β is a parameter to set. From thisresult, we derive an approach for assessing this parameterwithout resorting to a costly parameter tuning procedure.We also provide an experimental study showing the effec-tiveness of transfer metric learning with weighted biasedregularization in the presence of few labeled data both onstandard metric learning and transfer learning tasks.

This paper is organized as follows. Section 2 introducessome notations and definitions while Section 3 discussessome related work. Our theoretical analysis is presented inSection 4. We detail our experiments in Section 5 beforeconcluding in Section 6.

2. Notations and DefinitionsWe start by introducing several notations and definitionsthat will be used throughout the paper. Let T be a domainequipped with a probability distribution DT defined overX ×Y , where X ⊆ Rd and Y is the label set. We considermetrics corresponding to distance functions X ×X → R+

parameterized by a d× d positive semi-definite (PSD) ma-

trix M denoted M � 0. In the following, a metric willbe represented by its matrix M. We also consider that wehave access to some additional information under the formof an auxiliary d× d matrix MS , throughout this paper wecall this additional information source metric or source hy-pothesis. We denote the Frobenius norm by ‖ · ‖F , Mkl

represents the value of the entry at index (k, l) in matrixM, [a]+ = max(a, 0) denotes the hinge loss and [n] theset {1, . . . , n} for any n ∈ N.

Let T = {zi = (xi, yi)}ni=1 be a labeled training set drawnfrom DT . We consider the following learning frameworkfor biased regularized metric learning:

M∗ = arg minM�0

LT (M) + λ‖M−MS‖F (1)

where LT (M) = 1n2

∑z,z′∈T l(M, z, z′) stands for the

empirical risk of a metric hypothesis M. Similarly we de-note the true risk by LDT (M) = Ez,z′∼DT l(M, z, z′). Inthis work we only consider convex, k-lipschitz and (σ,m)-admissible losses for which we recall the definitions below.

Definition 1 (k-lipschitz continuity). A loss functionl(M, z, z′) is k-lipschitz w.r.t. its first argument if, for anymatrices M, M′ and any pair of examples z, z′, there existsk ≥ 0 such that:

|l(M, z, z′)− l(M′, z, z′)| ≤ k‖M−M′‖F .

This property ensures that the loss deviation does not ex-ceed the deviation between matrices M and M′ with re-spect to a positive constant k.

Definition 2 ((σ,m)-admissibility). A loss functionl(M, z, z′) is (σ,m)-admissible, w.r.t. M, if it is convexw.r.t. its first argument and if for any two pairs of examplesz1, z2 and z3, z4, we have:

|l(M, z1, z2)− l(M, z3, z4)| ≤ σ |y1y2 − y3y4|+m

where yiyj = 1 if yi = yj and −1 otherwise. Thus|y1y2 − y3y4| ∈ {0, 2}.

This property bounds the difference between the losses oftwo pairs of examples by a value only related to the labelsplus a constant independent from M.

To derive our theoretical results, we make use of the notionof algorithmic stability which allows one to provide gen-eralization guarantees. A learning algorithm is stable if aslight modification in its input does not change its outputmuch. In our analysis we use two definitions of stability.On the one hand, we introduce in Section 4.1 the notionof on-average-replace-two-stability which is an adaptationto metric learning of the notion of on-average-replace-one-stability proposed in (Shalev-Shwartz & Ben-David, 2014)and recalled in Def. 3 below.


Definition 3 (On-average-replace-one-stability). Let ε :N→ R be monotonically decreasing and U(n) be the uni-form distribution over [n]. An algorithm A is on-average-replace-one-stable with rate ε(n) if for any distributionDT

ET∼DTn

i∼U(n)z′∼DT

[l(A(T i), zi)− l(A(T ), zi)

]≤ ε(n)

where A(T ), respectively A(T i) is the optimal solution ofalgorithmAwhen learning with training set T , respectivelyT i. T i is obtained by replacing the ith example of T by z′.

This property ensures that, given an example, learning withor without it will not imply a big change in the hypothesisprediction. Note that the property is required to be true onaverage over all the possible training sets of size n.

On the other hand, we consider an adaptation of the frame-work of uniform stability for metric learning proposed in(Jin et al., 2009) and recalled in Def. 4.

Definition 4 (Uniform stability). A learning algorithm hasa uniform stability in Kn , with K ≥ 0 a constant, if ∀i,

supz,z′∼DT

∣∣∣l(M∗, z, z′)− l(Mi∗, z, z′)∣∣∣ ≤ K

n

where M∗ is the matrix learned on the training set T , Mi∗

is the matrix learned on the training set T i obtained byreplacing the ith example of T by a new independent one.

Uniform stability requires that a small change in the train-ing set does not imply a significant variation in the learnedmodels output. The constraint inO

(1n

)over the supremum

makes this property rather strong since it considers a worstcase over the possible pairs of examples to compare, what-ever the training set. It is actually one of the most generalalgorithmic stability setting (Bousquet & Elisseeff, 2002).

3. Related Work3.1. Metric Learning

Based on the pioneering approach of (Xing et al., 2002),metric learning aims at finding the parameters of a dis-tance function by maximizing the distance between dis-similar examples (i.e. examples of different class) whilemaintaining a small distance between similar ones (i.e. ofsimilar class). Following this idea, one of the most fa-mous approach, called LMNN (Weinberger et al., 2005),proposes to learn a PSD matrix dedicated to improve thek-nearest neighbours algorithm. To do so, the authors forcethe metric to respect triplet-based local constraints of theform (zi, zj , zk) where zj and zk belong to the neighbour-hood of zi, zi and zj being of the same class, and zk beingof opposite class. The constraints impose that zi shouldbe closer to zj than to zk with respect to a margin ε. In

ITML, (Davis et al., 2007) propose to use a LogDet diver-gence as a regularizer allowing one to ensure an automaticenforcement of the PSD constraint. The idea is to force thelearned matrix M to stay as close as possible to a good ma-trix MS defined a-priori (in general MS is chosen as theidentity matrix). Indeed, if this divergence is finite, the au-thors show that M is guaranteed to be PSD. This constraintover M can be interpreted as a biased regularization w.r.t.MS .

The idea behind biased regularization has been successfullyused in many metric learning approaches. For example,(Zha et al., 2009) have proposed to replace the identity ma-trix (MS = I) originally used in ITML by matrices previ-ously learned on so called auxiliary data sets. Similarly, in(Parameswaran & Weinberger, 2010) the authors are inter-ested in Multi-Task metric learning. They propose to learnone metric for each task and a global metric common toall the tasks. For this global metric, they consider a biasedregularization of the form ‖M − I‖2F where I is the iden-tity matrix but they do not study any other kind of sourceinformation. In (Cao et al., 2013), the authors use a similarbiased regularization to learn a metric learning model forface recognition. As a last example, (Bohne et al., 2014)introduce a regularization of the form ‖M − βI‖F wherethey learn M and β. In our work, instead of optimizingthese two parameters, we derive a theoretically founded al-gorithm to choose beforehand the optimal value of β.

3.2. Theoretical Frameworks in Metric Learning

Theoretically speaking, there is not a lot of frameworks formetric learning. The goal of generalization guarantees is toshow that the empirical estimation of the error of an algo-rithm does not deviate much from the true error. One of themain difficulty in deriving bounds for metric learning is thefact that instead of considering examples drawn i.i.d. froma distribution, we consider pairs of examples which mightnot be independent. Building upon the framework of sta-bility proposed in (Bousquet & Elisseeff, 2002), (Jin et al.,2009) propose one of the first study of the generalizationability of a metric learning algorithm. Building upon thiswork, (Perrot et al., 2014) give theoretical guarantees fora local metric learning algorithm and (Bellet et al., 2012)derive generalization guarantees for a similarity learningalgorithm. Other ways to derive generalization guaranteesare to use the Rademacher complexity as in (Cao et al.,2012; Guo & Ying, 2014) or to use the notion of algorith-mic robustness (Bellet & Habrard, 2015).

3.3. Biased Regularization in Supervised Learning

Biased regularization has already been studied in non met-ric learning settings. For example in (Kienzle & Chel-lapilla, 2006), the authors propose to use biased regular-


ization to learn SVM classifiers. A first theoretical studyof biased regularization in the context of regularized leastsquares has been proposed in (Kuzborskij & Orabona,2013). Their study is based on a notion of hypothesis sta-bility less general than the uniform stability used in our ap-proach. In (Kuzborskij & Orabona, 2014), the authors de-rive generalization bounds based on the Rademacher com-plexity for regularized empirical risk minimization meth-ods in a supervised learning setting. Their results show thatif the true risk of the source hypothesis on the target do-main is low, then the generalization rate can be improved.However computing the true risk of the source hypothesisis not possible in practice. In our analysis, we derive a gen-eralization bound which depends on the empirical risk andthe complexity (w.r.t. the regularization term) of the sourcemetric. It allows us to derive an algorithm to minimize thegeneralization bound taking into account the performanceand the complexity of the source metric.

4. ContributionWe divide our contribution consisting of a theoreticalanalysis of Alg. 1 given convex, k-lipschitz and (σ,m)-admissible losses into three parts. First, we provide an onaverage analysis for ET [LDT (M∗)] where M∗ representsthe metric learned with Alg. 1 using training set T . Thisanalysis allows us to bound the expected loss over distri-bution DT with respect to the loss of the auxiliary metricMS over DT . It shows that on average the learned metrictends to be better than the given source MS , with a fastconvergence rate in O(1/n). Second, we provide a con-sistency analysis of our framework leading to a standardconvergence rate of O

(1√n

)w.r.t the empirical loss over

T optimized in Alg. 1. In a third part, we specialize theprevious consistency result to a specific loss and show thatit is possible to refine our generalization bound in order todepend both on the complexity of our source metric MSand its empirical performance on the training set T . Wethen deduce an approach to weight the importance of thesource hypothesis for optimizing the generalization bound.

4.1. On average analysis

Def. 3 allows one to perform an average analysis over theexpected loss, however its formulation is not tailored tometric learning approaches that work with pair of exam-ples. Thus we propose an adaptation of it that we call on-average-replace-two-stability allowing one to derive sharpbounds for metric learning.

Definition 5 (On-average-replace-two-stability). Let ε :N → R be monotonically decreasing and let U(n) be theuniform distribution over [n]. A metric learning algorithmis on-average-replace-two-stable with rate ε(n) if for every

distribution DT :

E T∼DT n

i,j∼U(n)z1,z2∼DT

[l(Mij

∗, zi, zj)− l(M∗, zi, zj)

]≤ ε(n)

where M∗, respectively Mij∗, is the optimal solution

when learning with the training set T , respectively T ij .T ij is obtained by replacing zi, the ith example of T , by

z1 to get a training set T i and then by replacing zj , the jth

example of T i, by z2.

Note that when this definition holds, it impliesET [LDT (M∗)− LT (M∗)] ≤ ε(n). The next theoremshows that our algorithm is on-average-replace-two-stable.

Theorem 1 (On-average-replace-two-stability). Given atraining sample T of size n drawn i.i.d. fromDT , our algo-rithm is on-average-replace-two-stable with ε(n) = 8k2

λn .

Proof. The proof of Th. 1 can be found in the supplemen-tary material.

We can now bound the expected true risk of our algorithm.Theorem 2 (On average bound). For any convex, k-lipschitz loss, we have:

ET∼DT n [LDT (M∗)] ≤ LDT (MS) +8k2

λn

where the expected value is taken over size-n training sets.

Proof. We have:

ET [LDT (M∗)]

= ET [LDT (M∗)] + ET [LT (M∗)]− ET [LT (M∗)]

= ET [LT (M∗)] + ET [LDT (M∗)− LT (M∗)]

≤ ET [LT (MS)] +8k2

λn. (2)

Inequality 2 is obtained by noting that from Th. 1 we haveET [LDT (M∗)− LT (M∗)] ≤ 8k2

λn , then the convexity ofour algorithm and the optimality of M∗ give LT (M∗) ≤LT (M∗)+λ‖M∗−MS‖2F ≤ LT (MS)+λ‖MS−MS‖2F .Noting that ET [LT (MS)] = LDT (MS) gives Th. 2.

This bound shows that with a sufficient number of exam-ples w.r.t. a fast convergence rate in O(1/n), we will onaverage obtain a metric which is at least as good as thesource hypothesis. Thus choosing a good source metric iskey to learn well.

4.2. Consistency analysis

We now provide a consistency analysis taking into accountthe empirical risk optimized in Alg. 1. We begin by show-ing that our algorithm is uniformly stable w.r.t. Def. 4 inthe next theorem.


Theorem 3 (Uniform stability). Given a training sampleT of n examples drawn i.i.d. from DT , our algorithm hasa uniform stability in Kn with K = 4k2

λ .

Proof. The beginning of the proof follows closely the oneproposed in (Bousquet & Elisseeff, 2002) and is postponedto the supplementary material for the sake of readability.We consider the end of the proof here. We have

B ≤ 4kt

n‖∆M‖F

whereB = λ‖M−MS‖2F−λ‖M−t∆M−MS‖2F+λ‖Mi−MS‖2F − λ‖Mi + t∆M−MS‖2F .Setting t = 1

2 we have:

B = λ‖M−MS‖2F − λ‖M−1

2∆M−MS‖2F

+ λ‖Mi −MS‖2F − λ‖Mi +1

2∆M−MS‖2F

=λ∑

k

∑

l

[(Mkl −MSkl)

2 − (Mkl− 1

2(Mkl−Mi

kl)−MSkl)2

+(Mikl −MSkl)

2 − (Mikl +

1

2(Mkl −Mi

kl)−MSkl)2

]

=λ∑

i

∑

j

[(Mkl−MSkl)

2−(1

2(Mkl−MSkl)+

1

2(Mi

kl−MSkl))2

+(Mikl −MSkl)

2 − (1

2(Mkl −MSkl) +

1

2(Mi

kl −MSkl))2

]

=λ∑

i

∑

j

[1

2((Mkl −MSkl)

2

+(Mikl −MSkl)

2 − 2(Mkl −MSkl)(Mikl −MSkl))

]

=λ∑

i

∑

j

[1

2(Mkl −MSkl −Mi

kl −MSkl)2

]=λ

2‖∆M‖2F .

Then we obtainλ

2‖∆M‖2F ≤

4k

2n‖∆M‖F ⇔ ‖∆M‖F ≤ 4k

λn.

Using the k-lipschitz continuity of the loss, we have:

supz,z′|l(M, z, z′)− l(Mi, z, z′)| ≤ k‖∆M‖F ≤ 4k2

λn.

Setting K = 4k2

λ concludes the proof.

Using the fact that our algorithm is uniformly stable, wecan derive generalization guarantees as stated in Th. 4.Theorem 4 (Generalization bound). With probability 1−δ,for any matrix M learned with our K uniformly stablealgorithm and for any convex, k-lipschitz and (σ,m)-admissible loss, we have:

LDT (M) ≤ LT (M) + (4σ + 2m+ c)

√ln 2

δ

2n+O

(1

n

)

where c is a constant linked to the k-lipschitz property ofthe loss.

Proof. The proof is available in the supplementary.

This bound shows that with a convergence rate in O(

1√n

)

the true risk of our algorithm is bounded above by the em-pirical risk justifying the consistency of the approach. Inthe next section, we propose an extension of this analysisto include the performance of the source metric. This ex-tension allows us to introduce a natural weighting of thesource metric in order to improve the proposed bound.

4.3. Refinement with weighted source hypothesis

In this part we study a specific loss, namely l(M, z, z′) =[yy′((x− x′)TM(x− x′)− γyy′)

]+

where yy′ = 1 ify = y′ and −1 otherwise. The convexity follows from theuse of the hinge loss. In the next two lemmas, we show thatthis loss is k-lipschitz continuous and (σ,m)-admissible.The (σ,m)-admissibility result is of high importance be-cause it allows us to introduce some information comingfrom the source matrix MS .

Lemma 1 (k-lipschitz continuity). Let M and M′ be twomatrices and z, z′ be two examples. Our loss l(M, z, z′) isk-lipschitz continuous with k = maxx,x′ ‖x− x′‖2.

Proof. The proof is available in the supplementary.

Lemma 2 ((σ,m)-admissibility). Let z1, z2, z3, z4 be fourexamples and M∗ be the optimal solution of Problem 1.The convex and k-lipschitz loss function l(M, z, z′) is(σ,m)-admissible with σ = max(γy3y4 , γy1y2) and

m = 2 maxx,x′ ‖x− x′‖2 (√

LT (MS)λ + ‖MS‖F ).

Proof. Let ε∗ = M∗ −MS be the difference between thelearned and the source metric. We first bound the frobeniusnorm of ε∗ w.r.t. the performance of the source metric.

LT (M∗)+λ‖M∗−MS‖2F ≤ LT (MS) + λ‖MS −MS‖2F

⇒ λ‖ε∗‖2F ≤ LT (MS)⇔ ‖ε∗‖F ≤√LT (MS)

λ

Now we can prove the (σ,m)-admissibility of our loss.

|l(M∗, z1, z2)− l(M∗, z3, z4)|=|[y1y2((x1 − x2)TM∗(x1 − x2)− γy1y2)

]+

−[y3y4((x3 − x4)TM∗(x3 − x4)− γy3y4)

]+|

≤|y1y2((x1 − x2)TM∗(x1 − x2)− γy1y2)

− y3y4((x3 − x4)TM∗(x3 − x4)− γy3y4)| (3)

≤|y1y2(x1 − x2)TM∗(x1 − x2)

− y3y4(x3 − x4)TM∗(x3 − x4)|+ |y3y4γy3y4 − y1y2γy1y2 |


≤2 maxx,x′

((x− x′)TM∗(x− x′))

+ |y3y4 − y1y2|max(γy3y4 , γy1y2)

≤2 maxx,x′

((x− x′)T (ε∗ + MS)(x− x′))

+ |y3y4 − y1y2|max(γy3y4 , γy1y2)

≤2 maxx,x′‖x− x′‖2(‖ε∗‖F + ‖MS‖F )

+ |y3y4 − y1y2|max(γy3y4 , γy1y2) (4)

≤2 maxx,x′‖x− x′‖2(

√LT (MS)

λ+ ‖MS‖F )

+ |y3y4 − y1y2|max(γy3y4 , γy1y2).

Inequality 3 comes from the 1-lipschitz property of thehinge loss. We obtain inequality 4 by applying the Cauchy-Schwarz inequality and some classical norm properties.

Setting m = 2 maxx,x′ ‖x− x′‖2(√

LT (MS)λ + ‖MS‖F )

and σ = max(γy3y4 , γy1y2) gives the lemma.

Using Lemmas 1 and 2 we can now derive, in Th. 5, a gen-eralization bound associated with our specific loss.Theorem 5 (Generalization bound). With probability 1−δfor any matrix M learned with Alg. 1, we have:

LDT (M) ≤LT (M) +O(

1

n

)

+

(√LT (MS)

λ+ ‖MS‖F + cγ

)√ln 2

δ

2n

where cγ is a constant linked to the k-lipschitz property ofthe loss and the chosen margins.

Proof. The proof is the same as for Th. 4 replacing k, σand m by their values.

As for Th. 4, the convergence rate is in O(

1√n

). The term

C(MS)def=

(√LT (MS)

λ + ‖MS‖F)

mainly depends on

the quality of the source hypothesis MS . The productC(MS)O

(1√n

)means that as the number of examples

available for learning increases, the quality of the sourcemetric is of decreasing importance. A similar result has al-ready been stated in domain adaptation or transfer learningin (Ben-David et al., 2010; Kuzborskij & Orabona, 2013)where they show that as the number of target examples in-creases, the necessity of having source examples decreases.

Given a source hypothesis MS , it is possible to optimize itw.r.t. the bound derived in Th. 5. Indeed, note that C(MS)corresponds to a trade-off between the complexity of thesource metric and its performance on the training set. Thelower the value of this term, the tighter the bound. Hence,we propose a way to minimize the generalization bound

and more specifically C(MS) by adding a weighting pa-rameter β ≥ 0 on the source metric MS . This parameter isa way to control the trade-off between complexity and per-formance of the source metric. It can be assessed by meansof the following optimization problem:

β∗ = arg minβ

C(βMS) (5)

Note that the bound derived in Th. 5 holds whatever thevalue of MS . Thus replacing it with β∗MS does not im-pact the theoretical study proposed in this section.

Interpretation of the value of β∗ We can distinguishthree main cases. First if the source hypothesis performspoorly on the training set at hand we expect β∗ to be assmall as possible to reduce the importance of MS . Ina sense, we tend to go back to the classical case wereMS = 0. Second if the source hypothesis is complex andperforms well, we expect β∗ to be rather small to reducethe complexity of the hypothesis while keeping a good per-formance on the training set. Third if the source hypothesisis simple and performs well, we expect β∗ to be closer toone since MS is already a good choice.

Learning β∗ Problem 5 is highly non differentiable2 andnon convex. However, it remains simple in the sense thatwe have only one parameter to assess and we used a clas-sical subgradient descent to solve it. Even if it is not con-vex, our empirical study shows no need to perform manyrestarts to output a good solution: we always found almostthe same solution. As a consequence, we applied only oneoptimization procedure in our experiments.

In this section we presented a new framework for metriclearning where one can use a source hypothesis to addsome side information during the learning process. Wehave shown that our approach is consistent with a conver-gence rate in O

(1√n

). Furthermore, given a specific loss,

we have shown that the use of a weighting parameter tocontrol the importance of the source metric is theoreticallyfounded. In the next part we empirically demonstrate thatwe can obtain competitive results both in a classical metriclearning setting and in a domain adaptation setting.

5. ExperimentsWe propose an empirical study according to two directionsdepending on the choice of the source metric. First, usingsome well-known distances as a source metric, we showthat our framework performs well on classical supervisedmetric learning tasks of the UCI database and we empiri-cally demonstrate the interest of learning the β parameter.

2To avoid this problem, we can use the classical relaxationwith slack variables.


Baselines Our approachDataset 1-NN ITML LMNN IDENTITY IDENTITY-B1 MAHALANOBIS MAHALANOBIS-B1Breast 95.31 ± 1.11 95.40 ± 1.37 95.60 ± 0.92 96.06 ± 0.77 95.75 ± 0.87 95.71 ± 0.84 94.76 ± 1.38Pima 67.92 ± 1.95 68.13 ± 1.86 67.90 ± 2.05 67.87 ± 1.57 67.54 ± 1.99 68.37 ± 2.00 66.31 ± 2.37Scale 78.73 ± 1.69 87.31 ± 2.35 86.20 ± 2.83 80.98 ± 1.51 80.82 ± 1.27 81.35 ± 1.17 80.88 ± 1.43Wine 93.40 ± 2.70 93.82 ± 2.63 93.47 ± 1.80 95.42 ± 1.71 95.07 ± 1.68 94.31 ± 2.01 80.56 ± 5.75

Table 1. Results of the experiments conducted on the UCI datasets. Each value corresponds to the mean and standard deviation over 10runs. For each dataset we highlight the best result using a bold font. Approaches with the suffix -B1 do not learn β, it is fixed to 1.

Second, we apply our framework in a semi-supervised Do-main Adaptation task. We show that, using only sourceinformation through a learned metric, our method is able tocompete with state of the art algorithms.

Setup In all our experiments we use limited trainingdataset, making it difficult to apply any kind of cross-validation to set the parameters. Thus we propose to fixthem as follows. First the positive and negative margin arerespectively set to the 5th and 95th percentile of the trainingset possible distances computed with the source metric asproposed in (Davis et al., 2007). Next we set λ such that thetwo terms of Eq. 5 are equals, i.e. we balance the complex-ity and performance importance with respect to the sourcemetric. The β parameter is then learned using Algorithm 5.In all the experiments we plug our metric in a 1-nearestneighbour classifier to classify the examples of the test set.Furthermore, the significance of the results is assessed witha paired samples t-test considering that an approach is sig-nificantly better when the p-value is lower than 0.05.

5.1. Classical Supervised Metric Learning

First we start by conducting experiments on several UCIdatasets (Lichman, 2013), namely breast, pima, scale andwine. We propose to consider three source metrics: (i)Zero: No source hypothesis, (ii) Identity: Euclideandistance, (iii) Mahalanobis: Inverse of the variance-covariance matrix computed on the training set.

For the last two hypothesis we propose two experiments,one where we set β = 1 and one where we learn β usingAlgorithm 5. The goal of this experiment is to show theinterest of automatically setting β. We consider a 1-nearestneighbour (1-NN) classifier using the Euclidean Distanceas the baseline and also report the results of two well knownmetric learning algorithms, namely ITML, (Davis et al.,2007) and LMNN (Weinberger et al., 2005). The resultsaveraged over 10 runs are reported in Table 1. For eachrun we randomly draw a training set containing 20% of thedata available for each class and we test the metric on theremaining 80% of data.

These experiments highlight the interest of learning the βparameter. When we consider the performance of our ap-proach with and without learning β, we mainly notice the

following facts. First, learning β always leads to an im-provement on all the datasets and the final result is betterthan the 1NN classifier. Second, learning β when consid-ering the identity matrix as the source metric seems to beof limited interest as the differences in accuracy are onlysignificant for the wine dataset. This can be justified bythe fact that, in this case, it only consists of a rescaling ofthe diagonal of the matrix and it does not change much thebehaviour of the distance. Finally, learning β when consid-ering the variance-covariance matrix as the source metricleads to a significant improvement of the performance ofthe metric except on the breast dataset. This is particularlytrue for the wine dataset with a gain of nearly 14% in accu-racy. It can be explained by the fact that, for this dataset, weare learning with less than 40 examples. Thus the originalMahalanobis distance does not carry as much informationas in the other datasets and is thus of a lower quality. Learn-ing β allows us to compensate this drawback and to obtainresults which are even better than ITML or LMNN.

5.2. Metric learning for Semi-supervised DomainAdaptation

In this section we consider a Semi-supervised DomainAdaptation (DA) task with the Office-Caltech dataset. Thisdataset consists of four domains: Amazon (A), Caltech (C),DSLR (D) and Webcam (W) for which we consider 10classes. This leads to consider 12 different adaptation prob-lems when we alternatively take each domain as the sourceor the target dataset. In these experiments we use the samesplits as the ones considered in (Hoffman et al., 2013) sincethey are freely available from the authors website and fol-low their experimental setup. The results averaged over 20runs and for each run 8 labelled source examples (20 if thesource is Amazon) and 3 labelled target examples are se-lected. The data is normalized thanks to the zscore and thedimensionality is reduced to 20 thanks to a simple PCA.The results are presented in Table 2 where we compare theperformance of our algorithm to 6 baselines: (i) 1-NNS : a1-NN using the source examples, (ii) 1-NNT : a 1-NN us-ing the target examples, (iii) LMNNT : a 1-NN on the targetexamples using the metric learned by LMNN on the sourceexamples, (iv) ITMLT : a 1-NN on the target examples us-ing the metric learned by ITML on the source examples, (v)MMDT: a DA method (Hoffman et al., 2013), (vi) GFK:


Baselines Our approachTask 1-NNS 1-NNT LMNNT ITMLT MMDT GFK MAHALANOBIS ITML LMNN

A→ C 35.95 ± 1.30 31.92 ± 3.24 32.42 ± 3.03 32.56 ± 4.17 39.76 ± 2.25 37.81 ± 1.85 32.65 ± 3.76 32.93 ± 4.60 34.66 ± 3.66A→ D 33.58 ± 4.37 53.31 ± 4.31 49.96 ± 3.53 44.33 ± 8.18 54.25 ± 4.32 51.54 ± 3.55 54.69 ± 3.96 51.54 ± 4.03 54.72 ± 5.00A→W 33.68 ± 3.60 66.25 ± 3.87 62.62 ± 4.49 58.17 ± 10.63 64.91 ± 5.71 59.36 ± 4.30 67.11 ± 5.11 64.09 ± 5.20 67.62 ± 5.18C→ A 37.37 ± 2.95 47.28 ± 4.15 42.97 ± 3.76 45.16 ± 7.60 51.05 ± 3.38 46.36 ± 2.94 50.15 ± 4.87 49.89 ± 5.25 50.36 ± 4.67C→ D 31.89 ± 5.77 54.17 ± 4.76 46.02 ± 6.54 48.07 ± 8.98 52.80 ± 4.84 58.07 ± 3.90 56.77 ± 4.63 53.78 ± 7.23 57.44 ± 4.48C→W 28.60 ± 6.13 65.06 ± 6.27 55.79 ± 5.09 59.21 ± 9.71 62.75 ± 5.19 63.26 ± 5.89 64.64 ± 6.44 64.00 ± 6.08 65.11 ± 5.25D→ A 33.59 ± 1.77 47.81 ± 3.56 40.57 ± 3.79 45.06 ± 6.78 50.39 ± 3.40 40.77 ± 2.55 49.48 ± 4.41 49.11 ± 4.09 49.67 ± 4.00D→ C 31.16 ± 1.19 32.22 ± 2.98 27.96 ± 3.03 29.93 ± 4.84 35.70 ± 3.25 30.64 ± 1.98 32.90 ± 3.14 32.99 ± 3.58 33.84 ± 2.99D→W 76.92 ± 2.18 66.19 ± 4.60 65.36 ± 3.82 66.74 ± 7.16 74.43 ± 3.10 74.98 ± 2.89 65.57 ± 4.52 66.38 ± 6.04 69.72 ± 3.78W→ A 32.19 ± 3.04 48.25 ± 3.52 41.69 ± 3.71 45.11 ± 5.72 50.56 ± 3.66 43.26 ± 2.34 50.80 ± 3.63 50.16 ± 4.32 50.92 ± 4.00W→ C 27.67 ± 2.58 30.74 ± 3.92 28.60 ± 3.41 28.99 ± 4.31 34.86 ± 3.62 29.95 ± 3.05 31.54 ± 3.60 31.40 ± 4.29 32.64 ± 3.52W→ D 64.61 ± 4.30 54.84 ± 5.17 56.89 ± 5.06 57.76 ± 7.03 62.52 ± 4.40 71.93 ± 4.07 57.17 ± 6.50 56.85 ± 5.51 61.14 ± 5.78Mean 38.93 ± 3.26 49.84 ± 4.20 45.90 ± 4.11 46.76 ± 7.09 52.83 ± 3.93 50.66 ± 3.28 51.12 ± 4.55 50.26 ± 5.02 52.32 ± 4.36

Table 2. Metric Learning for Semi-Supervised Domain Adaptation. For the sake of readability we design the considered domains bytheir initials. S → T stands for adaptation from the source domain to the target domain. Each time we consider the mean and standarddeviation over 20 runs. For each task, the best result is highlighted with a bold font.

another DA approach (Gong et al., 2012).

The last two methods need the source sample while in ourcase we only use a source metric learned from the sourceinstances. For our biased regularization framework we con-sider 3 possible metrics learned on the sources examples,namely (i) Mahalanobis, (ii) ITML and (iii) LMNN.

These results show that metric hypothesis transfer learningcan perform well in a Semi-supervised Domain Adaptationsetting. Indeed, we perform better than directly pluggingthe metrics learned by LMNN and ITML in a 1-nearestneighbour classifier. Moreover, we obtain accuracies whichare competitive with state of the art approaches like MMDTor GFK while using less information. If we compare ourapproach using LMNN as the source metric with MMDT,we note that MMDT is significantly better than our ap-proach on 4 out of 12 tasks while we are significantly bet-ter on 3 and 5 ends as a draw. Hence we can concludethat our method presents a similar level of performancethan MMDT. Similarly, if we compare our approach usingLMNN as the source metric with GFK, we obtain that GFKis significantly better than our approach on 3 tasks, we aresignificantly better on 7 and 2 lead to a draw. Hence, wecan conclude that our approach performs better than GFK.

If we compare the performances of both ITML and LMNNas metrics used directly in a nearest neighbour classifierone can intuitively expect ITML to be a better source hy-pothesis than LMNN. However, in practice using the metriclearned by LMNN as the source hypothesis yields better re-sults. This suggests that using a learned source model thattends to overfit reasonably the learning source sample canbe of potential interest in a transfer learning context. In-deed LMNN does not use a regularization term in its for-mulation and it is well know that LMNN is prone to over-fitting. Since, the parameter β penalizes the source metricw.r.t. its complexity it may limit the impact of the sourcemetric to what is needed for the transfer. Nevertheless, this

point deserves further investigation.

6. ConclusionIn this paper we presented a new theoretical analysis formetric hypothesis transfer learning. This framework takesinto account a source hypothesis information to help learn-ing by means of a biased regularization. This biased reg-ularization can be interpreted into two ways: (i) when thesource metric is an a priori known metric such as the iden-tity matrix, the objective is to infer a new metric that per-forms better than the source metric, (ii) when the sourcemetric has been learned from another domain, the formula-tion allows one to transfer the knowledge from the sourcemetric to the new domain. This last interpretation refers toa transfer learning setting where the learner does not haveaccess to source examples and can only make use of thesource model in the presence of few labelled data.

Our analysis has shown that this framework is theoreticallywell founded and that a good source hypothesis can facil-itate fast generalization in O(1/n). Moreover, we haveprovided a consistency analysis from which we have de-veloped a generalization bound able to consider both theperformance and the complexity of the source hypothesis.This has led to the use of weighted source hypothesis tooptimize the bound in a theoretically sound way.

As stated in (Kuzborskij & Orabona, 2014) in another con-text, our results stress the importance of choosing goodsource hypothesis. However, choosing the best source met-ric from few labelled data is a difficult problem of cru-cial importance. One perspective could be to considernotions of reverse validations as used in some transferlearning/domain adaptation tasks (Bruzzone & Marconcini,2010; Zhong et al., 2010). Another perspective would be toextend our framework to other settings and other kind ofregularizers.


ReferencesBellet, Aurelien and Habrard, Amaury. Robustness and

Generalization for Metric Learning. Neurocomputing,151(1):259–267, 2015.

Bellet, Aurelien, Habrard, Amaury, and Sebban, Marc.Similarity learning for provably accurate sparse linearclassification. In Proc. of the 29th International Con-ference on Machine Learning, ICML 2012, Edinburgh,Scotland, UK, June 26 - July 1, 2012, 2012.

Bellet, Aurelien, Habrard, Amaury, and Sebban, Marc. Asurvey on metric learning for feature vectors and struc-tured data. CoRR, abs/1306.6709, 2013.

Ben-David, Shai, Blitzer, John, Crammer, Koby, Kulesza,Alex, Pereira, Fernando, and Vaughan, Jennifer Wort-man. A theory of learning from different domains. Ma-chine Learning, 79(1-2):151–175, 2010.

Bohne, Julien, Ying, Yiming, Gentric, Stephane, and Pon-til, Massimiliano. Large margin local metric learning. InComputer Vision - ECCV 2014 - 13th European Confer-ence, Zurich, Switzerland, September 6-12, 2014, Proc.,Part II, pp. 679–694, 2014.

Bousquet, Olivier and Elisseeff, Andre. Stability and gen-eralization. Journal of Machine Learning Research, 2:499–526, 2002.

Bruzzone, Lorenzo and Marconcini, Mattia. Domain adap-tation problems: A DASVM classification technique anda circular validation strategy. Transaction Pattern Anal-ysis and Machine Intelligence, 32(5):770–787, 2010.

Cao, Qiong, Guo, Zheng-Chu, and Ying, Yiming. General-ization bounds for metric and similarity learning. CoRR,abs/1207.5437, 2012.

Cao, Qiong, Ying, Yiming, and Li, Peng. Similarity metriclearning for face recognition. In Proc. of the IEEE In-ternational Conference on Computer Vision (ICCV), pp.2408–2415, 2013.

Davis, Jason V., Kulis, Brian, Jain, Prateek, Sra, Suvrit, andDhillon, Inderjit S. Information-theoretic metric learn-ing. In Machine Learning, Proc. of the Twenty-FourthInternational Conference (ICML 2007), Corvallis, Ore-gon, USA, June 20-24, 2007, pp. 209–216, 2007.

Gong, Boqing, Shi, Yuan, Sha, Fei, and Grauman, Kristen.Geodesic flow kernel for unsupervised domain adapta-tion. In 2012 IEEE Conference on Computer Vision andPattern Recognition, Providence, RI, USA, June 16-21,2012, pp. 2066–2073, 2012.

Guo, Zheng-Chu and Ying, Yiming. Guaranteed classifi-cation via regularized similarity learning. Neural Com-putation, 26(3):497–522, 2014. doi: 10.1162/NECOa 00556. URL http://dx.doi.org/10.1162/NECO_a_00556.

Hoffman, Judy, Rodner, Erik, Donahue, Jeff, Saenko,Kate, and Darrell, Trevor. Efficient learning of domain-invariant image representations. CoRR, abs/1301.3224,2013.

Jin, Rong, Wang, Shijun, and Zhou, Yang. Regularizeddistance metric learning: Theory and algorithm. In Ad-vances in Neural Information Processing Systems 22:23rd Annual Conference on Neural Information Process-ing Systems 2009. Proc. of a meeting held 7-10 Decem-ber 2009, Vancouver, British Columbia, Canada., pp.862–870, 2009.

Kienzle, Wolf and Chellapilla, Kumar. Personalizedhandwriting recognition via biased regularization. InMachine Learning, Proc. of the Twenty-Third Interna-tional Conference (ICML 2006), Pittsburgh, Pennsylva-nia, USA, June 25-29, 2006, pp. 457–464, 2006.

Kulis, Brian. Metric learning: A survey. Foundations andTrends in Machine Learning, 5(4):287–364, 2013.

Kuzborskij, Ilja and Orabona, Francesco. Stability and hy-pothesis transfer learning. In Proc. of the 30th Inter-national Conference on Machine Learning, ICML 2013,Atlanta, GA, USA, 16-21 June 2013, pp. 942–950, 2013.

Kuzborskij, Ilja and Orabona, Francesco. Learningby transferring from auxiliary hypotheses. CoRR,abs/1412.1619, 2014.

Lichman, M. UCI machine learning repository, 2013. URLhttp://archive.ics.uci.edu/ml.

Parameswaran, Shibin and Weinberger, Kilian Q. Largemargin multi-task metric learning. In Advances in Neu-ral Information Processing Systems 23: 24th AnnualConference on Neural Information Processing Systems2010. Proc. of a meeting held 6-9 December 2010,Vancouver, British Columbia, Canada., pp. 1867–1875,2010.

Perrot, Michael, Habrard, Amaury, Muselet, Damien, andSebban, Marc. Modeling perceptual color differencesby local metric learning. In Computer Vision - ECCV2014 - 13th European Conference, Zurich, Switzerland,September 6-12, 2014, Proc., Part V, pp. 96–111, 2014.

Shalev-Shwartz, Shai and Ben-David, Shai. UnderstandingMachine Learning - From Theory to Algorithms, chapterRegularization and Stability, pp. 137–149. CambridgeUniversity Press, 2014.


Weinberger, Kilian Q., Blitzer, John, and Saul,Lawrence K. Distance metric learning for largemargin nearest neighbor classification. In Advancesin Neural Information Processing Systems 18 [NeuralInformation Processing Systems, NIPS 2005, December5-8, 2005, Vancouver, British Columbia, Canada], pp.1473–1480, 2005.

Xing, Eric P., Ng, Andrew Y., Jordan, Michael I., and Rus-sell, Stuart J. Distance metric learning with applicationto clustering with side-information. In Advances in Neu-ral Information Processing Systems 15 [Neural Informa-tion Processing Systems, NIPS 2002, December 9-14,2002, Vancouver, British Columbia, Canada], pp. 505–512, 2002.

Zha, Zheng-Jun, Mei, Tao, Wang, Meng, Wang, Zengfu,and Hua, Xian-Sheng. Robust distance metric learningwith auxiliary knowledge. In IJCAI 2009, Proc. of the21st International Joint Conference on Artificial Intelli-gence, Pasadena, California, USA, July 11-17, 2009, pp.1327–1332, 2009.

Zhong, ErHeng, Fan, Wei, Yang, Qiang, Verscheure,Olivier, and Ren, Jiangtao. Cross validation frame-work to choose amongst models and datasets for trans-fer learning. In Proc. of European Conference on Ma-chine Learning and Knowledge Discovery in Databases(ECML/PKDD), volume 6323 of LNCS, pp. 547–562.Springer, 2010.


Supplementary Material

Michael Perrot and Amaury Habrard

1 OverviewThis supplementary material is organised into three parts. In the first two parts we respectively state the proofs of the on-average and uniform stability analysis. In the last part, we show that the specific loss presented in the paper is k-lipschitz.

For the sake of readability we start by recalling our setting. Let T be a training set drawn from a distribution DT overX × Y . We consider the following framework for biased regularization metric learning:

M∗ = argminM�0

LT (M) + λ‖M−MS‖F (1)

where LT (M) = 1n2

∑z,z′∈T l(M, z, z′) stands for the empirical risk of hypothesis M. Similarly we denote the true

risk by LDT (M) = Ez,z′∼DT l(M, z, z′). We only consider convex, k-lipschitz and (σ,m)-admissible losses.

2 On-average-replace-two-stability analysisIn this part we show that our algorithm is on-average-replace-two-stable. For the sake of completeness, we also recall theproof of the bound already presented in the paper.

First we show in the following lemma that our algorithm is strongly convex. Before proving this result, we recall thedefinition of strong convexity.

Definition 1. A function f is λ-strongly convex if for all w, u, and α ∈ [0, 1] we have:

f(αw + (1− α)u) ≤ αf(w) + (1− α)f(u)− λ

2α(1− α)‖w − u‖2.

We can now state the lemma.

Lemma 1. The algorithm presented in Eq.1 is 2λ-strongly convex.

Proof. First we show that the regularization term λ‖M−MS‖2F is 2λ-strongly convex in M:

λ‖α(M) + (1− α)(M′)−MS‖2F= λ‖α(M−MS) + (1− α)(M′ −MS)‖2F≤ λα‖M−MS‖2F + λ(1− α)‖M′ −MS‖2F − λα(1− α)‖M−MS −M′ + MS‖2F (2)

≤ λα‖M−MS‖2F + λ(1− α)‖M′ −MS‖2F − λα(1− α)‖M−M′‖2FEq. 2 comes from the strong convexity of the squared Frobenius norm.

The regularization term is 2λ-strongly convex and LT (M) is convex since it a sum of convex functions. Thus ouralgorithm is 2λ-strongly convex because it is a sum of a 2λ-strongly convex and a convex function.

We can now show the on-average-replace-two-stability of our algorithm.

1

Theorem 1 (On-average-replace-two-stability). Given a training sample T of n examples drawn i.i.d. from DT , ouralgorithm is on-average-replace-two-stable with ε(n) = 8k2

λn .

Proof. Let M∗, respectively Mij∗, be the optimal solution when learning with the training set T , respectively T ij . Let

zk, zik, z

ikj respectively be the kth examples of training sets T, T i, T ij . We have:

LT (Mij∗) + λ‖Mij

∗−MS‖2F − (LT (M

∗) + λ‖M∗ −MS‖2F )= LT i(Mij

∗) + λ‖Mij

∗−MS‖2F − (LT i(M∗) + λ‖M∗ −MS‖2F )

+

∑k l(M

ij∗, zk, zi)− l(M∗, zk, zi)

n2+

∑l l(M

ij∗, zi, zl)− l(M∗, zi, zl)

n2

+

∑k l(M

∗, zik, zii)− l(Mij

∗, zik, z

ii)

n2+

∑l l(M

∗, zii, zil)− l(Mij

∗, zii, z

il)

n2(3)

= LT ij (Mij∗) + λ‖Mij

∗−MS‖2F − (LT ij (M∗) + λ‖M∗ −MS‖2F )

+

∑k l(M

ij∗, zk, zi)− l(M∗, zk, zi)

n2+

∑l l(M

ij∗, zi, zl)− l(M∗, zi, zl)

n2

+

∑k l(M

∗, zik, zii)− l(Mij

∗, zik, z

ii)

n2+

∑l l(M

∗, zii, zil)− l(Mij

∗, zii, z

il)

n2

+

∑k l(M

ij∗, zik, z

ij)− l(M∗, zik, z

ij)

n2+

∑l l(M

ij∗, zij , z

il)− l(M∗, zij , z

il)

n2

+

∑k l(M

∗, zikj, zij

j)− l(Mij

∗, zik

j, zij

j)

n2+

∑l l(M

∗, zijj, zil

j)− l(Mij

∗, zij

j, zil

j)

n2(4)

≤ LT ij (Mij∗) + λ‖Mij

∗−MS‖2F − (LT ij (M∗) + λ‖M∗ −MS‖2F ) +

8k‖Mij∗−M∗‖Fn

(5)

≤ 8k‖Mij∗−M∗‖Fn

(6)

Equalities (3) and (4) are obtained by successively adding and removing similar terms. Inequality (5) is due to the k-lipschitz property of the loss. Inequality (6) is obtained by noticing that Mij

∗is the minimizer of our algorithm when

learning with training set T ij .Furthermore, from the 2λ-strong convexity of our algorithm, proved in Lemma 1, we have:

LT (Mij∗) + λ‖Mij

∗−MS‖2F − (LT (M

∗) + λ‖M∗ −MS‖2F ) ≥ λ‖Mij∗−M∗‖2F .

Thus we obtain:

λ‖Mij∗−M∗‖2F ≤

8k‖Mij∗−M∗‖Fn

⇒ ‖Mij∗−M∗‖F ≤

8k

λn

V∣∣∣l(Mij

∗, zi, zj)− l(M∗, zi, zj)

∣∣∣ ≤ k‖Mij∗−M∗‖F ≤

8k2

λn.

(7)

The last inequality is obtained thanks to the k-lipschitz property of the loss. Taking the expectation of both sides gives thetheorem.

Using the on-average-replace-two-stability property of our algorithm, we derive our first bound.

2

Theorem 2 (On average bound). For any convex, k-lipschitz loss, we have:

ET [LDT (M∗)] ≤ LDT (MS) +

8k2

λn

where the expected value is taken over training sets of size n.

Proof. We have:

ET [LDT (M∗)]

= ET [LDT (M∗)] + ET [LT (M

∗)]− ET [LT (M∗)]

= ET [LT (M∗)] + ET [LDT (M

∗)− LT (M∗)]

≤ ET [LT (M∗)] +

8k2

λn(8)

≤ ET [LT (MS)] +8k2

λn(9)

Inequality 8 is obtained by noting that from Th. 1 we have ET [LDT (M∗)− LT (M∗)] ≤ 8k2

λn . Inequality 9 comes fromthe convexity of our algorithm which gives LT (M∗) ≤ LT (M

∗) + λ‖M∗ −MS‖2F ≤ LT (MS) + λ‖MS −MS‖2F .Noting that ET [LT (MS)] = LDT (MS) gives the theorem.

3 Uniform stability analysisIn this second part, we show that our algorithm is uniformly stable before proving the generalization bound presented inthe paper.

Theorem 3 (Uniform stability). Given a training sample T of n examples drawn i.i.d. from DT , our algorithm has auniform stability in Kn with K = 4k2

λ .

Proof. Let ∆M = M −Mi where M is the optimal solution when learning with set T and Mi is the optimal solutionwhen learning with set T i. The empirical risk is convex by sum of convex functions, thus

LT i(M− t∆M)− LT i(M) ≤ t(LT i(Mi)− LT i(M))

LT i(Mi + t∆M)− LT i(Mi) ≤ t(LT i(M)− LT i(Mi))

Summing up the two inequalities gives:

LT i(M− t∆M)− LT i(M) + LT i(Mi + t∆M)− LT i(Mi) ≤ 0. (10)

Our algorithm is convex as stated in Lemma 1, thus:

LT (M) + λ‖M−MS‖2F − LT (M− t∆M)− λ‖M− t∆M−MS‖2F+ LT i(Mi) + λ‖Mi −MS‖2F − LT i(Mi + t∆M)− λ‖Mi + t∆M−MS‖2F ≤ 0. (11)

And thus summing inequalities 10 and 11 gives:

LT (M)− LT i(M) + LT i(M− t∆M)− LT (M− t∆M)

+ λ‖M−MS‖2F − λ‖M− t∆M−MS‖2F + λ‖Mi −MS‖2F − λ‖Mi + t∆M−MS‖2F ≤ 0. (12)

3

Let B = λ‖M−MS‖2F − λ‖M− t∆M−MS‖2F + λ‖Mi −MS‖2F − λ‖Mi + t∆M−MS‖2F , we have:

B ≤| − LT (M) + LT i(M)− LT i(M− t∆M) + LT (M− t∆M)|

≤ 1

n2|∑

j

∑

k

l(M− t∆M, zj , zk)− l(M− t∆M, zij , zik) + l(M, zij , z

ik)− l(M, zj , zk)|

≤ 1

n2|∑

j

l(M− t∆M, zj , zi)− l(M− t∆M, zij , zii) + l(M, zij , z

ii)− l(M, zj , zi)

+∑

k

l(M− t∆M, zi, zk)− l(M− t∆M, zii, zik) + l(M, zii, z

ik)− l(M, zi, zk)|

≤ 1

n2

∑

j

|l(M− t∆M, zj , zi)− l(M, zj , zi)|+∑

j

|l(M, zij , zii)− l(M− t∆M, zij , z

ii)|

+∑

k

|l(M− t∆M, zi, zk)− l(M, zi, zk)|+∑

k

|l(M, zii, zik)− l(M− t∆M, zii, z

ik)|)

≤nkn2

(‖M− t∆M−M‖F + ‖M−M + t∆M‖F + ‖M− t∆M−M‖F + ‖M−M + t∆M‖F )

≤4kt

n‖∆M‖F

Furthermore, setting t = 12 , we have

B =λ‖M−MS‖2F − λ‖M−1

2∆M−MS‖2F + λ‖Mi −MS‖2F − λ‖Mi +

1

2∆M−MS‖2F

=λ∑

k

∑

l

[(Mkl −MSkl)

2 − (Mkl −1

2(Mkl −Mi

kl)−MSkl)2 + (Mi

kl −MSkl)2

− (Mikl +

1

2(Mkl −Mi

kl)−MSkl)2

]

=λ∑

k

∑

l

[(Mkl −MSkl)

2 − (1

2Mkl +

1

2Mi

kl −MSkl)2 + (Mi

kl −MSkl)2 − (

1

2Mkl +

1

2Mi

kl −MSkl)2

]

=λ∑

i

∑

j

[(Mkl −MSkl)

2 − (1

2(Mkl −MSkl) +

1

2(Mi

kl −MSkl))2 + (Mi

kl −MSkl)2

− (1

2(Mkl −MSkl) +

1

2(Mi

kl −MSkl))2

]

=λ∑

i

∑

j

[1

2((Mkl −MSkl)

2 + (Mikl −MSkl)

2 − 2(Mkl −MSkl)(Mikl −MSkl))

]

=λ∑

i

∑

j

[1

2(Mkl −MSkl −Mi

kl −MSkl)2

]

=λ

2‖∆M‖2F .

Then we obtain:

λ

2‖∆M‖2F ≤

4k

2n‖∆M‖F

⇔ ‖∆M‖F ≤4k

λn.

4

Using the k-lipschitz continuity of the loss, we have:

supz,z′|l(M, z, z′)− l(Mi, z, z′)| ≤ ‖∆M‖F ≤

4k2

λn.

Setting K = 4k2

λ concludes the proof.

We now recall the McDiarmid inequality McDiarmid (1989), used to prove our main theorem.

Theorem 4 (McDiarmid inequality). Let X1, ..., Xn be n independent random variables taking values in X and letZ = f(X1, ..., Xn). If for each 1 ≤ i ≤ n, there exists a constant ci such that

supx1,...,xn,x′

i∈X|f(x1, ..., xn)− f(x1, ..., x′i, ..., xn)| ≤ ci,∀1 ≤ i ≤ n,

then for any ε > 0,Pr [|Z − E [Z]| ≥ ε] ≤ 2 exp

( −2ε2∑ni=1 c

2i

).

Using Th. 3 which state the uniform stability of our algorithm and the McDiarmid inequality we can derive ourgeneralization bound. For this purpose, we replace Z by RT = LDT (M

∗) − LT (M∗) in Theorem 4 and we need tobound ET [RT ] and |RT −RT i |, which is done in the following two lemmas.

Lemma 2. For any learning method of estimation error RT and satisfying a uniform stability in Kn , we have

ET [RT ] ≤2Kn

.

Proof.

ET [RT ] ≤ ET [LDT (M∗)− LT (M∗)]

≤ ET,z,z′∼DT

∣∣∣∣∣∣l(M∗, z, z′)− 1

n2

∑

i

∑

j

l(M∗, zi, zj)

∣∣∣∣∣∣

≤ ET,z,z′∼DT

∣∣∣∣∣∣1

n2

∑

i

∑

j

l(M∗, z, z′)− l(Mi∗, zi, zj) + l(Mi∗, zi, zj)− l(M∗, zi, zj)

∣∣∣∣∣∣

≤ ET,z,z′∼DT

∣∣∣∣∣∣1

n2

∑

i

∑

j

l(Mij∗, zi, zj)− l(Mi∗, zi, zj) + l(Mi∗, zi, zj)− l(M∗, zi, zj)

∣∣∣∣∣∣

(13)

≤ ET,z,z′∼DT

1

n2

∑

i

∑

j

∣∣∣l(Mij∗, zi, zj)− l(Mi∗, zi, zj)

∣∣∣+ 1

n2

∑

i

∑

j

∣∣∣l(Mi∗, zi, zj)− l(M∗, zi, zj)∣∣∣

(14)

≤ 2Kn

(15)

Inequality (13) comes from the fact that T, z, z′ are drawn i.i.d. from the distribution DT and thus we do not change theexpected value by replacing one example with another. Inequality (14) is obtained by applying triangle inequality. Thelemma comes from applying the property of uniform stability twice (Th. 3).

Lemma 3. For any matrix M∗ learned by our algorithm using n training examples, and any loss function l satisfying the(σ,m)-admissibility, we have

|RT −RT i | ≤ 2K + (4σ + 2m)

n.

5

Proof.

|RT −RT i | =∣∣∣LDT (M

∗)− LT (M∗)− (LDT (Mi∗)− LT i(Mi∗))

∣∣∣

=∣∣∣LDT (M

∗)− LDT (Mi∗) + LT i(Mi∗)− LT i(M∗) + LT i(M∗)− LT (M∗)

∣∣∣

≤∣∣∣LDT (M

∗)− LDT (Mi∗)∣∣∣+∣∣∣LT i(Mi∗)− LT i(M∗)

∣∣∣+ |LT i(M∗)− LT (M∗)| (16)

≤ 2Kn

+

∣∣∣∣∣∣1

n2

∑

j

∑

k

l(M∗, zij , zik)− l(M∗, zj , zk)

∣∣∣∣∣∣(17)

≤ 2Kn

+

∣∣∣∣∣∣1

n2

∑

j

l(M∗, zij , zii)− l(M∗, zj , zi) +

1

n2

∑

j

l(M∗, zii, zik)− l(M∗, zi, zk)

∣∣∣∣∣∣(18)

≤ 2Kn

+1

n2

∑

j

∣∣l(M∗, zij , zii)− l(M∗, zj , zi)

∣∣+ 1

n2

∑

k

∣∣l(M∗, zii, zik)− l(M∗, zi, zk)

∣∣ (19)

≤ 2Kn

+2(2σ +m)

n(20)

(21)

Inequalities (16) and (19) are due to the triangle inequality. (17) comes from the application of uniform stability (Th. 3).(18) comes from the fact that T and T i only differ by their ith example. (20) comes from the (σ,m)-admissibility of theloss and the fact that |y1y2 − y3y4| ≤ 2.

We are now ready to prove our generalization bound.

Theorem 5 (Generalization bound). With probability 1 − δ, for any matrix M learned with our K uniformly stablealgorithm and for any convex, k-lipschitz and (σ,m)-admissible loss, we have:

LDT (M) ≤ LT (M) + (4σ + 2m+ c)O(

1√n

)+O

(1

n

)

where c is a constant linked to the k-lipschitz property of the loss.

Proof. Using the McDiarmid inequality (Th. 4) and Lemma 3 we have:

Pr [|RT − ET [RT ]| ≥ ε] ≤ 2 exp

(− 2ε2

∑ni=1

(2K+4σ+2m

n

)2

)

≤ 2 exp

(− 2ε2

1n (2K + 4σ + 2m)

2

).

Then, by setting:

δ = 2 exp

(− 2ε2

1n (2K + 4σ + 2m)

2

)

we obtain:

ε = (2K + 4σ + 2m)

√ln(2δ

)

2n

and thus:

Pr [|RT − ET [RT ]| < ε] > 1− δ.

6

Then, with probability 1− δ:

RT < ET [RT ] + ε

⇔ LDT (M∗)− LT (M∗) < ET [RT ] + ε

V LDT (M∗) < LT (M

∗) +2Kn

+ (2K + 4σ + 2m)

√ln(2δ

)

2n.

The last inequality is obtained using Lem. 2 and replacing ε by its value.

4 Specific lossWe show the k-lipschitz property of our loss.

Lemma 4 (k-lipschitz continuity). Let M and M′ be two matrices and z, z′ be two examples. Our loss l(M, z, z′) isk-lipschitz continuous with k = maxx,x′ ‖x− x′‖2.

Proof.

|l(M, z, z′)− l(M′, z, z′)| = |[yy′((x− x′)TM(x− x′)− γyy′)

]+−[yy′((x− x′)TM′(x− x′)− γyy′)

]+|

≤ |yy′((x− x′)TM(x− x′)− γyy′)− yy′((x− x′)TM′(x− x′)− γyy′)|≤ |yy′(x− x′)TM(x− x′)− yy′(x− x′)TM′(x− x′)|≤ |(x− x′)T (M−M′)(x− x′)|≤ ‖x− x′‖2‖M−M′‖F≤ max

x,x′‖x− x′‖2‖M−M′‖F

Setting k = maxx,x′ ‖x− x′‖2 concludes the proof.

ReferencesMcDiarmid, Colin. Surveys in Combinatorics, chapter On the method of bounded differences, pp. 148–188. Cambridge

University Press, 1989.

7

A Theoretical Analysis of Metric Hypothesis Transfer Learning · A Theoretical Analysis of Metric...

Documents

Transcript of A Theoretical Analysis of Metric Hypothesis Transfer Learning · A Theoretical Analysis of Metric...