Transfer Learning and Optimal...

Transfer Learning

and

Optimal Transport

Ievgen Redko

[email protected]

UMR • CNRS • 5516 • SAINT-ETIENNE

Ievgen Redko SciDoLySE 1 / 79

Credits and acknowledgments

Documents used for this talk :

D. Xu, K. Saenko, I. Tsang. Tutorial on Domain Transfer Learning for VisionApplications, CVPR’12.

S. Pan, Q. Yang and W. Fan. Tutorial : Transfer Learning with Applications, IJCAI’13.

S. Ben-David. Towards Theoretical Understanding of Domain Adaptation Learning,workshop LNIID at ECML’09.

F. Sha and B. Kingsbury. Domain Adaptation in Machine Learning and SpeechRecognition, Tutorial - Interspeech 2012.

K. Grauman. Adaptation for Objects and Attributes, workshop VisDA at ICCV’13

J. Blitzer and H. DauméIII. Domain Adaptation, Tutorial - ICML 2010.

Acknowledegments :A. Habrard, Rémi Flamary, Nicolas Courty, Devis Tuia, Tien Nam Li, Marc Sebban


Outline

Introduction

Optimal transport for domain adaptationProblem formulationRegularization framework for domain adaptationNumerical experiments

Mapping estimation for discrete optimal transportProblem formulationApplication to domain adaptationApplication to seamless copy in images

Optimal transport for target shiftMotivationProposed modelExperimental results

Optimal transport for joint distribution adaptation

Other contributionsDifferentially private OT

Conclusions


Introduction


Artificial Intelligence

Ultimate goal : Build systems that can learn by exploring the world.


Artificial Intelligence

Ultimate goal : Build systems that can learn by exploring the world.

- Unfortunately, not easy (or almost impossible as for now)


Goals in AI

Intermediate goal : Build systems that can classify and recognize well

Solution : Use Machine learning (ML) methods = near-human performance


Issues of Traditional MLIssues :

- Near-human performance is achieved using lots of labeled data

- Some tasks do not have that much labeled data (biology, physics etc)

- Some data/tasks evolve with time

- There exist too many tasks !


Issues of Traditional MLIssues :

- Near-human performance is achieved using lots of labeled data

- Some tasks do not have that much labeled data (biology, physics etc)

- Some data/tasks evolve with time

- There exist too many tasks !

Solution : Transfer learning

+ Use systems build for different but related applications


Transfer Learning

Definition [Pan, TL-IJCAI’13 tutorial]Ability of a system to recognize and apply knowledge and skills learned in previousdomains/tasks to novel domains/tasks


Transfer Learning

Definition [Pan, TL-IJCAI’13 tutorial]Ability of a system to recognize and apply knowledge and skills learned in previousdomains/tasks to novel domains/tasks

Example

We have labeled images from a Web image corpus Is there a Person in unlabeled images from a Video corpus ?

Person no Person

?→

Is there a Person ?


Settings

Supervised learning vs Transfer learning

?

from the same domain

?

from different domains

Domains are modeled as probability distributions over an instance space

Tasks associated to a domain (classification, regression, clustering, ...)

Goal

Improve a target predictive function in the target domainusing knowledge from the source domain


A Taxonomy of Transfer Learning

“A survey on Transfer Learning” [Pan and Yang, TKDE 2010]


In this tutorial

We focus on domain adaptation for classification

How can we learn, using labeled data from a sourcedistribution, a low-error classifier for another related

target distribution ?


In this tutorial

We focus on domain adaptation for classification

How can we learn, using labeled data from a sourcedistribution, a low-error classifier for another related

target distribution ?

Why ?

“Hot topic” - tutorials at ICML 2010, CVPR 2012, Interspeech 2012,workshops at ICCV 2013, NIPS 2013, ECML 2014

Many real-world motivating examples


A toy problem : Inter-twinning moons

(a) 10 (b) 20

(c) 30

(d) 40 (e) 50

(f) 70


Intuition and motivation : computer vision

“Can we train classifiers with Flickr photos, as they have already beencollected and annotated, and hope the classifiers still work well on mobilecamera images ?” [Gonq et al., CVPR’12]

“object classifiers optimized on benchmark dataset often exhibit significantdegradation in recognition accuracy when evaluated on another one” [Gonq etal., ICML’13, Torralba et al., CVPR’11, Perronnin et al., CVPR’10]

“Hot topic” - Visual domain adaptation [Tutorial CVPR’12, ICCV’13]


Problems with data representations

[Xu,Saenko,Tsang, Domain Transfer Tutorial - CVPR’12]


Hard to predict what will change in the new domain

[Xu,Saenko,Tsang, Domain Transfer Tutorial - CVPR’12]


Natural Language Processing

Part of Speech TaggingAdapt a tagger learned from medical papers to a journal

Text are represented by “words” (Bag of Words)


Spam detection

Adapt a classifier from a mailbox of an office worker to that of a hippie musician


Sentiment analysis

Adapt a classifier predicting the preferences for books to those of DVDs


Electronics to video games [Pan-IJCAI’13 tutorial]

Electronics Video games

(1) Compact ; easy to operate ; verygood picture quality ; looks sharp !

(2) A very good game ! It is action pa-cked and full of excitement. I am verymuch hooked on this game.

(3) I purchased this unit from CircuitCity and I was very excited about thequality of the picture. It is really nice andsharp.

(4) Very realistic shooting action andgood plots. We played this and werehooked.

(5) It is also quite blurry in very darksettings. I will never_buy HP again.

(6) It is so boring. I am extremelyunhappy and will probably never_buy

UbiSoft again.

Source specific : compact, sharp, blurry.

Target specific : hooked, realistic, boring.

Domain independent : good, excited, nice, never_buy, unhappy.


Other applications

Speech recognition [Tutorial at Interspeech’12]

Medecine

Biology

Time series

Wifi localization


Why transfer learning ?


Why transfer learning ? Smart people talk


A bit of vocabulary

Unsupervised Transfer Learning

= No labels in source and target domains

Unsupervised DA

= Presence of source labels, no target labels

Semi-supervised DA

= Presence of source labels, few target labels and a lot of unlabeled data

Semi-supervised learning

= No distribution shift, few labeled data and a lot of unlabeled data from thesame domain


Several key questions

1. How to estimate the distribution shift ?

2. What are the generalization guarantees ?

RPT(h) ≤?RPS

(h)?+?

3. When adaptation is possible ?


Several key questions

4. How to design new algorithms ?


3 main classes of algorithms

1. Instance-based methods

= Correct a sample bias by reweighting source labeled data : source instancesclose to target instances are more important

2. Feature-based methods

= Find a common space where source and target are close

3. Adjustment/Iterative methods

= Modify the model by incorporating pseudo-labeled information


Optimal transport for domainadaptation


The following slides are courtesy of R. Flamary(https://remi.flamary.com/biblio/presvannes2016.pdf)


Problem setup

Amazon DLSR

Feature extraction Feature extraction

Source Domain Target Domain

+ Labels

not working !!!!

decision function

no labels !

Problems

Labels only in the source domain, and classification is in the target domain.

Classifier trained on the source data performs badly in the target domain


Optimal transport for domain adaptation

Assumptions There exist a transport T between the source and target domain.

The transport preserves the conditional distributions :

Ps(y|xs) = Pt(y|T(xs)).


Optimal transport for domain adaptation

Assumptions There exist a transport T between the source and target domain.

The transport preserves the conditional distributions :

Ps(y|xs) = Pt(y|T(xs)).

3-step strategy1. Estimate optimal transport between distributions.

2. Transport the training samples onto the target distribution.

3. Learn a classifier on the transported training samples.

Dataset

Class 1

Class 2

Samples

Samples

Classifier on

Optimal transport

Samples

Samples

Classification on transported samples

Samples

Samples

Classifier on


Objective function

Optimization problem

minγ∈P

〈γ,C〉F + λΩs(γ) + ηΩ(γ),

where

Ωs(γ) entropic regularization [Cuturi, 2013].

η ≥ 0 and Ωc(·) is a DA regularization term.

Regularization to avoid overfitting in high dimension and encode additionalinformation.


Entropic regularization

0 5 10 15

0

5

10

15

Optimal matrix γ

Ωs(γ) =∑

i,j

γ(i, j) log γ(i, j)

Extremely efficient optimization scheme (Sinkhorn Knopp).

Solution is not sparse anymore.


Class-based regularization [Courty et al., 2016]

0 5 10 15

0

5

10

15

Optimal matrix γ

Ωc(γ) =∑

j

∑

c

||γ(Ic, j)||pq ,

Group components of γ using source labels

Target samples receive masses only from “same class” source samples.


Laplacian regularization for sample displacement

Sim. graph with S si,j>0 Small λ Large λ

Ωc(γ) =1

N2s

∑

i,j

Ssi,j‖(xs

i − xsi )− (xs

j − xsj)‖2

Proposed in [Ferradans et al., 2013] for color transfer in images.

Similar samples defined in Ss have similar displacements

Similarity graph Ss is Similarity graph using source labels



minγ∈P

〈γ,C〉F + λΩs(γ) + ηΩ(γ),

Special cases

η = 0 : Sinkhorn Knopp [Cuturi, 2013].

λ = 0 and Laplacian regularization : Large quadratic program solved withconditionnal gradient [Ferradans et al., 2013].

Non convex group lasso ℓp − ℓ1 : Majoration Minimization with SinkhornKnopp [Courty et al., 2014].

General framework with convex regularization Ω(γ)

Can we use efficient Sinkhorn Knopp scaling to solve the global problem ?

Yes using generalized conditional gradient [Bredies et al., 2009].

Linearization of the second regularization term but not the entropicregularization.


Barycentric mapping

How to transport the samples using the obtained coupling matrix ?


Barycentric mapping

How to transport the samples using the obtained coupling matrix ?

Use barycentric mapping

xis = argmin

xt∈XT

∑

ij

γ(i, j)∗c(x, xjt)

If c(x, x′) is the Euclidean distance then

XS ≃ γ∗XT


Simulated problem with controllable complexity

(g) rotation=10 (h) rotation=30

(i) rotation=50 (j) rotation=70

Two moons problem [Germain et al., 2013]

Two entangled moons with a rotation between domains.

The rotation angle allow a control of the adaptation difficulty.


Results on the two moons dataset

10

20

30

40

50

70

90

SVM (no adapt.) 0 0.104 0.24 0.312 0.4 0.764 0.828

DASVM 0 0 0.259 0.284 0.334 0.747 0.820

PBDA 0 0.094 0.103 0.225 0.412 0.626 0.687

OT-exact 0 0.028 0.065 0.109 0.206 0.394 0.507

OT-IT 0 0.007 0.054 0.102 0.221 0.398 0.508

OT-GL 0 0 0 0.013 0.196 0.378 0.508

OT-Lap 0 0 0.004 0.062 0.201 0.402 0.524

Average prediction error for adaptation from 10 to 90.

Clear advantage of the optimal transport techniques.

Regularization helps (a lot) up to 40.


Visual adaptation datasets

Digit recognition, MNIST VS USPS (10 classes, d=256, 2 dom.).

Face recognition, PIE Dataset (68 classes, d=1024, 4 dom.).

Object recognition, Caltech-Office dataset (10 classes, d=800/4096, 4 dom.).


Comparison on vision datasets

Datasets Digits Faces ObjectsMethods ACC Nb best ACC Nb best ACC Nb best

1NN 48.66 0 26.22 0 28.47 0PCA 42.94 0 34.55 0 37.98 0GFK 52.56 0 26.15 0 39.21 0TSL 47.22 0 36.10 0 42.97 1JDA 57.30 0 56.69 7 44.34 1

OT-exact 49.96 0 50.47 0 36.69 0OT-IT 59.20 0 54.89 0 42.30 0OT-Lap 61.07 0 56.10 3 43.20 0OT-LpLq 64.11 1 55.45 0 46.42 1OT-GL 63.90 1 55.88 2 47.70 9

OT works very well on digits and object recognition

Good but not best on face recognition (-.5% wrt JDA).


Next step

Limits

Scales at least quadratically with the dataset.

What about domains with different class proportions ? [Tuia et al., 2015]

Out of sample extension ?


Mapping estimation for discreteoptimal transport


Mapping estimation for discrete optimal transport

Why estimate the mapping ?

Out of sample problem.

Solving optimization problem every time the dataset changes

Transporting a very large number of samples.

Interpretability (depending on the mapping model).


Mapping estimation for discrete optimal transport

Why estimate the mapping ?

Out of sample problem.

Solving optimization problem every time the dataset changes

Transporting a very large number of samples.

Interpretability (depending on the mapping model).

How to estimate the mapping ?

Go back to Monge formulation ? No !

Fit the barycentric mapping but also introduce smoothness.


Mapping estimation

Problem formulation [Perrot et al., 2016]

argminT∈H,γ∈P

f(γ, T ) = λγ 〈γ,C〉F

︸︷︷︸

OT loss

+ ‖T (Xs)− nsγXt‖2F︸︷︷︸

Mapping data fitting

+ λTR(T )︸︷︷︸

Mapping reg.

where

Xs and Xt are the source and target datasets,

T (·) is applied for each elements of the above matrices,

nsγXt is the barycentric mapping for source samples with uniform weights,

H is the space of transformations (more details later),

R(·) is a regularization term controlling the complexity of T .


Mapping family H

Linear transformations

H =

T : ∀x ∈ Ω, T (x) = xTL

.

L is a d× d real matrix.

R(T ) = ‖L− I‖2F

where I is the identity matrix.

Update is a classical linear least square regression.


Mapping family H

Linear transformations

H =

T : ∀x ∈ Ω, T (x) = xTL

.

L is a d× d real matrix.

R(T ) = ‖L− I‖2F

where I is the identity matrix.

Update is a classical linear least square regression.

Nonlinear transformations

H =

T : ∀x ∈ Ω, T (x) = kXs(xT )L

kXs(xT ) =

(k(x,xs

1) k(x,xs2) · · · k(x,xs

ns)).

k(·, ·) is a positive definite kernel.

L is a ns × d real matrix.

Update is a classical kernel least square regression.


Illustrative exampleLin

ear m

appi

ng

2D Dataset

Nonl

inea

r map

ping

Source samplesTarget samples

Barycentric displacement T displacement Out of sample T

Clearly a non-linear mapping.

The mapping model controls the barycentric mapping.


Domain adaptation : Caltech-Office dataset

Task 1NN GFK SA OT L1L2 OTEOTLin OTLinB OTKer OTKerBT γ T γ T γ T γ

D → W 89.5 93.3 95.6 77.0 95.7 95.7 97.3 97.3 97.3 97.3 98.4 98.5 98.5 98.5

D → A 62.5 77.2 88.5 70.8 74.9 74.8 85.7 85.7 85.8 85.8 89.9 89.9 89.5 89.5D → C 51.8 69.7 79.0 68.1 67.8 68.0 77.2 77.2 77.4 77.4 69.1 69.2 69.3 69.3W → D 99.2 99.8 99.6 74.1 94.4 94.4 99.4 99.4 99.8 99.8 97.2 97.2 96.9 96.9W → A 62.5 72.4 79.2 67.6 71.3 71.3 81.5 81.5 81.4 81.4 78.5 78.3 78.5 78.8W → C 59.5 63.7 55.0 63.1 67.8 67.8 75.9 75.9 75.4 75.4 72.7 72.7 65.1 63.3A → D 65.2 75.9 83.8 64.6 70.1 70.5 80.6 80.6 80.4 80.5 65.6 65.5 71.9 71.5A → W 56.8 68.0 74.6 66.8 67.2 67.3 74.6 74.6 74.4 74.4 66.4 64.8 70.0 68.9A → C 70.1 75.7 79.2 70.4 74.1 74.3 81.8 81.8 81.6 81.6 84.4 84.4 84.5 84.5

C → D 75.9 79.5 85.0 66.0 69.8 70.2 87.1 87.1 87.2 87.2 70.1 70.0 78.6 78.6C → W 65.2 70.7 74.4 59.2 63.8 63.8 78.3 78.3 78.5 78.5 80.0 80.4 73.5 73.4C → A 85.8 87.1 89.3 75.2 76.6 76.7 89.9 89.9 89.7 89.7 82.4 82.2 83.6 83.5Mean 70.3 77.8 81.9 68.6 74.5 74.6 84.1 84.1 84.1 84.1 79.6 79.4 80.0 79.7

XXX Clear advantage of the mapping estimation methods.


Seamless copy with gradient adaptation

Poisson image editing with gradient adaptation

Adapt the gradients from the source to the target domain :

∆f = div Ts→t(v) over Ω, with f |∂Ω = ft|∂Ω. (1)

Ts→t : R6 → R6 is the mapping between gradients of the source and target

images in the domain.


Optimal transport for target shift


Different reasons to adapt

Covariate shift

adapt when only marginal distributions of inputs change

−4 −2 0 2 40.0

0.2

0.4

0.6

0.8

1.0

1.2Source domain

−4 −3 −2 −1 0 1 2 3 4

Target domain



Covariate shift

adapt when only marginal distributions of inputs change

+ most popular scenario– conditional distributions may differ in practice

−4 −2 0 2 40.0

0.2

0.4

0.6

0.8

1.0

1.2Source domain

−4 −3 −2 −1 0 1 2 3 4

Target domain


Covariate shift and optimal transport (Courty et al., 2014)



Target shift

adapt when only marginal distributions of outputs change

−4 −2 0 2 40.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4Source domain

−4 −3 −2 −1 0 1 2 3 4

Target domain



Target shift

adapt when only marginal distributions of outputs change

+ occurs in many real-world applications (imbalanced data)+ few contributions in literature– assumes the same distribution of inputs

−4 −2 0 2 40.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4Source domain

−4 −3 −2 −1 0 1 2 3 4

Target domain



Sample-selection bias, source component shift, domain shift etc.

rarely studied (and very difficult) cases

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.50.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7Source domain

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Target domain


Wish list

General multi-source setting (more than 2 source domains)

Solving covariate shift between each pair source-target

Tackling target shift


First insight

How to tackle target shift ? Take 1

Assume that target distribution and source distributions are defined as

PT =

C∑

i=1

πTi Pi P

πS =

∑

i

πiPi,

with Pi being a distribution of class i ∈ 1, . . . , C. We want to solve Problem 1

π⋆ = argmin

π∈∆C

W (PπS , PT )


First insight


We can prove the following

Assume that ∀i, ∄α ∈ ∆C |αi = 0, Pi =∑

j αjPj.Then, for any distribution PT , the unique solution π∗ toProblem 1 is given by π

T.


First insight


We can prove the following

Assume that ∀i, ∄α ∈ ∆C |αi = 0, Pi =∑

j αjPj.Then, for any distribution PT , the unique solution π∗ toProblem 1 is given by π

T.

Wasserstein distance is a good candidate for proportion estimation !


Second insight


Consider a multi-source setting with a weighted source distribution

We can show that in case of binary classification

error intarget domain

≤ overallsource error+

distance betweenclasses

∗ distance between sourceand target class proportions

More formally, for class distributions P0 and P1


∗ distance between sourceand target class proportions=dist(P0, P1) ∗ |πT −

N∑

j=1

αjπjS |


Second insight


Consider a multi-source setting with a weighted source distribution

We can show that in case of binary classification

error intarget domain

≤ overallsource error+


∗ distance between sourceand target class proportions

More formally, for class distributions P0 and P1


∗ distance between sourceand target class proportions=dist(P0, P1) ∗ |πT −

N∑

j=1

αjπjS |

Reweight source samples to match target proportions !


Some notations

Data matrices X(k) ∼ µ(k) from K source domains

Data matrix X ∼ µ from target domain

Source proportions of classes h(k)c =

∫µ(k)c , µ(k) =

∑C

c=1 µ(k)c

Linear operators D(k)1 and D

(k)2 that transform m(k) to h

(k)c and back


Objective function

Putting it all together

h⋆ = argmin

h∈∆C ,Γ

K∑

k=1

λk KL(γ(k)|ζ(k))

s.t. ∀k D(k)1 γ

(k)1n = h.

A constrained Wasserstein barycenter problem

h⋆ reweights source instances to match source and target distributions

Efficient optimization with Bregman projections


Classification in the target domain

How would we obtain target labels ?

Before : Barycentric mapping

+ Accurate alignment of samples– Computationally costly : aligning + learning a classifier

Alternative : Label propagation

L =

K∑

k=1

λkD(k)1 γ

(k)

+ Directly obtaining target labels+ Majority vote by source domains = can be seen as boosting


Illustration

Covariate shift DA mixes instances from different classes !


Illustration

Proposed method handles target shift efficiently !


Real-world data

Zurich Summer’ data set composed of 20 satellite images

4 classes : Roads, Buildings, Trees and Grass

17 source and 1 target domain

Average class proportions [0.25± 0.07, 0.4± 0.13, 0.22± 0.11, 0.13± 0.11]


Examples of images

Input satellite images


Examples of images

Input satellite images

Satellite images with 4 classes


Classification results

# of sourcedomains

Average classproportions

# ofsource

instances

Noadaptation

OTDA

PT

OTDA

LP

MDA

Causal

JCPOT

LP

Targetonly

2 [0.17 0.4 0.16 0.27] 2’936 0.61 0.52 0.57 0.65 0.66 0.65

5 [0.22 0.39 0.18 0.21] 6’716 0.62 0.55 0.6 0.66 0.68 0.64

8 [0.25 0.46 0.17 0.12] 16’448 0.63 0.54 0.59 0.67 0.71 0.65

11 [0.26 0.48 0.16 0.1] 21’223 0.63 0.54 0.58 0.67 0.72 0.673

14 [0.26 0.45 0.19 0.1] 27’875 0.63 0.52 0.58 0.67 0.72 0.65

17 [0.25 0.42 0.20 0.13] 32’660 0.63 0.5 0.59 0.67 0.73 0.61


Optimal transport for jointdistribution adaptation

The following slides are courtesy of R. Flamary (OTML workshop, NIPS’17)http://otml17.marcocuturi.net/wp-content/uploads/2018/01/OTML_NIPS_

2017.pdf


Joint distribution and classifier estimation

Objectives of JDOT

• Model the transformation of labels (allow change of proportion/value).

• Learn an optimal target predictor with no labels on target samples.

• Approach theoretically justified.

Joint distributions and dataset

• We work with the joint feature/label distributions.

• Let Ω ∈ Rd be a compact input measurable space of dimension d and C the set of

labels.

• Let Ps(X,Y ) ∈ P(Ω× C) and Pt(X,Y ) ∈ P(Ω× C) the source and target joint

distribution.

• We have access to an empirical sampling Ps = 1Ns

∑Ns

i=1 δxsi,ys

iof the source

distribution defined by Xs = xsiNs

i=1 and label information Ys = ysi Ns

i=1.

• but the target domain is defined only by an empirical distribution in the feature

space with samples Xt = xtiNt

i=1.

14 / 29

Joint distribution OT (JDOT)

Proxy joint distribution

• Let f be a Ω → C function from a given class of hypothesis H.

• We define the following joint distribution that use f as a proxy of y

Pft = (x, f(x))x∼µt

(5)

and its empirical counterpart Pt

f= 1

Nt

∑Nt

i=1 δxti,f(xt

i) .

Learning with JDOT

We propose to learn the predictor f that minimize :

minf

W1(Ps, Pt

f) = inf

γ∈∆

∑

ij

D(xsi ,y

si ;x

tj , f(x

tj))γij

(6)

• ∆ is the transport polytope.

• D(xsi ,y

si ;x

tj , f(x

tj)) = α‖xs

i − xtj‖2 + L(ys

i , f(xtj)) with α > 0.

• We search for the predictor f that better align the joint distributions.

15 / 29


minf∈H,γ∈∆

∑

i,j

γi,j

(αd(xs

i ,xtj) + L(ys

i , f(xtj))

)+ λΩ(f) (7)

Optimization procedure

• Ω(f) is a regularization for the predictor f

• We propose to use block coordinate descent (BCD)/Gauss Seidel.

• Provably converges to a stationary point of the problem.

γ update for a fixed f

• Classical OT problem.

• Solved by network simplex.

• Regularized OT can be used

(add a term to problem (7))

f update for a fixed γ

minf∈H

∑

i,j

γi,jL(ysi , f(x

tj)) + λΩ(f) (8)

• Weighted loss from all source labels.

• γ performs label propagation.

19 / 29

Regression with JDOT

5 0 5x

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

y

Toy regression distributions

2.5 0.0 2.5 5.0x

1.0

0.5

0.0

0.5

1.0

Toy regression models

Source modelTarget modelSource samplesTarget samples

2.5 0.0 2.5 5.0x

1.0

0.5

0.0

0.5

1.0

Joint OT matrices

JDOT matrix linkOT matrix link

2.5 0.0 2.5 5.0x

1.0

0.5

0.0

0.5

1.0

Model estimated with JDOT

Source modelTarget modelJDOT model

Least square regression with quadratic regularization

For a fixed γ the optimization problem is equivalent to

minf∈H

∑

j

1

nt

‖yj − f(xtj)‖2 + λ‖f‖2 (9)

• yj = nt

∑

j γi,jysi is a weighted average of the source target values.

• Note that this problem is linear instead of quadratic.

• Can use any solver (linear, kernel ridge, neural network).

20 / 29

Classification with JDOT

0 2 4 6 8 10 12 14 160.0

0.2

0.4

0.6

0.8

1.0 Accuracy along BCD iterations

α=0.1

α=0.5

α=1.0

α=10.0

α=50.0

α=100.0

Multiclass classification with Hinge loss

For a fixed γ the optimization problem is equivalent to

minfk∈H

∑

j,k

Pj,kL(1, fk(xtj)) + (1− Pj,k)L(−1, fk(x

tj)) + λ

∑

k

‖fk‖2 (10)

• P is the class proportion matrix P =1Nt

γ⊤P

s.

• Ps and Y

s are defined from the source data with One-vs-All strategy as

Y si,k =

1 if ysi = k

−1 else, P s

i,k =

1 if ysi = k

0 else

with k ∈ 1, · · · ,K and K being the number of classes.21 / 29

Caltech-Office classification dataset

Domains Base SurK SA OT-IT OT-MM JDOT

caltech→amazon 92.07 91.65 90.50 89.98 92.59 91.54

caltech→webcam 76.27 77.97 81.02 80.34 78.98 88.81

caltech→dslr 84.08 82.80 85.99 78.34 76.43 89.81

amazon→caltech 84.77 84.95 85.13 85.93 87.36 85.22

amazon→webcam 79.32 81.36 85.42 74.24 85.08 84.75

amazon→dslr 86.62 87.26 89.17 77.71 79.62 87.90

webcam→caltech 71.77 71.86 75.78 84.06 82.99 82.64

webcam→amazon 79.44 78.18 81.42 89.56 90.50 90.71

webcam→dslr 96.18 95.54 94.90 99.36 99.36 98.09

dslr→caltech 77.03 76.94 81.75 85.57 83.35 84.33

dslr→amazon 83.19 82.15 83.19 90.50 90.50 88.10

dslr→webcam 96.27 92.88 88.47 96.61 96.61 96.61

Mean 83.92 83.63 85.23 86.02 86.95 89.04

Avg. rank 4.50 4.75 3.58 3.00 2.42 2.25

• Classical dataset [Saenko et al., 2010] dedicated to visual adaptation.

• Feature extraction by convolutional neural network [Donahue et al., 2014].

• Comparison with Surrogate Kernel [Zhang et al., 2013], Subspace Alignment

[Fernando et al., 2013] and OT Domain Adaptation [Courty et al., 2016b].

• Parameter selected via reverse cross-validation [Zhong et al., 2010].

• SVM (Hinge loss) classifiers with linear kernel.

• Best ranking method and 2% accuracy gain in average.

22 / 29

Amazon Review Classification dataset

Domains NN DANN JDOT (mse) JDOT (Hinge)

books→dvd 0.805 0.806 0.794 0.795

books→kitchen 0.768 0.767 0.791 0.794

books→electronics 0.746 0.747 0.778 0.781

dvd→books 0.725 0.747 0.761 0.763

dvd→kitchen 0.760 0.765 0.811 0.821

dvd→electronics 0.732 0.738 0.778 0.788

kitchen→books 0.704 0.718 0.732 0.728

kitchen→dvd 0.723 0.730 0.764 0.765

kitchen→electronics 0.847 0.846 0.844 0.845

electronics→books 0.713 0.718 0.740 0.749

electronics→dvd 0.726 0.726 0.738 0.737

electronics→kitchen 0.855 0.850 0.868 0.872

Mean 0.759 0.763 0.783 0.787

• Dataset aim at predicting reviews across domains [Blitzer et al., 2006].

• Comparison with Domain adversarial neural network [Ganin et al., 2016a].

• Classifier f is a neural network with same architecture as DANN.

• JDOT has better accuracy, classification loss is better than mean square error.

23 / 29

Wifi localization regression dataset

Domains KRR SurK DIP DIP-CC GeTarS CTC CTC-TIP JDOT

t1 → t2 80.84±1.14 90.36±1.22 87.98±2.33 91.30±3.24 86.76 ± 1.91 89.36±1.78 89.22±1.66 93.03 ± 1.24

t1 → t3 76.44±2.66 94.97±1.29 84.20±4.29 84.32±4.57 90.62±2.25 94.80±0.87 92.60 ± 4.50 90.06 ± 2.01

t2 → t3 67.12±1.28 85.83 ± 1.31 80.58 ± 2.10 81.22 ± 4.31 82.68 ± 3.71 87.92 ± 1.87 89.52 ± 1.14 86.76 ± 1.72

hallway1 60.02 ±2.60 76.36 ± 2.44 77.48 ± 2.68 76.24± 5.14 84.38 ± 1.98 86.98 ± 2.02 86.78 ± 2.31 98.83±0.58

hallway2 49.38 ± 2.30 64.69 ±0.77 78.54 ± 1.66 77.8± 2.70 77.38 ± 2.09 87.74 ± 1.89 87.94 ± 2.07 98.45±0.67

hallway3 48.42 ±1.32 65.73 ± 1.57 75.10± 3.39 73.40± 4.06 80.64 ± 1.76 82.02± 2.34 81.72 ± 2.25 99.27±0.41

• Objective is to predict position of a device on a discretized grid

[Zhang et al., 2013].

• Same experimental protocol as [Zhang et al., 2013, Gong et al., 2016].

• Comparison with domain-invariant projection and its cluster regularized version

([Baktashmotlagh et al., 2013], DIP and DIP-CC), generalized target shift

([Zhang et al., 2015], GeTarS), and conditional transferable components, with its

target information preservation regularization ([Gong et al., 2016], CTC and

CTC-TIP).

• JDOT solves the adaptation problem for transfer across device (10% accuracy

gain on Hallway).

24 / 29

Large scale JDOT Strategy

Large scale JDOT

• JDOT do not scale well to large datasets/ deep learning.

• Use minibach for computing the transport in the primal [Genevay et al., 2017].

• Evaluate batch-local couplings on (sufficiently large) couples of random (without

replacement) batches in source and target domain

• update f from these couplings

Algorithm : Deep JDOT

input Source data Xs, ys, Targte data Xt

for BCD Iterations do

for each Source/Target minibatch do

Solve OT with JDOT loss

Perform label propagation on minibatch

end for

Update model f on one epoch

end for

25 / 29

Large scale datasets

Description MNIST→ USPS USPS→MNIST SVHN→MNIST MNIST→ MNIST-M

Source samples 60000 9298 73257 60000

Target samples 9298 60000 60000 60000

height/width 16×16 16×16 32×32×3 28×28×3

• Four cross domain digits datasets: MNIST, USPS, SVHN, MNIST-M .

• We consider a deep convolutional architecture.

• Dropout is used on the dens layers when training.

• Transport distance computed in the raw image space.

26 / 29

Experimental Results for large scale JDOT

Methods MNIST→ USPS USPS→MNIST SVHN→MNIST MNIST→ MNIST-M

Source only (SO) 86.18 58.73 53.15 59.52

DeepCoral [Sun and Saenko, 2016] 88.43 (22.0) 85.02 (64.6) 69.61 (35.6) 62.18 (0.07)

MMD [Long and Wang, 2015] 89.89 (36.3) 79.19 (50.3) 53.27 (0.01) 52.53 (-19.1)

DANN [Ganin et al., 2016b] 89.06 (28.2) 87.03 (70.0) 73.85∗ (44.7) 76.63 (46.6)

ADDA [Tzeng et al., 2017] 91.22 (49.3) 79.98 (52.2) 76.0∗ (49.4) 79.16 (53.5)

DeepJDOT 91.50 (52.01) 91.21 (79.82) 83.62 (65.85) 67.84 (22.67)

Train on Target (TO) 96.41 99.42 99.42 96.21

• Accuracy in % of the DA methods.

• The values in () represent the coverage gap between SO (source only) and TO

(golden performance if the model is learnt on target labelled data), DA−SOTO−SO

.

• DeepJDOT is better in 3 out of 4 DA problems.

• Plots represent test performances along the BCD iterations.

27 / 29

Other contributions :

theoretical guarantees for DA with OT


Why domain adaptation works ?

Intuition : If two domains are similar then the adaptation should be easy ?


Why domain adaptation works ?

Intuition : If two domains are similar then the adaptation should be easy ?

Answer : Yes. Due to the following result [Ben David et al., 2007]

error in targetdomain

=

what we wantto learn

≤error in source

domain

=

can learn wellusing ML

+how different

two domains are ?

=

divergence betweenthe two domains

+is it possible

to adapt ?

=

A non-estimableterm

Can the Wasserstein distance be introduced into these bounds ?


Several learning bounds

With the Wasserstein distance as a divergence term

between marginal distributions µS and µT [Redko et al., 2017, Shen et al.,2018]

RT (h) ≤ RS(h) +W (µS , µT ) +O(

1√n

)

+Non-estimable

term

between joint distributions PS and PT [Courty et al., 2017]

RT (h) ≤ RS(h) +W (PS , PT ) +O(

1√n

)

+Non-estimable

term+

Prob. TransferLipschitzness term


Other contributions :

differentially private OT


Privacy in Optimal Transport and Domain Adaptation

Challenge : to perform optimal transport for domain adaptation

One needs to have access to both source and target data.

At least one party (source or target) has to reveal data.

It raises privacy concerns.



Challenge : to perform optimal transport for domain adaptation

One needs to have access to both source and target data.

At least one party (source or target) has to reveal data.

It raises privacy concerns.

How to transfer knowledge with OT while protecting the privacy of users ?


Differential Privacy

ε-differential private [Dwork et al. 2006]A randomized mechanism M : Xn → Rd is ε- differential private (generalized lateron to (ε, δ)-differential private) if for any two datasets X,X ′ ∈ Xn differing in asingle element and for any output t of M :

P(M(X) = t) ≤ eεP(M(X ′) = t).

Two close datasets correspond to close distributions.

M(X) does not leak much information about any individual point from X.

Example : Johnson-Lindenstrauss transform with some noise.


Differential Privacy

ε-differential private [Dwork et al. 2006]A randomized mechanism M : Xn → Rd is ε- differential private (generalized lateron to (ε, δ)-differential private) if for any two datasets X,X ′ ∈ Xn differing in asingle element and for any output t of M :

P(M(X) = t) ≤ eεP(M(X ′) = t).

Two close datasets correspond to close distributions.

M(X) does not leak much information about any individual point from X.

Example : Johnson-Lindenstrauss transform with some noise.

Idea : Use Johnson-Lindenstrauss transform to obtain a new DP cost matrix !


DPOT Algorithm

Input : Xs, Xt, and σ, ℓ > 0 (σ and ℓ known by both parties)

Step 1. Source generates a N (0, 1ℓ)k×ℓ matrix M (known by both parties)

and a N (0, σ)k×ℓ noise matrix ∆ (only known by the Source).

Step 2. Source sends M, Xs +∆, where Xs = XsM

Step 3. Target computes C = c(Xs +∆, Xt)− ℓσ2 where Xt = XtM

(ℓσ2 is subtracted from each entry to cancel the bias caused by ∆)

Step 4. Solve OT with cost matrix C and return P and W (Xs, Xt).

DPOT

Xs

Xt

Xs Xs

Xt

M

MCoupling matrix

~

~

Δ


Conclusions


Conclusions

OT is a very powerful tool for domain adaptation

A large variety of possible applications/scenarios (more remain uncovered)

Quite computationally efficient (but costly regularization is needed)


Conclusions

OT is a very powerful tool for domain adaptation

A large variety of possible applications/scenarios (more remain uncovered)

Quite computationally efficient (but costly regularization is needed)

Try it using POT library !


References I

Bredies, K., Lorenz, D. A., and Maass, P. (2009).

A generalized conditional gradient method and its connection to an iterativeshrinkage method.

Computational Optimization and Applications, 42(2) :173–193.

Courty, N., Flamary, R., and Tuia, D. (2014).

Domain adaptation with regularized optimal transport.

In European Conference on Machine Learning and Principles and Practice ofKnowledge Discovery in Databases (ECML PKDD).

Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. (2016).

Optimal transport for domain adaptation.

Pattern Analysis and Machine Intelligence, IEEE Transactions on.

Cuturi, M. (2013).

Sinkhorn distances : Lightspeed computation of optimal transportation.

In Neural Information Processing Systems (NIPS), pages 2292–2300.

Ferradans, S., Papadakis, N., Rabin, J., Peyré, G., and Aujol, J.-F. (2013).

Regularized discrete optimal transport.

In Scale Space and Variational Methods in Computer Vision, SSVM, pages 428–439.


References II

Germain, P., Habrard, A., Laviolette, F., and Morvant, E. (2013).

A PAC-Bayesian Approach for Domain Adaptation with Specialization to LinearClassifiers.

In ICML, pages 738–746, Atlanta, USA.

Perrot, M., Courty, N., Flamary, R., and Habrard, A. (2016).

Mapping estimation for discrete optimal transport.

In Neural Information Processing Systems (NIPS).

Tuia, D., Flamary, R., Rakotomamonjy, A., and Courty, N. (2015).

Multitemporal classification without new labels : a solution with optimal transport.

In 8th International Workshop on the Analysis of Multitemporal Remote SensingImages.


Transfer Learning and Optimal...

Documents

Transcript of Transfer Learning and Optimal...