Multi-Task Kernel Learning
Transcript of Multi-Task Kernel Learning
![Page 1: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/1.jpg)
Multi-Task Kernel Learning
J. Saketha Nath
CSE, IIT-Bombay
Saketh MLSS - IIT-Madras
![Page 2: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/2.jpg)
Multi-Task Learning
Setting:
Multiple related learning tasks
Eg. Object recognition
Exploit task relatedness for better generalization
The problem:
Learn shared features across tasks
If possible, sparse feature representations
Saketh MLSS - IIT-Madras
![Page 3: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/3.jpg)
Multi-Task Learning
Setting:
Multiple related learning tasks
Eg. Object recognition
Exploit task relatedness for better generalization
The problem:
Learn shared features across tasks
If possible, sparse feature representations
Saketh MLSS - IIT-Madras
![Page 4: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/4.jpg)
A simple case...
Suppose:
Tasks share a few input features.
Formulation:
minw,b,ξ
12
(∑df=1 ‖wf‖2
)2+ C
∑Tt=1
∑mti=1 ξti
s.t. yti(w>t xti − bt) ≥ 1− ξti, ξti ≥ 0
Saketh MLSS - IIT-Madras
![Page 5: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/5.jpg)
A simple case...
Suppose:
Tasks share a few input features.
Formulation:
minw,b,ξ
12
(∑df=1 ‖wf‖2
)2+ C
∑Tt=1
∑mti=1 ξti
s.t. yti(w>t xti − bt) ≥ 1− ξti, ξti ≥ 0
Saketh MLSS - IIT-Madras
![Page 6: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/6.jpg)
A simple case...
Suppose:
Tasks share a few input features.
Formulation:
minw,b,ξ
12
∑Tt=1 ‖wt‖22 + C
∑Tt=1
∑mti=1 ξti
s.t. yti(w>t xti − bt) ≥ 1− ξti, ξti ≥ 0
Saketh MLSS - IIT-Madras
![Page 7: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/7.jpg)
A simple case...
Suppose:
Tasks share a few input features.
Formulation:
minw,b,ξ
12
∑Tt=1 ‖wt‖21 + C
∑Tt=1
∑mti=1 ξti
s.t. yti(w>t xti − bt) ≥ 1− ξti, ξti ≥ 0
Saketh MLSS - IIT-Madras
![Page 8: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/8.jpg)
A simple case...
Suppose:
Tasks share a few input features.
Formulation:
minw,b,ξ
12
(∑df=1 ‖wf‖2
)2+ C
∑Tt=1
∑mti=1 ξti
s.t. yti(w>t xti − bt) ≥ 1− ξti, ξti ≥ 0
Saketh MLSS - IIT-Madras
![Page 9: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/9.jpg)
l1-l2 Regularizer
d∑f=1
‖wf‖2 ⇐︸︷︷︸l1
‖w1‖2
...‖wd‖2
⇐︸︷︷︸l2
w11 . . . wT1 ← w1
......
......
w1d . . . wTd ← wd
↑ ↑w1 . . . wT
Saketh MLSS - IIT-Madras
![Page 10: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/10.jpg)
Interpretation of lp regularization
Consider minx:‖x‖2≤1 f(x)
Saketh MLSS - IIT-Madras
![Page 11: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/11.jpg)
Interpretation of lp regularization
Consider minx:‖x‖2≤1 f(x)
Saketh MLSS - IIT-Madras
![Page 12: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/12.jpg)
Interpretation of lp regularization
Consider minx:‖x‖1≤1 f(x)
Saketh MLSS - IIT-Madras
![Page 13: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/13.jpg)
Interpretation of lp regularization
Consider minx:‖x‖1≤1 f(x)
Saketh MLSS - IIT-Madras
![Page 14: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/14.jpg)
Interpretation of lp regularization
Consider minx:‖x‖∞≤1 f(x)
Saketh MLSS - IIT-Madras
![Page 15: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/15.jpg)
Interpretation of lp regularization
Summary:
1 ≤ p < 2 promote sparsity
p = 2 induces robustness, rotation-invariant
2 < p <∞ promote non-sparse combinations
p =∞ promotes equal weightages
Saketh MLSS - IIT-Madras
![Page 16: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/16.jpg)
A Bit More Realistic Case...
Suppose: [Argyriou et.al., 08]
Tasks share a few (may be learnt) features.
Rotationally transformed features
Formulation:
minw,b,ξ,L
(∑df=1 ‖wf‖2
)2+ C
∑Tt=1
∑mti=1 ξti
s.t. yti(w>t L>xti − bt) ≥ 1− ξti, ξti ≥ 0, L ∈ Od
Saketh MLSS - IIT-Madras
![Page 17: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/17.jpg)
A Bit More Realistic Case...
Suppose: [Argyriou et.al., 08]
Tasks share a few (may be learnt) features.
Rotationally transformed features
Formulation:
minw,b,ξ,L
(∑df=1 ‖wf‖2
)2+ C
∑Tt=1
∑mti=1 ξti
s.t. yti(w>t L>xti − bt) ≥ 1− ξti, ξti ≥ 0, L ∈ Od
Saketh MLSS - IIT-Madras
![Page 18: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/18.jpg)
A Bit More Realistic Case...
Suppose: [Argyriou et.al., 08]
Tasks share a few (may be learnt) features.
Rotationally transformed features
Formulation:
minw,b,ξ,L
(∑df=1 ‖wf‖2
)2+ C
∑Tt=1
∑mti=1 ξti
s.t. yti(w>t L>xti − bt) ≥ 1− ξti, ξti ≥ 0, L ∈ Od
Saketh MLSS - IIT-Madras
![Page 19: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/19.jpg)
Multi-task Sparse Feature Learning (MTSFL)Formulation
Summary:
Though non-convex global optimum can be obtained
Can be kernelized
Efficient alternate minimization algorithm (EVD per iteration)
Achieves state-of-the-art performance on benchmarks
Discussion:
Rotationally transformed features — too restrictive
Essential for convexity
Idea: Enrich the input space itself
Multiple Kernel Learning (MKL) ??
Saketh MLSS - IIT-Madras
![Page 20: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/20.jpg)
Multi-task Sparse Feature Learning (MTSFL)Formulation
Summary:
Though non-convex global optimum can be obtained
Can be kernelized
Efficient alternate minimization algorithm (EVD per iteration)
Achieves state-of-the-art performance on benchmarks
Discussion:
Rotationally transformed features — too restrictive
Essential for convexity
Idea: Enrich the input space itself
Multiple Kernel Learning (MKL) ??
Saketh MLSS - IIT-Madras
![Page 21: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/21.jpg)
Multi-task Sparse Feature Learning (MTSFL)Formulation
Summary:
Though non-convex global optimum can be obtained
Can be kernelized
Efficient alternate minimization algorithm (EVD per iteration)
Achieves state-of-the-art performance on benchmarks
Discussion:
Rotationally transformed features — too restrictive
Essential for convexity
Idea: Enrich the input space itself
Multiple Kernel Learning (MKL) ??
Saketh MLSS - IIT-Madras
![Page 22: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/22.jpg)
Central Idea
Pose the problem as that of learning a shared kernel
Outline:
Two formulations:learn kernel shared across tasks (MK-MTFL)
Extension of standard MKL to multi-task case
learn sparse representation from shared kernel (MK-MTSFL)
Extension of MTSFL to multiple base kernels
Saketh MLSS - IIT-Madras
![Page 23: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/23.jpg)
Central Idea
Pose the problem as that of learning a shared kernel
Outline:
Two formulations:learn kernel shared across tasks (MK-MTFL)
Extension of standard MKL to multi-task case
learn sparse representation from shared kernel (MK-MTSFL)
Extension of MTSFL to multiple base kernels
Saketh MLSS - IIT-Madras
![Page 24: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/24.jpg)
Notational stuff...
k1, . . . , kn base kernels
φj(·) implicit mapping with kj
wtjf — tth task, jth kernel, f th feature loading
w·jf ,wt·f ,wtj·
Linear model: ft(x) =∑n
j=1 w>tj·φj(x)− bt
Saketh MLSS - IIT-Madras
![Page 25: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/25.jpg)
MK-MTFL Formulation
Primal:
minw,b,ξ
12
l1-l2-l2︷ ︸︸ ︷ n∑j=1
(T∑t=1
(‖wtj·‖2)2
) 12
2
+C∑T
t=1
∑mti=1 ξti
s.t. yti(∑n
j=1 w>tj·φj(xti)− bt) ≥ 1− ξti, ξti ≥ 0
Partial Dual:
minγ∈∆n
maxαt∈Smt (C)
T∑t=1
1>αt −12α>t Yt
n∑j=1
γjKtj
Ytαt
Saketh MLSS - IIT-Madras
![Page 26: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/26.jpg)
MK-MTFL Formulation
Primal:
minw,b,ξ
12
l1-l2-l2︷ ︸︸ ︷ n∑j=1
(T∑t=1
(‖wtj·‖2)2
) 12
2
+C∑T
t=1
∑mti=1 ξti
s.t. yti(∑n
j=1 w>tj·φj(xti)− bt) ≥ 1− ξti, ξti ≥ 0
Partial Dual:
minγ∈∆n
maxαt∈Smt (C)
T∑t=1
1>αt −12α>t Yt
n∑j=1
γjKtj
Ytαt
Saketh MLSS - IIT-Madras
![Page 27: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/27.jpg)
MK-MTFL Formulation
Primal (2≤p≤∞):
minw,b,ξ
12
l1-lp-l2, p≥2︷ ︸︸ ︷ n∑j=1
(T∑t=1
(‖wtj·‖2)p) 1
p
2
+C∑T
t=1
∑mti=1 ξti
s.t. yti(∑n
j=1 w>tj·φj(xti)− bt) ≥ 1− ξti, ξti ≥ 0
Partial Dual(p̄= p
p−2
):
minγ∈∆n
maxλj∈∆T,p̄
maxαt∈Smt (C)
T∑t=1
1>αt −12α>t Yt
n∑j=1
γjKtj
λjt
Ytαt
Saketh MLSS - IIT-Madras
![Page 28: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/28.jpg)
MK-MTFL Formulation
Primal (2≤p≤∞):
minw,b,ξ
12
l1-lp-l2, p≥2︷ ︸︸ ︷ n∑j=1
(T∑t=1
(‖wtj·‖2)p) 1
p
2
+C∑T
t=1
∑mti=1 ξti
s.t. yti(∑n
j=1 w>tj·φj(xti)− bt) ≥ 1− ξti, ξti ≥ 0
Partial Dual(p̄= p
p−2
):
minγ∈∆n
maxλj∈∆T,p̄
maxαt∈Smt (C)
T∑t=1
1>αt −12α>t Yt
n∑j=1
γjKtj
λjt
Ytαt
Saketh MLSS - IIT-Madras
![Page 29: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/29.jpg)
MK-MTFL Formulation
Summary:
Novel formulation for learning shared kernel
Extension of MKL to multi-task case
Tasks can be unequally reliable
Efficient mirror-descent based alg.
Each step solves T regular SVMs O(∑T
t=1m2tdn)
Saketh MLSS - IIT-Madras
![Page 30: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/30.jpg)
MK-MTSFL Formulation
Primal (1≤q≤2):
minw,b,ξ,L
12
lq-l1-l2︷ ︸︸ ︷ n∑j=1
dj∑f=1
‖w·jf‖2
q2q
+C∑T
t=1
∑mti=1 ξti
s.t. yti(∑n
j=1 w>tj·L>j φj(xti)− bt) ≥ 1− ξti, ξti ≥ 0,Lj ∈ Odj
Partial Dual“q̄= q
2−q
”:
minQ
T∑t=1
maxαt∈Smt (C)
1>αt − 12α>t Yt
(∑nj=1 M>
tjQjMtj
)Ytαt
s.t. Qj � 0,∑n
j=1(trace(Qj))q̄ ≤ 1
Saketh MLSS - IIT-Madras
![Page 31: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/31.jpg)
MK-MTSFL Formulation
Primal (1≤q≤2):
minw,b,ξ,L
12
lq-l1-l2︷ ︸︸ ︷ n∑j=1
dj∑f=1
‖w·jf‖2
q2q
+C∑T
t=1
∑mti=1 ξti
s.t. yti(∑n
j=1 w>tj·L>j φj(xti)− bt) ≥ 1− ξti, ξti ≥ 0,Lj ∈ Odj
Partial Dual“q̄= q
2−q
”:
minQ
T∑t=1
maxαt∈Smt (C)
1>αt − 12α>t Yt
(∑nj=1 M>
tjQjMtj
)Ytαt
s.t. Qj � 0,∑n
j=1(trace(Qj))q̄ ≤ 1
Saketh MLSS - IIT-Madras
![Page 32: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/32.jpg)
MK-MTSFL Formulation
Summary:
Novel formulation for learning shared sparse featurerepresentations
Trace-norm constraints lead to low rank matrices
Extension of MTSFL [Argyriou et.al., 08] to multiple base kernels
Though non-convex, global optimal can be efficiently obtained
Efficient mirror-descent based algorithm
Each step solves T regular SVMs, n EVDs of full matrices
Faster convergence in practice than alternate minimization
Saketh MLSS - IIT-Madras
![Page 33: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/33.jpg)
Solving MK-MTSFL
Partial Dual:
minQ
T∑t=1
maxαt∈Smt (C)
1>αt −12α>t Yt
n∑j=1
M>tjQjMtj
Ytαt
s.t. Qj � 0,n∑j=1
(trace(Qj))q̄ ≤ 1
g(Q) cannot be analytically computed
Danskin’s theorem provides ∇g(Q)Involves solving T regular SVMs
Saketh MLSS - IIT-Madras
![Page 34: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/34.jpg)
Solving MK-MTSFL
Partial Dual:
minQ
g(Q)︷ ︸︸ ︷T∑t=1
maxαt∈Smt (C)
1>αt −12α>t Yt
n∑j=1
M>tjQjMtj
Ytαt
s.t. Qj � 0,n∑j=1
(trace(Qj))q̄ ≤ 1
g(Q) cannot be analytically computed
Danskin’s theorem provides ∇g(Q)Involves solving T regular SVMs
Saketh MLSS - IIT-Madras
![Page 35: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/35.jpg)
Solving MK-MTSFL
Partial Dual:
minQ
g(Q)︷ ︸︸ ︷T∑t=1
maxαt∈Smt (C)
1>αt −12α>t Yt
n∑j=1
M>tjQjMtj
Ytαt
s.t. Qj � 0,n∑j=1
(trace(Qj))q̄ ≤ 1
g(Q) cannot be analytically computed
Danskin’s theorem provides ∇g(Q)Involves solving T regular SVMs
Saketh MLSS - IIT-Madras
![Page 36: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/36.jpg)
Projected (Sub-)Gradient Descent
minx∈X f(x) (f is convex, Lipschitz, X is compact)
At iteration k:
f is approx. by linear func. f(x) = f(xk) +∇f(xk)>(x− xk)valid only when ‖x− xk‖2 is small
xk+1
= arg minx∈X
sk∇f(xk)>(x− xk) +12‖x− xk‖22
= arg minx∈X
12‖x− (xk − sk∇f(xk))‖22
= ΠX (xk − sk∇f(xk))
Convergence guarantees with some choices of step-sizes (sk)
“Optimal” for Euclidean geometry
Saketh MLSS - IIT-Madras
![Page 37: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/37.jpg)
Projected (Sub-)Gradient Descent
minx∈X f(x) (f is convex, Lipschitz, X is compact)
At iteration k:
f is approx. by linear func. f(x) = f(xk) +∇f(xk)>(x− xk)valid only when ‖x− xk‖2 is small
xk+1 = arg minx∈X
sk∇f(xk)>(x− xk) +12‖x− xk‖22
= arg minx∈X
12‖x− (xk − sk∇f(xk))‖22
= ΠX (xk − sk∇f(xk))
Convergence guarantees with some choices of step-sizes (sk)
“Optimal” for Euclidean geometry
Saketh MLSS - IIT-Madras
![Page 38: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/38.jpg)
Projected (Sub-)Gradient Descent
minx∈X f(x) (f is convex, Lipschitz, X is compact)
At iteration k:
f is approx. by linear func. f(x) = f(xk) +∇f(xk)>(x− xk)valid only when ‖x− xk‖2 is small
xk+1 = arg minx∈X
sk∇f(xk)>(x− xk) +12‖x− xk‖22
= arg minx∈X
12‖x− (xk − sk∇f(xk))‖22
= ΠX (xk − sk∇f(xk))
Convergence guarantees with some choices of step-sizes (sk)
“Optimal” for Euclidean geometry
Saketh MLSS - IIT-Madras
![Page 39: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/39.jpg)
Projected (Sub-)Gradient Descent
minx∈X f(x) (f is convex, Lipschitz, X is compact)
At iteration k:
f is approx. by linear func. f(x) = f(xk) +∇f(xk)>(x− xk)valid only when ‖x− xk‖2 is small
xk+1 = arg minx∈X
sk∇f(xk)>(x− xk) +12‖x− xk‖22
= arg minx∈X
12‖x− (xk − sk∇f(xk))‖22
= ΠX (xk − sk∇f(xk))
Convergence guarantees with some choices of step-sizes (sk)
“Optimal” for Euclidean geometry
Saketh MLSS - IIT-Madras
![Page 40: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/40.jpg)
Mirror Descent
Key Idea:
Bregmann divergence based regularizer so that per-stepproblem is easy
xk+1 = arg minx∈X
sk∇f(xk)>(x− xk) +12‖x− xk‖22
Bregmann Divergence:
Strongly convex ω(·): Dx(y) = ω(y)−ω(x)−∇ω(x)>(y− x)Common choices:
X Sphere: ω(x) = 12‖x‖
22
X Simplex: ω(x) =∑
i xi log(xi)X Spectrahedron: ω(x) = trace(x log(x))
Saketh MLSS - IIT-Madras
![Page 41: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/41.jpg)
Mirror Descent
Key Idea:
Bregmann divergence based regularizer so that per-stepproblem is easy
xk+1 = arg minx∈X
sk∇f(xk)>(x− xk) +Dxk(x)
Bregmann Divergence:
Strongly convex ω(·): Dx(y) = ω(y)−ω(x)−∇ω(x)>(y− x)Common choices:
X Sphere: ω(x) = 12‖x‖
22
X Simplex: ω(x) =∑
i xi log(xi)X Spectrahedron: ω(x) = trace(x log(x))
Saketh MLSS - IIT-Madras
![Page 42: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/42.jpg)
Mirror Descent
Key Idea:
Bregmann divergence based regularizer so that per-stepproblem is easy
xk+1 = arg minx∈X
sk∇f(xk)>(x− xk) +Dxk(x)
Bregmann Divergence:
Strongly convex ω(·): Dx(y) = ω(y)−ω(x)−∇ω(x)>(y− x)Common choices:
X Sphere: ω(x) = 12‖x‖
22
X Simplex: ω(x) =∑
i xi log(xi)X Spectrahedron: ω(x) = trace(x log(x))
Saketh MLSS - IIT-Madras
![Page 43: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/43.jpg)
Solving MK-MTSFL
Our Finding:
Entropy function trace(x log(x)) good enough for our problem
Per-step problem:
minQ
∑nj=1 {trace(ζjQj) + trace(Qj log(Qj))}
s.t. Qj � 0,∑k
j=1(trace(Qj))q̄ ≤ 1
After EVDs of Qj:
minρ
∑nj=1 (ρj log(ρj) + ρjπj)
s.t. ρj ≥ 0,∑n
j=1 ρjq̄ ≤ 1
Saketh MLSS - IIT-Madras
![Page 44: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/44.jpg)
Solving MK-MTSFL
Our Finding:
Entropy function trace(x log(x)) good enough for our problem
Per-step problem:
minQ
∑nj=1 {trace(ζjQj) + trace(Qj log(Qj))}
s.t. Qj � 0,∑k
j=1(trace(Qj))q̄ ≤ 1
After EVDs of Qj:
minρ
∑nj=1 (ρj log(ρj) + ρjπj)
s.t. ρj ≥ 0,∑n
j=1 ρjq̄ ≤ 1
Saketh MLSS - IIT-Madras
![Page 45: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/45.jpg)
Solving MK-MTSFL
Our Finding:
Entropy function trace(x log(x)) good enough for our problem
Per-step problem:
minQ
∑nj=1 {trace(ζjQj) + trace(Qj log(Qj))}
s.t. Qj � 0,∑k
j=1(trace(Qj))q̄ ≤ 1
After EVDs of Qj:
minρ
∑nj=1 (ρj log(ρj) + ρjπj)
s.t. ρj ≥ 0,∑n
j=1 ρjq̄ ≤ 1
Saketh MLSS - IIT-Madras
![Page 46: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/46.jpg)
Simulations
Datasets:
School: Multi-task benchmark. Prediction of studentperformance in various schools.
139 regression tasks28 input features15 training examples per task
Letters: OCR dataset. Each letter considered as a task.
9 binary classification tasks128 input features10 training examples per task
Dermatology: Bio-informatics dataset. Predicting one of sixskin-diseases.
15 binary classification tasks33 input features10 training examples per task
Saketh MLSS - IIT-Madras
![Page 47: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/47.jpg)
Simulations
Table: Comparison of generalization performance
SVM MTSFL MK-MTFL MK-MTSFLp =2 7 Inf q =1 1.5 1.99
S -45.88 13.94 10.76 13.80 10.52 14.07 13.80 13.94
L 74.89 75.54 78.28 78.30 78.31 76.38 76.93 74.57
D 8 6 0 0 0 8 7 5.33
MTFSTL – 179sec, MK-MTFL – 192sec and MK-MTSFL –15445sec.
Saketh MLSS - IIT-Madras
![Page 48: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/48.jpg)
Simulations
Saketh MLSS - IIT-Madras
![Page 49: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/49.jpg)
Conclusions
Two novel formulations for multi-task feature learning:Extension of MKL to multi-task case (non-sparse)
Simple, good generalization, scalable
Extension of MTSFL to multiple base kernels (sparse)
better generalization than state-of-the-art
Efficient mirror-descent based algorithm
Faster convergence
Sparse representations may not always be desirable
Saketh MLSS - IIT-Madras
![Page 50: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/50.jpg)
Questions ?
Saketh MLSS - IIT-Madras
![Page 51: Multi-Task Kernel Learning](https://reader030.fdocuments.us/reader030/viewer/2022041202/6250bad6b68596223c252fec/html5/thumbnails/51.jpg)
Thank You
Saketh MLSS - IIT-Madras