ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning...

54
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Transcript of ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning...

Page 1: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

ECS289: Scalable Machine Learning

Cho-Jui HsiehUC Davis

Oct 27, 2015

Page 2: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Outline

One versus all/One versus one

Ranking loss for multiclass/multilabel classification

Scaling to millions of labels

Page 3: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Multiclass Learning

n data points, L labels, d featuresInput: training data xi , yini=1:

Each xi is a d dimensional feature vectorEach yi ∈ 1, . . . , L is the corresponding labelEach training data belongs to one category

Goal: find a function to predict the correct label

f (x) ≈ y

Page 4: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Multi-label Problems

n data points, L labels, d features

Input: training data xi , yini=1:

Each xi is a d dimensional feature vectorEach yi ∈ 0, 1L is a label vector (or Yi ∈ 1, 2, . . . , L)

Example: yi = [0, 0, 1, 0, 0, 1, 1] (or Yi = 3, 6, 7)Each training data can belong to multiple categories

Goal: Given a testing sample x , predict the correct labels

Page 5: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Illustration

Multiclass: each row of L has exact one “1”

Multilabel: each row of L can have multiple ones

Page 6: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Measure the accuracy (multi-class)

Let xi , yimi=1 be a set of testing data for multi-class problems

Let z1, . . . , zm be the prediction for each testing data

Accuracy:

1

m

m∑i=1

I (yi = zi )

If the algorithm outputs a set of k potential labels for each sample:

Z1,Z2, . . . ,Zm

Each Zi is a set of k labels

Precision@k:1

m

m∑i=1

I (yi ∈ Zi )

Page 7: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Measure the accuracy (multi-class)

Let xi , yimi=1 be a set of testing data for multi-class problems

Let z1, . . . , zm be the prediction for each testing data

Accuracy:

1

m

m∑i=1

I (yi = zi )

If the algorithm outputs a set of k potential labels for each sample:

Z1,Z2, . . . ,Zm

Each Zi is a set of k labels

Precision@k:1

m

m∑i=1

I (yi ∈ Zi )

Page 8: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Example (multiclass)

Page 9: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Example (multiclass)

Page 10: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Example (multiclass)

Page 11: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Example (multiclass)

Page 12: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Measure the accuracy (multi-label)

Let xi , yimi=1 be a set of testing data for multi-label problems

For each testing data, the classifier outputs zi ∈ 0, 1L

We use Yi = t : (yi )t 6= 0 and Zi = t : (zi )t 6= 0 to denote thesubset of real and predictive labels for the i-th testing data.

If each Zi has k elements, then Precision@k is defined by:

1

m

m∑i=1

Zi ∩ Yi

k

Hamming loss: overall classification performance

1

m

m∑i=1

d(yi , zi ),

where d(yi , zi ) measures the number of places where yi and zi differ.ROC curve (assume the classifier predicts a ranking of labels for eachdata point)

Page 13: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Example (multilabel)

Page 14: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Example (multilabel)

Page 15: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Example (multilabel)

Page 16: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Traditional Approach

Many algorithms for binary classification

Idea: transform multi-class or multi-label problems to multiple binaryclassification problems

Two approaches:

One versus All (OVA)One versus One (OVO)

Page 17: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

One Versus All (OVA)

Multi-class/multi-label problems with L categories

Build L different binary classifiers

For the t-th classifier:

Positive samples: all the points in class t (xi : t ∈ yi)Negative samples: all the points not in class t (xi : t /∈ yi)ft(x): the decision value for the t-th classifier

(larger ft ⇒ higher probability that x in class t)

Prediction:f (x) = arg max

tft(x)

Example: using SVM to train each binary classifier.

Page 18: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

One Versus One (OVO)

Multi-class/multi-label problems with L categories

Build L(L− 1) different binary classifiers

For the (s, t)-th classifier:

Positive samples: all the points in class s (xi : s ∈ yi)Negative samples: all the points in class t (xi : t ∈ yi)fs,t(x): the decision value for this classifier

(larger fs,t(x) ⇒ label s has higher probability than label t)ft,s(x) = −fs,t(x)

Prediction:

f (x) = arg maxs

(∑t

fs,t(x)

)Example: using SVM to train each binary classifier.

Not for multilabel problems.

Page 19: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

OVA vs OVO

Prediction accuracy: depends on datasets

Computational time:

OVA needs to train L classifiers

OVO needs to train L(L− 1)/2 classifiers

Is OVA always faster than OVO?

Page 20: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

OVA vs OVO

Prediction accuracy: depends on datasets

Computational time:

OVA needs to train L classifiers

OVO needs to train L(L− 1)/2 classifiers

Is OVA always faster than OVO?

NO, depends on the time complexity of the binary classifier

If the binary classifier requires O(n) time for n samples:OVA and OVO have similar time complexity

If the binary classifier requires O(n1.xx) time:OVO is faster than OVA

LIBSVM (kernel SVM solver): OVO

LIBLINEAR (linear SVM solver): OVA

Page 21: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

OVA vs OVO

Prediction accuracy: depends on datasets

Computational time:

OVA needs to train L classifiers

OVO needs to train L(L− 1)/2 classifiers

Is OVA always faster than OVO?

NO, depends on the time complexity of the binary classifier

If the binary classifier requires O(n) time for n samples:OVA and OVO have similar time complexity

If the binary classifier requires O(n1.xx) time:OVO is faster than OVA

LIBSVM (kernel SVM solver): OVO

LIBLINEAR (linear SVM solver): OVA

Page 22: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Comparisons (accuracy)

For kernel SVM with RBF kernel

(See “A comparison of methods for multiclass support vector machines”,2002)

Page 23: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Comparisons (training time)

For kernel SVM (with RBF kernel)

(See “A comparison of methods for multiclass support vector machines”,2002)

Page 24: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Methods Using Instance-wise Ranking Loss

Page 25: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Main idea

OVA and OVO: decompose the problem by labels

However, the ranking of the labels for a testing sample is important

For multiclass classification, the score of yi should be larger than otherlabelsFor multilabel classification, the score of Yi should be larger than otherlabels

Both OVA and OVO decompose the problem into individual labels

⇒ they cannot capture the ranking information

Solve one combined optimization problem by minimizing theranking loss

Page 26: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Main idea

For simplicity, we assume a linear model

Model parameters: w1, . . . ,wL

For each data point x , compute the decision value for each label:

wT1 x , wT

2 x , . . . , wTL x

Prediction:y = arg max

twT

t x

For training data xi , yi is the true label, so we want

yi ≈ arg maxt

wTt xi ∀i

Page 27: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Main idea

Question: how to define the distance between XW and Y

Page 28: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Weston-Watkins Formulation

Proposed in Weston and Watkins, “Multi-class support vectormachines”. In ESANN, 1999.

minwt,ξti

1

2

L∑t=1

‖wt‖2 + Cn∑

i=1

L∑t=1

ξti

s.t. wTyi

xi −wTt xi ≥ 1− ξti , ξti ≥ 0 ∀t 6= yi , ∀i = 1, . . . , n

If point i is in class yi , for any other labels (t 6= yi ), we want

wTyi

xi −wTt xi ≥ 1

or we pay a penalty ξtiPrediction:

f (x) = arg maxt

wTt xi

Page 29: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Weston-Watkins Formulation (dual)

The dual problem of Weston-Watkins formulation:

minαt

1

2

L∑t=1

‖wt(α)‖+n∑

i=1

∑t 6=yi

αti

s.t.0 ≤ αti ≤ C , ∀t 6= yi , ∀i = 1, . . . , n

α1, . . . ,αL ∈ Rn: dual variables for each label

wt(α) = −∑

i αti xi , and αyi

i = −∑

t 6=yiαti

Can be solved by dual (block) coordinate descent

Page 30: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Crammer-Singer Formulation

Proposed in Carmmer and Singer, “On the algorithmic implementationof multiclass kernel-based vector machines”. JMLR, 2001.

minwt,ξti

1

2

L∑t=1

‖wt‖2 + Cn∑

i=1

ξi

s.t. wTyi

xi −wTt xi ≥ 1− ξi , ∀t 6= yi , ∀i = 1, . . . , n

ξi ≥ 0 ∀i = 1, . . . , n

If point i is in class yi , for any other labels (t 6= yi ), we want

wTyi

xi −wTt xi ≥ 1

For each point i , we only pay the largest penalty

Prediction:f (x) = arg max

twT

t xi

Page 31: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Crammer-Singer Formulation (dual)

The dual problem of Crammer-Singer formulation:

minαt

1

2

L∑t=1

‖wt(α)‖+n∑

i=1

∑t 6=yi

αti

s.t.

(αti ≤ C , ∀t,

∑t

αti = 0

), ∀i = 1, 2, . . . , n

α1, . . . ,αL ∈ Rn: dual variables for each label

wt(α) = −∑

i αti xi , and αyi

i = −∑

t 6=yiαti

Can be solved by dual (block) coordinate descent

Page 32: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Comparisons

(See “A Sequential Dual Method for Large Scale Multi-Class Linear SVMs”,KDD 2008)

Page 33: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Scaling to huge number of labels

Page 34: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Challenges: large number of categories

Multi-label (or multi-class) classification with large number of labels

Image classification—> 10000 labels

Recommending tags for articles: millions of labels (tags)

Page 35: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Challenges: large number of categories

Consider a problem with 1 million labels (L = 1, 000, 000)

One versus all: reduce to 1 million binary problems

Training: 1 million binary classification problems.

Need 694 days if each binary problem can be solved in 1 minute

Model size: 1 million models.

Need 1 TB if each model requires 1MB.

Prediction one testing data: 1 million binary prediction

Need 1000 secs if each binary prediction needs 10−3 secs.

Page 36: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Several Approaches

Label space reduction by Compressed Sensing (CS)

Feature space reduction (PCA)

Supervised latent feature model by matrix factorization

Page 37: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Compressed Sensing

If A ∈ Rn×d ,w ∈ Rd , y ∈ Rn, and

Aw = y

Given A and y, can we recover w?

Page 38: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Compressed Sensing

If A ∈ Rn×d ,w ∈ Rd , y ∈ Rn, and

Aw = y

Given A and y, can we recover w?

When n d :

Usually yes, because number of constraints number of variable

(over-determined linear system)

Page 39: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Compressed Sensing

If A ∈ Rn×d ,w ∈ Rd , y ∈ Rn, and

Aw = y

Given A and y, can we recover w?

When n d :

No, because number of constraints number of variable

(high dimensional regression, under-determined linear system, . . . )

Page 40: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Compressed Sensing

However, if we know w is a sparse vector (with ≤ s nonozeroelements), and A satisfies certain condition, then we can recover weven in the high dimensional setting.

Page 41: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Compressed Sensing

Conditions: w is s-sparse and A satisfies Restricted Isometry Property(RIP)

Example: all entries of A are generated i.i.d. from Gaussian N(0, σ). . .

w can be recoverd by O(s log d) samples

Lasso (`1-regularized linear regression)Orthogonal Mathcing Pursuit (OMP). . .

How is this related to multilabel classification?

Page 42: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Label Space Reduction by Compressed Sensing

Proposed in Hsu et al., “Multi-Label Prediction via CompressedSensing”. In NIPS 2009.

Main idea: reduce the label space from RL to Rk , where k L

zi = Myi where M ∈ Rk×L

If we can recover yi by knowing zi and M, then:

we only need to learn a function to map xi to zi :

f (xi ) ≈ zi

By compressed sensing:

yi can be recovered given xi and M even when k = O(s log L)s is the number of nonzeroes in yi : usually very small in practice

Page 43: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Label Space Reduction by Compressed Sensing

Proposed in Hsu et al., “Multi-Label Prediction via CompressedSensing”. In NIPS 2009.

Main idea: reduce the label space from RL to Rk , where k L

zi = Myi where M ∈ Rk×L

If we can recover yi by knowing zi and M, then:

we only need to learn a function to map xi to zi :

f (xi ) ≈ zi

By compressed sensing:

yi can be recovered given xi and M even when k = O(s log L)s is the number of nonzeroes in yi : usually very small in practice

Page 44: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Label Space Reduction by Compressed Sensing

Training:

Step 1: Construct M by i.i.d. Gaussian

Step 2: Compute zi = Myi for all i = 1, 2, . . . , n

Step 3: Learn a function f such that

f (xi ) ≈ zi

Prediction for a test sample xStep 1: compute z = f (x)

Step 2: compute y ∈ RL by solving a Lasso problem

Step 3: Threshold y to give the prediction

Page 45: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Label Space Reduction by Compressed Sensing

Reduce the label size from L to O(s log L)

Drawbacks:1 Slow prediction time (need to solve a Lasso problem for every prediction)2 Large error in practice

Page 46: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Another way to understand this algorithm

X ∈ Rn×d : data matrix, each row is a data point xiY ∈ Rn×L: label matrix, each row is yi

M ∈ RL×k : Gaussian random matrix

Goal: find a U matrix such that XU ≈ YM

Page 47: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Another way to understand this algorithm

X ∈ Rn×d : data matrix, each row is a data point xiY ∈ Rn×L: label matrix, each row is yi

M ∈ RL×k : Gaussian random matrix

M†: psuedo inverse of M

Goal: find a U matrix such that XUM† ≈ YMM†

Page 48: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Feature space reduction

Another way to improve speed: reduce the feature space

Dimension reduction from d dimensional space to k dimensional space

xi → Nxi = xi ,

where N ∈ Rk×d and k d

Matrix form:X = XNT

Multilabel learning on the reduced feature space:

Learn V such thatXV ≈ Y

Time complexity: reduced by a factor of n/k

Several ways to choose N: PCA, random projection, . . .

Page 49: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Another way to understand this algorithm

X ∈ Rn×d : data matrix, each row is a data point xiY ∈ Rn×L: label matrix, each row is yi

N ∈ Rk×d : matrix for dimensional reduction

Goal: find a V matrix such that XNTV ≈ Y

Page 50: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Supervised latent space model (by matrix factorization)

Label space reduction: find U such that

XUV T ≈ Y

Feature space reduction: find V such that

XUV T ≈ Y

How to improve?

Find best U and V simultaneously

Proposed in

Chen and Lin, “Feature-aware label space dimension reduction formulti-label classification”. In NIPS 2012.Xu et al., “Speedup Matrix Completion with Side Information:Application to Multi-Label Learning”. In NIPS 2013.Yu et al., “Large-scale Multi-label Learning with Missing Labels”. InICML 2014.

Page 51: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Low Rank Modeling for Multilabel Classification

Let X ∈ Rn×d be the data matrix, Y ∈ Rn×L be the 0-1 label matrix

Find U ∈ Rd×k and V ∈ RL×k such that

Y ≈ XUV T

Obtain U,V by solving the following optimization problem:

minU∈Rd×k ,V∈RL×k

‖Y − XUV T‖2F + λ1‖U‖2

F + λ2‖V ‖2F

where λ1, λ2 are the regularization parameters

Solve by alternating minimization.

Page 52: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Low Rank Modeling for Multilabel Classification

Time complexity:

O(nkL) for updating V

O(ndk) for updating U

Overall: O(nkL + ndk) per iteration

Original time complexity: O(ndL) per iteration

Page 53: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Extensions

Extend to general loss function:

minU∈Rd×k ,V∈RL×k

∑i ,j

dis(Yij , (XUVT )ij) + λ1‖U‖2

F + λ2‖V ‖2F

Can handle problems with missing data:

minU∈Rd×k ,V∈RL×k

∑i ,j∈Ω

dis(Yij , (XUVT )ij) + λ1‖U‖2

F + λ2‖V ‖2F

Is it similar to inductive matrix completion?

What’s the time complexity for prediction?

Page 54: ECS289: Scalable Machine Learningchohsieh/teaching/ECS289G... · ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015

Coming up

Next class: paper presentations on mutlilabel classification and matrixcompletion

Questions?