Big Data is Low Rank - Cornell UniversityBig Data is Low Rank Madeleine Udell Operations Research...

Big Data is Low Rank

Madeleine Udell

Operations Research and Information EngineeringCornell University

UC Davis, 11/5/2018

1 / 40

Data table

age gender state income education · · ·29 F CT $53,000 college · · ·57 ? NY $19,000 high school · · ·? M CA $102,000 masters · · ·

41 F NV $23,000 ? · · ·...

......

I detect demographic groups?

I find typical responses?

I identify related features?

I impute missing entries?

2 / 40

Data table

m examples (patients, respondents, households, assets)n features (tests, questions, sensors, times) A

A11 · · · A1n...

. . ....

Am1 · · · Amn

I ith row of A is feature vector for ith example

I jth column of A gives values for jth feature across allexamples

3 / 40

Low rank model

given: m × n data table A, k m, nfind: X ∈ Rm×k , Y ∈ Rk×n for whichX

[ Y]≈

i.e., xiyj ≈ Aij , whereX

—x1—...

—xm—

| |y1 · · · yn| |

interpretation:

I X and Y are (compressed) representation of AI xTi ∈ Rk is a point associated with example iI yj ∈ Rk is a point associated with feature jI inner product xiyj approximates Aij

4 / 40

I reduce storage; speed transmission

I understand (visualize, cluster)

I remove noise

I infer missing data

I simplify data processing

5 / 40

Outline

Generalized low rank models

Applications

Impute missing data

Dimensionality reduction

Causal inference

Automatic machine learning

Why low rank?

6 / 40

Principal components analysis

PCA: for A ∈ Rm×n,

minimize ‖A− XY ‖2F =

∑mi=1

∑nj=1(Aij − xiyj)

with variables X ∈ Rm×k , Y ∈ Rk×n

I old roots [Pearson 1901, Hotelling 1933]

I least squares low rank fitting

I (analytical) solution via SVD of A = UΣV T

I (numerical) solution via alternating minimization

7 / 40

Outline

Applications

Impute missing data

Causal inference

Why low rank?

8 / 40

Generalized low rank model

minimize∑

(i ,j)∈Ω Lj(xiyj ,Aij) +∑m

i=1 ri (xi ) +∑n

j=1 rj(yj)

I loss functions Lj for each columnI e.g., different losses for reals, booleans, categoricals,

ordinals, . . .

I regularizers r : R1×k → R, r : Rk → R

I observe only (i , j) ∈ Ω (other entries are missing)

Note: can be NP-hard to optimize exactly. . .

9 / 40

Matrix completion

observe Aij only for (i , j) ∈ Ω ⊂ 1, . . . ,m × 1, . . . , n

minimize∑

(i ,j)∈Ω(Aij − xiyj)2 + λ

∑mi=1 ‖xi‖2 + λ

∑nj=1 ‖yj‖2

two regimes:

I some entries missing: don’t waste data; “borrowstrength” from entries that are not missing

I most entries missing: matrix completion still works!

Theorem ([Keshavan Montanari 2010])

If A has rank k ′ ≤ k and |Ω| = O(nk ′ log n) (and A is incoherent

and Ω is chosen UAR), then matrix completion exactly recovers thematrix A with high probability.

10 / 40

Maximum likelihood low rank estimation

Choose loss function to maximize (log) likelihood ofobservations:

I gaussian noise: L(u, a) = (u − a)2

I laplacian (heavy-tailed) noise: L(u, a) = |u − a|I gaussian + laplacian noise: L(u, a) = huber(u − a)

I poisson (count) noise: L(u, a) = exp(u)− au + a log a− a

I bernoulli (coin toss) noise: L(u, a) = log(1 + exp(−au))

11 / 40

Maximum likelihood low rank estimation works

Theorem (Template)

If a number of samples |Ω| = O(n log(n)) drawn UAR frommatrix entries is observed according to a probabilistic modelwith parameter Z , the solution to (appropriately) regularizedmaximum likelihood estimation is close to the true Z with highprobability.

examples (not exhaustive!):

I additive gaussian noise [Candes Plan 2009]

I additive subgaussian noise [Keshavan Montanari Oh 2009]

I gaussian + laplacian noise [Xu Caramanis Sanghavi 2012]

I 0-1 (Bernoulli) observations [Davenport et al. 2012]

I entrywise exponential family distribution [Gunasekar Ravikumar

Ghosh 2014]

I multinomial logit [Kallus Udell 2016]

12 / 40

Losses

minimize∑

i=1 ri (xi ) +∑n

j=1 rj(yj)

choose loss L : R×F → R adapted to data type F :

data type loss L(u, a)

real quadratic (u − a)2

real absolute value |u − a|real huber huber(u − a)

boolean hinge (1− ua)+

boolean logistic log(1 + exp(−au))

integer poisson exp(u)− au + a log a− a

ordinal ordinal hinge∑a−1

a′=1(1− u + a′)++∑da′=a+1(1 + u − a′)+

categorical one-vs-all (1− ua)+ +∑

a′ 6=a(1 + ua′)+

categorical multinomial logit exp(ua)∑da′=1 exp(ua′ ) 13 / 40

Regularizers

minimize∑

i=1 ri (xi ) +∑n

j=1 rj(yj)

choose regularizers r , r to impose structure:

structure r(x) r(y)

small ‖x‖22 ‖y‖2

sparse ‖x‖1 ‖y‖1

nonnegative 1(x ≥ 0) 1(y ≥ 0)clustered 1(card(x) = 1) 0

14 / 40

Outline

Applications

Impute missing data

Causal inference

Why low rank?

15 / 40

Impute missing data

impute most likely true data Aij

Aij = argmina

Lj(xiyj , a)

I implicit constraint: Aij ∈ Fj

I when Lj is quadratic, `1, or Huber loss, then Aij = xiyjI if F 6= R, argmina Lj(xiyj , a) 6= xiyj

I e.g., for hinge loss L(u, a) = (1− ua)+, Aij = sign(xiyj)

16 / 40

Impute heterogeneous data

mixed data types

−12−9−6−3036912

remove entries

−12−9−6−3036912

qpca rank 10 recovery

−12−9−6−3036912

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

glrm rank 10 recovery

−12−9−6−3036912

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

17 / 40

mixed data types

−12−9−6−3036912

remove entries

−12−9−6−3036912

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

−12−9−6−3036912

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

17 / 40

mixed data types

−12−9−6−3036912

remove entries

−12−9−6−3036912

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

−12−9−6−3036912

−3.0−2.4−1.8−1.2−0.60.00.61.21.82.43.0

17 / 40

Hospitalizations are low rank

hospitalization data set

GLRM outperforms PCA

[Schuler Udell et al., 2016]

18 / 40

American community survey

2013 ACS:

I 3M respondents, 87 economic/demographic surveyquestions

I incomeI cost of utilities (water, gas, electric)I weeks worked per yearI hours worked per weekI home ownershipI looking for workI use foodstampsI education levelI state of residenceI . . .

I 1/3 of responses missing

19 / 40

Fitting a GLRM to the ACS

I construct a rank 10 GLRM with loss functions respectingdata types

I huber for real valuesI hinge loss for booleansI ordinal hinge loss for ordinalsI one-vs-all hinge loss for categoricals

I scale losses and regularizers by 1/σ2j

I fit the GLRM

in 2 lines of code:

glrm, labels = GLRM(A, 10, scale = true)

X,Y = fit!(glrm)

20 / 40

American community survey

most similar features (in demography space):

I Alaska: Montana, North Dakota

I California: Illinois, cost of water

I Colorado: Oregon, Idaho

I Ohio: Indiana, Michigan

I Pennsylvania: Massachusetts, New Jersey

I Virginia: Maryland, Connecticut

I Hours worked: weeks worked, education

21 / 40

Low rank models for dimensionality reduction1

U.S. Wage & Hour Division (WHD) compliance actions:

company zip violations · · ·Holiday Inn 14850 109 · · ·

Moosewood Restaurant 14850 0 · · ·Cornell Orchards 14850 0 · · ·

Lakeside Nursing Home 14850 53 · · ·...

......

I 208,806 rows (cases) × 252 columns (violation info)

I 32,989 zip codes. . .

1labor law violation demo: https://github.com/h2oai/h2o-3/blob/

master/h2o-r/demos/rdemo.census.labor.violations.large.R22 / 40

Low rank models for dimensionality reduction

ACS demographic data:

zip unemployment mean income · · ·94305 12% $47,000 · · ·06511 19% $32,000 · · ·60647 23% $23,000 · · ·94121 4% $178,000 · · ·

......

I 32,989 rows (zip codes) × 150 columns (demographic info)

I GLRM embeds zip codes into (low dimensional)demography space

23 / 40

Zip code features:

24 / 40

build 3 sets of features to predict violations:

I categorical: expand zip code to categorical variable

I concatenate: join tables on zip

I GLRM: replace zip code by low dimensional zip codefeatures

fit a supervised (deep learning) model:

method train error test error runtime

categorical 0.2091690 0.2173612 23.7600000concatenate 0.2258872 0.2515906 4.4700000

GLRM 0.1790884 0.1933637 4.3600000

25 / 40

Causal inference with messy data

Income College

5500 1

3000 0

5800 1

3300 0

4000 0

Gender State Age IQ

M CT 29 110

? NY 57 ?

F ? ? 108

F CA 41 99

? NV 50 ? × ×

× × × ×

Estimate

treatment

effectof

college

[Kallus Mao Udell, 2018]

26 / 40

Matrix completion reduces error

500 1000 1500

Method Lasso Ridge OLS MF (Proposal) Oracle

Binary Covariates, High Dimension

10 20 30 40 50

Number of Proxies

MethodMatch

MF + Match

MF + LR

MF + DR

PS Match

MF + PS Match

Twins Data

[Kallus Mao Udell, 2018]

27 / 40

Low rank method for automatic machine learning

I yellow blocks are fully observed: results of each algorithmon each training dataset

I last row is a new datasetI blue blocks are observed results of algorithms on new

datasetI white blocks are unknown entriesI fit a low rank model to impute white blocks

[Yang Akimoto Udell, 2018]28 / 40

Low rank fit correctly identifies best algorithm type

[Yang Akimoto Udell, 2018]29 / 40

Experiment design for timely model selection

Which algorithms to use to predict performance?

maximizevj

log det( n∑

vjyjyTj

)subject to

n∑j=1

vj tj ≤ τ

vj ∈ [0, 1] ∀j ∈ [n].

I tj : estimated runtime of each machine learning model

I τ : runtime budget

[Yang Akimoto Udell, 2018]

30 / 40

Oboe: Time-constrained AutoML

1.0 2.0 4.0 8.0 16.0 32.0 64.0runtime budget (s)

(a) OpenML

1.0 2.0 4.0 8.0 16.0 32.0 64.0runtime budget (s)

(b) UCI

2.0 4.0 8.0 16.0 32.0 64.0runtime budget (s)

auto-sklearn

random

1.0 2.0 4.0 8.0 16.0 32.0 64.0runtime budget (s)

(c) OpenML

1.0 2.0 4.0 8.0 16.0 32.0 64.0runtime budget (s)

(d) UCI

Figure: In 1a and 1b, shaded area = 75th–25th percentile.In 1c and 1d, rank 1 is best and 3 is worst.

[Yang Akimoto Udell, 2018]

31 / 40

Outline

Applications

Impute missing data

Causal inference

Why low rank?

32 / 40

Latent variable models

Suppose A ∈ Rm×n generated by a latent variable model (LVM):

I αi ∼ A iid, i = 1, . . . ,m

I βj ∼ B iid, j = 1, . . . , n

I Aij = g(αi , βj)

33 / 40

Latent variable models: examples

inner product:

I αi ∼ A ⊆ Rk iid, i = 1, . . . ,m

I βj ∼ B ⊆ Rk iid, j = 1, . . . , n

I Aij = αTi βj

I rank of A? k

univariate:

I αi ∼ Unif(0, 1) iid, i = 1, . . . ,m

I βj ∼ Unif(0, 1) iid, j = 1, . . . , n

I Aij = g(αi , βj)I rank of A?

I can be large!I if g is analytic with supx∈R |g (d)(x)| ≤ M,

entrywise ε-approximation to A has rank O(log(1/ε))

34 / 40

inner product:

I αi ∼ A ⊆ Rk iid, i = 1, . . . ,m

I βj ∼ B ⊆ Rk iid, j = 1, . . . , n

I Aij = αTi βj

I rank of A?

univariate:

I αi ∼ Unif(0, 1) iid, i = 1, . . . ,m

I βj ∼ Unif(0, 1) iid, j = 1, . . . , n

34 / 40

inner product:

I αi ∼ A ⊆ Rk iid, i = 1, . . . ,m

I βj ∼ B ⊆ Rk iid, j = 1, . . . , n

I Aij = αTi βj

I rank of A? k

univariate:

I αi ∼ Unif(0, 1) iid, i = 1, . . . ,m

I βj ∼ Unif(0, 1) iid, j = 1, . . . , n

34 / 40

inner product:

I αi ∼ A ⊆ Rk iid, i = 1, . . . ,m

I βj ∼ B ⊆ Rk iid, j = 1, . . . , n

I Aij = αTi βj

I rank of A? k

univariate:

I αi ∼ Unif(0, 1) iid, i = 1, . . . ,m

I βj ∼ Unif(0, 1) iid, j = 1, . . . , n

I Aij = g(αi , βj)

I rank of A?I can be large!I if g is analytic with supx∈R |g (d)(x)| ≤ M,

34 / 40

inner product:

I αi ∼ A ⊆ Rk iid, i = 1, . . . ,m

I βj ∼ B ⊆ Rk iid, j = 1, . . . , n

I Aij = αTi βj

I rank of A? k

univariate:

I αi ∼ Unif(0, 1) iid, i = 1, . . . ,m

I βj ∼ Unif(0, 1) iid, j = 1, . . . , n

34 / 40

inner product:

I αi ∼ A ⊆ Rk iid, i = 1, . . . ,m

I βj ∼ B ⊆ Rk iid, j = 1, . . . , n

I Aij = αTi βj

I rank of A? k

univariate:

I αi ∼ Unif(0, 1) iid, i = 1, . . . ,m

I βj ∼ Unif(0, 1) iid, j = 1, . . . , n

34 / 40

Nice latent variable models

We say LVM is nice if

I distributions A and B have bounded support

I g is piecewise analytic and on each piece: for some M ∈ R,

‖Dµg(α, β)‖ ≤ CM |µ|‖g‖.

(‖g‖ = supx∈dom g g(x) is sup norm.)

Examples: g(α, β) = poly(α, β) or g(α, β) = exp(poly(α, β))

35 / 40

Rank of nice latent variable models?

Question: Suppose A ∈ Rm×n is drawn from a nice LVM.How does rank of ε-approximation to A change with m and n?

minimize Rank(X )subject to ‖X − A‖∞ ≤ ε

Answer: rank grows as O(log(m + n)/ε2)

Theorem (Udell and Townsend, 2017)

Nice latent variable models are of log rank.

36 / 40

Proof sketch

I For each α, expand g around β = 0 by its Taylor series

g(α, β)− g(α, 0) = 〈∇g(α, 0), β〉+ 〈∇2g(α, 0), ββ>〉+ . . .

∇g(α, 0)vec(∇2g(α, 0))

vec(ββ>)...

collect terms depending on α and on β

I Notice: very high dimensional inner product!

I apply Johnson Lindenstrauss lemma to reduce dimension

37 / 40

Johnson Lindenstrauss Lemma

Lemma (The Johnson–Lindenstrauss Lemma)

Consider x1, . . . , xn ∈ RN . Pick 0 < ε < 1 and setr = d8(log n)/ε2e.There is a linear map Q : RN → Rr such that for all 1 ≤ i , j ≤ n,

(1− ε)‖xi − xj‖2 ≤ ‖Q(xi − xj)‖2 ≤ (1 + ε)‖xi − xj‖2.

(Here, dae is the smallest integer larger than a.)

Lemma (Variant of the Johnson–Lindenstrauss Lemma)

Under the same assumptions, there is a linear mapQ : RN → Rr such that for all 1 ≤ i , j ≤ n,∣∣∣xTi xj − xTi QTQxj

∣∣∣ ≤ ε(‖xi‖2 + ‖xj‖2 − xTi xj

38 / 40

Johnson Lindenstrauss Lemma

Lemma (The Johnson–Lindenstrauss Lemma)

Consider x1, . . . , xn ∈ RN . Pick 0 < ε < 1 and setr = d8(log n)/ε2e.There is a linear map Q : RN → Rr such that for all 1 ≤ i , j ≤ n,

(1− ε)‖xi − xj‖2 ≤ ‖Q(xi − xj)‖2 ≤ (1 + ε)‖xi − xj‖2.

(Here, dae is the smallest integer larger than a.)

Lemma (Variant of the Johnson–Lindenstrauss Lemma)

Under the same assumptions, there is a linear mapQ : RN → Rr such that for all 1 ≤ i , j ≤ n,∣∣∣xTi xj − xTi QTQxj

∣∣∣ ≤ ε(‖xi‖2 + ‖xj‖2 − xTi xj

38 / 40

Summary

big data is low rank

I in social science

I in medicine

I in machine learning

we can exploit low rank to

I fill in missing data

I embed data in vector space

I infer causality

I choose machine learning models

39 / 40

References

I Generalized Low Rank Models. M. Udell, C. Horn, R. Zadeh, and S.Boyd. Foundations and Trends in Machine Learning, 2016.

I Revealed Preference at Scale: Learning Personalized Preferences fromAssortment Choices. N. Kallus and M. Udell. EC 2016.

I Discovering Patient Phenotypes Using Generalized Low Rank Models.A. Schuler, V. Liu, J. Wan, A. Callahan, M. Udell, D. Stark, and N.Shah. Pacific Symposium on Biocomputing (PSB), 2016.

I Low rank models for high dimensional categorical variables: labor lawdemo. A. Fu and M. Udell.https://github.com/h2oai/h2o-3/blob/master/h2o-r/demos/

rdemo.census.labor.violations.large.R

I Causal Inference with Noisy and Missing Covariates via MatrixFactorization. N. Kallus, X. Mao, and M. Udell. NIPS 2018.

I OBOE: Collaborative Filtering for AutoML Initialization. C. Yang, Y.Akimoto, D. Kim, and M. Udell.

I Why are Big Data Matrices Approximately Low Rank? M. Udell andA. Townsend. SIMODS, to appear.

40 / 40

Big Data is Low Rank - Cornell UniversityBig Data is Low Rank Madeleine Udell Operations Research...

Documents

Transcript of Big Data is Low Rank - Cornell UniversityBig Data is Low Rank Madeleine Udell Operations Research...

Generalized Low Rank Models - Cornell University...Generalized Low Rank Models Madeleine Udell Computational and Mathematical Engineering Stanford University Based on joint work with

Cache-Conscious Data Placement - Cornell University

2020-21 Common Data Set - Cornell College A. General ...

Cornell University · Cornell University Common Data Set -- 1999-00 The Common Data Set (CDS) was developed through collaboration among publishers of college guides, colleges and

Data Privacy - Cornell University

Big Data - Cornell University2 Big Data. game design initiative at cornell university the Today ... This is great! 35% I wish he would continue lecturing 17% Other 17% 7 Big Data.

Data Security At Cornell Steve Schuster. Questions I’d like to Answer ► Why do we care about data security? ► What are our biggest challenges at Cornell?

Bayesian Ordinal Peer Grading - Cornell University · 2015. 4. 28. · d Rank of assignment din ordering ˙(rank 1 is best) d 2 ˜ ˙d 1 d 2 is preferred/ranked higher than d 1 (in

Data Visualization - Cornell UniversityCornell CS 322 Data Visualization 45 Consumer Reports. Display of historical automobile reliability data. 1982 Cornell CS 322 Data Visualization

Mid Cap - Rank Data as of 08/23/14 - eqmcapital.com fileMid Cap - Rank Data as of 08/23/14 Ticker Name 100% Stock Rank 1 OILT Oiltanking Partners LP 99.92 2 PSXP Phillips 66 Partners

Workflows and Data Management - Cornell University Center ... · Workflows and Data Management Adam Brazier – brazier@cornell.edu Computational Scientist Cornell University Center

Cornell University · Cornell University Common Data Set -- 2001-02 The Common Data Set (CDS) was developed through collaboration among publishers of college guides, colleges and

Multiple rank multi-linear SVM for matrix data classificationtarjomefa.com/wp-content/uploads/2015/06/3166-English.pdf · Multiple rank multi-linear SVM for matrix data classiﬁcation

Data Ellipses, HE Plots and Reduced-Rank Displays for ... · hsym=4, htitle=9, interp=rl, anno=ellipse); 6 Data Ellipses, HE Plots and Reduced-Rank Displays 2.3. Biplots: Reduced-rank

Low-rank double dictionary learning from corrupted data ...

Rank higher in local search with consistent data

Augmented OLAP for Big Data · Kyligence = Kylin + Intelligence - Extreme multi-dimensional OLAP engine for big data-Rank 1 from googling “big data OAP”-Rank 1 from googling “hadoop

Data Publishing against Realistic Adversaries Johannes Gerhrke Cornell University Ithaca, NY Michaela Götz Cornell University Ithaca, NY Ashwin Machanavajjhala.

Supplementary Data Supplementary Methods Rank-Rank ... · metric scatter plots, ii) rank-rank scatter plots, and iii) rank-rank hypergeometric overlap (RRHO) heatmaps. Benjamini-Yekutieli

The Open-Source Cornell Spectrum Imager...2013 January • 43 Open-Source Cornell Spectrum Imager mean of the data unless the data are first mean-centered. By mean-centering the data