Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLconf SF - 11/13/15

Beating the Perils of Non-Convexity:Machine Learning using Tensor Methods

Anima Anandkumar

U.C. Irvine

Learning with Big Data

Learning is finding needle in a haystack

High dimensional regime: as data grows, more variables!

Useful information: low-dimensional structures.

Learning with big data: ill-posed problem.

High dimensional regime: as data grows, more variables!

Useful information: low-dimensional structures.

Learning with big data: ill-posed problem.

Learning with big data: statistically and computationally challenging!

Optimization for Learning

Most learning problems can be cast as optimization.

Unsupervised Learning

Clusteringk-means, hierarchical . . .

Maximum Likelihood EstimatorProbabilistic latent variable models

Supervised Learning

Optimizing a neural network withrespect to a loss function

Neuron

Output

Convex vs. Non-convex Optimization

Progress is only tip of the iceberg..

Images taken from https://www.facebook.com/nonconvex

Convex vs. Non-convex Optimization

Progress is only tip of the iceberg.. Real world is mostly non-convex!

Images taken from https://www.facebook.com/nonconvex

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima

In high dimensions possiblyexponential local optima

How to deal with non-convexity?

Outline

1 Introduction

2 Spectral Methods: Matrices and Tensors

3 Applications of Tensor Methods

4 Conclusion

Matrices and Tensors as Data Structures

Modern data: Multi-modal,multi-relational data.

Matrices: pairwise relations. Tensors: higher order relations.

Multi-modal data figure from Lise Getoor slides.

Tensor Representation of Higher Order Moments

Matrix: Pairwise Moments

E[x⊗ x] ∈ Rd×d is a second order tensor.

E[x⊗ x]i1,i2 = E[xi1xi2 ].

For matrices: E[x⊗ x] = E[xx⊤].

Tensor: Higher order Moments

E[x⊗ x⊗ x] ∈ Rd×d×d is a third order tensor.

E[x⊗ x⊗ x]i1,i2,i3 = E[xi1xi2xi3 ].

Spectral Decomposition of Tensors

M2 =∑

λiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

Spectral Decomposition of Tensors

M2 =∑

λiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

M3 =∑

λiui ⊗ vi ⊗ wi

= + ....

Tensor M3 λ1u1 ⊗ v1 ⊗ w1 λ2u2 ⊗ v2 ⊗ w2

We have developed efficient methods to solve tensor decomposition.

Tensor Decomposition Methods

SVD on Pairwise Moments toobtain W

Fast SVD via RandomProjection

+ ++ +

Dimensionality Reduction ofthird order moments using W

+ ++ +

Power method on reduced tensor T : v 7→T (I, v, v)

‖T (I, v, v)‖.

Guaranteed to converge to correct solution!

Summary on Tensor Methods

= + ....

Tensor decomposition: iterative algorithm with tensor contractions.

Tensor contraction: T (W1,W2,W3), where Wi are matrices/vectors.

Strengths of tensor method

Guaranteed convergence to globally optimal solution!

Fast and accurate, orders of magnitude faster than previous methods.

Embarrassingly parallel and suited for distributed systems, e.g.Spark.

Exploit optimized BLAS operations on CPUs and GPUs.

Preliminary Results on Spark

Alternating Least Squares for Tensor Decomposition.

minw,A,B,C

λiA(:, i) ⊗B(:, i)⊗ C(:, i)

Update Rows Independently

Tensor slices

worker 1

worker 2

worker i

worker k

Results on NYtimes corpus

3 ∗ 105 documents, 108 words

Spark Map-Reduce

26mins 4 hrs

https://github.com/FurongHuang/SpectralLDA-TensorSpark

Outline

1 Introduction

4 Conclusion

Tensor Methods for Learning LVMs

GMM HMM

h1 h2 h3

x1 x2 x3

h1 h2 hk

x1 x2 xd

Multiview and Topic Models

Overview of Our Approach

Tensor-based Model Estimation

Unlabeled Data

Unlabeled Data → Probabilistic admixture models

Unlabeled Data → Probabilistic admixture models → Tensor Decomposition

Unlabeled Data → Probabilistic admixture models → Tensor Decomposition → Inference

tensor decomposition → correct model

Highly parallelized, CPU/GPU/Cloud implementation.

Guaranteed global convergence for Tensor Decomposition.

Community Detection on Yelp and DBLP datasets

Facebook

n ∼ 20, 000

n ∼ 40, 000

n ∼ 1 million

Tensor vs. Variational

FB YP DBLP−sub DBLP10

Datasets

VariationalTensor

Tensor Methods for Training Neural Networks

Tremendous practical impactwith deep learning.

Algorithm: backpropagation.

Highly non-convex optimization

Toy Example: Failure of Backpropagation

y=1y=−1

Labeled input samplesGoal: binary classification

σ(·) σ(·)

x1 x2x

Our method: guaranteed risk bounds for training neural networks

y=1y=−1

σ(·) σ(·)

x1 x2x

y=1y=−1

σ(·) σ(·)

x1 x2x

Backpropagation vs. Our Method

Weights w2 randomly drawn and fixed

Backprop (quadratic) loss surface

4 −4−3

−2−1

w1(1) w1(2)

Backpropagation vs. Our Method

Weights w2 randomly drawn and fixed

Backprop (quadratic) loss surface

4 −4−3

−2−1

w1(1) w1(2)

Loss surface for our method

4 −4−3

−2−1

w1(1) w1(2)

Feature Transformation for Training Neural Networks

Feature learning: Learn φ(·) from input data.

How to use φ(·) to train neural networks?x

Feature Transformation for Training Neural Networks

Feature learning: Learn φ(·) from input data.

How to use φ(·) to train neural networks?x

Invert E[φ(x)⊗ y] to Learn Neural Network Parameters

In general, E[φ(x)⊗ y] is a tensor.

What class of φ(·) useful for training neural networks?

What algorithms can invert moment tensors to obtain parameters?

Score Function Transformations

Score function for x ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

Input:x ∈ R

S1(x) ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

Input:x ∈ R

S1(x) ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

Input:x ∈ R

S1(x) ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

mth-order score function:Input:x ∈ R

S1(x) ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

mth-order score function:

Sm(x) := (−1)m∇(m)p(x)

Input:x ∈ R

S1(x) ∈ Rd

with pdf p(·):

S1(x) := −∇x log p(x)

Sm(x) := (−1)m∇(m)p(x)

Input:x ∈ R

S2(x) ∈ Rd×d

with pdf p(·):

S1(x) := −∇x log p(x)

Sm(x) := (−1)m∇(m)p(x)

Input:x ∈ R

S3(x) ∈ Rd×d×d

NN-LiFT: Neural Network LearnIng

using Feature Tensors

Input:x ∈ R

Score functionS3(x) ∈ R

d×d×d

Input:x ∈ R

d×d×d

yi ⊗ S3(xi) =1

Estimating M3 using

labeled data {(xi, yi)}

S3(xi)

Cross-

moment

Input:x ∈ R

d×d×d

yi ⊗ S3(xi) =1

Estimating M3 using

labeled data {(xi, yi)}

S3(xi)

Cross-

moment

Rank-1 components are the estimates of first layer weights

+ + · · ·

CP tensor

decomposition

Reinforcement Learning (RL) of POMDPsPartially observable Markov decision processes.

Proposed Method

Consider memoryless policies. Episodic learning: indirect exploration.

Tensor methods: careful conditioning required for learning.

First RL method for POMDPs with logarithmic regret bounds.

xi xi+1 xi+2

yi yi+1

ri ri+1

ai ai+1

0 1000 2000 3000 4000 5000 6000 70002

Number of Trials

SM−UCRL−POMDPUCRL−MDPQ−learningRandom Policy

Logarithmic Regret Bounds for POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A.

, under preparation.

Convolutional Tensor Decomposition

== ∗

f∗i w∗

i F∗ w∗

(a)Convolutional dictionary model (b)Reformulated model

.... ....= ++

Cumulant λ1(F∗1 )

⊗3 +λ2(F∗2 )

⊗3 . . .

Efficient methods for tensor decomposition with circulant constraints.

Convolutional Dictionary Learning through Tensor Factorization by F. Huang, A. , June 2015.

Hierarchical Tensor Decomposition

= + .... = + .... = + ....

= + ....

Relevant for learning latent tree graphical models.

Propose first algorithm for integrated learning of hierarchy andcomponents of tensor decomposition.

Highly parallelized operations but with global consistency guarantees.

Scalable Latent Tree Model and its Application to Health Analytics By F. Huang, U. N.

Niranjan, J. Perros, R. Chen, J. Sun, A. , Preprint, June 2014.

Outline

1 Introduction

4 Conclusion

Summary and Outlook

Summary

Tensor methods: a powerful paradigm for guaranteed large-scalemachine learning.

First methods to provide provable bounds for training neuralnetworks, many latent variable models (e.g HMM, LDA), POMDPs!

Summary and Outlook

Summary

Tensor methods: a powerful paradigm for guaranteed large-scalemachine learning.

First methods to provide provable bounds for training neuralnetworks, many latent variable models (e.g HMM, LDA), POMDPs!

Outlook

Training multi-layer neural networks, models with invariances,reinforcement learning using neural networks . . .

Unified framework for tractable non-convex methods with guaranteedconvergence to global optima?

My Research Group and Resources

Podcast/lectures/papers/software available athttp://newport.eecs.uci.edu/anandkumar/

Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLconf SF - 11/13/15

Technology

Transcript of Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLconf SF - 11/13/15

Session 2 - Akyildiz, Beinecke, Yee at MLconf NYC

Automotive Textiles by Anandkumar Ubhare and Avdhoot Jadhav from V.J.T.I.

MLconf NYC Chang Wang

Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Josh Wills, MLconf 2013

MLconf Yael Elmatad

Ewa Dominowska – Engineering Manager, Facebook at MLconf ATL

MLconf NYC Claudia Perlich

Shri R. Baskaran, Shri B. Anandkumar, Shri R.P. Joshua ... · 1 27th Annual Report BOARD OF DIRECTORS Shri R. Baskaran, Chair man & Managing Director Shri B. Anandkumar, Joint Managing

MLconf seattle 2015 presentation

Claudia Perlich, Chief Scientist, Dstillery at MLconf NYC

Talwalkar mlconf (1)

To my Parents, my Family, and the memory of my Grandfather Anandkumar …euler.nmt.edu/~olegm/papers/Anand.pdf · 2019-05-20 · Anandkumar Shetiya Submitted in Partial Fulﬁllment

MLconf NYC Animashree Anandkumar

MLConf 2016 SigOpt Talk by Scott Clark

ReviewAnalysis MLconf 2016 JPrendki

American Express Slides, MLconf 2013

MLconf NYC 0xdata

Sri Ambati – CEO, 0xdata at MLconf ATL

MLconf NYC Shan Shan Huang