Big Data Analytics: Optimization and Randomizationhomepage.cs.uiowa.edu/~tyng/kdd15-tutorial.pdf ·...

Big Data Analytics: Optimizationand Randomization

Tianbao Yang†, Qihang Lin\, Rong Jin∗‡

Tutorial@SIGKDD 2015Sydney, Australia

†Department of Computer Science, The University of Iowa, IA, USA\Department of Management Sciences, The University of Iowa, IA, USA

∗Department of Computer Science and Engineering, Michigan State University, MI, USA‡Institute of Data Science and Technologies at Alibaba Group, Seattle, USA

August 10, 2015

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 1 / 234

http://www.cs.uiowa.edu/˜tyng/kdd15-tutorial.pdf

Some Claims

NoThis tutorial is not an exhaustive literature surveyIt is not a survey on different machine learning/data miningalgorithms

YesIt is about how to efficiently solve machine learning/data mining(formulated as optimization) problems for big data

Outline

Part I: BasicsPart II: OptimizationPart III: Randomization

Big Data Analytics: Optimization and Randomization

Part I: Basics

Basics Introduction

Outline

1 BasicsIntroductionNotations and Definitions

Basics Introduction

Three Steps for Machine Learning

Model Optimization

20 40 60 80 1000

iterations

Basics Introduction

Big Data Challenge

Big Data

Basics Introduction

Big Data Challenge

Big Model

60 million parameters

Basics Introduction

Learning as Optimization

Ridge Regression Problem:

minw∈Rd

n∑i=1

(yi −w>xi )2 +

2 ‖w‖22

xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points

Basics Introduction

minw∈Rd

n∑i=1

(yi −w>xi )2

︸︷︷︸Empirical Loss

2 ‖w‖22

Basics Introduction

minw∈Rd

n∑i=1

(yi −w>xi )2 +

2 ‖w‖22︸︷︷︸

Regularization

Basics Introduction

Classification Problems:

minw∈Rd

n∑i=1

`(yiw>xi ) +λ

2 ‖w‖22

yi ∈ +1,−1: labelLoss function `(z): z = yw>x

1. SVMs: (squared) hinge loss `(z) = max(0, 1− z)p, where p = 1, 2

2. Logistic Regression: `(z) = log(1 + exp(−z))

Basics Introduction

Feature Selection:

minw∈Rd

n∑i=1

`(w>xi , yi ) + λ‖w‖1

`1 regularization ‖w‖1 =∑d

i=1 |wi |λ controls sparsity level

Basics Introduction

Feature Selection using Elastic Net:

minw∈Rd

n∑i=1

`(w>xi , yi )+λ(‖w‖1 + γ‖w‖2

Elastic net regularizer, more robust than `1 regularizer

Basics Introduction

Multi-class/Multi-task Learning:

n∑i=1

`(Wxi , yi ) + λr(W)

W ∈ RK×d

r(W) = ‖W‖2F =

∑Kk=1

∑dj=1 W 2

kj : Frobenius Normr(W) = ‖W‖∗ =

∑i σi : Nuclear Norm (sum of singular values)

r(W) = ‖W‖1,∞ =∑d

j=1 ‖W:j‖∞: `1,∞mixed norm

Basics Introduction

Regularized Empirical Loss Minimization

minw∈Rd

n∑i=1

`(w>xi , yi ) + R(w)

Both ` and R are convex functionsExtensions to Matrix Cases are possible (sometimes straightforward)Extensions to Kernel methods can be combined with randomizedapproachesExtensions to Non-convex (e.g., deep learning) are in progress

Basics Introduction

Data Matrices and Machine Learning

The Instance-feature Matrix: X ∈ Rn×d

x>1x>2···

Basics Introduction

The output vector: y =

y1y2···

∈ Rn×1

continuous yi ∈ R: regression (e.g., house price)discrete, e.g., yi ∈ 1, 2, 3: classification (e.g., species of iris)

Basics Introduction

Data Matrices and Machine LearningThe Instance-Instance Matrix: K ∈ Rn×n

Similarity MatrixKernel Matrix

Basics Introduction

Data Matrices and Machine LearningSome machine learning tasks are formulated on the kernel matrix

ClusteringKernel Methods

Basics Introduction

The Feature-Feature Matrix: C ∈ Rd×d

Covariance MatrixDistance Metric Matrix

Basics Introduction

Some machine learning tasks requires the covariance matrixPrincipal Component AnalysisTop-k Singular Value (Eigen-Value) Decomposition of the CovarianceMatrix

Basics Introduction

Why Learning from Big Data is Challenging?

High per-iteration cost

High memory cost

High communication cost

Large iteration complexity

Basics Notations and Definitions

Outline

1 BasicsIntroductionNotations and Definitions

Vector x ∈ Rd

Euclidean vector norm: ‖x‖2 =√

x>x =√∑d

i=1 x2i

`p-norm of a vector: ‖x‖p =(∑d

i=1 |xi |p)1/p

where p ≥ 1

1 `2 norm ‖x‖2 =√∑d

i=1 x2i

2 `1 norm ‖x‖1 =∑d

i=1 |xi |3 `∞ norm ‖x‖∞ = maxi |xi |

Vector x ∈ Rd

x>x =√∑d

i=1 x2i

i=1 |xi |p)1/p

where p ≥ 1

1 `2 norm ‖x‖2 =√∑d

i=1 x2i

2 `1 norm ‖x‖1 =∑d

i=1 |xi |3 `∞ norm ‖x‖∞ = maxi |xi |

Vector x ∈ Rd

x>x =√∑d

i=1 x2i

i=1 |xi |p)1/p

where p ≥ 1

1 `2 norm ‖x‖2 =√∑d

i=1 x2i

2 `1 norm ‖x‖1 =∑d

i=1 |xi |3 `∞ norm ‖x‖∞ = maxi |xi |

Matrix Factorization

Matrix X ∈ Rn×d

Singular Value Decomposition X = UΣV>1 U ∈ Rn×r : orthonormal columns (U>U = I): span column space2 Σ ∈ Rr×r : diagonal matrix Σii = σi > 0, σ1 ≥ σ2 . . . ≥ σr3 V ∈ Rd×r : orthonormal columns (V>V = I): span row space4 r ≤ min(n, d): max value such that σr > 0: rank of X5 UkΣkV>k : top-k approximation

Pseudo inverse: X † = V Σ−1U>

QR factorization: X = QR (n ≥ d)

Q ∈ Rn×d : orthonormal columnsR ∈ Rd×d : upper triangular matrix

Matrix X ∈ Rn×d

Frobenius norm: ‖X‖F =√

tr(X>X ) =√∑n

i=1∑d

j=1 X 2ij

Spectral (induced norm) of a matrix: ‖X‖2 = max‖u‖2=1 ‖Xu‖2‖A‖2 = σ1 (maximum singular value)

Matrix X ∈ Rn×d

Frobenius norm: ‖X‖F =√

tr(X>X ) =√∑n

i=1∑d

j=1 X 2ij

Spectral (induced norm) of a matrix: ‖X‖2 = max‖u‖2=1 ‖Xu‖2‖A‖2 = σ1 (maximum singular value)

Convex Optimization

minx∈X f (x)

X is a convex domainfor any x , y ∈ X , their convex combinationαx + (1− α)y ∈ X

f (x) is a convex function

Convex Function

Characterization of Convex Function

f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y),

∀x , y ∈ X , α ∈ [0, 1]

f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

local optimum is global optimum

Convex Function

Characterization of Convex Function

f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y),

∀x , y ∈ X , α ∈ [0, 1]

f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

local optimum is global optimum

Convex vs Strongly Convex

Convex function:

f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

Strongly Convex function:

f (x) ≥ f (y) +∇f (y)>(x − y) +λ

2 ‖x − y‖22 ∀x , y ∈ X

Global optimum is unique

strong convexityconstant

Convex vs Strongly Convex

Convex function:

f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

Strongly Convex function:

f (x) ≥ f (y) +∇f (y)>(x − y) +λ

2 ‖x − y‖22 ∀x , y ∈ X

Global optimum is unique

Non-smooth function vs Smooth functionNon-smooth function

Lipschitz continuous: e.g. absolute lossf (x) = |x ||f (x)− f (y)| ≤ G‖x − y‖2

Lipschitzconstant

Subgradient: f (x) ≥ f (y) + ∂f (y)>(x − y)

−1 −0.5 0 0.5 1−0.2

non−smooth

sub−gradient

Smooth function

e.g. logistic loss f (x) = log(1 + exp(−x))

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

smoothnessconstant

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

log(1+exp(−x))

f(y)+f’(y)(x−y)

Quadratic Function

Lipschitzconstant

−1 −0.5 0 0.5 1−0.2

non−smooth

sub−gradient

Smooth function

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

smoothnessconstant

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

log(1+exp(−x))

f(y)+f’(y)(x−y)

Quadratic Function

Lipschitzconstant

−1 −0.5 0 0.5 1−0.2

non−smooth

sub−gradient

Smooth function

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

smoothnessconstant

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

log(1+exp(−x))

f(y)+f’(y)(x−y)

Quadratic Function

Next ...

minw∈Rd

n∑i=1

`(w>xi , yi ) + R(w)

Part II: Optimizationstochastic optimizationdistributed optimization

Reduce Iteration Complexity: utilizing properties of functions

Next ...

Part III: RandomizationClassification, RegressionSVD, K-means, Kernel methods

Reduce Data Size: utilizing properties of data

Please stay tuned!

Optimization

Part II: Optimization

Optimization (Sub)Gradient Methods

Outline

2 Optimization(Sub)Gradient MethodsStochastic Optimization Algorithms for Big Data

Stochastic OptimizationDistributed Optimization

Regularized Empirical Loss Minimization

minw∈Rd

n∑i=1

`(w>xi , yi) + R(w)︸︷︷︸F (w)

Convergence Measure

Most optimization algorithms are iterative

wt+1 = wt + ∆wt

Iteration Complexity: the number ofiterations T (ε) needed to have

F (wT )−minw

F (w) ≤ ε (ε 1)

Convergence Rate: after T iterations, howgood is the solution

F (wT )−minw

F (w) ≤ ε(T )

0 20 40 60 80 100

iterations

Total Runtime = Per-iteration Cost×Iteration Complexity

Convergence Measure

wt+1 = wt + ∆wt

F (wT )−minw

F (w) ≤ ε (ε 1)

F (wT )−minw

F (w) ≤ ε(T )

0 20 40 60 80 100

iterations

Convergence Measure

wt+1 = wt + ∆wt

F (wT )−minw

F (w) ≤ ε (ε 1)

F (wT )−minw

F (w) ≤ ε(T )

0 20 40 60 80 100

iterations

Convergence Measure

wt+1 = wt + ∆wt

F (wT )−minw

F (w) ≤ ε (ε 1)

F (wT )−minw

F (w) ≤ ε(T )

0 20 40 60 80 100

iterations

Total Runtime = Per-iteration Cost×Iteration ComplexityYang, Lin, Jin Tutorial for KDD’15 August 10, 2015 38 / 234

Big Data Analytics: Optimization and Randomizationhomepage.cs.uiowa.edu/~tyng/kdd15-tutorial.pdf ·...

Documents

Transcript of Big Data Analytics: Optimization and Randomizationhomepage.cs.uiowa.edu/~tyng/kdd15-tutorial.pdf ·...

Development of Environmental Law in China: 1978-2013 Prof. Dr. Qin Tianbao Research Institute of Environmental Law Wuhan University.

Inferring Networks of Substitutable and Complementary Productscseweb.ucsd.edu/~jmcauley/pdfs/kdd15.pdf · Inferring Networks of Substitutable and Complementary Products Julian McAuley

Fair Online Power Capping for Emergency Handling in Multi ...cwu/papers/qhsun-tcc17.pdf · 1 Fair Online Power Capping for Emergency Handling in Multi-Tenant Cloud Data Centers Qihang

The Process of Legislation on Access to Genetic Resources and Benefit-Sharing (ABS) in China: A New Long March Prof. Dr Qin, Tianbao Assistant Dean for.

Original Research Return Period for Urban Rainwater ... Period for Urban.pdf · the design of Road and Tianbao Road.drainage networks under the conditions of climate change, there

Mask Guided Matting via Progressive Refinement NetworkMask Guided Matting via Progressive Reﬁnement Network Qihang Yu1∗ Jianming Zhang2 He Zhang2 Yilin Wang2 Zhe Lin2 Ning Xu2

University of Connecticut Virtual Lab Carl DiFederico, Shane Tobey, Kasim Ward Graduate Student Advisor: Qihang Shi Senior Faculty Advisor: Mohammed Tehranipoor.

QINGDAO QIHANG TYRE CO., LTD., - cit.uscourts.gov C. Kahn, Grunfeld Desiderio Lebowitz Silverman & Klestadt, LLP, of Washington, D.C., argued for plaintiff Qingdao Qihang Tyre Co.,

Tianbao Yang tianbao-yang@uiowa.edu the University of Iowa ... · TP∗ T) (Zinkevich, 2003). And when all the functions are strongly convex and smooth, the upper bound can be improved

VEWS: A Wikipedia Vandal Early Warning Systemvs/pubs/KDD15-VEWS-Wikipedia-vandals.pdfVEWS: A Wikipedia Vandal Early Warning System Srijan Kumar Computer Science Dep. University of

Trends in Tropospheric Humidity from 1970 to 2008 over …...Trends in Tropospheric Humidity from 1970 to 2008 over China from a Homogenized Radiosonde Dataset TIANBAO ZHAO Key Laboratory

Inferring Networks of Substitutable and …jmcauley/pdfs/kdd15.pdfInferring Networks of Substitutable and Complementary Products Julian McAuley Rahul Pandey Jure Leskovec UC San Diego

Data-Driven Activity Prediction: Algorithms, Evaluation ...cook/pubs/kdd15.pdf · Data-Driven Activity Prediction: Algorithms, Evaluation Methodology, and Applications Bryan Minor

730 SE Maynard Road • Cary, NC 27511 • 919.481 · 730 SE Maynard Road • Cary, NC 27511 • 919.481.1413 Tianbao Bai, Ph.D., CIH Laboratory Director Kind Regards, ASBESTOS ANALYTICAL

July,2013 Berlin/Frankfurt, Germany 1. Investor Presentation Tianbao Holdings Limited 0. Disclaimer Pg.4 1. About us 1.1 Company Background Pg.5-9 1.2.

Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng › acml15-tutorial.pdfBig Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML

Shenzhen Strong Eagle Lighting Co.,LTD · 2018. 9. 6. · Tel: +86-755-27403289 Email: Rachel@strongeaglelighing.com Web: Add: Room 7E, Qihang Building, No. 5059, Songbai Road, Matian

For Safety Use You are kindly suggested to follow the tips ... · Adress: 3F, Building A2, Baicai Industrial Park, 19 Tianbao Road,Shiyan,Shenzhen Shenzhen SKY TOPPOWER Technology

A Beamformer Design Based on Fibonacci Branch Search · Tianbao Dong*, Haichuan Zhang, and Fangling Zeng Abstract—An approach towards beamforming for a uniform linear array (ULA)

QINGDAO QIHANG TYRE CO., LTD., · Hanbang Tyre Co., Ltd. in Supp. of their Mot. for J. on the Agency R. (Dec. 7, 2016), ECF Nos. 32 (conf.), 33 (public) (“Xugong’s Br.”). Both