Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train...
Transcript of Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train...
![Page 1: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/1.jpg)
Optimization for Data Science
AMIES – ILB
S. Gaıffas
![Page 2: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/2.jpg)
Motivations
1. Partnership with Caisse Nationale d’Assurance Maladie(CNAM)
World’s largest electronic health records database
Extremely complex data preprocessing
Pharmacovigilance: detect potentially dangerous drugs
2. Time-oriented machine learning
High-frequency financial signals
Social networks
“Causality maps”
![Page 3: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/3.jpg)
Motivation 1. CNAM Project
Context
Electronic health records: SNIIRAM + PMSI
Extremely complex database: 800 SQL tables, 500To, all in aSAS-Oracle closed ecosystem (exadata)
All health-care reimbursement of the French population (withdiagnosis, prescriptions, hospital stays, etc.)
Applications with a strong social impact
Goals
Pharmacovigilance: automatically detect potentially dangerous drugs(screening)
Examples: some anti-diabetics and bladder cancer, drug changesand fractures (with old persons)
Team
6 engineers
Administration, big data development
![Page 4: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/4.jpg)
Motivation 1. CNAM Project
Big data cluster
“Scalable” architecture
4 masters
15 slaves
240 cores
1.9To RAM
480To (120 hard-drives)
HDFS
Spark (mostlyspark.sql), Scala
Only open-sourcetechnology
![Page 5: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/5.jpg)
Motivation 1. CNAM project
Administration of two duplicateclusters (CNAM and X)
Understanding of the data
“Flattening” of the data
All from “raw” CSV extracts...
Work/Code management
Production in “agile” mode
Confluence + JIRA + GitHub
![Page 6: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/6.jpg)
Motivation 1. CNAM Project
Types of results
For antidiabetics and bladder cancer
New model for longitudinal data for “self-controlled case series”
Validation: blind detection of a well-known adverse effect of somedrug (suppressed from French market in 2011)
![Page 7: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/7.jpg)
Motivation 2. Time-oriented machine learning
From
We want to quantify interactions:
![Page 8: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/8.jpg)
Motivation 2. Time-oriented machine learning
Hawkes process
N = [N1, . . . ,Nd ]> where Ni (t) =∑
k≥1 1t ik≤t
Ni jumps each time i does something (e.g. tweet, price up or down)
Model: Ni has an intensity λi given by
λi (t) = µi (t) +
∫(0,t)
d∑j=1
ϕij(t − s)dNj(s),
λ1(t) and corresponding ticks with d = 1 and ϕ11(t) = e−t
![Page 9: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/9.jpg)
Motivation 2. Time-oriented machine learning
Achab, Bacry, Gaıffas, Mastromatteo, Muzy, Uncovering Causality fromMultivariate Hawkes Integrated Cumulants, ICML 2017
Granger causality estimation Integrated cumulants
Highly non-convex problem
Application on social networks and financial datasets
MemeTracker DAX data
Method ODE GC ADM4 NPHCErr 0.162 0.19 0.092 0.071Corr 0.07 0.053 0.081 0.095Time 2944 2780 2217 38
![Page 10: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/10.jpg)
Motivation 2. Time-oriented machine learning
Lead/lag + flow prediction
![Page 11: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/11.jpg)
Software: tick library
Python 3 et C++11
Open-source (BSD-3 License)
pip install tick (on MacOS and Linux...)
https://x-datainitiative.github.io/tick
Statistical learning for time-dependent models
Point processes (Poisson, Hawkes), Survival analysis, GLMs(parallelized, sparse, etc.)
A strong simulation and optimization toolbox
Partnership with Intel (use-case of a new processor with 180 cores)
Contributors welcome!
![Page 12: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/12.jpg)
Software: tick library
![Page 13: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/13.jpg)
Optimization for data science
You want to train a logistic regression with ridge penalization:
argminw∈Rd
{1
n
n∑i=1
log(1 + e−yix>i w ) +
λ
2‖w‖22
}
You have many ways to do it:
Gradient descent
Coordinate descent
Quasi-newton (BFGS)
Stochastic gradient descent
Dual, primal-dual methods
...
![Page 14: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/14.jpg)
Optimization for data science
You’re likely to get very different performances:
![Page 15: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/15.jpg)
Many linear methods for supervised learning write as the followingproblem:
argminw∈Rd
{1
n
n∑i=1
fi (w) + λg(w)
}where
fi (w) = “loss” for model-weights w on i-th data point
g is a penalization
Examples where fi (w) = `(yi , x>i w)
Linear regression:`(y , y ′) = 1
2 (y − y ′)2
Logistic regression:`(y , y ′) = log(1 + e−yy
′)
Hinge loss (SVM):`(y , y ′) = (1− yy ′)+
And let’s define f = 1n
∑ni=1 fi . NB: goodness-of-fit is an average
![Page 16: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/16.jpg)
“Simplest” algorithm: gradient descent
Input: starting point w0, learning rate η > 0
For k = 1, 2, . . . until converged do
wk ← proxηg
(wk−1 − η∇f (wk−1)
)Return last wk
What if sample size n is large?
Each iteration of a full gradient method has complexity O(nd)
You need to wait some time before doing anything...
Idea: we want to minimize an average of losses...
If I choose uniformly at random I ∈ {1, . . . , n}, then
E[∇fI (w)] =1
n
n∑i=1
∇fi (w) = ∇f (w)
∇fI (w) unbiased but very noisy estimate of the full gradient ∇f (w)
Stochastic Gradient Descent: Robbins and Monro 51
![Page 17: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/17.jpg)
Stochastic Gradient Descent
Input: starting point w0, sequence of learning rates {ηt}t≥0For t = 1, 2, . . . until convergence do
Sample it uniformly in {1, . . . , n}w t ← w t−1 − ηt∇fi (w t−1)
Return last w t
Remarks
Each iteration has complexity O(d) instead of O(nd) for fullgradient methods
Actually faster for sparse datasets (lazy or delayed updates)
Very fast in the early iterations (first passes on the data)
Very slow convergence to a precise minimizer
![Page 18: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/18.jpg)
The problem
Put X = ∇fI (w) with I uniformly chosen at random in {1, . . . , n}In SGD we use X = ∇fI (w) as an approximation of EX = ∇f (w)
How to reduce varX ?
Recent results improve this:
Bottou and LeCun (2005)
Shalev-Shwartz et al (2007, 2009)
Nesterov et al. (2008, 2009)
Bach et al. (2011, 2012, 2014, 2015)
T. Zhang et al. (2014, 2015)
![Page 19: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/19.jpg)
An idea
Reduce it by finding C s.t. EC is “easy” to compute and such thatC is highly correlated with X
Put Zα = α(X − C ) + EC for α ∈ [0, 1]. We have
EZα = αEX + (1− α)EC
andvarZα = α2(varX + varC − 2 cov(X ,C ))
Standard variance reduction: α = 1, so that EZα = EX (unbiased)
![Page 20: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/20.jpg)
Variance reduction of the gradient
In the iterations of SGD, replace ∇fit (w t−1) by
α(∇fit (w t−1)−∇fit (w)) +∇f (w)
where w is an “old” value of the iterate, namely use
w t ← w t−1 − η(α(∇fit (w t−1)−∇fit (w)) +∇f (w)
)Several cases
α = 1/n: SAG (Bach et al. 2013)
α = 1: SVRG (T. Zhang et al. 2015, 2015)
α = 1: SAGA (Bach et al., 2014)
![Page 21: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/21.jpg)
Stochastic Variance Reduced GradientPhase size typically chosen as m = n or m = 2nIf F = f + g with g prox-capable, use
w t+1k ← proxηg (w t
k − η(∇fi (w tk)−∇fi (wk) +∇f (wk)))
SAGAIf F = f + g with g prox-capable, use
w t ← proxηg
(w t−1 − η
(∇fit (w t−1)− gt−1(it) +
1
n
n∑i=1
gt−1(i)))
Important remark
In these algorithms, the step-size η is kept constant
Leads to linearly convergent algorithms, with a numericalcomplexity comparable to SGD!
![Page 22: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/22.jpg)
Theoretical knowledge
What is the order required number of iterates K to achieveε-precision:
F (wK )− F (w0) ≤ ε ?
Ifµ ≤ λmin(∇2f ) ≤ λmax(∇2f ) ≤ L
we know that, with κ = Lµ
Gradient descent: K = Θ(d × n × κ× log
(1ε
))SGD: K = Θ
(d × κ× 1
ε
)If
µ ≤ λmin(∇2f ) and λmax(∇2fi ) ≤ Li for all i = 1, . . . , n
we know that, with κ = maxi Li
µ
SAG, SAGA, SVRG, SDCA: K = Θ(d × (n + κ)× log
(1ε
))
![Page 23: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/23.jpg)
Algorithms comparison
![Page 24: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/24.jpg)
Asynchronous mode
All above algorithms can be strongly parallelized: using anasynchronous mode, involving “lock-free” updates
Lock-free SGD: apply in parallel
sample i uniformly in {1, . . . , n}w ← w − η∇fi (w)
without locking w (hence allowing collisions)
References (with generalizations and variance-reduction)
Niu et al. (2011), Hsieh et al. (2015), Reddi et al. (2015), Mania etal (2015), Zhao and Li, (2016), Leblond, Pedregosa, Lacoste-Julien(2017)
![Page 25: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/25.jpg)
[from Leblond, Pedregosa, Lacoste-Julien (2017)]
![Page 26: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/26.jpg)
The Poisson problem: non-smooth objectives
Great! So, let’s use these powerful tools for our problems about:
Health (Poisson-type models)
Hawkes processes (Social networks, Finance)
Our general optimization problem is
minw∈Rd
ψ>w +1
n
n∑i=1
fi (w>xi ) + λg(w)
subject to x>i w > 0 for all i = 1, . . . , n,
with fi (u) = −yi log(u) and yi > 0.
Example 1. Poisson regression (linear link)
minw∈Rd
1
n
n∑i=1
w>xi − yi log(w>xi ) + g(w)
subject to x>i w > 0 for all i = 1, . . . , n,
![Page 27: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/27.jpg)
The Poisson problem: non-smooth objectives
Example 2. Hawkes process’ log-likelihood
minw∈Rd
ψ>w −d∑
i=1
N i∑k=1
log(w>xi,k)
subject to x>i,kw > 0 for all i = 1, . . . , n,
for some choice of xi,k with n =∑d
i=1 Ni .
Problem
− log is non-smooth! (gradient not even bounded). Standard theoryis useless. Linear rates?
In practice: hard to tune the step-size
![Page 28: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/28.jpg)
The Poisson problem: non-smooth objectives
Most algorithm won’t even converge (example on a non-pathologicHawkes process, with sparse/small parameters)
0 10 20 30 40 50number of iterations
10 10
10 8
10 6
10 4
10 2
100
102
reac
hed
prec
ision
L-BFGS-BIstaFistaSCPGSVRGSDCA
![Page 29: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/29.jpg)
The Poisson problem: non-smooth objectives
Idea (work with M. Bompaire)
Use a slightly modified Stochastic Dual Coordinate Ascent (T.Zhang et al. 2015)
supα∈Rn : αi>0
1
n
n∑i=1
−f ∗i (−αi )− λg∗( 1
λn
n∑i=1
αixi −1
λψ)
With the primal dual relation:
w = ∇g∗( 1
λn
n∑i=1
αixi −1
λψ),
where f ∗i and g∗ are convex conjugates of fi and g .Since fi (u) = −yi log(u), we have f ∗i (v) = −yi − yi log
(−vyi
).
![Page 30: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/30.jpg)
The Poisson problem: non-smooth objectives
Algorithm.Input: Starting point α(0)
Put w (0) = 1λn
∑ni=1 α
(0)i xi − 1
λψFor t = 1, 2 . . . ,T do:
Randomly pick i
Find αi that maximize
yi + yi log(αi
yi
)− λn
2
∥∥∥w (t−1) +1
λn(αi − α(t−1)
i )xi
∥∥∥22
(1)
(explicit solution)
α(t) ← α(t−1) + ∆αiei
w (t) ← w (t−1) + (λn)−1∆αixi
Contribution
A new state-of-the-art for Poisson regression and Hawkes processes
Provable linear rates (using self-concordance)
![Page 31: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/31.jpg)
The Poisson problem: non-smooth objectives
![Page 32: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/32.jpg)
The Poisson problem in tick
Parallelized version of the algorithm in tick (in development)
![Page 33: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/33.jpg)
Conclusion
Optimization for machine learning: many recent development
Distributed / parallel / lock-free
Variance reduction
Results also for non-convex problems
But many problems from statistical learning don’t fit
Such as the ones mentioned here... still a lot to do
![Page 34: Optimization for Data Science 50pt1pt AMIES ILB · Optimization for data science You want to train a logistic regression with ridge penalization: argmin w2Rd (1 n Xn i=1 log(1 + e](https://reader036.fdocuments.us/reader036/viewer/2022081607/5ee440bfad6a402d666d7d6a/html5/thumbnails/34.jpg)
Thank you!