Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative...
Transcript of Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative...
![Page 1: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/1.jpg)
Gradient Descent
![Page 2: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/2.jpg)
Linear Regression Optimization
Goal: Find that minimizes
• Closed form solution exists • Gradient Descent is iterative
(Intuition: go downhill!)
f(w) = ||Xw � y||22
w�
w
f(w)
w*
Scalar objective: f(w) = ||wx � y||22 =n�
j=1
(wx(j) � y(j))2
![Page 3: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/3.jpg)
Gradient Descent
Start at a random point
w
f(w)
w0w*
![Page 4: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/4.jpg)
Gradient Descent
Start at a random point
Determine a descent direction
w
f(w)
w0w*
![Page 5: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/5.jpg)
Gradient Descent
Start at a random point
Determine a descent directionChoose a step size
w
f(w)
w0w*
![Page 6: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/6.jpg)
Gradient Descent
Start at a random point
Determine a descent directionChoose a step sizeUpdate
w
f(w)
w1 w0w*
![Page 7: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/7.jpg)
Start at a random point Repeat
Determine a descent direction Choose a step size Update
Until stopping criterion is satisfied
Gradient Descent
w
f(w)
w1 w0w*
![Page 8: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/8.jpg)
Start at a random point Repeat
Determine a descent direction Choose a step size Update
Until stopping criterion is satisfied
Gradient Descent
w
f(w)
w1 w0w*
![Page 9: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/9.jpg)
Start at a random point Repeat
Determine a descent direction Choose a step size Update
Until stopping criterion is satisfied
Gradient Descent
w
f(w)
w1 w0w*
![Page 10: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/10.jpg)
Start at a random point Repeat
Determine a descent direction Choose a step size Update
Until stopping criterion is satisfied
Gradient Descent
w
f(w)
w1 w0w*
![Page 11: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/11.jpg)
Start at a random point Repeat
Determine a descent direction Choose a step size Update
Until stopping criterion is satisfied
Gradient Descent
w
f(w)
w2 w1 w0w*
![Page 12: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/12.jpg)
Start at a random point Repeat
Determine a descent direction Choose a step size Update
Until stopping criterion is satisfied
Gradient Descent
w
f(w)
w* …
![Page 13: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/13.jpg)
Start at a random point Repeat
Determine a descent direction Choose a step size Update
Until stopping criterion is satisfied
Gradient Descent
w
f(w)
w2 w1 w0w* …
![Page 14: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/14.jpg)
w
g(w) Non-convex
Any local minimum is a global minimum
Where Will We Converge?
Least Squares, Ridge Regression and Logistic Regression are all convex!
…
…
w
f(w) Convex
w*
Multiple local minima may existw*w!
…
![Page 15: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/15.jpg)
Update Rule: wi+1 = wi � αidfdw
(wi)
Choosing Descent Direction (1D)
We can only move in two directionsNegative slope is direction of descent!
w0 w
f(w)
w* w
f(w)
w*
positive ⇒ go left!
w0
negative ⇒ go right!
zero ⇒ done!
Step Size
Negative Slope
![Page 16: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/16.jpg)
We can move anywhere in Negative gradient is direction of steepest descent!
Rd
2D Example: • Function values are in black/white
and black represents higher values• Arrows are gradients
"Gradient2" by Sarang. Licensed under CC BY-SA 2.5 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Gradient2.svg#/media/File:Gradient2.svg
Choosing Descent Direction
Update Rule:
Step Size
Negative Slope
wi+1 = wi � αi�f(wi)
![Page 17: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/17.jpg)
Gradient Descent for Least Squares
Scalar objective:
Derivative:
Scalar Update:
Vector Update:
(2 absorbed in ) α
(chain rule)
Update Rule:
f(w) = ||wx � y||22 =n�
j=1
(wx(j) � y(j))2
wi+1 = wi � αi
n�
j=1
(w�i x(j) � y(j))x(j)
wi+1 = wi � αi
n�
j=1
(wix(j) � y(j))x(j)
dfdw
(w) = 2n�
j=1
(wx(j) � y(j))x(j)
wi+1 = wi � αidfdw
(wi)
![Page 18: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/18.jpg)
Choosing Step Size
Theoretical convergence results for various step sizes
A common step size is
Too small: converge very slowly
w
f(w)
w*
Too big: overshoot and even diverge
w
f(w)
w*
Reduce size over timew
f(w)
w*
αi =α
n�i
Constant
Iteration ## Training Points
![Page 19: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/19.jpg)
Example: n = 6; 3 workers
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
O(nd) Distributed Storage
Parallel Gradient Descent for Least SquaresVector Update: wi+1 = wi � αi
n�
j=1
(w�i x(j) � y(j))x(j)
Compute summands in parallel! note: workers must all have wi
O(nd) Distributed
ComputationO(d) Local
Storagemap: (w�
i x(j) � y(j))x(j) (w�i x(j) � y(j))x(j) (w�
i x(j) � y(j))x(j)
reduce: O(d) Local Computation
O(d) Local Storage
n�
j=1
(w�i x(j) � y(j))x(j)
wi+1
![Page 20: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/20.jpg)
Example: n = 6; 3 workers
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
O(nd) Distributed Storage
Parallel Gradient Descent for Least SquaresVector Update: wi+1 = wi � αi
n�
j=1
(w�i x(j) � y(j))x(j)
Compute summands in parallel! note: workers must all have wi
O(nd) Distributed
ComputationO(d) Local
Storagemap: (w�
i x(j) � y(j))x(j) (w�i x(j) � y(j))x(j) (w�
i x(j) � y(j))x(j)
reduce: O(d) Local Computation
O(d) Local Storage
n�
j=1
(w�i x(j) � y(j))x(j)
wi+1
![Page 21: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/21.jpg)
Pros:• Easily parallelized• Cheap at each iteration• Stochastic variants can make
things even cheaper
Cons:• Slow convergence (especially
compared with closed-form)• Requires communication
across nodes!
Gradient Descent Summary
![Page 22: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/22.jpg)
Communication Hierarchy
![Page 23: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/23.jpg)
Communication Hierarchy
CPU
2 billion cycles/sec per core
Clock speeds not changing, but number of cores growing
with Moore’s Law
![Page 24: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/24.jpg)
Communication Hierarchy
CPU
50 GB/s
RAM
10-100 GB Capacity growing with
Moore’s Law
![Page 25: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/25.jpg)
Communication Hierarchy
CPU
50 GB/s
100 MB/s
Disk
RAM
1-2 TB Capacity growing
exponentially, but not speed
![Page 26: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/26.jpg)
Communication HierarchyNetwork
10Gbps (1 GB/s)
Top-of-rackswitch
Nodes insame rack
10Gbps
CPU
50 GB/s
100 MB/s
Disk
RAM
×10
Nodes inother racks
3Gbps
![Page 27: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/27.jpg)
SummaryAccess rates fall sharply with distance 50× gap between memory and network!
CPU
50 GB/s1 GB/s
Local disksRAM Rack
0.3 GB/s1 GB/s
Must be mindful of this hierarchy when developing parallel algorithms!
DifferentRacks
![Page 28: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/28.jpg)
Distributed ML: Communication Principles
![Page 29: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/29.jpg)
Access rates fall sharply with distance • Parallelism makes computation fast • Network makes communication slow
Must be mindful of this hierarchy when developing parallel algorithms!
Communication Hierarchy
CPU
50 GB/s1 GB/s
Local disksRAM Rack
0.3 GB/s1 GB/s
DifferentRacks
![Page 30: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/30.jpg)
Persisting in memory reduces communication• Especially for iterative computation (gradient descent)Scale-up (powerful multicore machine)• No network communication• Expensive hardware, eventually hit a wall
2nd Rule of thumb Perform parallel and in-memory computation
CPU
Disk
RAM
CPUDisk
RAM
![Page 31: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/31.jpg)
Persisting in memory reduces communication• Especially for iterative computation (gradient descent)Scale-out (distributed, e.g., cloud-based)• Need to deal with network communication• Commodity hardware, scales to massive problems
2nd Rule of thumb Perform parallel and in-memory computation
Network
CPU
Disk
RAM …
CPU
Disk
RAM
CPU
Disk
RAM
CPU
Disk
RAM
![Page 32: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/32.jpg)
Persisting in memory reduces communication• Especially for iterative computation (gradient descent)Scale-out (distributed, e.g., cloud-based)• Need to deal with network communication• Commodity hardware, scales to massive problems
2nd Rule of thumb Perform parallel and in-memory computation
Network
CPU
Disk
RAM …
CPU
Disk
RAM
CPU
Disk
RAM
CPU
Disk
RAM
← Persist training data across iterations
![Page 33: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/33.jpg)
Q: How should we leverage distributed computing while mitigating network communication?
First Observation: We need to store and potentially communicate Data, Model and Intermediate objects• A: Keep large objects local
3rd Rule of thumb Minimize Network Communication
![Page 34: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/34.jpg)
Example: Linear regression, big n and small d• Solve via closed form (not iterative!)• Communicate O(d2) intermediate data• Compute locally on data (Data Parallel)
3rd Rule of thumb Minimize Network Communication - Stay Local
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
map:x(
i)
x(i)
x(i)
x(i)
x(i)
x(i)
( )-1reduce: �x(
i)
x(i)
![Page 35: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/35.jpg)
Example: Linear regression, big n and big d• Gradient descent, communicate • O(d) communication OK for fairly large d• Compute locally on data (Data Parallel)
3rd Rule of thumb Minimize Network Communication - Stay Local
workers: x(1)
x(5)
x(3)
x(4)
x(2)
x(6)
reduce:
map: (w�i x(j) � y(j))x(j) (w�
i x(j) � y(j))x(j) (w�i x(j) � y(j))x(j)
n�
j=1
(w�i x(j) � y(j))x(j)
wi+1
wi
![Page 36: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/36.jpg)
Example: Hyperparameter tuning for ridge regression with small n and small d
• Data is small, so can communicate it • ‘Model’ is collection of regression models corresponding
to different hyperparameters • Train each model locally (Model Parallel)
3rd Rule of thumb Minimize Network Communication - Stay Local
![Page 37: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/37.jpg)
3rd Rule of thumb Minimize Network Communication - Stay Local
Example: Linear regression, big n and huge d• Gradient descent • O(d) communication slow with hundreds of millions parameters • Distribute data and model (Data and Model Parallel)
• Often rely on sparsity to reduce communication
![Page 38: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/38.jpg)
Second Observation: ML methods are typically iterative• A: Reduce # iterations
3rd Rule of thumb Minimize Network Communication
Q: How should we leverage distributed computing while mitigating network communication?
First Observation: We need to store and potentially communicate Data, Model and Intermediate objects • A: Keep large objects local
![Page 39: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/39.jpg)
3rd Rule of thumb Minimize Network Communication - Reduce Iterations
Distributed iterative algorithms must compute and communicate• In Bulk Synchronous Parallel (BSP) systems, e.g., Apache Spark, we
strictly alternate between the two
Distributed Computing Properties• Parallelism makes computation fast• Network makes communication slow
Idea: Design algorithms that compute more, communicate less• Do more computation at each iteration• Reduce total number of iterations
![Page 40: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/40.jpg)
3rd Rule of thumb Minimize Network Communication - Reduce Iterations
Extreme: Divide-and-conquer • Fully process each partition locally, communicate final result • Single iteration; minimal communication • Approximate results
![Page 41: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/41.jpg)
3rd Rule of thumb Minimize Network Communication - Reduce Iterations
Less extreme: Mini-batch • Do more work locally than gradient descent before communicating • Exact solution, but diminishing returns with larger batch sizes
![Page 42: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/42.jpg)
3rd Rule of thumb Minimize Network Communication - Reduce Iterations
Throughput: How many bytes per second can be readLatency: Cost to send message (independent of size)
LatencyMemory 1e-4 ms
Hard Disk 10 msNetwork (same datacenter) .25 ms
Network (US to Europe) >5 ms
We can amortize latency! • Send larger messages • Batch their communication • E.g., Train multiple models together
![Page 43: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/43.jpg)
1st Rule of thumb Computation and storage should be linear (in n, d)
2nd Rule of thumb Perform parallel and in-memory computation
3rd Rule of thumb Minimize Network Communication
![Page 44: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/44.jpg)
Lab Preview
![Page 45: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/45.jpg)
Goal: Predict song’s release year from audio features
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
training set
full dataset
test set
new entity
predictionaccuracy
model
validation set
![Page 46: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/46.jpg)
Raw Data: Millionsong Dataset from UCI ML Repository • Explore features • Shift labels so that they start at 0 (for interpretability) • Visualize data
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
training set
full dataset
test set
new entity
predictionaccuracy
model
validation set
![Page 47: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/47.jpg)
Split Data: Create training, validation, and test sets
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
training set
full dataset
test set
new entity
predictionaccuracy
model
validation set
![Page 48: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/48.jpg)
Feature Extraction: • Initially use raw features • Subsequently compare with quadratic features
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
training set
full dataset
test set
new entity
predictionaccuracy
model
validation set
![Page 49: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/49.jpg)
Supervised Learning: Least Squares Regression • First implement gradient descent from scratch • Then use MLlib implementation • Visualize performance by iteration
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
training set
full dataset
test set
new entity
predictionaccuracy
model
validation set
![Page 50: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/50.jpg)
Evaluation (Part 1): Hyperparameter tuning • Use grid search to find good values for regularization
and step size hyperparameters • Evaluate using RMSE • Visualize grid search
training set
full dataset
test set
new entity
predictionaccuracy
model
validation set
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
![Page 51: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/51.jpg)
Evaluation (Part 2): Evaluate final model • Evaluate using RMSE • Compare to baseline model that returns
average song year in training data
training set
full dataset
test set
new entity
predictionaccuracy
model
validation set
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data
![Page 52: Gradient Descent - edX · Persisting in memory reduces communication • Especially for iterative computation (gradient descent) Scale-out (distributed, e.g., cloud-based) • Need](https://reader035.fdocuments.us/reader035/viewer/2022062414/5f02a8a47e708231d4055dfc/html5/thumbnails/52.jpg)
Predict: Final model could be used to predict song year for new songs (we won’t do this though)
training set
full dataset
test set
new entity
predictionaccuracy
model
validation set
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Split Data