A Practical Guide to Deep Learning at the Department of ......Start prosess which aren’t shut down...
Transcript of A Practical Guide to Deep Learning at the Department of ......Start prosess which aren’t shut down...
A Practical Guide to Deep Learning at theDepartment of Mathematics
Vegard Antun (UiO)
March 19, 2019
1 / 61
Layout of the talk
Part I Computer resources, the linux operating system, large scalecomputations.
Part II Neural networks, mathematical framework, practical example.
2 / 61
Computer resources
CPU
Cache
Memory
Hard drive
3 / 61
INF1060, Pål Halvorsen University of Oslo
cache(s)
main memory
secondary storage (disks)
tertiary storage (tapes)
Memory Hierarchies
0.3 ns
On die memory - 1 ns
50 ns
5 ms
< 1 s
2 s
1.5 minutes
3.5 months
Computer resources
GPU
Memory
CPU
Cache
Memory
Hard drive
5 / 61
Time measurements
Total time for 10 epochs on CIFAR10. Batch size 10.
I CPU: 8 min, 35 sec
I GPU: 53 sec (≈10 times faster)
Network Local disk RAM
0
5
10
15
20
Seco
nds
Loading 50 MR scans (each 40 MB) on nam shub
6 / 61
Operating systems (OS)
Hardware
Operating system
7 / 61
The Linux Filesystem Hierarchy
The uppermost directory in the Linux file system is /
[ ∼ ]$ ls
Desktop Downloads Pictures www_docs
Documents pc WINDOWS
[ ∼ ]$ pwd
/mn/sarpanitu/ansatte -u4/vegarant
[ ∼ ]$ cd /
[ / ]$ ls
admin etc lib misc opt sbin tf usit
bin hf lib64 mn proc site tmp usr
boot home local mnt rh srv ub uv
dev ifi med net root sv uio var
div jus media odont run sys use
8 / 61
Some important directories
I /bin Most basic executable files (ls, cp, cd)
I /lib Libraries used by the executables
I /boot Files related to the boot loader
I /dev All devices, /dev/random, /dev/null, /dev/pst/0
I /etc Configuration files, /etc/hostname, /etc/passwd
I /home/username Your home folder ∼/ (not on UiO-system)
I /root Home directory of root user
I /tmp Temporary files - Not preserved during reboots
I /usr Read-only user data. Multiuser applications
I /var Variable files, i.e. files which changes during execution
9 / 61
Environment variables
Variable with a name and a value, used by one or moreapplications. To view all type env
Some important environment variables
I PATH All directories where we search for executables
I PYTHONPATH All directories where we search for python modules
I HOME Your home directory i.e. the position of ∼/
I EDITOR Default editor
I TF_CPP_MIN_LOG_LEVEL Level of verbosity for tensorflow
11 / 61
Environment variables - Example
[ ∼ ]$ echo $PYTHONPATH
/path/to/module1 :/path/to/module2
[ ∼ ]$
[ ∼ ]$ export PYTHONPATH=$PYTHONPATH :/path/to/new_module
[ ∼ ]$
[ ∼ ]$ echo $PYTHONPATH
/path/to/module1 :/path/to/module2 :/path/to/new_module
12 / 61
The ∼/.bashrc
The scrip language you type in the terminal is called “BASH“(Bourne Again SHell)
We often want the environment to stay persistent between logins.Set defaults in the files
I ∼/.bashrc Run each time you open a terminal on yourcomputer
[ ∼ ]$ cat ∼/. bashrc
export PYTHONPATH=$PYTHONPATH :/path/to/new_module
export TF_CPP_MIN_LOG_LEVEL =1
alias la=’ls -a --color=auto ’
alias ll=’ls -lh --color=auto ’
# Describes the command line prompt
PS1=’[ \h \w ]$ ’
13 / 61
The ∼/.bashrc and ∼/.bash profile files
I ∼/.bashrc Run each time you open a terminal on yourcomputer
I ∼/.bash_profile Run each time you log in remotely.
To have two different settings in ∼/.bashrc and ∼/.bash profile isoften inconvenient. To only use the ∼/.bashrc file, place thefollowing lines in your ∼/.bash profile
[ ∼ ]$ cat .bash_profile
if [ -f ∼/. bashrc ]; then
. ∼/. bashrc
fi
Note: Files starting with ’.’ don’t show whenever you type ls. Inorder to see these files, type ls -a
14 / 61
Login to remote machines via SSH
Login to the universities network from a personal linux or maccomputer
[ ∼ ]$ ssh -X [email protected]
The -X options enabels X11 forwarding i.e. you can open GUIbased applications.
Once you are logged in you can continue to the desired computerby typing
[ ∼ ]$ ssh -X computername
[ ∼ ]$ # Example , logging into the hadad computer
[ ∼ ]$ ssh -X hadad
15 / 61
Login to remote machines via SSH
Next we will see how to make this preceedure require less typing!
16 / 61
SSH config file
Create the file ∼/.ssh/config and add the following lines
host uio
hostname login.math.uio.no
user your_username
ForwardX11 no
You can then logon to the university’s network by
ssh -X uio
We assume you have this setup in the rest of this presentation
17 / 61
SSH keys
To make the UiO passwords secure they often require a lot oftyping. SSH-keys provides an easy way to maintain high sequretywhile having shorter passwords.
18 / 61
Generate and set up SSH-key
[ ∼ ]$ ssh -keygen -t rsa -b 4096 -C "[email protected]"
This command will create two files
I ∼/.ssh/id_rsa Private key. Do not share it.
I ∼/.ssh/id_rsa.pub Public key. Can be shared with anyone.
Copy the public key to the remote host (UIO)
ssh−copy−i d − i ∼/ . s s h / i d r s a . pub <username>@ l o g i n . math . u i o . no
19 / 61
SSH and jump connections
Your comp. login.math.uio.no math comp.
I Jump connection sends the ssh trafic directly through acomputer like a regular ruter
I You avoid some typing and you do not allocate a terminal onthe jump computer
I Does only allow for one jump
20 / 61
SSH and jump connections
To use jump connection add the following to your ∼/.ssh/config
# Setup for the math computers , this example belet -ili
Host belet -ili1
Hostname belet -ili.uio.no
ProxyJump [email protected]
User vegarant
or you can add the jump connection directly
s s h −J <username>@ l o g i n . math . u i o . no <username>@<hostname>. u i o . no
21 / 61
Terminal window managers
I Common choices are “tmux“ or “screen“.
22 / 61
Monitor CPU usage
I Use the htop command to view CPU-usage and priority
23 / 61
Reducing the priority of your process
I Linux processes can have “niceness“ values {−20, . . . , 19}where a smaller value gives higher priority.
I Negative nice values can only be given by rootuser/administrator.
I The default priority of any process you start will be 0 i.e. youwill typicaly reduce the priority.
[ ∼ ]$ nice -n 19 python3 my_python_script.py &
24 / 61
Monitor GPU usage
I All of our GPUs are from Nvidia. To view their current usageuse nvidia-smi
I To call this command every 5 second use the watch command
[ ∼ ]$ watch -n 5 nvidia -smi
[ ∼ ]$ # or use
[ ∼ ]$ nvidia -smi -l 5
25 / 61
GPU resources at Dep. of Mathematics
Name GPU CPU cores Mem. scratchnam-shub-01 4 × RTX 2080 ti 28 128GB 30GB
zadkiel 1 × RTX 2080 4 16 GB −belet-ili 1 × GTX 1080 4 16 GB −cleopatra 1 × GTX 1080 4 16 GB −euphrosyne 1 × GTX 1080 4 16 GB −hadad 1 × GTX 1080 4 16 GB −
26 / 61
AI HUB
I An experimental service for machine learning provided byUSIT, to gain experience with hardware and software for deeplearning.
I Reserved for students on weekdays (Mon-Fri) from 09:00 to17:00.
I Need to login via Abel (add ssh keys as before).
Name GPU CPU cores Mem.
Nonepresistent
scratchml1 4 × RTX 2080 Ti 28 128 GB 17TB
ml2 4 × RTX 2080 Ti 28 128 GB 17TB
ml3 4 × RTX 2080 Ti 28 128 GB 17TB
I AI mailing list: [email protected]
28 / 61
Deep learning frameworks
I Many old frameworks like: MatConvNet, Caffe, Theano ...
I For most scientists Tensorflow (and maybe Pytorch) would bethe prefered option.
29 / 61
Tensorflow
I Developed by Google, and have a large community.
I Relatively well documented
I Have APIs in Python, JavaScript, C++, Java, Go, Swift.
I Models can be deployed into applications, such as websitesand phones.
30 / 61
How to run Tensorflow?
I No unified way to do this on all systems.
I The machines ml1, ml2 and ml3, have tensorflow v1.12 andPyTorch v1.0. Just type python3 to get started.
I On math computers we use the module system (and maybesingularity)
module avail # See which modules are avaiable
module load tensorflow/<version > # Load tensorflow
module rm tensorflow/<version > # Unload tensorflow
module list # view loaded modules
I ML software located under python-ml/<version> andtensorflow/<version>. Do not load both.
31 / 61
Singularity
I Singularity (similar to docker) is container with a minimaloperating system.
I Shares the kernel with the host operating system so that CPUoverhead is almost non.
I You can install whatever software you like within thecontainer, with the nessesary libaries.
I Makes reproducible research much easier!
I Check out Tormod Landet’s excelent guide to singularityhttp://folk.uio.no/tormodla/singularity/
I On maths computers precompiled singularity images arelocated at /mn/sarpanitu/singularity/images/Machine_learning
32 / 61
Neat commands
I ag or ack – Search for pattern in each source file in the treefrom the current directory and downward.
I fzf – Fuzzy finder. Search for filenames in the tree from thecurrent directory and downwards.
I which <command> – E.g. which python Gives the location of theprogram python.
I nohup nice -n 19 python -u my_script.py > output.txt & –Start prosess which aren’t shut down when you exit the loginshell.
33 / 61
File permissions
On UNIX systems, access can be given to a user, group or all. Thetree types of permissions are read, write and execute
[ ∼/some/directory]$ ls -l
drwxrwxr -x. 1 vegarant vegarant 4096 Oct 26 10:53 my_dir
-rwxrwxr -x. 1 vegarant vegarant 8448 Oct 26 10:53 my_file
-rw -r--r--. 1 vegarant vegarant 108 Oct 26 10:52 my_file.c
d︸︷︷︸directory
rwx︸︷︷︸user
rwx︸︷︷︸group
r − x︸ ︷︷ ︸all
vegarant︸ ︷︷ ︸username
vegarant︸ ︷︷ ︸group name
4096︸︷︷︸size
Oct26 10 : 53︸ ︷︷ ︸last modified
my dir︸ ︷︷ ︸name
[ ∼/some/directory]$ # Make directory private
[ ∼/some/directory]$ chmod 700 my_dir
[ ∼/some/directory]$ ls -l
drwx ------. 1 vegarant vegarant 4096 Oct 26 10:53 my_dir
-rwxrwxr -x. 1 vegarant vegarant 8448 Oct 26 10:53 my_file
-rw -r--r--. 1 vegarant vegarant 108 Oct 26 10:52 my_file.c
34 / 61
Part II
Neural networks, mathematical framework, practical example.
35 / 61
Neural Network
Definition 1Let NNN,L,d with N = (c = NL+1,NL, . . . ,N2,N1 = d) denote theset of all L-layer neural networks. That is, all mappingsf : Rd → Rc of the form
f (x) = WL(. . . ρ(W2(ρ(W1(x)))) . . .), x ∈ Rd ,
where Wjz = Ajz + bj , Aj ∈ RNj×Nj+1 , bj ∈ RNj+1
and ρ : R→ R is a non-linear function that acts elementwise on a vector.
37 / 61
Choices of ρ
ρ : R→ R acts elementwise on a vector.
Sigmoid: ρ(x) = 1/(1 + e−x) ReLu: ρ(x) = max(0, x)
tanh: ρ(x) = tanh(x) Leaky ReLu: ρ(x) =
{x x ≥ 0
αx x < 038 / 61
Choices of ρ
ρ
x1...xN
=
max{x1, x2}...
max{xN−1, xN}
, ρ
x1...xN
=
x1+x2
2...
xN−1+xN2
Max pooling Avrage pooling (linear map)
39 / 61
Neural Network (Alternative definition)
Directed acyclic graph
x
z1 = A1x + b1
z2 = ρ1(z1)
z3 = A2z2 + b2
z4 = A3x + b3
z5 = ρ2(z4)z6 = z3 + z5
z7 = ρ3(z6)
Output40 / 61
What is machine learning?
41 / 61
Machine learning model
I Training set: S = (z1, . . . , zm) ⊂ Z where each zi is i.i.d.from an unknown probability distribution D over Z ⊂ Rd .
I Function class: F class of funtions/hypotheses.
I Cost function: C : F × Z → RI Risk: RD(f ) := Ez∼DC (f , z) where z ∼ D is independent of
S .
I Goal: Find a “good hypotesis“ f̂ ∈ F based on S such thatRD(f̂ ) is small.
Shalev-Shwartz & Ben-David, Understanding Machine Learning: From Theory
to Algorithms, Cambridge University Press, 2014.
42 / 61
Examples
Binary classificationI Training set: {(xi , yi )}mi=1 ⊂ Rd × {0, 1}.I Function class: F can be set of linear classifiers, Neural
networks, decision trees.I Cost function: C (f , (xi , yi )) = 1{yi=f (xi )}.
Linear regressionI Training set: {(xi , yi )}mi=1 ⊂ Rd × R.I Function class: F = {〈·, θ〉 : θ ∈ Rd+1}I Cost function: C (f , (xi , yi )) = (yi − 〈[xi , 1], θ〉)2.
ClusteringI Training set: S = {zi}mi=1 ⊂ Rd .I Function class:
F = {T = {T1, . . . ,Tk} : Partition of S with centers (c1, . . . , ck)}
I Cost function: C (T , zi ) = ||zi − cj || for zi ∈ Tj .
43 / 61
Machine learning model
I Risk: RD(f ) := Ez∼DC (f , z) where z ∼ D is independent ofS .
I Goal: Find a “good hypotesis“ f̂ ∈ F based on S such thatRD(f̂ ) is small. Notice: We can not evaluate RD(f ) since D isunknown
Emperical Risk Minimazation
Approximate RD(f ) by
RS(f ) =1
|S |∑z∈S
C (f , z)
We seek to findf ] ∈ argminf ∈F RS(f )
44 / 61
Bias-Complexity tradeoff
Let
εapprox = minf ∈F
RD(f ) and f ] ∈ argminf ∈F RS(f ).
Then
RD(f ]) = εapprox︸ ︷︷ ︸approximation error
+RD(f ])− εapprox︸ ︷︷ ︸estimation error
45 / 61
Emperial Risk Minimization for Neural Networks
I Training set: {(xi , yi )}mi=1 ⊂ Rd × Rc .
I Function class: F = NNN,L,d parametrized by the weightsθ = (vec(A1), b1, . . . , vec(AL), bL) i.e. f (·, θ) : Rd → RNL+1 .
I Cost function: C (f , (xi , yi )) = d(f (xi , θ), yi ). Functiond : Rc × Rc → R+ problem dependent.
1. θ ∈ Rp is often referred to as the weights.
2. Define loss function
L(θ) =n∑
i=1
d(f (xi , θ), yi )
3. Try to findθ ∈ argmin
θ∈RpL(θ)
using (stochastic) gradient decent.
46 / 61
Convex Optimization – Boyd & Vandenberghe
“Nonlinear optimization (or nonlinear programming) is the termused to describe an optimization problem when the objective orconstraint functions are not linear, but not known to be convex.Sadly, there are no effective methods for solving the generalnonlinear programming problem (1.1). Even simple lookingproblems with as few as ten variables can be extremely challenging,while problems with a few hundreds of variables can be intractable.Methods for the general nonlinear programming problem thereforetake several different approaches, each of which involves somecompromise.“
minimize f0(x), x ∈ Rn
subject to fi (x) ≤ bi i = 1, . . . ,m(1.1)
Boyd & Vandenberghe, Convex Optimization, Cambridge universitypress, 2004.
47 / 61
Convex Optimization – Boyd & Vandenberghe
From section on local optimization approaches to nonlinearoptimization:
“Roughly speaking, local optimization methods are more art thantechnology. Local optimization is a well developed art, and oftenvery effective, but it is nevertheless an art.“
Boyd & Vandenberghe, Convex Optimization, Cambridge universitypress, 2004.
48 / 61
Gradient Decent for Neural Networks
I Recall we wanted to minimize
L(θ) =n∑
i=1
d(f (xi , θ), yi )
Gradient decent gives the iterations
θk+1 = θk − αk∇L(θk)
for some step length αk > 0.
I What happens to the computational cost if n is very large, sayn ≈ 1 200 000
49 / 61
Stochastic Gradient Decent for Neural Networks
I Create a partition {T1, . . . ,Tk} of the numbers {1, . . . , n}where each |Tj | ≤ s.
I LetGj(θ) =
∑i∈Tj
∇θC (f (xi , θ), yi )
I Perform the updates
1: t = 02: for e = 1, . . . ,M do3: for j = 1, . . . , k do4: θt+1 = θt − αtGj(θt)5: t = t + 1;
6: return θkM
50 / 61
Alternative update rules
GD with momentum, 0 < γ < 1.
vt+1 = γvt + ηGj(θt)
θt+1 = θt − vt+1
Individual scaling of the different parameters. (Adagrad, RMSprop,Adam)
θt+1 = θt − DtGj(θt)
Dt is a diagonal matrix depending on some or all of the previouscomptued gradients.
51 / 61
Tensorflow
import tensorflow as tf
import numpy as np
Most important tensors
I tf.Variable (Must be initialized. Can take gradient)
I tf.placeholder (Input to the network)
I tf.constant (Constant values)
I tf.Tensor (Output of an operation)
Important Attributes
I shape (Default is None, i.e. not specified)
I dtype (tf.float32, tf.int32, . . .)
I name (Will be assigned a name of not specified)
52 / 61
xA
z1 = Ax b
z2 = z1 + b
I A: tf.Variable
I x : tf.placeholder
I z1: tf.Tensor
I b: tf.Variable, tf.placeholder or tf.constant
I z2: tf.Tensor
53 / 61
Tensorflow
# Nodes in a graph
a = tf.Variable(initial_value=np.random.randn(1,3),
name=’weights ’, dtype=tf.float32)
b = tf.Variable(initial_value =[0], name=’bias ’,
dtype=tf.float32)
print(a)
print(b)
$ python3 program_name.py
<tf.Variable ’weights:0’ shape =(1, 3) dtype=float32_ref >
<tf.Variable ’bias:0’ shape =(1,) dtype=float32_ref >
54 / 61
Linear regression
# Code generating all the data
N = 50
a_true = np.array ([[4., -5, 3 ]], dtype=np.float32)
b_true = np.array ([2], dtype=np.float32)
x_data = np.concatenate( (np.random.randn(1, N),
np.random.uniform(size=[1, N]),
np.random.chisquare(df=3.0, size=(1, N))) )
noise = 0.01*np.random.randn(1, N)
labels = np.dot(a_true , x_data) + b_true # + noise
a =
4−53
, b = 2, xi ∈ R3, i = 1, . . . ,N
xTi a + b = yi , i = 1, . . . ,N
55 / 61
Tensorflow
# Nodes in a graph
a = tf.Variable(initial_value=np.random.randn(1,3),
name=’weights ’, dtype=tf.float32)
b = tf.Variable(initial_value =[0], name=’bias ’,
dtype=tf.float32)
X = tf.placeholder(dtype=tf.float32 , name=’data ’,
shape =[3, N])
prediction = tf.linalg.matmul(a,X) + b # TF graph
print(x)
print(prediction)
$ python3 program_name.py
Tensor ("data:0", shape =(3, 50), dtype=float32)
Tensor ("add:0", shape =(1, 50), dtype=float32)
56 / 61
Tensorflow – Sessions
I Graphs only define the function you would like to compute.I To execute a graph (function), open a tf.Session().
init = tf.global_variables_initializer ();
with tf.Session () as sess:
sess.run(init); # All variables must be initalized
# All relevant placeholders goes into the feed_dict
pred = sess.run(prediction , feed_dict ={X: x_data })
a_start = sess.run(a);
print(a_start );
print(pred) # pred is a numpy array with
# values = a*data + b
$ python3 program_name.py
[[ -0.9025026 0.6354202 -0.09739944]]
[[ -0.86136425 0.6985589 0.51153713 1.2961135
...
0.91275173 -1.0157912 -0.41740212 0.45071918
0.3727951 -0.81552047]]
57 / 61
Tensorflow – Gradient Decent
Y = tf.placeholder(dtype=tf.float32 , name=’label ’,
shape =[1, N]);
# Compute sum_{i} (y[i]-prediction[i])^2
loss = tf.reduce_sum(tf.pow(prediction -Y, 2));
nbr_epochs = 100;
step_length = 0.01; # often called learning rate
optimizer = tf.train.GradientDescentOptimizer(
step_length ). minimize(loss);
with tf.Session () as sess:
sess.run(init); # All variables must be initalized
for epoch in range(nbr_epochs ):
# Do gradient decent step
sess.run(optimizer , feed_dict ={X: x_data ,
Y: labels })
a_pred , b_pred = sess.run([a, b]);58 / 61
NeurIPS (earlier NIPS)
Submitted papers
I 2016: 2406 submissions
I 2017: 3240 submissions
I 2018: ∼4900 submissions
Source: Twitter
59 / 61