Post on 17-Feb-2021
Differentiable FunctionalProgrammingAtılım Güneş Baydin
University of Oxfordhttp://www.robots.ox.ac.uk/~gunes/
F#unctional Londoners Meetup, April 28, 2016
http://www.robots.ox.ac.uk/~gunes/
About me
Current (from 11 April 2016):Postdoctoral researcher,Machine Learning Research Group, University of Oxfordhttp://www.robots.ox.ac.uk/~parg/
Previously:Brain and Computation Lab, National University of IrelandMaynooth: http://www.bcl.hamilton.ie/Working primarily with F#, on algorithmic differentiation,functional programming, machine learning
1/36
http://www.robots.ox.ac.uk/~parg/http://www.bcl.hamilton.ie/
Today’s talk
Derivatives in computer programsDifferentiable functional programmingDiffSharp + Hype librariesTwo demos
2/36
Derivatives in computer programsHow do we compute them?
Manual differentiationf(x) = sin(exp x)let f x = sin (exp x)
Calculus 101: differentiation rulesd(fg)dx =
dfdxg+ f
dgdx
d(af + bg)dx = a
dfdx + b
dgdx
. . .
f ′(x) = cos(exp x)× exp xlet f’ x = (cos (exp x)) * (exp x)
3/36
Manual differentiationf(x) = sin(exp x)let f x = sin (exp x)
Calculus 101: differentiation rulesd(fg)dx =
dfdxg+ f
dgdx
d(af + bg)dx = a
dfdx + b
dgdx
. . .
f ′(x) = cos(exp x)× exp xlet f’ x = (cos (exp x)) * (exp x)
3/36
Manual differentiationf(x) = sin(exp x)let f x = sin (exp x)
Calculus 101: differentiation rulesd(fg)dx =
dfdxg+ f
dgdx
d(af + bg)dx = a
dfdx + b
dgdx
. . .
f ′(x) = cos(exp x)× exp xlet f’ x = (cos (exp x)) * (exp x)
3/36
Manual differentiationIt can get complicatedf(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2(4th iteration of the logistic map ln+1 = 4ln(1− ln), l1 = x)let f x =
64*x * (1-x) * ((1 - 2*x) ** 2) * ((1 - 8*x + 8*x*x) ** 2)
f ′(x) =128x(1−x)(−8+16x)(1−2x)2(1−8x+8x2)+64(1−x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2−256x(1−x)(1−2x)(1−8x+8x2)2let f’ x = 128*x * (1-x) * (-8+16*x) * (1-2*x)**2 * (1-8*x+8*x*
x) + 64 * (1-x) * (1-2*x)**2 * (1-8*x+8*x*x)**2 - 64*x(1-2*x)**2 * (1-8*x+8*x*x)**2 - 256*x*(1-x) * (1-2*x) * (1-8*x+8*x*x)**2
4/36
Manual differentiationIt can get complicatedf(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2(4th iteration of the logistic map ln+1 = 4ln(1− ln), l1 = x)let f x =
64*x * (1-x) * ((1 - 2*x) ** 2) * ((1 - 8*x + 8*x*x) ** 2)
f ′(x) =128x(1−x)(−8+16x)(1−2x)2(1−8x+8x2)+64(1−x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2−256x(1−x)(1−2x)(1−8x+8x2)2let f’ x = 128*x * (1-x) * (-8+16*x) * (1-2*x)**2 * (1-8*x+8*x*
x) + 64 * (1-x) * (1-2*x)**2 * (1-8*x+8*x*x)**2 - 64*x(1-2*x)**2 * (1-8*x+8*x*x)**2 - 256*x*(1-x) * (1-2*x) * (1-8*x+8*x*x)**2
4/36
Symbolic differentiationComputer algebra packages help: Mathematica, Maple, Maxima
But, it has some serious drawbacks5/36
Symbolic differentiationComputer algebra packages help: Mathematica, Maple, Maxima
But, it has some serious drawbacks5/36
Symbolic differentiationWe get “expression swell”
Logistic map ln+1 = 4ln(1 − ln), l1 = xn ln ddx ln
1 x 12 4x(1 − x) 4(1 − x) − 4x3 16x(1− x)(1−
2x)216(1− x)(1− 2x)2 −16x(1 − 2x)2 −64x(1 − x)(1 − 2x)
4 64x(1− x)(1−2x)2 (1− 8x +8x2)2
128x(1 − x)(−8 +16x)(1−2x)2(1−8x+8x2) + 64(1− x)(1−2x)2(1−8x+8x2)2−64x(1−2x)2(1−8x+8x2)2 − 256x(1 −x)(1 − 2x)(1 − 8x +8x2)2 1 2 3 4 5
0100200300400500600
n
Number of terms
ln
ddx ln
6/36
Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =
if n = 1 thenx
elselet mutable v = xfor i = 1 to n
v
Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =
if n = 1 thenx
elselet mutable v = xfor i = 1 to n
v
Symbolic differentiationWe are limited to closed-form formulaeYou can find the derivative of math expressions:f(x) = 64x(1− x)(1− 2x)2(1− 8x+ 8x2)2But not of algorithms, branching, control flow:let f x n =
if n = 1 thenx
elselet mutable v = xfor i = 1 to n
v
Numerical differentiationA very common hack:Use the limit definition of the derivative
dfdx = limh→0
f(x+ h)− f(x)h
to approximate the numerical value of the derivativelet diff f x =
let h = 0.00001(f (x + h) - f (x)) / h
Again, some serious drawbacks
8/36
Numerical differentiationA very common hack:Use the limit definition of the derivative
dfdx = limh→0
f(x+ h)− f(x)h
to approximate the numerical value of the derivativelet diff f x =
let h = 0.00001(f (x + h) - f (x)) / h
Again, some serious drawbacks
8/36
Numerical differentiationA very common hack:Use the limit definition of the derivative
dfdx = limh→0
f(x+ h)− f(x)h
to approximate the numerical value of the derivativelet diff f x =
let h = 0.00001(f (x + h) - f (x)) / h
Again, some serious drawbacks
8/36
Numerical differentiationWe must select a proper value of hand we face approximation errors
h
10-17 10-15 10-13 10-11 10-9 10-7 10-5 10-3 10-110-1010-810-610-410-2100102
Round-off errordominant Truncation errordominant
Error
Computed using
E(h, x∗) =∣∣∣∣∣ f(x
∗ + h) − f(x∗)h
−ddx
f(x)∣∣x∗
∣∣∣∣∣f(x) = 64x(1 − x)(1 − 2x)2(1 − 8x + 8x2)2x∗ = 0.2
9/36
Numerical differentiationBetter approximations exist
Higher-order finite differencesE.g.∂f(x)∂xi
=f(x+ hei)− f(x− hei)2h + O(h2) ,
Richardson extrapolationDifferential quadrature
but they increase rapidly in complexity and never completelyeliminate the error
10/36
Numerical differentiation
Poor performance:f : Rn → R, approximate the gradient∇f = ( ∂f∂x1 , . . . , ∂f∂xn) using
∂f(x)∂xi
≈ f(x+ hei)− f(x)h , 0 < h� 1We must repeat the function evaluation n times for getting∇f
11/36
Algorithmic differentiation (AD)
Algorithmic differentiationAlso known as automatic differentiation (Griewank & Walther,2008)Gives numeric code that computesthe function AND its derivatives at a given point
f(a, b):c = a * bd = sin creturn d
f'(a, a', b, b'):(c, c') = (a*b, a'*b + a*b')(d, d') = (sin c, c' * cos c)return (d, d')
Derivatives propagated at the elementary operation level,as a side effect, at the same time when the function itself iscomputed→ Prevents the “expression swell” of symbolic derivativesFull expressive capability of the host language→ Including conditionals, looping, branching
12/36
Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2
b = 3
c = a * b = 6
d = log c = 1.791
return d
(primal)
a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’
(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD
13/36
Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2
b = 3
c = a * b = 6
d = log c = 1.791
return d
(primal)
a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’
(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD
13/36
Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2
b = 3
c = a * b = 6
d = log c = 1.791
return d
(primal)
a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’
(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD
13/36
Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2
b = 3
c = a * b = 6
d = log c = 1.791
return d
(primal)
a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’
(tangent)
i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD
13/36
Function evaluation tracesAll numeric evaluations are sequences of elementary operations:a “trace,” also called a “Wengert list” (Wengert, 1964)f(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2
b = 3
c = a * b = 6
d = log c = 1.791
return d
(primal)
a = 2a’ = 1b = 3b’ = 0c = a * b = 6c’ = a’ * b + a * b’ = 3d = log c = 1.791d’ = c’ * (1 / c) = 0.5return d, d’
(tangent)i.e., a Jacobian-vector product Jf (1,0)|(2,3) = ∂∂a f(a, b)∣∣(2,3) = 0.5This is called the forward (tangent) mode of AD
13/36
Function evaluation tracesf(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2b = 3c = a * b = 6d = log c = 1.791return d
(primal)
a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’
(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD
14/36
Function evaluation tracesf(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2b = 3c = a * b = 6d = log c = 1.791return d
(primal)
a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’
(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD
14/36
Function evaluation tracesf(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2b = 3c = a * b = 6d = log c = 1.791return d
(primal)
a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’
(adjoint)
i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD
14/36
Function evaluation tracesf(a, b):
c = a * bif c > 0d = log c
elsed = sin c
return d
f(2, 3)
a = 2b = 3c = a * b = 6d = log c = 1.791return d
(primal)
a = 2b = 3c = a * b = 6d = log c = 1.791d’ = 1c’ = d’ * (1 / c) = 0.166b’ = c’ * a = 0.333a’ = c’ * b = 0.5return d, a’, b’
(adjoint)i.e., a transposed Jacobian-vector productJTf (1)∣∣(2,3) = ∇f|(2,3) = (0.5,0.333)This is called the reverse (adjoint) mode of ADBackpropagation is just a special case of the reverse mode:code your neural network objective computation, apply reverse AD
14/36
How is this useful?
Forward vs reverseIn the extreme cases,for F : R→ Rm, forward AD can compute all (∂F1∂x , . . . , ∂Fm∂x )for f : Rn → R, reverse AD can compute∇f = ( ∂f∂xi , . . . , ∂f∂xn)in just one evaluationIn general, for f : Rn → Rm, the Jacobian J ∈ Rm×n takes
O(n× time(f)) with forward ADO(m× time(f)) with reverse AD
Reverse mode performs better when n� m
15/36
Forward vs reverseIn the extreme cases,for F : R→ Rm, forward AD can compute all (∂F1∂x , . . . , ∂Fm∂x )for f : Rn → R, reverse AD can compute∇f = ( ∂f∂xi , . . . , ∂f∂xn)in just one evaluationIn general, for f : Rn → Rm, the Jacobian J ∈ Rm×n takes
O(n× time(f)) with forward ADO(m× time(f)) with reverse AD
Reverse mode performs better when n� m
15/36
How is this useful?Traditional application domains of AD in industry and academia(Corliss et al., 2002)
Computational fluiddynamicsAtmospheric chemistryEngineering designoptimizationComputational finance
16/36
Functional ADor”Differentiable functional programming”
AD and functional programmingAD has been around since the 1960s(Wengert, 1964; Speelpenning, 1980; Griewank, 1989)The foundations for AD in a functional framework(Siskind and Pearlmutter, 2008; Pearlmutter and Siskind, 2008)With research implementations
R6RS-ADhttps://github.com/qobi/R6RS-AD
Stalingradhttp://www.bcl.hamilton.ie/~qobi/stalingrad/
Alexey Radul’s DVLhttps://github.com/axch/dysvunctional-language
Recently, my DiffSharp libraryhttp://diffsharp.github.io/DiffSharp/
17/36
https://github.com/qobi/R6RS-ADhttp://www.bcl.hamilton.ie/~qobi/stalingrad/https://github.com/axch/dysvunctional-languagehttp://diffsharp.github.io/DiffSharp/
Differentiable functional programmingDeep learning: neural network models are assembled frombuilding blocks and trained with backpropagation
Traditional:FeedforwardConvolutionalRecurrent
18/36
Differentiable functional programmingDeep learning: neural network models are assembled frombuilding blocks and trained with backpropagationTraditional:
FeedforwardConvolutionalRecurrent
18/36
Differentiable functional programmingNewer additions:Make algorithmic elements continuous and differentiable→ enables use in deep learning
NTM on copy task(Graves et al. 2014)
Neural Turing Machine (Graves et al., 2014)→ can infer algorithms: copy, sort, recallStack-augmented RNN (Joulin & Mikolov, 2015)End-to-end memory network (Sukhbaatar et al., 2015)Stack, queue, deque (Grefenstette et al., 2015)Discrete interfaces (Zaremba & Sutskever, 2015)
19/36
Differentiable functional programmingStacking of many layers, trained through backpropagationAlexNet, 8 layers (ILSVRC 2012)
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 128
, /2
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 256
, /2
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 512
, /2
3x3
conv
, 512
1x1
conv
, 204
8
1x1
conv
, 512
3x3
conv
, 512
1x1
conv
, 204
8
1x1
conv
, 512
3x3
conv
, 512
1x1
conv
, 204
8
ave
pool
, fc 1
000
7x7
conv
, 64,
/2,
poo
l/2
3x3
conv
, 64
3x3
conv
, 64,
poo
l/2
3x3
conv
, 128
3x3
conv
, 128
, poo
l/2
3x3
conv
, 256
3x3
conv
, 256
3x3
conv
, 256
3x3
conv
, 256
, poo
l/2
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
, poo
l/2
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
, poo
l/2
fc, 4
096
fc, 4
096
fc, 1
000
11x1
1 co
nv, 9
6, /4
, poo
l/2
5x5
conv
, 256
, poo
l/2
3x3
conv
, 384
3x3
conv
, 384
3x3
conv
, 256
, poo
l/2
fc, 4
096
fc, 4
096
fc, 1
000
VGG, 19 layers (ILSVRC 2014)1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 10007x7 conv, 64, /2, pool/23x
3 co
nv, 6
4
3x3
conv
, 64,
poo
l/2
3x3
conv
, 128
3x3
conv
, 128
, poo
l/2
3x3
conv
, 256
3x3
conv
, 256
3x3
conv
, 256
3x3
conv
, 256
, poo
l/2
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
, poo
l/2
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
, poo
l/2
fc, 4
096
fc, 4
096
fc, 1
000
11x1
1 co
nv, 9
6, /4
, poo
l/2
5x5
conv
, 256
, poo
l/2
3x3
conv
, 384
3x3
conv
, 384
3x3
conv
, 256
, poo
l/2
fc, 4
096
fc, 4
096
fc, 1
000ResNet, 152 layers (deep residual learning) (ILSVRC 2015)
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 64
3x3
conv
, 64
1x1
conv
, 256
1x1
conv
, 128
, /2
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 128
3x3
conv
, 128
1x1
conv
, 512
1x1
conv
, 256
, /2
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 256
3x3
conv
, 256
1x1
conv
, 102
4
1x1
conv
, 512
, /2
3x3
conv
, 512
1x1
conv
, 204
8
1x1
conv
, 512
3x3
conv
, 512
1x1
conv
, 204
8
1x1
conv
, 512
3x3
conv
, 512
1x1
conv
, 204
8
ave
pool
, fc 1
000
7x7
conv
, 64,
/2,
poo
l/2
3x3
conv
, 64
3x3
conv
, 64,
poo
l/2
3x3
conv
, 128
3x3
conv
, 128
, poo
l/2
3x3
conv
, 256
3x3
conv
, 256
3x3
conv
, 256
3x3
conv
, 256
, poo
l/2
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
, poo
l/2
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
3x3
conv
, 512
, poo
l/2
fc, 4
096
fc, 4
096
fc, 1
000
11x1
1 co
nv, 9
6, /4
, poo
l/2
5x5
conv
, 256
, poo
l/2
3x3
conv
, 384
3x3
conv
, 384
3x3
conv
, 256
, poo
l/2
fc, 4
096
fc, 4
096
fc, 1
000
(He, Zhang, Ren, Sun. “Deep Residual Learning for Image Recognition.” 2015. arXiv:1512.03385)20/36
Differentiable functional programmingOne way of viewing deep learning systems is“differentiable functional programming”
Two main characteristics:Differentiability→ optimization
Chained function composition→ successivetransformations→ successive levels ofdistributed representations(Bengio 2013)→ the chain rule of calculuspropagates derivatives
21/36
The bigger pictureIn a functional interpretation
Weight-tying or multiple applications of the same neuron(e.g., ConvNets and RNNs) resemble function abstractionStructural patterns of composition resemblehigher-order functions (e.g., map, fold, unfold, zip)
22/36
The bigger pictureEven when you have complex compositions,differentiability ensures that they can be trained end-to-endwith backpropagation
(Vinyals, Toshev, Bengio, Erhan. “Show and tell: a neural image caption generator.” 2014. arXiv:1411.4555)
23/36
The bigger pictureChristopher Olah’s blog post (September 3, 2015)http://colah.github.io/posts/2015-09-NN-Types-FP/
“The field does not (yet) have a unifying insight or narrative”
David Dalrymple’s essay (January 2016)http://edge.org/response-detail/26794
“The most natural playground ... would be a new language that canrun back-propagation directly on functional programs.”
AD in a functional framework is a manifestation of this vision.
24/36
http://colah.github.io/posts/2015-09-NN-Types-FP/http://edge.org/response-detail/26794
The bigger pictureChristopher Olah’s blog post (September 3, 2015)http://colah.github.io/posts/2015-09-NN-Types-FP/
“The field does not (yet) have a unifying insight or narrative”
David Dalrymple’s essay (January 2016)http://edge.org/response-detail/26794
“The most natural playground ... would be a new language that canrun back-propagation directly on functional programs.”
AD in a functional framework is a manifestation of this vision.
24/36
http://colah.github.io/posts/2015-09-NN-Types-FP/http://edge.org/response-detail/26794
DiffSharp
The ambitionDeeply embedded AD (forward and/or reverse)as part of the language infrastructureRich API of differentiation operationsas higher-order functionsHigh-performance matrix operations for deep learning(GPU support, model and data parallelism), gradients,Hessians, Jacobians, directional derivatives, matrix-freeHessian- and Jacobian-vector products
I have been working on these issues with Barak Pearlmutterand created DiffSharp:http://diffsharp.github.io/DiffSharp/
25/36
http://diffsharp.github.io/DiffSharp/
The ambitionDeeply embedded AD (forward and/or reverse)as part of the language infrastructureRich API of differentiation operationsas higher-order functionsHigh-performance matrix operations for deep learning(GPU support, model and data parallelism), gradients,Hessians, Jacobians, directional derivatives, matrix-freeHessian- and Jacobian-vector products
I have been working on these issues with Barak Pearlmutterand created DiffSharp:http://diffsharp.github.io/DiffSharp/
25/36
http://diffsharp.github.io/DiffSharp/
DiffSharp“Generalized AD as a first-class function in an augmentedλ-calculus” (Pearlmutter and Siskind, 2008)Forward, reverse, and any nested combination thereof,instantiated according to usage scenarioNested lambda expressions with free-variable references
min (λx . (f x) +min (λy . g x y))
let m = min (fun x -> (f x) + min (fun y -> g (x y)))
Must handle “perturbation confusion” (Manzyuk et al., 2012)d
dx
(x(
d
dyx+ y)∣∣∣∣
y=1
)∣∣∣∣∣x=1
?= 1
let d = diff (fun x -> x * (diff (fun y -> x + y) 1.)) 1.26/36
DiffSharp“Generalized AD as a first-class function in an augmentedλ-calculus” (Pearlmutter and Siskind, 2008)Forward, reverse, and any nested combination thereof,instantiated according to usage scenarioNested lambda expressions with free-variable references
min (λx . (f x) +min (λy . g x y))
let m = min (fun x -> (f x) + min (fun y -> g (x y)))
Must handle “perturbation confusion” (Manzyuk et al., 2012)d
dx
(x(
d
dyx+ y)∣∣∣∣
y=1
)∣∣∣∣∣x=1
?= 1
let d = diff (fun x -> x * (diff (fun y -> x + y) 1.)) 1.26/36
DiffSharpHigher-order differentiation APIOp. Value Type signature AD Num. Sym.
f : R → R diff f ′ (R → R) → R → R X, F A Xdiff’ (f, f ′) (R → R) → R → (R× R) X, F A Xdiff2 f ′′ (R → R) → R → R X, F A Xdiff2’ (f, f ′′) (R → R) → R → (R× R) X, F A Xdiff2’’ (f, f ′, f ′′) (R → R) → R → (R× R× R) X, F A Xdiffn f(n) N → (R → R) → R → R X, F Xdiffn’ (f, f(n)) N → (R → R) → R → (R× R) X, F X
f : Rn → R grad ∇f (Rn → R) → Rn → Rn X, R A Xgrad’ (f,∇f) (Rn → R) → Rn → (R× Rn) X, R A Xgradv ∇f · v (Rn → R) → Rn → Rn → R X, F Agradv’ (f,∇f · v) (Rn → R) → Rn → Rn → (R× R) X, F Ahessian Hf (Rn → R) → Rn → Rn×n X, R-F A Xhessian’ (f,Hf ) (Rn → R) → Rn → (R× Rn×n) X, R-F A Xhessianv Hfv (Rn → R) → Rn → Rn → Rn X, F-R Ahessianv’ (f,Hfv) (Rn → R) → Rn → Rn → (R× Rn) X, F-R Agradhessian (∇f,Hf ) (Rn → R) → Rn → (Rn × Rn×n) X, R-F A Xgradhessian’ (f,∇f,Hf ) (Rn → R) → Rn → (R× Rn × Rn×n) X, R-F A Xgradhessianv (∇f · v,Hfv) (Rn → R) → Rn → Rn → (R× Rn) X, F-R Agradhessianv’ (f,∇f · v,Hfv) (Rn → R) → Rn → Rn → (R× R× Rn) X, F-R Alaplacian tr(Hf ) (Rn → R) → Rn → R X, R-F A Xlaplacian’ (f, tr(Hf )) (Rn → R) → Rn → (R× R) X, R-F A X
f : Rn → Rm jacobian Jf (Rn → Rm) → Rn → Rm×n X, F/R A Xjacobian’ (f ,Jf ) (Rn → Rm) → Rn → (Rm × Rm×n) X, F/R A Xjacobianv Jfv (Rn → Rm) → Rn → Rn → Rm X, F Ajacobianv’ (f ,Jfv) (Rn → Rm) → Rn → Rn → (Rm × Rm) X, F AjacobianT JTf (R
n → Rm) → Rn → Rn×m X, F/R A XjacobianT’ (f ,JTf ) (R
n → Rm) → Rn → (Rm × Rn×m) X, F/R A XjacobianTv JTf v (R
n → Rm) → Rn → Rm → Rn X, RjacobianTv’ (f ,JTf v) (R
n → Rm) → Rn → Rm → (Rm × Rn) X, RjacobianTv’’ (f ,JTf (·)) (Rn → Rm) → Rn → (Rm × (Rm → Rn)) X, Rcurl ∇× f (R3 → R3) → R3 → R3 X, F A Xcurl’ (f ,∇× f) (R3 → R3) → R3 → (R3 × R3) X, F A Xdiv ∇ · f (Rn → Rn) → Rn → R X, F A Xdiv’ (f ,∇ · f) (Rn → Rn) → Rn → (Rn × R) X, F A Xcurldiv (∇× f ,∇ · f) (R3 → R3) → R3 → (R3 × R) X, F A Xcurldiv’ (f ,∇× f ,∇ · f) (R3 → R3) → R3 → (R3 × R3 × R) X, F A X 27/36
DiffSharpMatrix operationshttp://diffsharp.github.io/DiffSharp/api-overview.html
High-performance OpenBLAS backend by default, work on aCUDA-based GPU backend underwaySupport for 64- and 32-bit floats (faster on many systems)Benchmarking toolhttp://diffsharp.github.io/DiffSharp/benchmarks.html
A growing collection of tutorials: gradient-based optimizationalgorithms, clustering, Hamiltonian Monte Carlo, neural networks,inverse kinematics
28/36
http://diffsharp.github.io/DiffSharp/api-overview.htmlhttp://diffsharp.github.io/DiffSharp/benchmarks.html
Hype
Hypehttp://hypelib.github.io/Hype/
An experimental library for “compositional machine learningand hyperparameter optimization”, built on DiffSharpA robust optimization core
highly configurable functional modulesSGD, conjugate gradient, Nesterov, AdaGrad, RMSProp,Newton’s methodUse nested AD for gradient-based hyperparameteroptimization (Maclaurin et al., 2015)
Researching the differentiable functional programming paradigmfor machine learning29/36
http://hypelib.github.io/Hype/
HypeExtracts from Hype neural network code,use higher-order functions, don’t think about gradients orbackpropagationhttps://github.com/hypelib/Hype/blob/master/src/Hype/Neural.fs
30/36
https://github.com/hypelib/Hype/blob/master/src/Hype/Neural.fs
HypeExtracts from Hype optimization codehttps://github.com/hypelib/Hype/blob/master/src/Hype/Optimize.fs
Optimization and training as higher-order functions→ can be composed, nested
31/36
https://github.com/hypelib/Hype/blob/master/src/Hype/Optimize.fs
HypeUser doesn’t need to think about derivativesThey are instantiated within the optimization code
32/36
HypeBut they can use derivatives within their models, if needed→ input sensitivities→ complex objective functions→ adaptive PID controllers→ integrating differential equations
Thanks to nested generalized ADyou can optimize components that are internally usingdifferentiationresulting higher-order derivatives propagate viaforward/reverse AD as needed
33/36
HypeBut they can use derivatives within their models, if needed→ input sensitivities→ complex objective functions→ adaptive PID controllers→ integrating differential equations
Thanks to nested generalized ADyou can optimize components that are internally usingdifferentiationresulting higher-order derivatives propagate viaforward/reverse AD as needed
33/36
HypeWe also provide a Torch-like API for neural networks
A cool thing: thanks to AD, we can freely codeany F# function as a layer, it just works
34/36
HypeWe also provide a Torch-like API for neural networks
A cool thing: thanks to AD, we can freely codeany F# function as a layer, it just works
34/36
Hypehttp://hypelib.github.io/Hype/feedforwardnets.html
We also have some nice additions for F# interactive
35/36
http://hypelib.github.io/Hype/feedforwardnets.html
Roadmap
Transformation-based, context-aware ADF# quotations (Syme, 2006) give us a direct path for deeplyembedding ADCurrently experimenting with GPU backends(CUDA, ArrayFire, Magma)Generalizing to tensors(for elegant implementations of, e.g., ConvNets)
36/36
Demos
Thank You!References• Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (Submitted) Automatic differentiation in machine learning: a survey [arXiv:1502.05767]• Baydin AG, Pearlmutter BA, Siskind JM (Submitted) DiffSharp: automatic differentiation library [arXiv:1511.07727]• Bengio Y (2013) Deep learning of representations: looking forward. Statistical Language and Speech Processing. LNCS 7978:1–37 [arXiv:1404.7456]• Graves A, Wayne G, Danihelka I (2014) Neural Turing machines. [arXiv:1410.5401]• Grefenstette E, Hermann KM, Suleyman M, Blunsom, P (2015) Learning to transduce with unbounded memory. [arXiv:1506.02516]• Griewank A, Walther A (2008) Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics,Philadelphia [DOI 10.1137/1.9780898717761]• He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. [arXiv:1512.03385]• Joulin A, Mikolov T (2015) Inferring algorithmic patterns with stack-augmented recurrent nets. [arXiv:1503.01007]• Maclaurin D, David D, Adams RP (2015) Gradient-based Hyperparameter Optimization through Reversible Learning [arXiv:1502.03492]• Manzyuk O, Pearlmutter BA, Radul AA, Rush DR, Siskind JM (2012) Confusion of tagged perturbations in forward automatic differentiation of higher-order functions[arXiv:1211.4892]• Pearlmutter BA, Siskind JM (2008) Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM TOPLAS 30(2):7 [DOI10.1145/1330017.1330018]• Siskind JM, Pearlmutter BA (2008) Nesting forward-mode AD in a functional framework. Higher Order and Symbolic Computation 21(4):361–76 [DOI10.1007/s10990-008-9037-1]• Sukhbaatar S, Szlam A, Weston J, Fergus R (2015) Weakly supervised memory networks. [arXiv:1503.08895]• Syme D (2006) Leveraging .NET meta-programming components from F#: integrated queries and interoperable heterogeneous execution. 2006 Workshop on ML.ACM.• Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. [arXiv:1411.4555]• Wengert R (1964) A simple automatic derivative evaluation program. Communications of the ACM 7:463–4• Zaremba W, Sutskever I (2015) Reinforcement learning neural Turing machines. [arXiv:1505.00521]