Deep Learning Part I · Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato • Each...

53
Deep Learning Part I Yin Li [email protected] University of Wisconsin, Madison Some of the slides from Yingyu Liang, Marc'Aurelio Ranzato and others

Transcript of Deep Learning Part I · Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato • Each...

  • Deep LearningPart I

    Yin [email protected]

    University of Wisconsin, Madison

    Some of the slides from Yingyu Liang, Marc'Aurelio Ranzato and others

  • Neural Networks / Deep Learning

    • What type of functions shall we consider for f?

    ChairFeatures

    + Decision𝑓𝑓(𝑥𝑥; 𝜃𝜃)

    Proposal: Composing a set of (nonlinear) functions g

    𝑓𝑓 𝒙𝒙;𝜽𝜽 = 𝑔𝑔1 …𝑔𝑔𝑛𝑛−1(𝑔𝑔𝑛𝑛 𝒙𝒙;𝜽𝜽𝒏𝒏 ,𝜽𝜽𝒏𝒏−𝟏𝟏 … ,𝜽𝜽𝟏𝟏)Example: 𝐚𝐚 = 𝑠𝑠𝑠𝑠𝑔𝑔𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑾𝑾𝑻𝑻𝒙𝒙 + 𝒃𝒃 = 𝑔𝑔(𝒙𝒙;𝑾𝑾,𝒃𝒃)

  • Neural network as a computational graph

    • A two-layer neural network• Forward propagation vs. backward propagation

    ∗x

    𝑊𝑊(1)

    +

    𝑏𝑏(1)

    ∗ +

    𝑊𝑊(2) 𝑏𝑏(2)

    a

  • What prevent us from learning a deep network?

    • Say 100 layers …

    • Way too many parameters• 𝐚𝐚 = 𝑠𝑠𝑠𝑠𝑔𝑔𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑾𝑾𝑻𝑻𝒙𝒙 + 𝒃𝒃 = 𝑔𝑔(𝒙𝒙;𝑾𝑾,𝒃𝒃)• 𝒙𝒙 ∈ 𝑅𝑅𝑛𝑛, 𝑾𝑾 ∈ 𝑅𝑅𝑛𝑛×𝑚𝑚, 𝒃𝒃 ∈ 𝑅𝑅𝑚𝑚, 𝒂𝒂 ∈ 𝑅𝑅𝑚𝑚• If you have a high dimensional input (e.g., an image)

    • Gradient descent does not quite work any more …

  • Deep learning: a sketchDeep Learning: Composing a set of (nonlinear) functions g

    𝑓𝑓 𝒙𝒙;𝜽𝜽 = 𝑔𝑔1 …𝑔𝑔𝑛𝑛−1(𝑔𝑔𝑛𝑛 𝒙𝒙;𝜽𝜽𝒏𝒏 ,𝜽𝜽𝒏𝒏−𝟏𝟏 … ,𝜽𝜽𝟏𝟏)

    Each of the function g is represented using a layer of a neural network

    • General form for each layer 𝐚𝐚 = 𝜎𝜎 𝑾𝑾𝑻𝑻𝒙𝒙 + 𝒃𝒃 = 𝑔𝑔 𝒙𝒙;𝑾𝑾,𝒃𝒃

    • σ the activation function

    • Key element: Linear operations + Nonlinear activations

  • Deep learning: a sketchDeep Learning: Composing a set of (nonlinear) functions g

    𝑓𝑓 𝒙𝒙;𝜽𝜽 = 𝑔𝑔1 …𝑔𝑔𝑛𝑛−1(𝑔𝑔𝑛𝑛 𝒙𝒙;𝜽𝜽𝒏𝒏 ,𝜽𝜽𝒏𝒏−𝟏𝟏 … ,𝜽𝜽𝟏𝟏)

    Each of the function g is represented using a layer of a neural network

    • Key element: Linear operations + Nonlinear activations 𝜎𝜎 𝑾𝑾𝑻𝑻𝒙𝒙 + 𝒃𝒃

    m x n n x 1 m x 1

    =+

    m x 1

  • How to get the deep networks work?Deep Learning: Composing a set of (nonlinear) functions g

    𝑓𝑓 𝒙𝒙;𝜽𝜽 = 𝑔𝑔1 …𝑔𝑔𝑛𝑛−1(𝑔𝑔𝑛𝑛 𝒙𝒙;𝜽𝜽𝒏𝒏 ,𝜽𝜽𝒏𝒏−𝟏𝟏 … ,𝜽𝜽𝟏𝟏)

    Each of the function g is represented using a layer of a neural network

    • Key element: 𝜎𝜎 𝑾𝑾𝑻𝑻𝒙𝒙 + 𝒃𝒃

    • Which activation function to use?

    • What linear function to use?

    • The design of the network …

  • The choice of activation function

    Sigmoid

    𝑔𝑔 𝑥𝑥 = 1/(1 + exp(−𝑥𝑥))

    𝑔𝑔𝑔 𝑥𝑥 = 𝑔𝑔(𝑥𝑥)(1 − 𝑔𝑔(𝑥𝑥))

  • The choice of activation function• Saturated neurons “kill” the gradients

    • Exponential function is expensive

    Sigmoid

    𝑔𝑔 𝑥𝑥 = 1/(1 + exp(−𝑥𝑥))

    𝑔𝑔𝑔 𝑥𝑥 = 𝑔𝑔(𝑥𝑥)(1 − 𝑔𝑔(𝑥𝑥))

  • The choice of activation function• Does not saturate (in +region)

    • Very computationally efficient

    • Converges much faster than sigmoid in practice

    • Differentiable?

    ReLU(Rectified Linear Unit)

    f(x) = max(0, x)

  • The choice of activation function

    ReLU(Rectified Linear Unit)

    f(x) = max(0, x)

    • Does not saturate (in +region)

    • Very computationally efficient

    • Converges much faster than sigmoid in practice

    • Differentiable? Yes, if we fix 𝑓𝑓𝑔 0

  • The choice of activation function

    ReLU(Rectified Linear Unit)

    f(x) = max(0, x)

    • Does not saturate (in +region)

    • Very computationally efficient

    • Converges much faster than sigmoid in practice

    • Differentiable? Yes, if we fix 𝑓𝑓𝑔 0

    • Zero gradient in -region

  • The choice of activation function

    Sigmoid

    tanh

    ReLU

    Leaky ReLU

    Maxout

    ELU

  • How to get the deep networks work?Deep Learning: Composing a set of (nonlinear) functions g

    𝑓𝑓 𝒙𝒙;𝜽𝜽 = 𝑔𝑔1 …𝑔𝑔𝑛𝑛−1(𝑔𝑔𝑛𝑛 𝒙𝒙;𝜽𝜽𝒏𝒏 ,𝜽𝜽𝒏𝒏−𝟏𝟏 … ,𝜽𝜽𝟏𝟏)

    Each of the function g is represented using a layer of a neural network

    • Key element: 𝜎𝜎 𝑾𝑾𝑻𝑻𝒙𝒙 + 𝒃𝒃

    • Which activation function to use?

    • What linear function to use?

    • The design of the network …

  • Convolution layer

    • Use convolution in place of general matrix multiplication

    for a specific kind of weight matrix 𝑾𝑾

    • Strong empirical application performance

    a = 𝜎𝜎 𝑾𝑾𝑻𝑻𝒙𝒙 + 𝒃𝒃

  • Convolution

  • Convolution: discrete version

    • Given array 𝑢𝑢𝑡𝑡 and 𝑤𝑤𝑡𝑡, their convolution is a function 𝑠𝑠𝑡𝑡

    • Written as

    • When 𝑢𝑢𝑡𝑡 or 𝑤𝑤𝑡𝑡 is not defined, assumed to be 0

    𝑠𝑠𝑡𝑡 = �𝑎𝑎=−∞

    +∞

    𝑢𝑢𝑎𝑎𝑤𝑤𝑡𝑡−𝑎𝑎

    𝑠𝑠 = 𝑢𝑢 ∗ 𝑤𝑤 or 𝑠𝑠𝑡𝑡 = 𝑢𝑢 ∗ 𝑤𝑤 𝑡𝑡

  • Illustration 1

    a b c d e f

    x y z

    xb+yc+zd𝑤𝑤= [z, y, x]𝑢𝑢 = [a, b, c, d, e, f]

    𝑠𝑠3

    𝐰𝐰𝟐𝟐 𝐰𝐰𝟏𝟏 𝐰𝐰𝟎𝟎

    𝐮𝐮𝟏𝟏 𝒖𝒖𝟐𝟐 𝐮𝐮𝟑𝟑

  • Illustration 1

    a b c d e f

    x y z

    xc+yd+ze𝑠𝑠4

    𝐰𝐰𝟐𝟐 𝐰𝐰𝟏𝟏 𝐰𝐰𝟎𝟎

    𝐮𝐮𝟐𝟐 𝒖𝒖𝟑𝟑 𝐮𝐮𝟒𝟒

  • Illustration 1

    a b c d e f

    x y z

    xd+ye+zf

    𝐰𝐰𝟐𝟐 𝐰𝐰𝟏𝟏 𝐰𝐰𝟎𝟎

    𝐮𝐮𝟑𝟑 𝒖𝒖𝟒𝟒 𝐮𝐮𝟓𝟓

    𝑠𝑠5

  • Illustration 1: boundary case

    a b c d e f

    x y

    xe+yf

    𝐰𝐰𝟐𝟐 𝐰𝐰𝟏𝟏

    𝒖𝒖𝟒𝟒 𝐮𝐮𝟓𝟓

    𝑠𝑠6

  • Illustration 1 as matrix multiplication

    y z

    x y z

    x y z

    x y z

    x y z

    x y

    a

    b

    c

    d

    e

    f

  • Illustration 2: two dimensional case

    a b c d

    e f g h

    i j k l

    w x

    y z

    wa + bx + ey + fz

  • Illustration 2

    a b c d

    e f g h

    i j k l

    w x

    y z

    bw + cx + fy + gz

    wa + bx + ey + fz

  • Illustration 2

    a b c d

    e f g h

    i j k l

    w x

    y z

    bw + cx + fy + gz

    wa + bx + ey + fz

    Kernel (or filter)

    Feature map

    Input

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

  • Slides Credit: Deep Learning Tutorial by Marc’Aurelio Ranzato

    • Each convolution kernel is a local pattern detector

    • Use many of them in a convolutional layer!

  • Convolutional neural networks

    • Strong empirical application performance

    • Convolutional networks: neural networks that use convolution in place of general matrix multiplication in at least one of their layers

  • Advantage: sparse interaction

    Figure from Deep Learning, by Goodfellow, Bengio, and Courville

    Fully connected layer, 𝑠𝑠 × 𝑛𝑛 edges

    𝑠𝑠 output nodes

    𝑛𝑛 input nodes

  • Advantage: sparse interaction

    Figure from Deep Learning, by Goodfellow, Bengio, and Courville

    Convolutional layer, ≤ 𝑠𝑠 × 𝑘𝑘 edges

    𝑠𝑠 output nodes

    𝑛𝑛 input nodes

    𝑘𝑘 kernel size

  • Advantage: sparse interaction

    Figure from Deep Learning, by Goodfellow, Bengio, and Courville

    Multiple convolutional layers: larger receptive field

  • 7

    7x7 input (spatially) assume 3x3 filter

    7

    2D convolution: spatial dimensions

  • 7x7 input (spatially) assume 3x3 filter

    2D convolution: spatial dimensions

  • 7x7 input (spatially) assume 3x3 filter

    2D convolution: spatial dimensions

  • 7x7 input (spatially) assume 3x3 filter

    => 5x5 output

    2D convolution: spatial dimensions

  • 7x7 input (spatially) assume 3x3 filterapplied with stride 2

    2D convolution: spatial dimensions

  • 7x7 input (spatially) assume 3x3 filterapplied with stride 2

    2D convolution: spatial dimensions

  • 7x7 input (spatially) assume 3x3 filterapplied with stride 2=> 3x3 output

    2D convolution: spatial dimensions

    Slide Number 1Neural Networks / Deep LearningNeural network as a computational graphWhat prevent us from learning a deep network?Deep learning: a sketchDeep learning: a sketchHow to get the deep networks work?The choice of activation functionThe choice of activation functionThe choice of activation functionThe choice of activation functionThe choice of activation functionThe choice of activation functionHow to get the deep networks work?Convolution layerConvolutionConvolution: discrete versionIllustration 1Illustration 1Illustration 1Illustration 1: boundary caseIllustration 1 as matrix multiplicationIllustration 2: two dimensional caseIllustration 2Illustration 2Slide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Slide Number 33Slide Number 34Slide Number 35Slide Number 36Slide Number 37Slide Number 38Slide Number 39Slide Number 40Slide Number 41Slide Number 42Slide Number 43Slide Number 44Convolutional neural networksAdvantage: sparse interactionAdvantage: sparse interactionAdvantage: sparse interaction2D convolution: spatial dimensions2D convolution: spatial dimensions2D convolution: spatial dimensions2D convolution: spatial dimensions2D convolution: spatial dimensions2D convolution: spatial dimensions2D convolution: spatial dimensions