CS 4803 / 7643: Deep Learning

68
CS 4803 / 7643: Deep Learning Dhruv Batra Georgia Tech Topics: Automatic Differentiation (Finish) Forward mode vs Reverse mode AD Patterns in backprop

Transcript of CS 4803 / 7643: Deep Learning

Page 1: CS 4803 / 7643: Deep Learning

CS 4803 / 7643: Deep Learning

Dhruv Batra

Georgia Tech

Topics: – Automatic Differentiation

– (Finish) Forward mode vs Reverse mode AD– Patterns in backprop

Page 2: CS 4803 / 7643: Deep Learning

Administrativia• HW1 Reminder

– Due: 09/09, 11:59pm

• HW2 out on 9/10• Schedule: https://www.cc.gatech.edu/classes/AY2021/cs7643_fall/

• Project discussion next class

(C) Dhruv Batra 2

Page 3: CS 4803 / 7643: Deep Learning

Recap from last time

(C) Dhruv Batra 3

Page 4: CS 4803 / 7643: Deep Learning

How do we compute gradients?• Analytic or “Manual” Differentiation

• Symbolic Differentiation

• Numerical Differentiation

• Automatic Differentiation– Forward mode AD– Reverse mode AD

• aka “backprop”

(C) Dhruv Batra 4

Page 5: CS 4803 / 7643: Deep Learning

Vector/Matrix Derivatives Notation

(C) Dhruv Batra 5

Page 6: CS 4803 / 7643: Deep Learning

Vector/Matrix Derivatives Notation

(C) Dhruv Batra 6

Page 7: CS 4803 / 7643: Deep Learning

Vector Derivative Example

(C) Dhruv Batra 7

Page 8: CS 4803 / 7643: Deep Learning

Extension to Tensors

(C) Dhruv Batra 8

Page 9: CS 4803 / 7643: Deep Learning

Chain Rule: Composite Functions

(C) Dhruv Batra 9

Page 10: CS 4803 / 7643: Deep Learning

Chain Rule: Scalar Case

(C) Dhruv Batra 10

Page 11: CS 4803 / 7643: Deep Learning

Chain Rule: Vector Case

(C) Dhruv Batra 11

Page 12: CS 4803 / 7643: Deep Learning

Chain Rule: Jacobian view

(C) Dhruv Batra 12

Page 13: CS 4803 / 7643: Deep Learning

Chain Rule: Graphical view

(C) Dhruv Batra 13

Page 14: CS 4803 / 7643: Deep Learning

Chain Rule: Cascaded

(C) Dhruv Batra 14

Page 15: CS 4803 / 7643: Deep Learning

Deep Learning = Differentiable Programming

• Computation = Graph– Input = Data + Parameters– Output = Loss– Scheduling = Topological ordering

• Auto-Diff– A family of algorithms for

implementing chain-rule on computation graphs

(C) Dhruv Batra 15

Page 16: CS 4803 / 7643: Deep Learning

Directed Acyclic Graphs (DAGs)• Exactly what the name suggests

– Directed edges– No (directed) cycles– Underlying undirected cycles okay

(C) Dhruv Batra 16

Page 17: CS 4803 / 7643: Deep Learning

Directed Acyclic Graphs (DAGs)• Concept

– Topological Ordering

(C) Dhruv Batra 17

Page 18: CS 4803 / 7643: Deep Learning

Computational Graphs• Notation

(C) Dhruv Batra 18

Page 19: CS 4803 / 7643: Deep Learning

Deep Learning = Differentiable Programming

• Computation = Graph– Input = Data + Parameters– Output = Loss– Scheduling = Topological ordering

• Auto-Diff– A family of algorithms for

implementing chain-rule on computation graphs

(C) Dhruv Batra 19

Page 20: CS 4803 / 7643: Deep Learning

20

g

Forward mode AD

Page 21: CS 4803 / 7643: Deep Learning

21

g

Reverse mode AD

Page 22: CS 4803 / 7643: Deep Learning

Plan for Today• Automatic Differentiation

– (Finish) Forward mode vs Reverse mode AD– Backprop– Patterns in backprop

(C) Dhruv Batra 22

Page 23: CS 4803 / 7643: Deep Learning

Example: Forward mode AD

(C) Dhruv Batra 23

+

sin( )

x1 x2

*

Page 24: CS 4803 / 7643: Deep Learning

Example: Forward mode AD

(C) Dhruv Batra 24

+

sin( )

x1 x2

*

Page 25: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 25

+

sin( )

x1 x2

*

Example: Forward mode AD

Page 26: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 26

+

sin( )

x1 x2

*

Example: Forward mode AD

Page 27: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 27

+

sin( )

x1 x2

*

Example: Forward mode AD

Q: What happens if there’s another input variable x3?

Page 28: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 28

+

sin( )

x1 x2

*

Example: Forward mode AD

Q: What happens if there’s another input variable x3?A: more sophisticated graph; d+1 “passes” for d+1 variables

Page 29: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 29

+

sin( )

x1 x2

*

Example: Forward mode AD

Q: What happens if there’s another output variable f2?

Page 30: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 30

+

sin( )

x1 x2

*

Example: Forward mode AD

Q: What happens if there’s another output variable f2?A: more sophisticated graph;

d “passes” for variables

Page 31: CS 4803 / 7643: Deep Learning

Example: Reverse mode AD

(C) Dhruv Batra 31

+

sin( )

x1 x2

*

Page 32: CS 4803 / 7643: Deep Learning

Example: Reverse mode AD

(C) Dhruv Batra 32

+

sin( )

x1 x2

*

Page 33: CS 4803 / 7643: Deep Learning

+

Gradients add at branches

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 34: CS 4803 / 7643: Deep Learning

Duality in Fprop and Bprop

(C) Dhruv Batra 34

+

+

FPROP BPROPSUM

COPY

Page 35: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 35

Example: Reverse mode AD

+

sin( )

x1 x2

*

Page 36: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 36

Example: Reverse mode AD

+

sin( )

x1 x2

*

Q: What happens if there’s another input variable x3?

Page 37: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 37

Example: Reverse mode AD

+

sin( )

x1 x2

*

Q: What happens if there’s another input variable x3?A: more sophisticated graph;

single “backward prop”

Page 38: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 38

Example: Reverse mode AD

+

sin( )

x1 x2

*

Q: What happens if there’s another output variable f2?

Page 39: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 39

Example: Reverse mode AD

+

sin( )

x1 x2

*

Q: What happens if there’s another output variable f2?A: more sophisticated graph; c “backward props” for c vars

Page 40: CS 4803 / 7643: Deep Learning

Forward Pass vs Forward mode AD vs Reverse Mode AD

(C) Dhruv Batra 40

+

sin( )

x1 x2

*

+

sin( )

x2

*

x1

+

sin( )

x1 x2

*

Page 41: CS 4803 / 7643: Deep Learning

Forward mode vs Reverse Mode• What are the differences?

(C) Dhruv Batra 41

+

sin( )

x2

*

+

sin( )

x1 x2

*

x1

Page 42: CS 4803 / 7643: Deep Learning

Forward mode vs Reverse Mode• What are the differences?

• Which one is faster to compute? – Forward or backward?

(C) Dhruv Batra 42

Page 43: CS 4803 / 7643: Deep Learning

Forward mode vs Reverse Mode• x Graph L• Intuition of Jacobian

(C) Dhruv Batra 43

Page 44: CS 4803 / 7643: Deep Learning

Forward mode vs Reverse Mode• What are the differences?

• Which one is faster to compute? – Forward or backward?

• Which one is more memory efficient (less storage)? – Forward or backward?

(C) Dhruv Batra 44

Page 45: CS 4803 / 7643: Deep Learning

Plan for Today• Automatic Differentiation

– (Finish) Forward mode vs Reverse mode AD– Backprop– Patterns in backprop

(C) Dhruv Batra 45

Page 46: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 46Figure Credit: Andrea Vedaldi

Neural Network Computation Graph

Page 47: CS 4803 / 7643: Deep Learning

(C) Dhruv Batra 47Figure Credit: Andrea Vedaldi

Backprop

Page 48: CS 4803 / 7643: Deep Learning

Any DAG of differentiable modules is allowed!

Slide Credit: Marc'Aurelio Ranzato(C) Dhruv Batra 48

Computational Graph

Page 49: CS 4803 / 7643: Deep Learning

Key Computation: Forward-Prop

(C) Dhruv Batra 49Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Page 50: CS 4803 / 7643: Deep Learning

Key Computation: Back-Prop

(C) Dhruv Batra 50Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Page 51: CS 4803 / 7643: Deep Learning

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra 51Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Page 52: CS 4803 / 7643: Deep Learning

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra 52Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Page 53: CS 4803 / 7643: Deep Learning

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]

(C) Dhruv Batra 53Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Page 54: CS 4803 / 7643: Deep Learning

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra 54Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Page 55: CS 4803 / 7643: Deep Learning

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra 55Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Page 56: CS 4803 / 7643: Deep Learning

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]

(C) Dhruv Batra 56Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Page 57: CS 4803 / 7643: Deep Learning

Neural Network Training• Step 1: Compute Loss on mini-batch [F-Pass]• Step 2: Compute gradients wrt parameters [B-Pass]• Step 3: Use gradient to update parameters

(C) Dhruv Batra 57Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Page 58: CS 4803 / 7643: Deep Learning

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example

Page 59: CS 4803 / 7643: Deep Learning

Backpropagation: a simple example

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 60: CS 4803 / 7643: Deep Learning

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 61: CS 4803 / 7643: Deep Learning

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Q: What is an add gate?

Page 62: CS 4803 / 7643: Deep Learning

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

add gate: gradient distributor

Page 63: CS 4803 / 7643: Deep Learning

add gate: gradient distributor

Q: What is a max gate?

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 64: CS 4803 / 7643: Deep Learning

add gate: gradient distributor

max gate: gradient router

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 65: CS 4803 / 7643: Deep Learning

add gate: gradient distributor

max gate: gradient router

Q: What is a mul gate?

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 66: CS 4803 / 7643: Deep Learning

add gate: gradient distributor

max gate: gradient router

mul gate: gradient switcher

Patterns in backward flow

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 67: CS 4803 / 7643: Deep Learning

+

Gradients add at branches

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Page 68: CS 4803 / 7643: Deep Learning

Duality in Fprop and Bprop

(C) Dhruv Batra 68

+

+

FPROP BPROPSUM

COPY