Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a...

105
Lecture 8 : The Topology of Neural Nets and Deep Learning May 16, 2019 Gunnar Carlsson Stanford University

Transcript of Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a...

Page 1: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Lecture 8 : The Topology of NeuralNets and Deep Learning

May 16, 2019

Gunnar Carlsson

Stanford University

Page 2: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Perceptrons

Page 3: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Perceptrons

I Input variables are initially Boolean, as is Output

I

Output = sign(∑

wixi )

I Given an output function on data, “train” the weights wi

to best approximate a given Boolean function

I Ultimately generalize to include continuous functions aswell as Boolean ones

I Couple together to get networks of perceptrons

Page 4: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Perceptrons

I Input variables are initially Boolean, as is Output

I

Output = sign(∑

wixi )

I Given an output function on data, “train” the weights wi

to best approximate a given Boolean function

I Ultimately generalize to include continuous functions aswell as Boolean ones

I Couple together to get networks of perceptrons

Page 5: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Perceptrons

I Input variables are initially Boolean, as is Output

I

Output = sign(∑

wixi )

I Given an output function on data, “train” the weights wi

to best approximate a given Boolean function

I Ultimately generalize to include continuous functions aswell as Boolean ones

I Couple together to get networks of perceptrons

Page 6: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Perceptrons

I Input variables are initially Boolean, as is Output

I

Output = sign(∑

wixi )

I Given an output function on data, “train” the weights wi

to best approximate a given Boolean function

I Ultimately generalize to include continuous functions aswell as Boolean ones

I Couple together to get networks of perceptrons

Page 7: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Perceptrons

I Input variables are initially Boolean, as is Output

I

Output = sign(∑

wixi )

I Given an output function on data, “train” the weights wi

to best approximate a given Boolean function

I Ultimately generalize to include continuous functions aswell as Boolean ones

I Couple together to get networks of perceptrons

Page 8: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Activation Functions

Page 9: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Neural Networks

I Idea is to hook several perceptrons together to create amore complicated and expressive computational unit

I Pattern can be encoded by a directed graph

I Each node is called a neuron. It has a state, which can beBoolean or continuous

I State is updated based on graph connections

Page 10: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Neural Networks

I Idea is to hook several perceptrons together to create amore complicated and expressive computational unit

I Pattern can be encoded by a directed graph

I Each node is called a neuron. It has a state, which can beBoolean or continuous

I State is updated based on graph connections

Page 11: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Neural Networks

I Idea is to hook several perceptrons together to create amore complicated and expressive computational unit

I Pattern can be encoded by a directed graph

I Each node is called a neuron. It has a state, which can beBoolean or continuous

I State is updated based on graph connections

Page 12: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Neural Networks

I Idea is to hook several perceptrons together to create amore complicated and expressive computational unit

I Pattern can be encoded by a directed graph

I Each node is called a neuron. It has a state, which can beBoolean or continuous

I State is updated based on graph connections

Page 13: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Neural Networks

Page 14: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Neural Networks

I An architecture is a choice of graph structure

I Assignment of weights to edges

I Different architectures for different types of data anddifferent types of problems

I Training is done by performing a stochastic version ofgradient descent on the weights

I Many variable optimization problem

Page 15: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Neural Networks

I An architecture is a choice of graph structure

I Assignment of weights to edges

I Different architectures for different types of data anddifferent types of problems

I Training is done by performing a stochastic version ofgradient descent on the weights

I Many variable optimization problem

Page 16: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Neural Networks

I An architecture is a choice of graph structure

I Assignment of weights to edges

I Different architectures for different types of data anddifferent types of problems

I Training is done by performing a stochastic version ofgradient descent on the weights

I Many variable optimization problem

Page 17: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Neural Networks

I An architecture is a choice of graph structure

I Assignment of weights to edges

I Different architectures for different types of data anddifferent types of problems

I Training is done by performing a stochastic version ofgradient descent on the weights

I Many variable optimization problem

Page 18: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Neural Networks

I An architecture is a choice of graph structure

I Assignment of weights to edges

I Different architectures for different types of data anddifferent types of problems

I Training is done by performing a stochastic version ofgradient descent on the weights

I Many variable optimization problem

Page 19: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Feed Forward Networks

Page 20: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Feed Forward Networks

I Designed to compute composites of simple functions, i.e.of functions computed by individual perceptrons

I Neurons are divided into layers

I Collection of layers are given a total ordering

I There is an edge from neuron A to neuron B if and only ifthe layer containing B immediately follows the layercontaining A.

I One optimizes a fit of a function of this form to givenfunction over the space of all possible weights

Page 21: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Feed Forward Networks

I Designed to compute composites of simple functions, i.e.of functions computed by individual perceptrons

I Neurons are divided into layers

I Collection of layers are given a total ordering

I There is an edge from neuron A to neuron B if and only ifthe layer containing B immediately follows the layercontaining A.

I One optimizes a fit of a function of this form to givenfunction over the space of all possible weights

Page 22: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Feed Forward Networks

I Designed to compute composites of simple functions, i.e.of functions computed by individual perceptrons

I Neurons are divided into layers

I Collection of layers are given a total ordering

I There is an edge from neuron A to neuron B if and only ifthe layer containing B immediately follows the layercontaining A.

I One optimizes a fit of a function of this form to givenfunction over the space of all possible weights

Page 23: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Feed Forward Networks

I Designed to compute composites of simple functions, i.e.of functions computed by individual perceptrons

I Neurons are divided into layers

I Collection of layers are given a total ordering

I There is an edge from neuron A to neuron B if and only ifthe layer containing B immediately follows the layercontaining A.

I One optimizes a fit of a function of this form to givenfunction over the space of all possible weights

Page 24: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Feed Forward Networks

I Designed to compute composites of simple functions, i.e.of functions computed by individual perceptrons

I Neurons are divided into layers

I Collection of layers are given a total ordering

I There is an edge from neuron A to neuron B if and only ifthe layer containing B immediately follows the layercontaining A.

I One optimizes a fit of a function of this form to givenfunction over the space of all possible weights

Page 25: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Hopfield Networks

Page 26: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Hopfield Networks and Boltzmann Machines

I Fully connected, although one may enforce the fact thatsome edges are given zero weight

I Idea is to learn to recognize examples, or a distribution ona set

I The examples are local minima of an energy function

I A dynamical system is created which flows to the minimaof the energy function

I An initial point which is close to one of the exampleswould naturally flow to the corresponding local minimum

Page 27: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Hopfield Networks and Boltzmann Machines

I Fully connected, although one may enforce the fact thatsome edges are given zero weight

I Idea is to learn to recognize examples, or a distribution ona set

I The examples are local minima of an energy function

I A dynamical system is created which flows to the minimaof the energy function

I An initial point which is close to one of the exampleswould naturally flow to the corresponding local minimum

Page 28: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Hopfield Networks and Boltzmann Machines

I Fully connected, although one may enforce the fact thatsome edges are given zero weight

I Idea is to learn to recognize examples, or a distribution ona set

I The examples are local minima of an energy function

I A dynamical system is created which flows to the minimaof the energy function

I An initial point which is close to one of the exampleswould naturally flow to the corresponding local minimum

Page 29: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Hopfield Networks and Boltzmann Machines

I Fully connected, although one may enforce the fact thatsome edges are given zero weight

I Idea is to learn to recognize examples, or a distribution ona set

I The examples are local minima of an energy function

I A dynamical system is created which flows to the minimaof the energy function

I An initial point which is close to one of the exampleswould naturally flow to the corresponding local minimum

Page 30: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Hopfield Networks and Boltzmann Machines

I Fully connected, although one may enforce the fact thatsome edges are given zero weight

I Idea is to learn to recognize examples, or a distribution ona set

I The examples are local minima of an energy function

I A dynamical system is created which flows to the minimaof the energy function

I An initial point which is close to one of the exampleswould naturally flow to the corresponding local minimum

Page 31: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

Page 32: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

Locality

Page 33: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

I Locality is a restriction on the architecture

I Uses geometry of feature space (grid geometry) to restrictthe connections

I Means that the features “learned” will be local, i.e. willinvolve only the neighborhoods in the picture

I Restriction means models are smaller and learning is fasterthan fully connected situation

Page 34: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

I Locality is a restriction on the architecture

I Uses geometry of feature space (grid geometry) to restrictthe connections

I Means that the features “learned” will be local, i.e. willinvolve only the neighborhoods in the picture

I Restriction means models are smaller and learning is fasterthan fully connected situation

Page 35: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

I Locality is a restriction on the architecture

I Uses geometry of feature space (grid geometry) to restrictthe connections

I Means that the features “learned” will be local, i.e. willinvolve only the neighborhoods in the picture

I Restriction means models are smaller and learning is fasterthan fully connected situation

Page 36: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

I Locality is a restriction on the architecture

I Uses geometry of feature space (grid geometry) to restrictthe connections

I Means that the features “learned” will be local, i.e. willinvolve only the neighborhoods in the picture

I Restriction means models are smaller and learning is fasterthan fully connected situation

Page 37: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

w

w

Homogeneity or “Convolutionality”

Page 38: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

I Restriction on assignment of weights, not the architecture

I Means training will be a constrained optimizaton problem

I Means that recognizing a cat is done the same wayregardless of the position of the cat

I Exploits symmetries in space of features

I Can exploit this idea for other feature geometries

Page 39: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

I Restriction on assignment of weights, not the architecture

I Means training will be a constrained optimizaton problem

I Means that recognizing a cat is done the same wayregardless of the position of the cat

I Exploits symmetries in space of features

I Can exploit this idea for other feature geometries

Page 40: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

I Restriction on assignment of weights, not the architecture

I Means training will be a constrained optimizaton problem

I Means that recognizing a cat is done the same wayregardless of the position of the cat

I Exploits symmetries in space of features

I Can exploit this idea for other feature geometries

Page 41: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

I Restriction on assignment of weights, not the architecture

I Means training will be a constrained optimizaton problem

I Means that recognizing a cat is done the same wayregardless of the position of the cat

I Exploits symmetries in space of features

I Can exploit this idea for other feature geometries

Page 42: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

I Restriction on assignment of weights, not the architecture

I Means training will be a constrained optimizaton problem

I Means that recognizing a cat is done the same wayregardless of the position of the cat

I Exploits symmetries in space of features

I Can exploit this idea for other feature geometries

Page 43: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

Pooling

Page 44: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

I Pooling creates a lower resolution and higher abstractversion

I Such layers are interspersed between convolutional layers

I Mimics our understanding of how human visual pathwayworks

Page 45: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

I Pooling creates a lower resolution and higher abstractversion

I Such layers are interspersed between convolutional layers

I Mimics our understanding of how human visual pathwayworks

Page 46: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Convolutional Neural Networks

I Pooling creates a lower resolution and higher abstractversion

I Such layers are interspersed between convolutional layers

I Mimics our understanding of how human visual pathwayworks

Page 47: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Problems with Deep Learning

I Long training time

I Requires a lot of data

I Models are often very large

I Don’t generalize

I Adversarial Examples

Page 48: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Problems with Deep Learning

I Long training time

I Requires a lot of data

I Models are often very large

I Don’t generalize

I Adversarial Examples

Page 49: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Problems with Deep Learning

I Long training time

I Requires a lot of data

I Models are often very large

I Don’t generalize

I Adversarial Examples

Page 50: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Problems with Deep Learning

I Long training time

I Requires a lot of data

I Models are often very large

I Don’t generalize

I Adversarial Examples

Page 51: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Problems with Deep Learning

I Long training time

I Requires a lot of data

I Models are often very large

I Don’t generalize

I Adversarial Examples

Page 52: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Adversarial Examples

Page 53: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Adversarial Examples

Page 54: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Adversarial Examples

Page 55: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Adversarial Examples

Page 56: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Do CNN’s Work Like Mammalian Visual Pathway?

Page 57: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

DoCNN’s Reflect Natural Image Statistics?

PRIMARY

SECONDARY SECONDARY

Page 58: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Creating Data Sets from CNN’s

I Consider convolutional neural nets with 9 connections foreach pixel

I Construct 9-vectors for each pixel

I Many grids in first and second layers

I Many experiments

I Apply same methodology as in Lecture 4

Page 59: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Creating Data Sets from CNN’s

I Consider convolutional neural nets with 9 connections foreach pixel

I Construct 9-vectors for each pixel

I Many grids in first and second layers

I Many experiments

I Apply same methodology as in Lecture 4

Page 60: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Creating Data Sets from CNN’s

I Consider convolutional neural nets with 9 connections foreach pixel

I Construct 9-vectors for each pixel

I Many grids in first and second layers

I Many experiments

I Apply same methodology as in Lecture 4

Page 61: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Creating Data Sets from CNN’s

I Consider convolutional neural nets with 9 connections foreach pixel

I Construct 9-vectors for each pixel

I Many grids in first and second layers

I Many experiments

I Apply same methodology as in Lecture 4

Page 62: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Creating Data Sets from CNN’s

I Consider convolutional neural nets with 9 connections foreach pixel

I Construct 9-vectors for each pixel

I Many grids in first and second layers

I Many experiments

I Apply same methodology as in Lecture 4

Page 63: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Findings: MNIST

MNIST for layer 1 of a depth two net, non-localized estimator

Page 64: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Findings: MNIST

Mnist for layer 1 of a depth two net, localized estimator

Page 65: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Findings: Cifar10

Layer 1 of depth two net, reduced to gray scale

Page 66: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Findings: Cifar10

Second layer, reduced to gray scale

Page 67: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Findings: Cifar10

1D barcode for tightly thresholded data set of for 2nd layers ,reduced to gray scale

Page 68: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Findings: Cifar10

Mapper representations over the number of iterations, tightlydensity thresholded, gray scale reduced

Page 69: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Findings: Cifar10

1st layer, Coarse density thresholding, color retained

Page 70: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Findings: Cifar10

1st layer, looser density thresholding, more localized densityestimator, color retained

Page 71: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Findings: Cifar10

2nd layer, fine density thresholding, color retained

Page 72: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Findings: VGG16

Mapper findings from each of 13 layers, same densitythresholding, relatively local estimator

Page 73: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Experiments

I Create new features by forming dot products with patchesmodeled on primary circle

I Add these to existing “raw” features

I Obtain a 3.5x speedup in training for SVHN data set

I Obtain improved generalization for model computed onMNIST applied to SVHN

Page 74: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Experiments

I Create new features by forming dot products with patchesmodeled on primary circle

I Add these to existing “raw” features

I Obtain a 3.5x speedup in training for SVHN data set

I Obtain improved generalization for model computed onMNIST applied to SVHN

Page 75: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Experiments

I Create new features by forming dot products with patchesmodeled on primary circle

I Add these to existing “raw” features

I Obtain a 3.5x speedup in training for SVHN data set

I Obtain improved generalization for model computed onMNIST applied to SVHN

Page 76: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Experiments

I Create new features by forming dot products with patchesmodeled on primary circle

I Add these to existing “raw” features

I Obtain a 3.5x speedup in training for SVHN data set

I Obtain improved generalization for model computed onMNIST applied to SVHN

Page 77: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures

I Key ingredient in constructing convolutional neural nets isthe geometry of the feature space, i.e. the grid of pixels

I Are there other geometries that might be useful?

I We can build geometries on feature spaces using theprimary circle or Klein bottle

I Mapper construction gives compressed geometricstructures on the set of features of a general data set

Page 78: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures

I Key ingredient in constructing convolutional neural nets isthe geometry of the feature space, i.e. the grid of pixels

I Are there other geometries that might be useful?

I We can build geometries on feature spaces using theprimary circle or Klein bottle

I Mapper construction gives compressed geometricstructures on the set of features of a general data set

Page 79: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures

I Key ingredient in constructing convolutional neural nets isthe geometry of the feature space, i.e. the grid of pixels

I Are there other geometries that might be useful?

I We can build geometries on feature spaces using theprimary circle or Klein bottle

I Mapper construction gives compressed geometricstructures on the set of features of a general data set

Page 80: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures

I Key ingredient in constructing convolutional neural nets isthe geometry of the feature space, i.e. the grid of pixels

I Are there other geometries that might be useful?

I We can build geometries on feature spaces using theprimary circle or Klein bottle

I Mapper construction gives compressed geometricstructures on the set of features of a general data set

Page 81: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Feed Forward Systems

I Given two sets X and Y , a relation between X and Y is asubset R ⊆ X × Y

I Can compose relations just like functions

I Graph of a function is an example

I There is a category CR whose objects are sets and wherethe morphisms from X to Y are the relations in X × Y .

I An abstract feed forward system of depth n is a functorfrom the category

n = 1 −→ 2 −→ · · · · · · −→ n

to cR

Page 82: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Feed Forward Systems

I Given two sets X and Y , a relation between X and Y is asubset R ⊆ X × Y

I Can compose relations just like functions

I Graph of a function is an example

I There is a category CR whose objects are sets and wherethe morphisms from X to Y are the relations in X × Y .

I An abstract feed forward system of depth n is a functorfrom the category

n = 1 −→ 2 −→ · · · · · · −→ n

to cR

Page 83: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Feed Forward Systems

I Given two sets X and Y , a relation between X and Y is asubset R ⊆ X × Y

I Can compose relations just like functions

I Graph of a function is an example

I There is a category CR whose objects are sets and wherethe morphisms from X to Y are the relations in X × Y .

I An abstract feed forward system of depth n is a functorfrom the category

n = 1 −→ 2 −→ · · · · · · −→ n

to cR

Page 84: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Feed Forward Systems

I Given two sets X and Y , a relation between X and Y is asubset R ⊆ X × Y

I Can compose relations just like functions

I Graph of a function is an example

I There is a category CR whose objects are sets and wherethe morphisms from X to Y are the relations in X × Y .

I An abstract feed forward system of depth n is a functorfrom the category

n = 1 −→ 2 −→ · · · · · · −→ n

to cR

Page 85: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Feed Forward Systems

I Given two sets X and Y , a relation between X and Y is asubset R ⊆ X × Y

I Can compose relations just like functions

I Graph of a function is an example

I There is a category CR whose objects are sets and wherethe morphisms from X to Y are the relations in X × Y .

I An abstract feed forward system of depth n is a functorfrom the category

n = 1 −→ 2 −→ · · · · · · −→ n

to cR

Page 86: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Construct Computation Graphfrom Feed Forward Systems

I For an abstract feed forward system F : n→ CR, we forma directed graph

I The vertex set is∐n

i=1 F (i)

I For vertices v ∈ F (i) and w ∈ F (i + 1), there is an edgefrom v to w if and only if (v ,w) is in the relationF (i → i + 1).

I No edges from v to w unless there is an i so that v ∈ F (i)and w ∈ F (i + 1).

Page 87: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Construct Computation Graphfrom Feed Forward Systems

I For an abstract feed forward system F : n→ CR, we forma directed graph

I The vertex set is∐n

i=1 F (i)

I For vertices v ∈ F (i) and w ∈ F (i + 1), there is an edgefrom v to w if and only if (v ,w) is in the relationF (i → i + 1).

I No edges from v to w unless there is an i so that v ∈ F (i)and w ∈ F (i + 1).

Page 88: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Construct Computation Graphfrom Feed Forward Systems

I For an abstract feed forward system F : n→ CR, we forma directed graph

I The vertex set is∐n

i=1 F (i)

I For vertices v ∈ F (i) and w ∈ F (i + 1), there is an edgefrom v to w if and only if (v ,w) is in the relationF (i → i + 1).

I No edges from v to w unless there is an i so that v ∈ F (i)and w ∈ F (i + 1).

Page 89: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Construct Computation Graphfrom Feed Forward Systems

I For an abstract feed forward system F : n→ CR, we forma directed graph

I The vertex set is∐n

i=1 F (i)

I For vertices v ∈ F (i) and w ∈ F (i + 1), there is an edgefrom v to w if and only if (v ,w) is in the relationF (i → i + 1).

I No edges from v to w unless there is an i so that v ∈ F (i)and w ∈ F (i + 1).

Page 90: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Important Relations

I If Γ is a graph, then we have a relation RΓ : VΓ → VΓ

defined by (v ,w) ∈ RΓ if and only if (v ,w) is an edge in Γ.

I If (X , d) is a metric space and r > 0 is a threshold, thenRd(r) : X → X is the relation given by (x , x ′) ∈ Rd(r) ifand only if d(x , x ′) ≤ r .

I If f : X → Y is a function, then the graph {(x , f (x)|x ∈ Xis a relation from X to Y

I If Γ1 and Γ2 are two mapper models of the same data set,then there is a naturally defined relation R(Γ1, Γ2) fromVΓ1 to VΓ2 given by (v ,w) ∈ R(Γ1, Γ2) if and only if hecollections corresponding to v and w have a data point incommon

Page 91: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Important Relations

I If Γ is a graph, then we have a relation RΓ : VΓ → VΓ

defined by (v ,w) ∈ RΓ if and only if (v ,w) is an edge in Γ.

I If (X , d) is a metric space and r > 0 is a threshold, thenRd(r) : X → X is the relation given by (x , x ′) ∈ Rd(r) ifand only if d(x , x ′) ≤ r .

I If f : X → Y is a function, then the graph {(x , f (x)|x ∈ Xis a relation from X to Y

I If Γ1 and Γ2 are two mapper models of the same data set,then there is a naturally defined relation R(Γ1, Γ2) fromVΓ1 to VΓ2 given by (v ,w) ∈ R(Γ1, Γ2) if and only if hecollections corresponding to v and w have a data point incommon

Page 92: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Important Relations

I If Γ is a graph, then we have a relation RΓ : VΓ → VΓ

defined by (v ,w) ∈ RΓ if and only if (v ,w) is an edge in Γ.

I If (X , d) is a metric space and r > 0 is a threshold, thenRd(r) : X → X is the relation given by (x , x ′) ∈ Rd(r) ifand only if d(x , x ′) ≤ r .

I If f : X → Y is a function, then the graph {(x , f (x)|x ∈ Xis a relation from X to Y

I If Γ1 and Γ2 are two mapper models of the same data set,then there is a naturally defined relation R(Γ1, Γ2) fromVΓ1 to VΓ2 given by (v ,w) ∈ R(Γ1, Γ2) if and only if hecollections corresponding to v and w have a data point incommon

Page 93: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures: Important Relations

I If Γ is a graph, then we have a relation RΓ : VΓ → VΓ

defined by (v ,w) ∈ RΓ if and only if (v ,w) is an edge in Γ.

I If (X , d) is a metric space and r > 0 is a threshold, thenRd(r) : X → X is the relation given by (x , x ′) ∈ Rd(r) ifand only if d(x , x ′) ≤ r .

I If f : X → Y is a function, then the graph {(x , f (x)|x ∈ Xis a relation from X to Y

I If Γ1 and Γ2 are two mapper models of the same data set,then there is a naturally defined relation R(Γ1, Γ2) fromVΓ1 to VΓ2 given by (v ,w) ∈ R(Γ1, Γ2) if and only if hecollections corresponding to v and w have a data point incommon

Page 94: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Buillding Architectures

I The idea is to use “geometry” on the feature space toconstruct neural network architectures

I Geometry can arise in numerous ways

I A priori geometry: grid in case of image convolutionalnetworks, linear arrays for time series and text

I Geometries found by research. E.g. primary circle or Kleinbottle in the case of natural images

I Bespoke geometries obtained by taking mapper models forthe column spaces of data matrices

Page 95: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Buillding Architectures

I The idea is to use “geometry” on the feature space toconstruct neural network architectures

I Geometry can arise in numerous ways

I A priori geometry: grid in case of image convolutionalnetworks, linear arrays for time series and text

I Geometries found by research. E.g. primary circle or Kleinbottle in the case of natural images

I Bespoke geometries obtained by taking mapper models forthe column spaces of data matrices

Page 96: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Buillding Architectures

I The idea is to use “geometry” on the feature space toconstruct neural network architectures

I Geometry can arise in numerous ways

I A priori geometry: grid in case of image convolutionalnetworks, linear arrays for time series and text

I Geometries found by research. E.g. primary circle or Kleinbottle in the case of natural images

I Bespoke geometries obtained by taking mapper models forthe column spaces of data matrices

Page 97: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Buillding Architectures

I The idea is to use “geometry” on the feature space toconstruct neural network architectures

I Geometry can arise in numerous ways

I A priori geometry: grid in case of image convolutionalnetworks, linear arrays for time series and text

I Geometries found by research. E.g. primary circle or Kleinbottle in the case of natural images

I Bespoke geometries obtained by taking mapper models forthe column spaces of data matrices

Page 98: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Buillding Architectures

I The idea is to use “geometry” on the feature space toconstruct neural network architectures

I Geometry can arise in numerous ways

I A priori geometry: grid in case of image convolutionalnetworks, linear arrays for time series and text

I Geometries found by research. E.g. primary circle or Kleinbottle in the case of natural images

I Bespoke geometries obtained by taking mapper models forthe column spaces of data matrices

Page 99: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures

Page 100: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures

Page 101: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures

Page 102: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures

I Can use symmetries to construct other convolutionalstructures

I Example: rotations by multiplication by roots of unity ondiscretization on circle

I Also exist analogues of pooling constructions for circlesand Klein bottles

I Sphere occurs in 3D voxelized images. Icosahedron is asymmetric discretization

Page 103: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures

I Can use symmetries to construct other convolutionalstructures

I Example: rotations by multiplication by roots of unity ondiscretization on circle

I Also exist analogues of pooling constructions for circlesand Klein bottles

I Sphere occurs in 3D voxelized images. Icosahedron is asymmetric discretization

Page 104: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures

I Can use symmetries to construct other convolutionalstructures

I Example: rotations by multiplication by roots of unity ondiscretization on circle

I Also exist analogues of pooling constructions for circlesand Klein bottles

I Sphere occurs in 3D voxelized images. Icosahedron is asymmetric discretization

Page 105: Lecture 8 : The Topology of Neural Nets and Deep Learning May …€¦ · to best approximate a given Boolean function I Ultimately generalize to include continuous functions as well

Building Architectures

I Can use symmetries to construct other convolutionalstructures

I Example: rotations by multiplication by roots of unity ondiscretization on circle

I Also exist analogues of pooling constructions for circlesand Klein bottles

I Sphere occurs in 3D voxelized images. Icosahedron is asymmetric discretization