Fundamentals of Deep Learning

Nikhil Buduma

Boston

Fundamentals of Deep LearningDesigning Next Generation Artificial Intelligence

Algorithms

978-1-491-92561-4

[FILL IN]

Fundamentals of Deep Learningby Nikhil Buduma

Copyright © 2015 Nikhil Buduma. All rights reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions arealso available for most titles ( http://safaribooksonline.com ). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected] .

Editors: Mike Loukides and Shannon CuttProduction Editor: FILL IN PRODUCTION EDI‐TORCopyeditor: FILL IN COPYEDITOR

Proofreader: FILL IN PROOFREADERIndexer: FILL IN INDEXERInterior Designer: David FutatoCover Designer: Karen MontgomeryIllustrator: Rebecca Demarest

November 2015: First Edition

Revision History for the First Edition2015-06-12 First Early Release2015-07-23 Second Early Release

See http://oreilly.com/catalog/errata.csp?isbn=9781491925614 for release details.

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Fundamentals of Deep Learning, thecover image, and related trade dress are trademarks of O’Reilly Media, Inc.

While the publisher and the author(s) have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibil‐ity for errors or omissions, including without limitation responsibility for damages resulting from the useof or reliance on this work. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is subject to open sourcelicenses or the intellectual property rights of others, it is your responsibility to ensure that your usethereof complies with such licenses and/or rights.

http://safaribooksonline.com

http://oreilly.com/catalog/errata.csp?isbn=9781491925614

Table of Contents

Preface Title. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

1. The Neural Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Building Intelligent Machines 9The Limits of Traditional Computer Programs 10The Mechanics of Machine Learning 11The Neuron 15Expressing Linear Perceptrons as Neurons 17Feed-forward Neural Networks 18Linear Neurons and their Limitations 21Sigmoid, Tanh, and ReLU Neurons 21Softmax Output Layers 23Looking Forward 24

2. Training Feed-Forward Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25The Cafeteria Problem 25Gradient Descent 27The Delta Rule and Learning Rates 29Gradient Descent with Sigmoidal Neurons 31The Backpropagation Algorithm 33Stochastic and Mini-Batch Gradient Descent 36Test Sets, Validation Sets, and Overfitting 38Preventing Overfitting in Deep Neural Networks 45Summary 49

3. Implementing Neural Networks in Theano. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51What is Theano and why are we using it? 51Installing Theano 52

v

Basic Algebra in Theano 53Theano Graph Structures 54Shared Variables and Side-Effects 56Randomness in Theano 57Computing Derivatives Symbolically 58Expressing a Logistic Regression Network in Theano 59Using Theano to Train a Logistic Regression Network 66Multilayer Models in Theano 75Summary 84

vi | Table of Contents

Preface Title

Congratulations on starting your new project! We’ve added some skeleton files foryou, to help you get started, but you can delete any or all of them, as you like. In thefile called chapter.html, we’ve added some placeholder content showing you how tomarkup and use some basic book elements, like notes, sidebars, and figures.

vii

CHAPTER 1

The Neural Network

Building Intelligent MachinesThe brain is the most incredible organ in the human body. It dictates the way we per‐ceive every sight, sound, smell, taste, and touch. It enables us to store memories,experience emotions, and even dream. Without it, we would be primitive organ‐isms, incapable of anything other than the simplest of reflexes. The brain is, inher‐ently, what makes us intelligent.

The infant brain only weighs a single pound, but somehow, it solves problems thateven our biggest, most powerful supercomputers find impossible. Within a matter ofdays after birth, infants can recognize the faces of their parents, discern discreteobjects from their backgrounds, and even tell apart voices. Within a year, they’vealready developed an intuition for natural physics, can track objects even when theybecome partially or completely blocked, and can associate sounds with specific mean‐ings. And by early childhood, they have a sophisticated understanding of grammarand thousands of words in their vocabularies.

For decades, we’ve dreamed of building intelligent machines with brains like ours -robotic assistants to clean our homes, cars that drive themselves, microscopes thatautomatically detect diseases. But building these artificially intelligent machinesrequires us to solve some of the most complex computational problems we have evergrappled with, problems that our brains can already solve in a manner of microsec‐onds. To tackle these problems, we’ll have to develop a radically different way of pro‐gramming a computer using techniques largely developed over the past decade. This

9

is an extremely active field of artificial computer intelligence often referred to as deeplearning.

The Limits of Traditional Computer ProgramsWhy exactly are certain problems so difficult for computers to solve? Well it turnsout, traditional computer programs are designed to be very good at two things: 1)performing arithmetic really fast and 2) explicitly following a list of instructions. So ifyou want to do some heavy financial number crunching, you’re in luck. Traditionalcomputer programs can do just the trick. But let’s say we want to do somethingslightly more interesting, like write a program to automatically read someone’s hand‐writing.

Figure 1-1. Image from MNIST handwritten digit dataset

Although every digit in Figure 1-1 is written in a slightly different way, we can easilyrecognize every digit in the first row as a zero, every digit in the second row as a one,etc. Let’s try to write a computer program to crack this task. What rules could we useto tell a one digit from another?

Well we can start simple! For example, we might state that we have a zero if our imageonly has a single closed loop. All the examples in Figure 1-1 seem to fit this bill, but

10 | Chapter 1: The Neural Network

this isn’t really a sufficient condition. What if someone does’t perfectly close the loopon their zero? And, as in Figure 1-2, how do you distinguish a messy zero from a six?

Figure 1-2. A zero that’s difficult to distinguish from a six algorithmically

You could potentially establish some sort of cutoff for the distance between the start‐ing point of the loop and the ending point, but it’s not exactly clear where we shouldbe drawing the line. But this dilemma is only the beginning of our worries. How dowe distinguish between threes and fives? Or between fours and nines? We can addmore and more rules, or features, through careful observation and months of trial anderror, but it’s quite clear that this isn’t going to be an easy process.

There many other classes of problems that fall into this same category: object recog‐nition, speech comprehension, automated translation, etc. We don’t know what pro‐gram to write because we don’t know how it’s done by our brains. And even if we didknow how to do it, the program might be horrendously complicated.

The Mechanics of Machine LearningTo tackle these classes of problems we’ll have to use a very different kind of approach.A lot of the things we learn in school growing up have a lot in common with tradi‐tional computer programs. We learn how to multiply numbers, solve equations, andtake derivatives by internalizing a set of instructions. But the things we learn at anextremely early age, the things we find most natural, are learned by example, not byformula.

For example, when we were two years old, our parents didn’t teach us how to recog‐nize a dog by measuring the shape of its nose or the contours of its body. We learned

The Mechanics of Machine Learning | 11

to recognize a dog by being shown multiple examples and being corrected when wemade the wrong guess. In other words, when we were born, our brains provided uswith a model that described how we would be able to see the world. As we grew up,that model would take in our sensory inputs and make a guess about what we’re expe‐riencing. If that guess was confirmed by our parents, our model would be reinforced.If our parents said we were wrong, we’d modify our model to incorporate this newinformation. Over our lifetime, our model becomes more and more accurate as weassimilate billions of examples. Obviously all of this happens subconsciously, withoutus even realizing it, but we can use this to our advantage nonetheless.

Deep learning is a subset of a more general field of artificial intelligencecalled machine learning, which is predicated on this idea of learning from example. Inmachine learning, instead of teaching a computer the a massive list rules to solve theproblem, we give it a model with which it can evaluate examples and a small set ofinstructions to modify the model when it makes a mistake. We expect that, overtime, a well-suited model would be able to solve the problem extremely accurately.

Let’s be a little bit more rigorous about what this means so we can formulate this ideamathematically. Let’s define our model to be a function h �, θ . The input � is anexample expressed in vector form. For example, if � were a greyscale image, the vec‐tor’s components would be pixel intensities at each position, as shown in Figure 1-3.

Figure 1-3. The process of vectorizing an image for a machine learning algorithm

The input θ is a vector of the parameters that our model uses. Our machine learningprogram tries to perfect the values of these parameters as it is exposed to more andmore examples. We’ll see this in action in more detail in a future section.

To develop a more intuitive understanding for machine learning models, let’s walkthrough a quick example. Let’s say we wanted to figure out wanted to determine how


to predict exam performance based on the number of hours of sleep we get and thenumber of hours we study the previous day. We collect a lot of data, and for each datapoint � = x1 x2

T, we record the number of hours of sleep we got (x1), the number ofhours we spent studying (x2), and whether we performed above average or below theclass average. Our goal, then, might be to learn a model h �, θ with parameter vectorθ = θ0 θ1 θ2

T such that:

h �, θ =

−1 if �T ·θ1

θ2+ θ0 < 0

1 if �T ·θ1

θ2+ θ0 ≥ 0

In other words, we guess that the blueprint for our model h �, θ is as describedabove (geometrically, this particular blueprint describes a linear classifier that dividesthe coordinate plane into two halves). Then, we want to learn a parameter vec‐tor θ such that our model makes the right predictions (-1 if we perform below aver‐age, and 1 otherwise) given an input example �. This is model is called a linearperceptron. Let’s assume our data is as shown in Figure 1-4.

Then it turns out, by selecting θ = −24 3 4 T, our machine learning model makesthe correct prediction on every data point:

h �, θ =−1 if 3x1 + 4x2 − 24 < 0

1 if 3x1 + 4x2 − 24 ≥ 0

An optimal parameter vector θ positions the classifier so that we make as many cor‐rect predictions as possible. In most cases, there are many (or even infinitelymany) possible choices for θ that are optimal. Fortunately for us, most of the timethese alternatives are so close to one another that the difference in their performanceis negligible. If this is not the case, we may want to collect more data to narrow ourchoice of θ.

The Mechanics of Machine Learning | 13

Figure 1-4. Sample data for our exam predictor algorithm and a potential classifier

This is pretty cool, but there are still some pretty significant questions that remain.First off, how do we even come up with an optimal value for the parameter vec‐tor θ in the first place? Solving this problem requires a technique commonly knownas optimization. An optimizer aims to maximize the performance of a machine learn‐ing model by iteratively tweaking its parameters until the error is minimized. We’llbegin to tackle this question of learning parameter vectors in more detail in the nextchapter, when we describe the process of gradient descent. In later chapters, we’ll tryto find ways to make this process even more efficient.

Second, it’s quite clear that this particular model (the linear perceptron model) isquite limited in the relationships it can learn. For example, the distributions of datashown in Figure 1-5 cannot be described well by a linear perceptron.


Figure 1-5. As our data takes on more complex forms, we need more complex models todescribe them

But these situations are only the tip of the iceberg. As we move onto much morecomplex problems such as object recognition and text analysis, our data not onlybecomes extremely high dimensional, but the relationships we want to capture alsobecome highly nonlinear. To accommodate this complexity, recent research inmachine learning has attempted to build models that highly resemble the structuresutilized by our brains. It’s essentially this body of research, commonly referred toas deep learning, that has had spectacular success in tackling problems in computervision and natural language processing. These algorithms not only far surpass otherkinds of machine learning algorithms, but also rival (or even exceed!) the accuraciesachieved by humans.

The NeuronThe foundational unit of the human brain is the neuron. A tiny piece of of the brain,about the size of grain of rice, contains over 10,000 neurons, each of which forms anaverage of 6,000 connections with other neurons. It’s this massive biological networkthat enables us to experience the world around us. Our goal in this section will be touse this natural structure to build machine learning models that solve problems in ananalogous way.

At its core, the neuron is optimized to receive information from other neurons, pro‐cess this information in a unique way, and send its result to other cells. This process issummarized in Figure 1-6. The neuron receives its inputs along antennae-like struc‐tures called dendrites. Each of these incoming connections is dynamically strength‐ened or weakened based on how often it is used (this is how we learn new concepts!),and it’s the strength of each connection that determines the contribution of the inputto the neuron’s output. After being weighted by the strength of their respective con‐

The Neuron | 15

nections, the inputs are summed together in the cell body. This sum is then trans‐formed into a new signal that’s propagated along the cell’s axon and sent off to otherneurons.

Figure 1-6. A functional description of a biological neuron’s structure

We can translate this functional understanding of the neurons in our brain into anartificial model that we can represent on our computer. Such a model is describedin Figure 1-7. Just as in biological neurons, our artificial neuron takes in some num‐ber of inputs, x1, x2, . . . , xn, each of which is multiplied by a specific weight,w1, w2, . . . , wn. These weighted inputs are, as before, summed together to producethe logit of the neuron, z = ∑i = 0

n wixi. In many cases, the logit also includes a bias,which is a constant (not shown in figure). The logit is then passed through a func‐tion f to produce the output y = f z . This output can be transmitted to other neu‐rons.

Figure 1-7. Schematic for a neuron in an artificial neural net


In Example 1-1, we show how a neuron might be implemented in Python. A fewquick notes on implementation. Throughout this book, we’ll be constantly using acouple of libraries to make our lives easier. One of these is NumPy, a fundamentallibrary for scientific computing. Among other things, NumPy will allow us to quicklymanipulate matrices and vectors with ease. In Example 1-1, NumPy enables us topainlessly take the dot product of two vectors (inputs and self.weights). Anotherlibrary that we will use further down the road is Theano. Theano integrates closelywith NumPy and allows us to define, optimize, and evaluate mathematical expres‐sions. These two libraries will serve as a foundation for tools we explore in futurechapters, so it’s worth taking some time to gain some familiarity with them.

Example 1-1. Neuron Implementation

import numpy as np

###################################################### Assume inputs and weights are 1-dimensional numpy ## arrays and bias is a number ######################################################

class Neuron: def __init__(self, weights, bias, function): self.weights = weights self.bias = bias self.function = function

def forward(self, inputs): logit = np.dot(inputs, self.weights) + self.bias output = self.function(logit) return output

Expressing Linear Perceptrons as NeuronsIn the previous section we talked about how using machine learning models to cap‐ture the relationship between success on exams and time spent studying and sleeping.To tackle this problem, we constructed a linear perceptron classifier that divided theCartesian coordinate plane into two halves:

h �, θ =−1 if 3x1 + 4x2 − 24 < 0

1 if 3x1 + 4x2 − 24 ≥ 0

Expressing Linear Perceptrons as Neurons | 17

As shown in Figure 1-4, this is an optimal choice for θ because it correctly classifiesevery sample in our dataset. Here, we show that our model h is easily using a neuron.Consider the neuron depicted in Figure 1-8. The neuron has two inputs, a bias, anduses the function:

f z =−1 if z < 01 if z ≥ 0

It’s very easy to show that our linear perceptron and the neuronal model are perfectlyequivalent. And in general, it’s quite simple to show singular neurons are strictlymore expressive than linear perceptrons. In other words, every linear perceptron canbe expressed as a single neuron, but single neurons can also express models that can‐not be expressed by any linear perceptron.

Figure 1-8. Expressing our exam performance perceptron as a neuron

Feed-forward Neural NetworksAlthough single neurons are more powerful than linear perceptrons, they’re notnearly expressive enough to solve complicated learning problems. There’s a reasonour brain is made of more than one neuron. For example, it is impossible for a singleneuron to differentiate hand-written digits. So to tackle much more complicatedtasks, we’ll have to take our machine learning model even further.

The neurons in the human brain are organized in layers. In fact the human cerebralcortex (the structure responsible for most of human intelligence) is made of six lay‐ers. Information flows from one layer to another until sensory input is converted into


conceptual understanding. For example, the bottom-most layer of the visual cortexreceives raw visual data from the eyes. This information is processed by each layerand passed onto the next until, in the sixth layer, we conclude whether we are lookingat a cat, or a soda can, or an airplane.

Figure 1-9. A simple example of a feed-forward neural network with 3 layers (input, onehidden, and output) and 3 neurons per layer

Borrowing these concepts, we can construct an artificial neural network. A neuralnetwork comes about when we start hooking up neurons to each other, the inputdata, and to the output nodes, which correspond to the network’s answer to a learn‐ing problem. Figure 1-9 demonstrates a simple example of an artificial neural net‐work. The bottom layer of the network pulls in the input data. The top layer ofneurons (output nodes) computes our final answer. The middle layer(s) of neuronsare called the hidden layers, and we let wi, j

k be the weight of the connection betweenthe ith neuron in the kth layer with the jth neuron in the k + 1st layer. These weightsconstitute our parameter vector, θ, and just as before, our ability to solve problemswith neural networks depends on finding the optimal values to plug into θ.

Feed-forward Neural Networks | 19

We note that in this example, connections only traverse from a lower layer to a higherlayer. There are no connections between neurons in the same layer, and there are noconnections that transmit data from a higher layer to a lower layer. These neural net‐works are called feed-forward networks, and we start by discussing these networksbecause they are the simplest to analyze. We present this analysis (specifically, theprocess of selecting the optimal values for the weights) in the next chapter. Morecomplicated connectivities will be addressed in later chapters.

In the final sections, we’ll discuss the major types of layers that are utilized in feed-forward neural networks. But before we proceed, here’s a couple of important notesto keep in mind:

1. As we mentioned above, the layers of neurons that lie sandwiched between thefirst layer of neurons (input layer) and the last layer of neurons (output layer), arecalled the hidden layers. This is where most of the magic is happening when theneural net tries to solve problems. Whereas (as in the handwritten digit example)we would previously have to spend a lot of time identifying useful features, thehidden layers automate this process for us. Often times, taking a look at the activ‐ities of hidden layers can tell you a lot about the features the network has auto‐matically learned to extract from the data.

2. Although, in this example, every layer has the same number of neurons, this isneither necessary nor recommended. More often than not, hidden layers oftenhave fewer neurons than the input layer to force the network to learn compressedrepresentations of the original input. For example, while our eyes obtain rawpixel values from our surroundings, our brain thinks in terms of edges and con‐tours. This is because the hidden layers of biological neurons in our brain forceus to come up with better representations for everything we perceive.

3. It is not required that every neuron has its output connected to the inputs of allneurons in the next layer. In fact, selecting which neurons to connect to whichother neurons in the next layer is an art that comes from experience. We’ll dis‐cuss this issue in more depth as we work through various examples of neural net‐works.

4. The inputs and outputs are vectorized representations. For example, you mightimagine a neural network where the inputs are the individual pixel RGB values inan image represented as a vector (refer to Figure 1-3). The last layer might have 2neurons which correspond to the answer to our problem: 1, 0 if the image con‐tains a dog, 0, 1 if the image contains a cat, 1, 1 if it contains both, and 0, 0 ifit contains neither.


Linear Neurons and their LimitationsMost neuron types are defined by the function f they apply to their logit z. Let’s firstconsider layers of neurons that use a linear function in the form of f z = az + b. Forexample, a neuron that attempts to estimate a cost of a meal in a cafeteria would use alinear neuron where a = 1 and b = 0. In other words, using f z = z and weightsequal to the price of each item, the linear neuron in Figure 1-10, would take in someordered triple of servings of burgers, fries, and sodas, and output the price of thecombination.

Figure 1-10. An example of a linear neuron

Linear neurons are easy to compute with, but they run into serious limitations. Infact, it can be shown that any feed-forward neural network consisting of only linearneurons can be expressed as a network with no hidden layers. This is problematicbecause as we discussed before, hidden layers are what enable us to learn importantfeatures from the input data. In other words, in order to learn complex relationships,we need to use neurons that employ some sort of nonlinearity.

Sigmoid, Tanh, and ReLU NeuronsThere are three major types of neurons that are used in practice that introduce nonli‐nearities in their computations. The first of these is the sigmoid neuron, which usesthe function:

Linear Neurons and their Limitations | 21

f z = 1

1 + e−z

Figure 1-11. The output of a sigmoid neuron as z varies

Intuitively, this means that when the logit is very small, the output of a logistic neu‐ron is very close to 0. When the logit is very large, the output of the logistic neuron isclose to 1. In between these two extremes, the neuron assumes an s-shape, as shownin Figure 1-11.

Figure 1-12. The output of a tanh neuron as z varies


Tanh neurons use a similar kind of s-shaped nonlinearity, but instead of ranging from0 to 1, the output of tanh neurons range from -1 to 1. As one would expect, theyuse f z = tanh z . The resulting relationship between the output y and the logit z isdescribed by Figure 1-12. When s-shaped nonlinearities are used, the tanh neuron isoften preferred over the sigmoid neuron because it is zero-centered.

A different kind of nonlinearity is used by the restricted linear unit (ReLU) neuron. Ituses the function f z = max 0, z , resulting in a characteristic hockey stick shapedresponse as shown in Figure 1-13.

Figure 1-13. The output of a ReLU neuron as z varies

The ReLU has recently become the neuron of choice for many tasks (especially incomputer vision) because of a number of reasons, despite some drawbacks. We’ll dis‐cuss these reasons in Chapter 5 as well as strategies to combat the potential pitfalls.

Softmax Output LayersOften times, we want our output vector to be a probability distribution over a set ofmutually exclusive labels. For example, let’s say we want to build a neural network torecognize handwritten digits from the MNIST data set. Each label (0 through 9) ismutually exclusive, but it’s unlikely that we will be able to recognize digits with 100%confidence. Using a probability distribution gives us a better idea of how confidentwe are in our predictions. As a result, the desired output vector is of the form below,where ∑i = 0

9 pi = 1:

Softmax Output Layers | 23

p0 p1 p2 p3 . . . p9

This is achieved by using a special output layer called a softmax layer. Unlike in otherkinds of layers, the output of a neuron in a softmax layer depends on the outputs ofall of the other neurons in its layer. This is because we require the sum of all the out‐puts to be equal to 1. Letting zi be the logit of the ith softmax neuron, we can achievethis normalization by setting its output to:

yi = ezi

∑ j ez j

A strong prediction would have a single entry in the vector close to 1 while theremaining entries were close to 0. A weak prediction would have multiple possiblelabels that are more or less equally likely.

Looking ForwardIn this chapter, we’ve built a basic intuition for machine learning and neural net‐works. We’ve talked about the basic structure of a neuron, how feed-forward neuralnetworks work, and the importance of nonlinearity in tackling complex learningproblems. In the next chapter we will begin to build the mathematical backgroundnecessary to train a neural network to solve problems. Specifically, we will talk aboutfinding optimal parameter vectors, best practices while training neural networks, andmajor challenges. In future chapters, we will take these foundational ideas to buildmore specialized neural architectures.


CHAPTER 2

Training Feed-Forward Neural Networks

The Cafeteria Problem We’re beginning to understand how we can tackle some interesting problems usingdeep learning, but one big question still remains - how exactly do we figure out whatthe parameter vectors (the weights for all of the connections in our neural network)should be? This is accomplished by a process commonly referred to as training. Dur‐ing training, we show the neural net a large number of training examples and itera‐tively modify the weights to minimize the errors we make on the training examples.After enough examples, we expect that our neural network will be quite effective atsolving the task it’s been trained to do.

Figure 2-1. This is the neuron we want to train for the Dining Hall Problem

25

Let’s continue with the example we mentioned in the previous chapter involving a lin‐ear neuron. As a brief review, every single day, we purchase a meal from the dininghall consisting of burgers, fries, and sodas. We buy some number of servings for eachitem. We want to be able to predict how much a meal is going to cost us, but the itemsdon’t have price tags. The only thing the cashier will tell us is the total price of themeal. We want to train a single linear neuron to solve this problem. How do we do it?

One idea is to be smart about picking our training cases. For one meal we could buyonly a single serving of burgers, for another we could only buy a single serving offries, and then for our last meal we could buy a single serving of soda. In general,choosing smart training cases is a very good idea. There’s lots of research that showsthat by engineering a clever training set, you can make your neural network a lotmore effective. The issue with this approach is that in real situations, it rarely evergets you close to the solution. For example, there’s no clear analog of this strategy inimage recognition. It’s just not a practical solution.

Instead, we try to motivate a solution that works well in general. Let’s say we have abunch of training examples. Then we can calculate what the neural network will out‐put on the ith training example using the simple formula in the diagram. We want totrain the neuron so that we pick the optimal weights possible - the weights that mini‐mize the errors we make on the training examples. In this case, let’s say we want tominimize the square error over all of the training examples that we encounter. Moreformally, if we know that t i is the true answer for the ith training example and y i isthe value computed by the neural network, we want to minimize the value of theerror function E:

E = 12 ∑i t i − y i 2

The squared error is zero when our model makes a perfectly correct prediction onevery training example. Moreover, the closer E is to 0, the better our model is. As aresult, our goal will be to select our parameter vector θ (the values for all the weightsin our model) such that E is as close to 0 as possible.

Now at this point you might be wondering why we need to bother ourselves witherror functions when we can treat this problem as a system of equations. After all, wehave a bunch of unknowns (weights) and we have a set of equations (one for each

26 | Chapter 2: Training Feed-Forward Neural Networks

training example). That would automatically give us an error of 0 assuming that wehave a consistent set of training example.

That’s a smart observation, but the insight unfortunately doesn’t generalize well.Remember that although we’re using a linear neuron here, linear neurons aren’t usedvery much in practice because they’re constrained in what they can learn. And themoment we start using nonlinear neurons like the sigmoidal, tanh, or ReLU neuronswe talked about at the end of the previous chapter, we can no longer set up a systemof equations! Clearly we need a better strategy to tackle the training process.

Gradient DescentLet’s visualize how we might minimize the squared error over all of the trainingexamples by simplifying the problem. Let’s say our linear neuron only has two inputs(and thus only two weights, w1 and w2). Then we can imagine a 3-dimensional spacewhere the horizontal dimensions correspond to the weights w1 and w2, and the verti‐cal dimension corresponds to the value of the error function E. In this space, pointsin the horizontal plane correspond to different settings of the weights, and the heightat those points corresponds to the incurred error. If we consider the errors we makeover all possible weights, we get a surface in this 3-dimensional space, in particular, aquadratic bowl as shown in Figure 2-2.

Gradient Descent | 27

Figure 2-2. The quadratic error surface for a linear neuron

We can also conveniently visualize this surface as a set of elliptical contours, wherethe minimum error is at the center of the ellipses. In this setup, we are working in a 2-dimensional plane where the dimensions correspond to the two weights. Contourscorrespond to settings of w1 and w2 that evaluate to the same value of E. The closerthe contours are to each other, the steeper the slope. In fact it turns out that the direc‐tion of the steepest descent is always perpendicular to the contours. This direction isexpressed as a vector known as the gradient.


Figure 2-3. Visualizing the error surface as a set of contours

Now we can develop a high-level strategy for how to find the values of the weightsthat minimizes the error function. Suppose we randomly initialize the weights of ournetwork so we find ourselves somewhere on the horizontal plane. By evaluating thegradient at our current position, we can find the direction of steepest descent and wecan take a step in that direction. Then we’ll find ourselves at a new position that’scloser to the minimum than we were before. We can re-evaluate the direction ofsteepest descent by taking the gradient at this new position and taking a step in thisnew direction. It’s easy to see that, as shown in Figure 2-3, following this strategy willeventually get us to the point of minimum error. This algorithm is known as gradientdescent, and we’ll use it to tackle the problem of training individual neurons and themore general challenge of training entire networks.

The Delta Rule and Learning RatesBefore we derive the exact algorithm for training our cafeteria neuron, we make aquick note on hyperparameters. In addition to the weight parameters defined in ourneural network, learning algorithms also require a couple of additional parameters tocarry out the training process. One of these so-called hyperparameters is the learningrate.

The Delta Rule and Learning Rates | 29

In practice at each step of moving perpendicular to the contour, we need to deter‐mine how far we want to walk before recalculating our new direction. This distanceneeds to depend on the steepness of the surface. Why? The closer we are to the mini‐mum, the shorter we want to step forward. We know we are close to the minimum,because the surface is a lot flatter, so we can use the steepness as an indicator of howclose we are to the minimum. However, if our error surface is rather mellow, trainingcan potentially take a large amount of time. As a result, we often multiply the gradientby a factor �, the learning rate. Picking the learning rate is a hard problem. As we justdiscussed, if we pick a learning rate that’s too small, we risk taking too long during thetraining process. But if we pick a learning rate that’s too big, we’ll mostly likely startdiverging away from the minimum. In the next chapter, we’ll learn about variousoptimization techniques that utilize adaptive learning rates to automate the process ofselecting learning rates.

Figure 2-4. Convergence is difficult when our learning rate is too large

Now, we are finally ready to derive the delta rule for training our linear neuron. Inorder to calculate how to change each weight, we evaluate the gradient, which isessentially the partial derivative of the error function with respect to each of theweights. In other words, we want:

Δwk = − � ∂E∂wk


= − � ∂∂wk

12 ∑i t i − y i 2

= ∑i� t i − y i ∂yi∂wk

= ∑i�xki t i − y i

Applying this method of changing the weights at every iteration, we are finally able toutilize gradient descent.

Gradient Descent with Sigmoidal NeuronsIn this section and the next, we will deal with training neurons and neural networksthat utilize nonlinearites. We use the sigmoidal neuron as a model, and leave the deri‐vations for other nonlinear neurons as an exercise for the reader. For simplicity, weassume that the neurons do not use a bias term, although our analysis easily extendsto this case. We merely need to assume that the bias is a weight on an incoming con‐nection whose input value is always one.

Let’s recall the mechanism by which logistic neurons compute their output valuefrom their inputs:

z = ∑k wkxk

y = 1

1 + e−z

The neuron computes the weighted sum of its inputs, the logit, z. It then feeds its logitinto the input function to compute y, its final output. Fortunately for us, these func‐tions have very nice derivatives, which makes learning easy! For learning, we want tocompute the gradient of the error function with respect to the weights. To do so, westart by taking the derivative of the logit with respect to the inputs and the weights.

∂z

∂wk= xk

∂z∂xk

= wk

Gradient Descent with Sigmoidal Neurons | 31

Also, quite surprisingly, the derivative of the output with respect to the logit is quitesimple if you express it in terms of the output.

dydz = e−z

1 + e−z 2

= 1

1 + e−ze−z

1 + e−z

= 1

1 + e−z 1 − 1

1 + e−z

= y 1 − y

We then use the chain rule to get the derivative of the output with respect to eachweight:

∂y

∂wk= dy

dz∂z

∂wk= xky 1 − y

Putting all of this together, we can now compute the derivative of the error functionwith respect to each weight:

∂E∂wk

= ∑i∂E

∂y i∂y i

∂wk= − ∑i xk

i y i 1 − y i t i − y i

Thus, the final rule for modifying the weights becomes:

Δwk = ∑i�xki y i 1 − y i t i − y i

As you may notice, the new modification rule is just like the delta rule, except withextra multiplicative terms included to account for the logistic component of the sig‐moidal neuron.


The Backpropagation AlgorithmNow we’re finally ready to tackle the problem of training multilayer neural networks(instead of just single neurons). So what’s the idea behind backpropagation? We don’tknow what the hidden units ought to be doing, but what we can do is compute howfast the error changes as we change a hidden activity. From there, we can figure outhow fast the error changes when we change the weight of an individual connec‐tion. Essentially we’ll be trying to find the path of steepest descent! The only catch isthat we’re going to be working in an extremely high dimensional space. We start bycalculating the error derivatives with respect to a single training example.

Each hidden unit can affect many output units. Thus, we’ll have to combine manyseparate effects on the error in an informative way. Our strategy will be one ofdynamic programming. Once we have the error derivatives for one layer of hiddenunits, we’ll use them to compute the error derivatives for the activites of the layerbelow. And once we find the error derivatives for the activities of the hidden units, it’squite easy to get the error derivatives for the weights leading into a hidden unit. We’llredefine some notation for ease of discussion and refer to the following diagram:

The Backpropagation Algorithm | 33

Figure 2-5. Reference diagram for the derivation of the backpropagation algorithm

The subscript we use will refer to the layer of the neuron. The symbol y will refer tothe activity of a neuron, as usual. Similarly the symbol z will refer to the logit of theneuron. We start by taking a look at the base case of the dynamic programming prob‐lem. Specifically, we calculate the error function derivatives at the output layer:

E = 12 ∑ j ∈ output t j − y j

2 ∂E∂yj

= − t j − y j


Now we tackle the inductive step. Let’s presume we have the error derivatives forlayer j. We now aim to calculate the error derivatives for the layer below it, layer i. Todo so, we must accumulate information about how the output of a neuron inlayer i affects the logits of every neuron in layer j. This can be done as follows, usingthe fact that the partial derivative of the logit with respect to the incoming outputdata from the layer beneath is merely the weight of the connection wi j:

∂E∂yi

= ∑ j∂E∂z j

dz jdyi

= ∑ j wi j∂E∂z j

Furthermore, we observe the following:

∂E∂z j

= ∂E∂yj

dyjdz j

= y j 1 − y j∂E∂yj

Combining these two together, we can finally express the error derivatives of layer i interms of the error derivatives of layer j:

∂E∂yi

= ∑ j wi jy j 1 − y j∂E∂yj

Then once we’ve gone through the whole dynamic programming routine, having fil‐led up the table appropriately with all of our partial derivatives (of the error functionwith respect to the hidden unit activities), we can then determine how the errorchanges with respect to the weights. This gives us how to modify the weights aftereach training example:

∂E∂wij

=∂z j

∂wij

∂E∂z j

= yiy j 1 − y j∂E∂yj

Finally to complete the algorithm, we, just as before, merely sum up the partial deriv‐atives over all the training examples in our dataset. This gives us the following modi‐fication formula:

The Backpropagation Algorithm | 35

Δwi j = − ∑k ∈ dataset�yik y j

k 1 − y jk ∂E k

∂yjk

This completes our description of the backpropagation algorithm!

Stochastic and Mini-Batch Gradient DescentIn the algorithms we’ve described above, we’ve been using a version of gradientdescent known as batch gradient descent. The idea behind batch gradient descent isthat we use our entire dataset to compute the error surface and then follow the gradi‐ent to take the path of steepest descent. For a simple quadratic error surface, thisworks quite well. But in most cases, our error surface may be a lot more complicated.Let’s consider the scenario in Figure 2-6 for illustration.

Figure 2-6. Batch gradient descent is sensitive to local minima

We only have a single weight, and we use random initialization and batch gradientdescent to find its optimal setting. The error surface, however, has a spurious localminimum, and if we get unlucky, we might get stuck in a non-optimal minima.

Another potential approach is stochastic gradient descent, where at each iteration, ourerror surface is estimated only with respect to a single example. This approach isillustrated by Figure 2-7, where instead of a single static error surface, our error sur‐


face is dynamic. As a result, descending on this stochastic surface significantlyimproves our ability to avoid local minima.

Figure 2-7. The stochastic error surface fluctuates with respect to the batch error surface,enabling local-minima avoidance

The major pitfall of stochastic gradient descent, however, is that looking at the errorincurred one example at a time may not be a good enough approximation of the errorsurface. This, in turn, could potentially make gradient descent take a significantamount of time. One way to combat this problem is using mini-batch gradientdescent. In mini-batch gradient descent, at every iteration, we compute the error sur‐face with respect to some subset of the total dataset (instead of just a single example).This subset is called a mini-batch, and in addition to the learning rate, mini-batch sizeis another hyperparameter. Mini-batches strike a balance between the efficiency ofbatch gradient descent and the local-minima avoidance afforded by stochastic gradi‐ent descent. In the context of backpropagation, our weight update step becomes:

Δwi j = − ∑k ∈ mini − batch�yik y j

k 1 − y jk ∂E k

∂yjk

This is identical to what we derived in the previous section, but instead of summingover all the examples in the dataset, we sum over the examples in the current mini-batch.

Stochastic and Mini-Batch Gradient Descent | 37

Test Sets, Validation Sets, and OverfittingOne of the major issues with artificial neural networks is that the models are quitecomplicated. For example, let’s consider a neural network that’s pulling data from animage from the MNIST database (28 by 28 pixels), feeds into two hidden layers with30 neurons, and finally reaches a soft-max layer of 10 neurons. The total number ofparameters in the network is nearly 25,000. This can be quite problematic, and tounderstand why, let’s take a look at the example data in Figure 2-8.

Figure 2-8. Two potential models that might describe our dataset - a linear model vs. adegree 12 polynomial

Using the data, we train two different models - a linear model and a degree 12 poly‐nomial. Which curve should we trust? The line which gets almost no training exam‐ple correctly? Or the complicated curve that hits every single point in the dataset? Atthis point we might trust the linear fit because it seems much less contrived. But justto be sure, let’s add more data to our dataset! The result is shown in Figure 2-9.


Figure 2-9. Evaluating our model on new data indicates that the linear fit is a much bet‐ter model than the degree 12 polynomial

Now the verdict is clear, the linear model is not only subjectively better, but now alsoquantitatively performs better as well (measured using the squared error metric). Butthis leads to a very interesting point about training and evaluating machine learningmodels. By building a very complex model, it’s quite easy to perfectly fit our dataset.But when we evaluate such a complex model on new data, it performs very poorly. Inother words, the model does not generalize well. This is a phenomenon called overfit‐ting, and it is one of the biggest challenges that a machine learning engineer mustcombat. This becomes an even more significant issue in deep learning, where ourneural networks have large numbers of layers containing many neurons. The numberof connections in these models is astronomical, reaching the millions. As a result,overfitting is commonplace.

Let’s see how this looks in the context of a neural network. Let’s say we have a neuralnetwork with two inputs, a soft-max output of size two, and a hidden layer with 3, 6,or 20 neurons. We train these networks using mini-batch gradient descent (batch size10), and the results, visualized using the ConvnetJS library, are show in Figure 2-10.

Test Sets, Validation Sets, and Overfitting | 39

Figure 2-10. A visualization of neural networks with 3, 6, and 20 neurons (in that order)in their hidden layer.

It’s already quite apparent from these images that as the number of connections inour network increases, so does our propensity to overfit to the data. We can similarlysee the phenomenon of overfitting as we make our neural networks deep. Theseresults are shown in Figure 2-11, where we use networks that have 1, 2, or 4 hiddenlayers of 3 neurons each.

Figure 2-11. A visualization of neural networks with 1, 2, and 4 hidden layers (in thatorder) of 3 neurons each.

This leads to three major observations. First, the machine learning engineer is alwaysworking with a direct trade-off between overfitting and model complexity. If themodel isn’t complex enough, it may not be powerful enough to capture all of the use‐ful information necessary to solve a problem. However, if our model is very complex(especially if we have a limited amount of data at our disposal), we run the risk ofoverfitting. Deep learning takes the approach of solving very complex problems withcomplex models and taking additional countermeasures to prevent overfitting. We’llsee a lot of these measures in this chapter as well as in later chapters.

Second, it is very misleading to evaluate a model using the data we used to train it.Using the example in Figure 2-8, this would falsely suggest that the degree 12 polyno‐


mial model is preferable to a linear fit. As a result, we almost never train our modelon the entire dataset. Instead, as shown in Figure 2-12 we split up our data intoa training set and a test set.

Figure 2-12. We often split our data into non-overlapping training and test sets in orderto fairly evaluate our model

This enables us to make a fair evaluation of our model by directly measuring howwell it generalizes on new data it has not yet seen. In the real world, large datasets arehard to come by, so it might seem like a waste to not use all of the data at our disposalduring the training process. As a result, it may be very tempting to reuse training datafor testing or cut corners while compiling test data. Be forewarned. If the test set isn’twell constructed, we won’t be able draw any meaningful conclusions about ourmodel.

Third, it’s quite likely that while we’re training our data, there’s a point in time whereinstead of learning useful features, we start overfitting to the training set. As a result,we want to be able to stop the training process as soon as we start overfitting to pre‐vent poor generalization. To do this, we divide our training process into epochs. Anepoch is a single iteration over the entire training set. In other words, if we have atraining set of size d and we are doing mini-batch gradient descent with batch size b,then an epoch would be equivalent to d

b model updates. At the end of each epoch, wewant to measure how well our model is generalizing. To do this, we use an addi‐tional validation set, which is shown in Figure 2-13. At the end of an epoch, the vali‐dation set will tell us how the model does on data it has yet to see. If the accuracy onthe training set continues to increase while the accuracy on the validation set stays thesame (or decreases), it’s a good sign that it’s time to stop training because we’re over‐fitting.


Figure 2-13. In deep learning we often include a validation set to prevent overfitting dur‐ing the training process.

With this in mind, before we jump into describing the various ways to directly com‐bat overfitting, let’s outline the workflow we use when building and training deeplearning models. The workflow is described in detail in Figure 2-14. It is a tad intri‐cate, but it’s critical to understand the pipeline in order to ensure that we’re properlytraining our neural networks.

First we define our problem rigorously. This involves determining our inputs, thepotential outputs, and the vectorized representations of both. For instance, let’s sayour goal was to train a deep learning model to identify cancer. Our input would be anRBG image, which can be represented as a vector of pixel values. Our output wouldbe a probability distribution over three mutually exclusive possibilities: 1) normal, 2)benign tumor (a cancer that has yet to metastasize), or 3) malignant tumor (a cancerthat has already metastasized to other organs).

After we build define our problem, we need to build a neural network architecture tosolve it. Our input layer would have to be of appropriate size to accept the raw datafrom the image, and our output layer would have to be a softmax of size 3. We willalso have to define the internal architecture of the network (number of hidden layers,the connectivities, etc.). We’ll further discuss the architecture of image recognitionmodels when we talk about convolutional neural networks in chapter 4. At thispoint, we also want to collect a significant amount of data for training or model. Thisdata would probably be in the form of uniformly sized pathological images that have


been labeled by a medical expert. We shuffle and divide this data up into separatetraining, validation, and test sets.


Figure 2-14. Detailed workflow for training and evaluating a deep learning model


Finally, we’re ready to begin gradient descent. We train the model on our training setfor an epoch at a time. At the end of each epoch, we ensure that our error on thetraining set and validation set is decreasing. When one of these stops to improve, weterminate and make sure we’re happy with the model’s performance on the test data.If we’re unsatisfied, we need to rethink our architecture. If our training set error stop‐ped improving, we probably need to do a better job of capturing the important fea‐tures in our data. If our validation set error stopped improving, we probably need totake measures to prevent overfitting.

If, however, we are happy with the performance of our model on the training data,then we can measure its performance on the test data, which the model has neverseen before this point. If it is unsatisfactory, that means that we need more data in ourdataset because the test set seems to consist of example types that weren’t well repre‐sented in the training set. Otherwise, we are finished!

Preventing Overfitting in Deep Neural NetworksThere are several techniques that have been proposed to prevent overfitting duringthe training process. In this section, we’ll discuss these techniques in detail.

One method of combatting overfitting is called regularization. Regularization modi‐fies the objective function that we minimize by adding additional terms that penalizelarge weights. In other words, we change the objective function so that itbecomes Error + λ f θ , where f θ grows larger as the components of θ grow largerand λ is the regularization strength (another hyperparameter). The value we choosefor λ determines how much we want to protect against overfitting. A λ = 0 impliesthat we do not take any measures against the possibility of overfitting. If λ is too large,then our model will prioritize keeping θ as small as possible over trying to find theparameter values that perform well on our training set. As a result, choosing λ is avery important task and can require some trial and error.

The most common type of regularization is L2 regularization. It can be implementedby augmenting the error function with the squared magnitude of all weights in theneural network. In other words, for every weight w in the neural network, weadd 1

2 λw2 to the error function. The L2 regularization has the intuitive interpretationof heavily penalizing peaky weight vectors and preferring diffuse weight vectors. Thishas the appealing property of encouraging the network to use all of its inputs a littlerather than using only some of its inputs a lot. Of particular note is that during thegradient descent update, sing the L2 regularization ultimately means that every

Preventing Overfitting in Deep Neural Networks | 45

weight is decayed linearly to zero. Expressed succinctly in NumPy, this is equivalentto the line: W += -lambda * W. Because of this phenomenon, L2 regularization is alsocommonly referred to as weight decay.

We can visualize the effects of L2 regularization using ConvnetJs. Similar to above, weuse a neural network with two inputs, a soft-max output of size two, and a hiddenlayer with 20 neurons. We train the networks using mini-batch gradient descent(batch size 10) and regularization strengths of 0.01, 0.1, and 1. The results can be seenin Figure 2-15.

Figure 2-15. A visualization of neural networks trained with regularization strengths of0.01, 0.1, and 1 (in that order).

Another common type of regularization is L1 regularization. Here, we add theterm λ w for every weight w in the neural network. The L1 regularization has theintriguing property that it leads the weight vectors to become sparse during optimiza‐tion (i.e. very close to exactly zero). In other words, neurons with L1 regularizationend up using only a small subset of their most important inputs and become quiteresistant to noise in the inputs. In comparison, weight vectors from L2 regularizationare usually diffuse, small numbers. L1 regularization is very useful when you want tounderstand exactly which features are contributing to a decision. If this level of fea‐ture analysis isn’t necessary, we prefer to use L2 regularization because it empiricallyperforms better.

Max norm constraints have a similar goal of attempting to restrict θ from becomingtoo large, but they do this more directly. Max norm constraints enforce an absoluteupper bound on the magnitude of the incoming weight vector for every neuron anduse projected gradient descent to enforce the constraint. In other words, anytime agradient descent step moved the incoming weight vector such that w 2 > c, weproject the vector back onto the ball (centered at the origin) with radius c. Typicalvalues of c are 3 and 4. One of the nice properties is that the parameter vector cannot


grow out of control (even if the learning rates are too high) because the updates to theweights are always bounded.

Dropout is a very different kind of method for preventing overfitting that can often beused in lieu of other techniques. While training, dropout is implemented by onlykeeping a neuron active with some probability p (a hyperparameter), or setting it tozero otherwise. Intuitively, this forces the network to be accurate even in the absenceof certain information. It prevents the network from becoming too dependent on anyone (or any small combination) of neurons. Expressed more mathematically, it pre‐vents overfitting by providing a way of approximately combining exponentially manydifferent neural network architectures efficiently. The process of dropout is expressedpictorially in Figure 2-16.

Figure 2-16. Dropout sets each neuron in the network as inactive with some randomprobability during each mini-batch of training.

Dropout is pretty intuitive to understand, but there are some important intricacies toconsider. We illustrate these considerations through Python code. Let’s assume we areworking with a 3-layer ReLU neural network.

Example 2-1. Naïve Dropout Implementation

import numpy as np

# Let p = probability of keeping a hidden unit active # A larger p means less dropout (p =1 --> no dropout)

Preventing Overfitting in Deep Neural Networks | 47

network.p = 0.5

def train_step(network, X): # forward pass for a 3-layer neural network

Layer1 = np.maximum(0, np.dot(network.W1, X) + network.b1) # first dropout mask Dropout1 = (np.random.rand(*Layer1.shape) < network.p # first drop! Layer1 *= Dropout1

Layer2 = np.maximum(0, np.dot(network.W2, Layer1) + network.b2) # second dropout mask Dropout2 = (np.random.rand(*Layer2.shape) < network.p # second drop! Layer2 *= Dropout2

Output = np.dot(network.W3, Layer2) + network.b3

# backward pass: compute gradients... (not shown) # perform parameter update... (not shown) def predict(network, X): # NOTE: we scale the activations Layer1 = np.maximum(0, np.dot(network.W1, X) + network.b1) * network.p Layer2 = np.maximum(0, np.dot(network.W2, Layer) + network.b2) * network.p Output = np.dot(network.W3, Layer2) + network.b3 return Output

One of the things we’ll realize is that during test-time (the predict function), wemultiply the output of each layer by network.p. Why do we do this? Well, we we’d likethe outputs of neurons during test-time to be equivalent to their expected outputs attraining time. For example, if p = 0 . 5, neurons must halve their outputs at test timein order to have the same (expected) output they would have during training. This iseasy to see because a neuron’s output is set to 0 with probability 1 − p. This meansthat if a neuron’s output prior to dropout was x, then after dropout, the expected out‐put would be � output = px + 1 − p · 0 = px.

The naïve implementation of dropout is undesirable because it requires scaling ofneuron outputs at test-time. Test-time performance is extremely critical to modelevaluation, so it’s always preferable to use inverted dropout, where the scaling occursat training time instead of at test time. This has the additional appealing property thatthe predict code can remain the same whether or not dropout is used. In otherwords, only the train_step code would have to be modified.


Example 2-2. Inverted Dropout Implementation

import numpy as np

# Let network.p = probability of keeping a hidden unit active # A larger network.p means less dropout (network.p == 1 --> no dropout)

network.p = 0.5

def train_step(network, X): # forward pass for a 3-layer neural network

Layer1 = np.maximum(0, np.dot(network.W1, X) + network.b1) # first dropout mask, note that we divide by p Dropout1 = ((np.random.rand(*Layer1.shape) < network.p) / network.p # first drop! Layer1 *= Dropout1

Layer2 = np.maximum(0, np.dot(network.W2, Layer1) + network.b2) # second dropout mask, note that we divide by p Dropout2 = ((np.random.rand(*Layer2.shape) < network.p) / network.p # second drop! Layer2 *= Dropout2

Output = np.dot(network.W3, Layer2) + network.b3

# backward pass: compute gradients... (not shown) # perform parameter update... (not shown) def predict(network, X): Layer1 = np.maximum(0, np.dot(network.W1, X) + network.b1) Layer2 = np.maximum(0, np.dot(network.W2, Layer) + network.b2) Output = np.dot(network.W3, Layer2) + network.b3 return Output

SummaryIn this chapter, we’ve learned all of the basics involved in training feedforward neuralnetworks. We’ve talked about gradient descent, the backpropagation algorithm, aswell as various methods we can use to prevent overfitting. In the next chapter, we’llput these lessons into practice when we use the Theano library to efficiently imple‐ment our first neural networks. Then in Chapter 4, we’ll return to the problem ofoptimizing objective functions for training neural networks and design algorithms tosignificantly improve performance. These improvements will enable us to processmuch more data, which means we’ll be able to build more comprehensive models.

Summary | 49

CHAPTER 3

Implementing Neural Networks in Theano

What is Theano and why are we using it?One of the most attractive aspects of the Python programming language is that it’svery easy to use. There’s very little boilerplate, it’s expressive, and it’s great for rapidprototyping. That’s why Python is one of the most popular languages amongresearchers and why we use Python in this book. The biggest drawback, however, isits performance. Python is slow, and that’s a problem for the deep learning engineer.Deep learning models are very computationally intensive and use a lot of data, so ifwe write inefficient code, our model could take weeks or even months to train com‐pletely.

In order to write more efficient Python programs, developers have created a numberof libraries to optimize numerical computations. These libraries includeNumPy, numexpr, and Cython. Although these libraries are effective at solving theproblems they’re built for, they often turn our elegant program into a gargantuanmess. And for the uninitiated, it’s just as likely that poor integrations with these libra‐ries will just make things worse.

Theano was built as a solution to this core problem of achieving simplicity and per‐formance simultaneously. Theano lets us to define, optimize, and evaluate mathemat‐ical expressions, especially ones with multi-dimensional arrays. It often achievesspeeds comparable to hand-crafted C programs for problems involving largeamounts of data, such as training deep neural networks. It can also far surpass tradi‐tional C programs by many orders of magnitude by utilizing GPU resources. Theano

51

manages data flow and computation under the hood so that we don’t have to worryabout it when writing code.

Theano allows us to define variables symbolically. We’ll talk about what this entails inthe next section, but intuitively, it means that using Theano feels a lot like being inalgebra class. Theano can also perform symbolic differentiation, which means thatonce we define our cost functions, Theano can compute the appropriate gradients onits own. This makes our life as a deep learning engineer much simpler.

Installing TheanoWe’ll walk through the basics of installing Theano in this section. If you run intotrouble or would like to enable bleeding-edge features, you should check out thedetailed installation instructions online at http://deeplearning.net/software/theano/install.html.

First, we have to make sure that we have certain pre-requisites already installed onour machine to successfully install Theano:

• Python >= 2.6• g++, python-dev• NumPy >= 1.6.2• SciPy >= 0.11• A BLAS installation (with Level 3 functionality and the development headers)• nose (to run Theano’s test suite)• pydot (to make pictures of Theano computation graphs)

We’ll also need a couple of additional libraries if we would like to use the GPU withour installation:

• NVIDIA CUDA drivers and SDK• libgpuarray

Once all of the pre-requisites are in order, installing Theano is as simple as runningthe following command from the terminal:

$ pip install Theano

Then in the Python or iPython interpreter, we should be able to run the Theano testsuite:

>>> import theano>>> theano.test()

52 | Chapter 3: Implementing Neural Networks in Theano

http://deeplearning.net/software/theano/install.html

http://deeplearning.net/software/theano/install.html

http://www.python.org/

http://numpy.scipy.org/

http://scipy.org/

http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

http://somethingaboutorange.com/mrl/projects/nose/

https://code.google.com/p/pydot/

http://developer.nvidia.com/object/gpucomputing.html

http://deeplearning.net/software/libgpuarray/installation.html

In order to set up Theano to use a GPU (this is optional), we need to make a coupleof modifications to the .theanorc file. First, we must add a [cuda] section as followswith the path to the NVIDIA CUDA root directory:

[cuda]root = /path/to/cuda/root

Then, we need to change the device option in the [global] section to name the GPUdevice on our computer and also set the default floating point computations tofloat32:

[global]device = gpufloatX = float32

If our computer has multiple GPUs, device = gpu selects one of the GPUs (usuallygpu0). If you want to choose one of the GPUs specifically, you can instead spec‐ify device=gpuX, with X the the corresponding GPU index (0, 1, 2, ...). To ensure thatthe GPU is working as expected, we follow the instructions at http://deeplearn‐ing.net/software/theano/tutorial/using_gpu.html.

If everything goes smoothly, we’re now ready to get started using Theano! In the restof this chapter, we’ll be understanding how to express operations in Theano, howTheano optimizes these operations under the hood, and then we’ll start implement‐ing some of our neural network models in Theano.

Basic Algebra in TheanoLet’s start off by trying something very simple. Let’s build a Theano module that willadd two numbers for us. Here’s some code that will do this for us:

>>> import theano.tensor as T>>> from theano import function>>> a = T.dscalar('a')>>> b = T.dscalar('b')>>> c = a + b>>> f = function(inputs=[a, b], outputs=c)

Let’s try to understand how this code works! First, let’s take a close look at the firsttwo lines after the theano imports:

a = T.dscalar('a')b = T.dscalar('b')

These lines create two symbols (variables) named a and b that represent the quantitiesthat we want to add. These symbols are of type dscalar, which is equivalent to a 64-bit

Basic Algebra in Theano | 53

http://deeplearning.net/software/theano/tutorial/using_gpu.html

http://deeplearning.net/software/theano/tutorial/using_gpu.html

float. Other common types that we will be using are dvector (a vector of 64-bitfloats), dmatrix (a matrix of 64-bit floats), and dtensor3 (a 3-dimensional tensor, e.g.representing a color image, of 64-bit floats).

The next line defines a new symbol named c in terms of a and b:

c = a + b

We can confirm that Theano has the correct representation for c by using the pretty-print out functionality:

>>> from theano import pp>>> print pp(c)'(a + b)'

Finally, we use these definitions for a, b, and c to build a function:

f = function(inputs=[a, b], outputs=c)

The inputs argument to function is a variable or list of variables that represents thevalues we feed into the function. The outputs argument is a variable or list of vari‐ables that represents what is returned. When this command is executed, Theano ana‐lyzes the relationships between all of the symbols involved in the function andcompiles some optimized code to perform the calculation. Once it’s finished, we canrefer to f like a normal Python function:

>>> f(7, 3)array(10.0)

In the next section we’ll talk some more about the optimizations that Theano auto‐matically finishes for us under the hood by discussing Theano’s symbolic graph struc‐tures.

Theano Graph StructuresTo illustrate the optimizations that Theano performs under the hood, we need tounderstand Theano’s graph structures. As we define new symbols and express themin terms of other symbols via various operations, Theano generates a graph to keeptrack of their relationships. For example, let’s take a look at what happens when wedefine a symbol to be the sum of the cubes of two other symbols (specifically, twovectors):

>>> import theano.tensor as T>>> from theano import function>>> from theano.printing import pydotprint>>> a = T.dvector('a')


>>> b = T.dvector('b')>>> c = a**3 + b**3>>> f = theano.function(inputs=[a,b], outputs=c)>>> f([1, 2, 3], [2, 3, 4])array([ 9., 35., 91.])

To visualize the graph structure generated for c, we can use the following command:

pydotprint(c, outfile="unopt.png", var_with_name_simple=True)

The resulting graph in Figure 3-1 explains how c is generated from inputs a and b.Let’s try to understand what this graph means. First, to raise each element of a andeach element of b to the power of 3, the scalar value must be “broadcasted” in orderto match the dimensions of the vectors (this is a NumPy operation). This is achievedby the DimShuffle operations in the graph. Then the outputs of each DimShuffle arecombined with a and b through an element-wise pow operation. Finally, the output ofeach pow is fed into an element-wise add to produce c (the node colored blue).

Figure 3-1.Theano’s unoptimized graph structure for c in terms of a and b

Now to see the power of Theano’s under the hood optimizations, we can similarlyprint out the graph structure for the compiled function f:

pydotprint(f, outfile="opt.png", var_with_name_simple=True)

The result is shown in Figure 3-2. All of the various operations from the unoptimizedgraph are compressed into a single element operation. We also notice that the powoperation is nowhere to be found. Instead, the more efficient square (sqr) operationis used. With this reconfigured graph structure, Theano can now generate extremely

Theano Graph Structures | 55

efficient code to compute f instead of having to rely one Python’s inefficient datamanagement and computation.

Figure 3-2. Theano’s optimized graph structure for c in terms of a and b

With this understanding of how Theano works, we’ll be better equipped to debug andprofile Theano programs (e.g. identify bottlenecks, missing optimizations, etc.). Inthe following sections, we’ll learn more about more specialized Theano variables.

Shared Variables and Side-EffectsSo far, we’ve talked about functions that are “stateless.” In other words, they hold nomemory of previous operations. In many cases, however, we require “stateful” com‐putation. For example, if we’re training a machine learning model, we’d like everyiteration of training to modify the parameter vector, and we’d like these modifica‐tions, or side-effects, to be remembered. Theano represents internal function statethrough a special data type knowns as a shared variable.

To illustrate this concept, let’s imagine we’re using Theano to provide some computa‐tionally intensive service to a customer. Say, for example, that the server has computesa linear perceptron for sentiment classification, and the customer would like to knowwhether his feature vector has positive or negative sentiment. In this situation, wewant to not only produce the result for the customer, but we also want to keep trackof the number of times we provided the customer the service (so we know how muchto charge them). Theano allows us to do this as follows:

>>> import theano.tensor as T >>> from theano import function>>> from theano import shared>>> state = shared(0)>>> query = T.dvector('query')>>> W = T.dvector('W') # model parameter vector


>>> result = T.dot(W, query) > 0>>> sentiment = function(inputs=[query], ..: outputs=result, ..: updates=[(state, state + 1)],..: givens={..: W : np.array([1, -2, 3, -0.5, 1])..: })

The output of the sentiment function should be zero if the sentiment is negative andone if the sentiment is positive. We pass the parameter vector W as one of the givensfor our Theano function because its expression is fixed with respect to the inputs (it’snot some arbitrary user-chosen value). This allows Theano to include the expressionin the compiled graph structure, enabling additional optimization. We’ll use this trickto efficiently pass our training set to the Theano function responsible for training ourmodel. Moreover, we see that the state increments by 1 every time the function iscalled. This is achieved by passing a set of side-effects to our function’s updates.Specifically, our update instructions the function to replace state with the valuestate + 1. The action of the sentiment function can be seen in an interactivePython shell:

>>> state.get_value()array(0)>>> sentiment([1, 6, 0, 9, 0])array(0, dtype=int8)>>> state.get_value()array(1)>>> sentiment([1, -6, 0, -9, 0])array(1, dtype=int8)>>> state.get_value()array(2)

We note that we can also share the state variable with other functions if we wantother functions to update its value. We can also choose to reset the value of the vari‐able by calling state.set_value(0). We’ll use the shared variable feature of Theanoextensively as we build our own machine learning models in this book.

Randomness in TheanoBecause Theano requires us to define everything symbolically a priori and then com‐piles these expressions to produce a function, generating pseudorandom numbers is alittle bit more involved than it is in NumPy, but it isn’t too difficult. The way we canthink about introducing randomness into a Theano graph structure is by declaring arandom variable. Theano will allocate a NumPy RandomStream object (a randomnumber generator) for each random variable, and utilize it as necessary. We call thisTheano construct (and the sequence of pseudorandom numbers it generates) a ran‐dom steam. At their core, random streams are shared variables, so they behave simi‐larly to the state variable from the previous section.

Randomness in Theano | 57

Let’s start off by looking at a simple example. We instantiated shared random numbergenerator shared_rng with an arbitrary seed value. We can then simulate rolling afair die as follows:

>>> import theano.tensor as T>>> from theano.tensor.shared_randomstreams import RandomStreams >>> from theano import function>>> shared_rng = RandomStreams(seed=123)>>> die = T.floor(shared_rng.uniform() * 6) + 1>>> roll = function(inputs=[], outputs=die)>>> roll()array(1.0)>>> roll()array(5.0)>>> roll()array(3.0)>>> roll()array(2.0)

In this case, die is a new random variable generated from a uniform distributionextracted from shared_rng. Every time we call roll(), we generate a new numberbetween 1 and 6 inclusive. But let’s suppose we want our function to select a randomnumber between 1 and 6 inclusive and fix it, instead of invoking shared_rng everytime it is called. We can do that by passing the special flag no_default_updates toour function at compile time. As a quick note, we must be sure to instantiate a newrandom variable for roll_once() in order to make sure that the roll() androll_once() functions do not interfere with each other.

>>> fixed_die = T.floor(shared_rng.uniform() * 6) + 1>>> roll_once = function(inputs=[], outputs=fixed_die, no_default_updates=True)>>> roll_once()array(5.0)>>> roll_once() # same value as beforearray(5.0)>>> roll() # test potential interferencearray(6.0)>>> roll_once() # interference test passes (should call multiple times)array(5.0)

These examples may seem simplistic, but generating randomness is a critical part ofmany deep learning models and we will use random streams extensively when imple‐menting dropout layers in more complicated networks.

Computing Derivatives Symbolically The final feature of Theano that we’ll cover in this chapter is symbolic differentiation.This is one of the reasons why Theano is such a popular library among machine


learning researchers. We don’t have to manually write out the gradient updates forour objective functions. Theano does this automatically for us!

Let’s start off with a simple example. Let’s say that we have a func‐tion f � = x1, x2 = x1

2 + x22. Our goal is to compute the the gradient of this function:

∇ f = ∂∂x1

x12 + x2

2 , ∂∂x2

x12 + x2

2 = 2x1, 2x2

So if we wanted to compute ∇ f at the point � = 1, 2 , we can just plug it into theformula above to get 2, 4 . We can do the same computation elegantly in Theanowith the following code snippet:

>>> import theano.tensor as T>>> from theano import function>>> x = T.dvector('x')>>> sum_squares = T.sum(x ** 2)>>> gradient = T.grad(sum_squares, x)>>> grad_f = function(inputs=[x], outputs=gradient)>>> grad_f([1,2])array([ 2., 4.])

In general, given any scalar function f and any vector or matrix �, Theano can effi‐ciently compute ∂ f

∂� , even if f has multiple inputs.

Expressing a Logistic Regression Network in TheanoNow that we’ve developed all of the basic concepts of Theano, let’s build a simple neu‐ral network model to tackle the MNIST dataset. As you may recall, our goal is toidentify handwritten digits from 28 x 28 black and white images. The first networkthat we’ll build implements a machine learning model known as logistic regression.

On a high level, logistic regression is a method by which we can calculate the proba‐bility that an input belongs to one of the target classes. In our case, we’ll compute theprobability that a given input image is a 0, 1, ..., or 9. Our model uses amatrix W representing the weights of the connections in the network as well as a vec‐tor b corresponding to the biases to estimate whether a input x belongs toclass i using the softmax expression we talked about earlier:

Expressing a Logistic Regression Network in Theano | 59

P y = i x = so f tmaxi Wx + b = eWix + bi

∑ j eW jx + bj

Our goal is to learn the values for W and b that most effectively classify our inputs asaccurately as possible. Pictorially, we can express the logistic regression network asshown below in Figure 3-3 (bias connections not shown to reduce clutter).

Figure 3-3. Interpreting logistic regression as a primitive neural network

You’ll notice that the network interpretation for logistic regression is rather primitive.It doesn’t any hidden layers, meaning that it is limited in its ability to learn complexrelationships! We have a output softmax of size 10 because we have 10 possible out‐comes for each input. Moreover, we have an input layer of size 784, one input neuronfor every pixel in the image! As we’ll see, the model makes decent headway towardscorrectly classifying our dataset, but there’s lots of room for improvement. Over thecourse of the rest of this chapter and Chapter 5, we’ll try to significantly improve ouraccuracy. But first, let’s look at how we can implement the logistic network in Theanoso we can train it on our computer!

We begin by taking a look at how we represent, on a high level, the logistic network inTheano. The network receives some input, which it multiplies by the weights of theconnections (represented by W). This result, combined with the bias terms then goesinto a softmax (for which Theano has a built in function that we can use out of thebox). At the end of the softmax, we just look for the bin with the highest probability.


We can express this functional procedure succinctly in Theano with the followingcode.

def __init__(self, input, input_dim, output_dim): """ We first initialize the logistic network object with some important information.

PARAM input : theano.tensor.TensorType A symbolic variable that we'll use to represent one minibatch of our dataset

PARAM input_dim : int This will represent the number of input neurons in our model

PARAM ouptut_dim : int This will represent the number of neurons in the output layer (i.e. the number of possible classifications for the input) """ # We initialize the weight matrix W of size (input_dim, output_dim) self.W = theano.shared( value=np.zeros((input_dim, output_dim)), name='W', borrow=True )

# We initialize a bias vector for the neurons of the output layer self.b = theano.shared( value=np.zeros(output_dim), name='b', borrow='True' )

# Symbolic description of how to compute class membership probabilities self.output = T.nnet.softmax(T.dot(input, self.W) + self.b)

# Symbolic description of the final prediction self.predicted = T.argmax(self.output, axis=1)

Let’s talk about how this code works in more detail. The initialization takes in fourparameters. The first is a symbolic variable, denoted input, corresponding to amatrix containing all of the training/validation/test examples. Each row of the matrixcorresponds to an example in our dataset. In our case, we’ll be using a vectorizedform of the MNIST images. We also need to keep track of the dimension of the input,input_dim, and the number of possible classes in the output, output_dim, so weknow how big our network is. The connections are represented by the matrix self.W,which has input_dim rows and output_dim columns. Specifically, the component inthe ith row and the jth column of the weight matrix, i.e. self.W[i][j], corresponds tothe weight of the connection between the ith input neuron and the jth output neuron


in the softmax layer. The final two lines of code merely express the logistic networkmodel in Theano and select the digit with the highest probability. Note that formatrix inputs, the T.dot function computes the matrix product. Moreover, althoughthe output of T.dot is a matrix, we can still add the vector self.b to it because The‐ano broadcasts self.b to the right dimensions (stacks the vector up multiple times toenable point-wise addition of self.b with every row of T.dot(input, self.W)).Moreover, both the T.nnet.softmax and T.argmax (with axis=1) functions operateon each row of their inputs. Consequently, self.prediction is a vector correspond‐ing to the model’s prediction for each example.

Now we have that we’ve developed a symbolic expression for the model’s predictiongiven a dataset of input images, we’ll need to develop a measure of how well ourmodel performs! In other words, we’re going to develop a symbolic expression for theobjective function. In this example, our objective function is going to have two com‐ponents. The core component is a measure known as the negative log likelihood. Let’sdenote the prediction made by the machine learning model as y and the ith trainingexample and its label as x i and y i respectively. The the negative log likelihood canbe expressed as:

−∑i log P y = y i ∣ x i , W, b

In other words, we want to maximize the negated sum of the log probabilities thatour model assigns to the correct answer. The second component of the objectivefunction is the L2 regularization component we discussed in the previous chapter. Bydefault we set λ = 0, but we can tweak this parameter while training our model if wefind ourselves overfitting to the training data. The code to express this cost functionis shown below.

def logistic_network_cost(self, y, lambda_l2=0): """ Here we express the cost incurred by an example given the correct distribution

PARAM y : theano.tensor.TensorType These are the correct answers, and we compute the cost with respect to this ground truth (over the entire minibatch). This means that y is of size (minibatch_size, output_dim)

PARAM lambda : float This is the L2 regularization parameter that we use to penalize large values for components of W, thus discouraging potential overfitting """


# Calculate the log probabilities of the softmax output log_probabilities = T.log(self.output)

# We use these log probabilities to compute the negative log likelihood negative_log_likelihood = -T.mean(log_probabilities[T.arange(y.shape[0]), y]) # Compute the L2 regularization component of the cost function l2_regularization = lambda_l2 * (self.W ** 2).sum() # Return a symbolic description of the cost function return negative_log_likelihood + l2_regularization

Let’s take a moment to understand how this Theano snippet functions. The expres‐sion requires two parameters. The first parameter is the set of true labels, stored inthe symbolic Theano variable y. The optional regularization parameter islambda_l2 and is set to 0 by default. In order to grab the correct log probabilities, weuse some Python array magic to pull out the correct entry of the log_probabilities matrix with the expression log_probabilities[T.arange(y.shape[0]), y].We then compute the regularization component by taking the sum of the squares ofall the connections between the input and softmax layers. This is simply expressed inTheano as (self.W ** 2).sum(). We can then return the sum of these two compo‐nents as a symbolic description of the objective function we aim to optimize. Thisexpression will be critical for us as we train our logistic network model.

The final piece of the logistic regression network is a symbolic expression for evaluat‐ing how well our model performs on a dataset. Fortunately, this is the simplest part ofthe model. The code to accomplish this task is shown below.

def error_rate(self, y): """ Here we return the error rate of the model over a set of given labels (perhaps in a minibatch, in the validation set, or the test set)

PARAM y : theano.tensor.TensorType These are the correct answers, and we compute the cost with respect to this ground truth (over the entire minibatch). This means that y is of size (minibatch_size, output_dim) """

# Make sure y is of the correct dimension assert y.ndim == self.predicted.ndim

# Make sure that ys contains values of the correct data type (ints) assert y.dtype.startswith('int')

# Return the error rate on the data return T.mean(T.neq(self.predicted, y))


The only parameter is a symbolic variable y that holds the true labels for the dataset.After asserting that this input is of the correct input and holds valid labels (the labelvalues must be integers!), we can count up the number of times the model makes amistake. The error rate is expressed in Theano as T.mean(T.neq(self.predicted,y)). This completes our Theano representation of a logistic regression network! Thefull LogisticNetwork class is included here with extensive comments for reference.

"""We will use this class to represent a simple logistic regressionclassifier. We'll represent this in Theano as a neural network with no hidden layers. This is our first attempt at building a neural network model to solve interesting problems. Here, we'lluse this class to crack the MNIST handwritten digit dataset problem,but this class has been constructed so that it can be reappropriatedto any use!

References: - textbooks: "Pattern Recognition and Machine Learning", Christopher M. Bishop, section 4.3.2 - websites: http://deeplearning.net/tutorial, Lisa Lab"""

import numpy as npimport theano.tensor as T import theano

class LogisticNetwork(object): """ The logistic regression class is described by two parameters (which we will want to learn). The first is a weight matrix. We'll refer to this weight matrix as W. The second is a bias vector b. Refer to the text if you want to learn more about how this network works. Let's get started! """

def __init__(self, input, input_dim, output_dim): """ We first initialize the logistic network object with some important information.

PARAM input : theano.tensor.TensorType A symbolic variable that we'll use to represent one minibatch of our dataset

PARAM input_dim : int This will represent the number of input neurons in our model

PARAM ouptut_dim : int This will represent the number of neurons in the output layer (i.e. the number of possible classifications for the input) """


# We initialize the weight matrix W of size (input_dim, output_dim) self.W = theano.shared( value=np.zeros((input_dim, output_dim)), name='W', borrow=True )


# Symbolic description of how to compute class membership probabilities self.output = T.nnet.softmax(T.dot(input, self.W) + self.b)

# Symbolic description of the final prediction self.predicted = T.argmax(self.output, axis=1)

def logistic_network_cost(self, y, lambda_l2=0): """ Here we express the cost incurred by an example given the correct distribution

PARAM y : theano.tensor.TensorType These are the correct answers, and we compute the cost with respect to this ground truth (over the entire minibatch). This means that y is of size (minibatch_size, output_dim)

PARAM lambda : float This is the L2 regularization parameter that we use to penalize large values for components of W, thus discouraging potential overfitting """ # Calculate the log probabilities of the softmax output log_probabilities = T.log(self.output)

# We use these log probabilities to compute the negative log likelihood negative_log_likelihood = -T.mean(log_probabilities[T.arange(y.shape[0]), y]) # Compute the L2 regularization component of the cost function l2_regularization = lambda_l2 * (self.W ** 2).sum() # Return a symbolic description of the cost function return negative_log_likelihood + l2_regularization

def error_rate(self, y): """ Here we return the error rate of the model over a set of given labels (perhaps in a minibatch, in the validation set, or the test set)

PARAM y : theano.tensor.TensorType


These are the correct answers, and we compute the cost with respect to this ground truth (over the entire minibatch). This means that y is of size (minibatch_size, output_dim) """

# Make sure y is of the correct dimension assert y.ndim == self.predicted.ndim

# Make sure that y contains values of the correct data type (ints) assert y.dtype.startswith('int')

# Return the error rate on the data return T.mean(T.neq(self.predicted, y))

Using Theano to Train a Logistic Regression NetworkNow that we have a working logistic regression network, we’ll use Theano to train iton the MNIST dataset. As in the previous section, we’ll take the code one snippet at atime and understand how and why it works conceptually. Then we’ll end the sectionby presenting the complete Python script.

After downloading our dataset and any appropriate pre-processing, we’ll have to pre‐pare our dataset so that Theano can utilize it efficiently. Specifically, we need todeclare the datasets as shared variables. This is especially important if we’re using theGPU to train our model. If a variable is not shared, it will be copied into the GPUmemory at every use. This is extremely inefficient, especially when we’re dealing withlarge datasets. We can declare all our MNIST datasets as shared with the followingcode snippet.

def shared_dataset(data_xy): """ We store the data in a shared variable because it allows Theano to copy it into GPU memory (if GPU utilization is enabled). By default, if a variable is not shared, it is moved to GPU at every use. This results in a big performance hit because that means the data will be copied one minibatch at a time. Instead, if we use shared variables, we don't have to worry about copying data repeatedly. """

data_x, data_y = data_xy shared_x = shared(np.asarray(data_x, dtype=config.floatX), borrow=True) shared_y = shared(np.asarray(data_y, dtype='int32'), borrow=True) return shared_x, shared_y

# We now instantiate the shared datasetstraining_set_x , training_set_y = shared_dataset(training_set)validation_set_x, validation_set_y = shared_dataset(validation_set)test_set_x, test_set_y = shared_dataset(test_set)


Now we need to set up several symbolic variables so we can build our finalized train‐ing, validation, and test functions. Using Theano’s symbolic functionality, we can ach‐ieve all of this boilerplate with just 10 lines of Theano code.

# Lets compute the number of minibatches for training, validation, and testingn_training_batches = training_set_x.get_value(borrow=True).shape[0] / BATCH_SIZEn_validation_batches = validation_set_x.get_value(borrow=True).shape[0] / BATCH_SIZEn_test_batches = test_set_x.get_value(borrow=True).shape[0] / BATCH_SIZE

# Now it's time for us to build the model! #Let's start of with an index to the minibatch we're usingindex = T.lscalar()

# Generate symbolic variables for the input (a minibatch)x = T.dmatrix('x')y = T.ivector('y')

# Construct the logistic network model# Keep in mind MNIST image is of size (28, 28)# Also number of output class is is 10 (digits 0, 1, ..., 9)model = logistic_network.LogisticNetwork(input=x, input_dim=28*28, output_dim=10)

# Obtain a symbolic expression for the objective function# EXPERIMENT!!! Play around with L2 regression parameter!objective = model.logistic_network_cost(y, lambda_l2=0.0001)

# Obtain a symbolic expression for the error incurrederror = model.error_rate(y)

# Compute symbolic gradients of objective with respect to model parametersdW, db = T.grad(objective, model.W), T.grad(objective, model.b)

In the first three lines, we start off by determining how many minibatches are in eachdataset and storing the values in n_training_batches, n_validation_batches, andn_test_batches. In line 4, we declare the symbolic scalar variable index in order tokeep track of which minibatch we are on during the training process. In lines 4-5, wedeclare symbolic variables to store the current minibatch examples in x and theirassociated labels (ground truth) in y. Now we’re ready to pull out the symbolicexpressions we defined in the LogisticRegression class we defined in the previoussection. In line 6, we instantiate a LogisticRegression object in model. In line 7 and8 we pull out the objective and error expressions we defined. And finally, in line 10,we bring out the pull power of Theano by expressing the error derivatives in just asingle line of code, maintained symbolically in the variables dW and db.

With this boilerplate out of the way, we’re ready to declare the functions that we willneed to train, validate, and test our model on a minibatch of data. The code for thispurpose is shown below.

Using Theano to Train a Logistic Regression Network | 67

# Compile theano function for training with a minibatchtrain_model = function( inputs=[index], outputs=objective, updates=[ (model.W, model.W - LEARNING_RATE * dW), (model.b, model.b - LEARNING_RATE * db) ], givens={ x : training_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE], y : training_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE] } )

# Compile theano functions for validation and testingvalidate_model = function( inputs=[index], outputs=error, givens={ x : validation_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE], y : validation_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE] } )

test_model = function( inputs=[index], outputs=error, givens={ x : test_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE], y : test_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE] } )

The Theano code for these functions is mostly self explanatory. All of the functionstake in the index representing a minibatch from the dataset. The output oftrain_model is the value of the objective function on that minibatch while the out‐puts of validate_model and test_model is a measure of the error incurred. More‐over, the train_model performs the required minibatch stochastic gradient descentupdates on the connection weights W and the softmax layer biases b. You may havenoticed that we decided to pass the datasets as givens instead of as inputs. Asdescribed above, this allows us to include the datasets in the graph optimizations,which turns out to be more performant than passing them in as variable inputs.

The final code snippet is the piece of Python that is responsible for actually perform‐ing the training process. It is also the most involved because, although we define aconstant N_EPOCHS to pre-determine how long training should occur, we also utilize aconcept called early stopping. As you may recall from the previous chapter, we always


use a validation set in order to ensure that we aren’t overfitting to the training dataduring the training process. Early stopping is an algorithmic mechanism by which wecan use the performance on the validation data as an indicator of whether our modelis experiencing useful learning or if it is merely overfitting. The code for this processis shown below and we will describe it in extensive detail so we understand how itworks.

# Let's set up the early stopping parameters (based on the validation set)# Must look at this many examples no matter whatpatience = 5000

# Wait this much longer if a new best is found patience_increase = 2

# This is when an improvement is significantimprovement_threshold = 0.995

# We go through this number of minbatches before we check on the validation setvalidation_freq = min(n_training_batches, patience / 2)

# We keep of the best loss on the validation set herebest_loss = np.inf

# We also keep track of the epoch we are inepoch = 0

# A boolean flag that propagates when patience has been exceededexceeded_patience = False

# Now we're ready to start training the modelprint "... TRAINING MODEL ..."start_time = time.clock()while (epoch < N_EPOCHS) and not exceeded_patience: epoch = epoch + 1 for minibatch_index in xrange(n_training_batches): minibatch_objective = train_model(minibatch_index) iteration = (epoch - 1) * n_training_batches + minibatch_index

if (iteration + 1) % validation_freq == 0: # Compute loss on validation set validation_losses = [validate_model(i) for i in xrange(n_validation_batches)] validation_loss = np.mean(validation_losses)

print 'epoch %i, minibatch %i/%i, validation error: %f %%' % ( epoch, minibatch_index + 1, n_training_batches, validation_loss * 100 )

if validation_loss < best_loss:


if validation_loss < best_loss * improvement_threshold: patience = max(patience, iteration * patience_increase) best_loss = validation_loss

if patience <= iteration: exceeded_patience = True breakend_time = time.clock()

# Let's compute how well we do on the test settest_losses = [test_model(i) for i in xrange(n_test_batches)]test_loss = np.mean(test_losses)

Early stopping works by terminating training if the validation error ceases toimprove. To determine whether we need to stop training early, we require a few ofimportant parameters. First, we need to decide how patient we will be when waitingfor an improvement in validation error. This is defined by the patience parameter. Inother words, we must wait a minimum of patience iterations (i.e. training steps onminibatches) before we decide that training isn’t worth it. If we achieve a new best, weincrease the number of iterations we need to wait by multiplying the current iterationby a predefined patience_increase factor. And finally, to determine if a new best issignificant, it must satisfy a predefined improvement_threshold. We compute thevalidation error either once every epoch or every time we complete patience/2 mini‐batches, whichever is smaller.

To illustrate these concepts, let’s walk through a two examples. Let’s set patience =5000 and patience_increase = 2. In the first situation, we we do not achieve a sig‐nificant improvement in validation error after completing patience/2 or 2500 itera‐tions. This means that our value for patience never gets increased. Consequently,after 5000 iterations, our training algorithm terminates.

In the second example, we’ll set patience and patience_increase to the same valuesas before. This time however, our model experiences a significant improvement atiteration 3000. This means that our patience will be increased to 6000. If we experi‐ence no significant improvements during iterations 3001-6000, our algorithm willterminate. On the other hand, if a significant improvement in validation error isachieved, we will continue to resize patience until eventually, we max out N_EPOCHSor we finally do exhaust our patience.

After completing the training process, we use the test_model function to evaluate ourerror on the test set. We can report this error as the accuracy of this model. We’vefinally completed writing and training our first model in Theano! Using the parame‐


ters in the code below, we are able to achieve a test error of approximately 7.6%. Feelfree to experiment with the hyperparameters to see if you can force the logisticregression network to perform any better!

"""We'll now use the LogisticNetwork object we built in logistic_network.py in order to tackle the MNIST dataset challenge. We will use minibatch gradientdescent to train this simplistic network model.


__docformat__ = 'restructedtext en'

import cPickleimport gzipimport osimport timeimport urllibfrom theano import function, shared, configimport theano.tensor as T import numpy as npimport logistic_network

# Let's start off by defining some constants# EXPERIMENT!!! Play around the the learning rate!LEARNING_RATE = 0.2N_EPOCHS = 1000DATASET = 'mnist.pkl.gz'BATCH_SIZE = 600

# Time to check if we have the data and if we don't, let's download it print "... LOADING DATA ..."

data_path = os.path.join( os.path.split(__file__)[0], "..", "data", DATASET )

if (not os.path.isfile(data_path)): import urllib origin = ( 'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz' ) print 'Downloading data from %s' % origin urllib.urlretrieve(origin, data_path)


# Time to build our modelsprint "... BUILDING MODEL ..."

# Load the datasetdata_file = gzip.open(data_path, 'rb')training_set, validation_set, test_set = cPickle.load(data_file)data_file.close()

# Define a quick function to established a shared dataset (for efficiency)

def shared_dataset(data_xy): """ We store the data in a shared variable because it allows Theano to copy it into GPU memory (if GPU utilization is enabled). By default, if a variable is not shared, it is moved to GPU at every use. This results in a big performance hit because that means the data will be copied one minibatch at a time. Instead, if we use shared variables, we don't have to worry about copying data repeatedly. """






# Construct the logistic network model# Keep in mind MNIST image is of size (28, 28)# Also number of output class is is 10 (digits 0, 1, ..., 9)model = logistic_network.LogisticNetwork(input=x, input_dim=28*28, output_dim=10)

# Obtain a symbolic expression for the objective function# EXPERIMENT!!! Play around with L2 regression parameter!objective = model.logistic_network_cost(y, lambda_l2=0.0001)



# Compute symbolic gradients of objective with respect to model parametersdW, db = T.grad(objective, model.W), T.grad(objective, model.b)

# Compile theano function for training with a minibatchtrain_model = function( inputs=[index], outputs=objective, updates=[ (model.W, model.W - LEARNING_RATE * dW), (model.b, model.b - LEARNING_RATE * db) ], givens={ x : training_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE], y : training_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE] } )



# Let's set up the early stopping parameters (based on the validation set)

# Must look at this many examples no matter whatpatience = 5000











if validation_loss < best_loss: if validation_loss < best_loss * improvement_threshold: patience = max(patience, iteration * patience_increase) best_loss = validation_loss



# Print out the results!print '\n'print 'Optimization complete with best validation score of %f %%' % (best_loss * 100)print 'And with a test score of %f %%' % (test_loss * 100)print '\n'print 'The code ran for %d epochs and for a total time of %.1f seconds' % (epoch, end_time - start_time)print '\n'


Multilayer Models in TheanoIn the previous section, we built a simple logistic regression network with no hiddenlayers. With a 93% accuracy rate, the model does quite well, but is still not nearly aseffective as a human is. In this section we will attempt to create a more powerfulmodel, a feed forward neural network with a hidden layer, in order to construct amore accurate classifier. In the process, we’ll understand how to build models withmultiple layers, a characteristic that is fundamental to deep learning models.

In order to accomplish this, we’ll be building on many of the constructs we developedin previous sections. Specifically, we’ll still be using the same output softmax layerand the same early stopping strategy for training our feed forward neural network. Inthis section we’ll discuss the unique components of this model and develop a concep‐tual understanding of how the code is built step by step.

We start by looking at the structure of the hidden layers of our neural network. Thehidden layers utilize a tanh nonlinearity, which is generally considered to be moreeffective than the sigmoidal nonlinearity because it is zero-centered. As described inGlorot and Bengio 2010, we also initialize the weights by randomly sampling from anarrow uniform distribution instead of setting all of the weights initially to zero. Thisis because, if every neuron in the network computes the same output, then they willalso all compute the same gradients during backpropagation and undergo the exactsame parameter updates. The Python code responsible for this initialization is shownbelow.

"""We will use this class to represent a tanh hidden layer. This will be a building block for a simplefeedforward neural network.



class HiddenLayer(object): """ The hidden layer class is described by two parameters (which we will want to learn). The first is a incoming weight matrix. We'll refer to this weight matrix as W. The second is a bias vector b. Refer to the text if you want to learn more about how

Multilayer Models in Theano | 75

http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

this layer works. Let's get started! """

def __init__(self, input, input_dim, output_dim, random_gen): """ We first initialize the hidden layer object with some important information.

PARAM input : theano.tensor.TensorType A symbolic variable that we'll use to describe incoming data from the previous layer

PARAM input_dim : int This will represent the number of neurons in the previous layer

PARAM ouptut_dim : int This will represent the number of neurons in the hidden layer

PARAM random_gen : numpy.random.RandomState A random number generator used to properly initialize the weights. For a tanh activation function, the literature suggests that the incoming weights should be sampled from the uniform distribution [-sqrt(6./(input_dim + output_dim)), sqrt(6./(input_dim + output_dim)] """

# We initialize the weight matrix W of size (input_dim, output_dim) self.W = theano.shared( value=np.asarray( random_gen.uniform( low=-np.sqrt(6. / (input_dim + output_dim)), high=np.sqrt(6. / (input_dim + output_dim)), size=(input_dim, output_dim) ), dtype=theano.config.floatX ), name='W', borrow=True )


# Symbolic description of the incoming logits logit = T.dot(input, self.W) + self.b

# Symbolic description of the outputs of the hidden layer neurons self.output = T.tanh(logit)


We can now stack these hidden layers in order to construct a feedforward neural net‐work. The initialization takes a random generator random_gen to perform the weightinitialization as well as a list of hidden_layer_sizes in order to specify the sizes ofeach of the hidden layers in the neural network. We hook up the input of the firsthidden layer to a minibatch from our dataset and the output of the last hidden layerto the input of self.softmax_layer. Every one of the other hidden layers has itsinputs drawing from the outputs of the layer below it and its outputs transmitting tothe inputs of the layer above it. These hidden layers are maintained in the listself.hidden_layers as internal state. The objective function, as before, has both anegative log likelihood component and an L2 regularization component. This time,however, computing the sum of the squares of the weights of all connections in thenetwork is slightly more involved. The error function is the same as what we had forthe logistic network model. The Python code for the feed forward neural network isshown below.

"""We will use this class to represent a tanh hidden layer. This will be a building block for a simplefeedforward neural network.



class HiddenLayer(object): """ The hidden layer class is described by two parameters (which we will want to learn). The first is a incoming weight matrix. We'll refer to this weight matrix as W. The second is a bias vector b. Refer to the text if you want to learn more about how this layer works. Let's get started! """

def __init__(self, input, input_dim, output_dim, random_gen): """ We first initialize the hidden layer object with some important information.

PARAM input : theano.tensor.TensorType A symbolic variable that we'll use to describe incoming data from the previous layer

PARAM input_dim : int This will represent the number of neurons in the previous layer


PARAM ouptut_dim : int This will represent the number of neurons in the hidden layer

PARAM random_gen : numpy.random.RandomState A random number generator used to properly initialize the weights. For a tanh activation function, the literature suggests that the incoming weights should be sampled from the uniform distribution [-sqrt(6./(input_dim + output_dim)), sqrt(6./(input_dim + output_dim)] """

# We initialize the weight matrix W of size (input_dim, output_dim) self.W = theano.shared( value=np.asarray( random_gen.uniform( low=-np.sqrt(6. / (input_dim + output_dim)), high=np.sqrt(6. / (input_dim + output_dim)), size=(input_dim, output_dim) ), dtype=theano.config.floatX ), name='W', borrow=True )


# Symbolic description of the incoming logits logit = T.dot(input, self.W) + self.b

# Symbolic description of the outputs of the hidden layer neurons self.output = T.tanh(logit)

The Python script responsible for instantiating and training the feedforward networkis quite similar to how we trained the logistic network model. Here, we take a closerlook at the differences before we present the entire script.

# Construct the logistic network model# Keep in mind MNIST image is of size (28, 28)# Also number of output class is is 10 (digits 0, 1, ..., 9)model = feed_forward_network.FeedForwardNetwork( random_gen=random_gen, input=x, input_dim=28*28, output_dim=10, hidden_layer_sizes=[500] )


# Obtain a symbolic expression for the objective function# EXPERIMENT!!! Play around with L2 regression parameter!objective = model.feed_forward_network_cost(y, lambda_l2=0.0001)


# Compute symbolic gradients of objective with respect to model parametersupdates = []for hidden_layer in model.hidden_layers: dW = T.grad(objective, hidden_layer.W) db = T.grad(objective, hidden_layer.b) updates.append((hidden_layer.W, hidden_layer.W - LEARNING_RATE * dW)) updates.append((hidden_layer.b, hidden_layer.b - LEARNING_RATE * db))

dW = T.grad(objective, model.softmax_layer.W)db = T.grad(objective, model.softmax_layer.b)updates.append((model.softmax_layer.W, model.softmax_layer.W - LEARNING_RATE * dW))updates.append((model.softmax_layer.b, model.softmax_layer.b - LEARNING_RATE * db))

# Compile theano function for training with a minibatchtrain_model = function( inputs=[index], outputs=objective, updates=updates, givens={ x : training_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE], y : training_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE] } )

As you’ll notice, our network only has a single hidden layer with 500 tanh neurons, asdeclared by the line hidden_layer_sizes=[500]. Moreover, we need to compileupdates for the connections and biases over all of the layers in our feed forward net‐work. This incurs some additional complexity, but all of the gradient updates can becompiled into a single variable, updates, and passed to the train_model function.For completeness, we present the full training script below. The model performsmuch better than the logistic network, incurring a test error of only 1.65%!

"""We'll now use the LogisticNetwork object we built in feed_forward_network.py in order to tackle the MNIST dataset challenge. We will use minibatch gradientdescent to train this simplistic network model.


__docformat__ = 'restructedtext en'


import cPickleimport gzipimport osimport timeimport urllibfrom theano import function, shared, configimport theano.tensor as T import numpy as npimport feed_forward_network

# Let's start off by defining some constants# EXPERIMENT!!! Play around the the learning rate!LEARNING_RATE = 0.01N_EPOCHS = 1000DATASET = 'mnist.pkl.gz'BATCH_SIZE = 20

# Time to check if we have the data and if we don't, let's download it print "... LOADING DATA ..."

data_path = os.path.join( os.path.split(__file__)[0], "..", "data", DATASET )

if (not os.path.isfile(data_path)): import urllib origin = ( 'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz' ) print 'Downloading data from %s' % origin urllib.urlretrieve(origin, data_path)

# Time to build our modelsprint "... BUILDING MODEL ..."

# Load the datasetdata_file = gzip.open(data_path, 'rb')training_set, validation_set, test_set = cPickle.load(data_file)data_file.close()

# Define a quick function to established a shared dataset (for efficiency)

def shared_dataset(data_xy): """ We store the data in a shared variable because it allows Theano to copy it into GPU memory (if GPU utilization is enabled). By default, if a variable is not shared, it is moved to GPU at every use. This results in a big performance


hit because that means the data will be copied one minibatch at a time. Instead, if we use shared variables, we don't have to worry about copying data repeatedly. """






# Create a random number generator for seeding weight initializationrandom_gen = np.random.RandomState(1234)

# Construct the logistic network model# Keep in mind MNIST image is of size (28, 28)# Also number of output class is is 10 (digits 0, 1, ..., 9)model = feed_forward_network.FeedForwardNetwork( random_gen=random_gen, input=x, input_dim=28*28, output_dim=10, hidden_layer_sizes=[500] )

# Obtain a symbolic expression for the objective function# EXPERIMENT!!! Play around with L2 regression parameter!objective = model.feed_forward_network_cost(y, lambda_l2=0.0001)


# Compute symbolic gradients of objective with respect to model parametersupdates = []for hidden_layer in model.hidden_layers:


dW = T.grad(objective, hidden_layer.W) db = T.grad(objective, hidden_layer.b) updates.append((hidden_layer.W, hidden_layer.W - LEARNING_RATE * dW)) updates.append((hidden_layer.b, hidden_layer.b - LEARNING_RATE * db))

dW = T.grad(objective, model.softmax_layer.W)db = T.grad(objective, model.softmax_layer.b)updates.append((model.softmax_layer.W, model.softmax_layer.W - LEARNING_RATE * dW))updates.append((model.softmax_layer.b, model.softmax_layer.b - LEARNING_RATE * db))

# Compile theano function for training with a minibatchtrain_model = function( inputs=[index], outputs=objective, updates=updates, givens={ x : training_set_x[index * BATCH_SIZE : (index + 1) * BATCH_SIZE], y : training_set_y[index * BATCH_SIZE : (index + 1) * BATCH_SIZE] } )



# Let's set up the early stopping parameters (based on the validation set)

# Must look at this many examples no matter whatpatience = 10000











if validation_loss < best_loss: if validation_loss < best_loss * improvement_threshold: patience = max(patience, iteration * patience_increase) best_loss = validation_loss



# Print out the results!print '\n'print 'Optimization complete with best validation score of %f %%' % (best_loss * 100)print 'And with a test score of %f %%' % (test_loss * 100)print '\n'


print 'The code ran for %d epochs and for a total time of %.1f seconds' % (epoch, end_time - start_time)print '\n'

SummaryIn this chapter, we learned more about using Theano as a library for expressing andtraining machine learning models. We discussed many internal features of Theano,including shared variables, symbolic differentiation, and graph optimizations. In thefinal sections, we used this understanding to train a logistic network model and ageneralized feedforward neural network using stochastic gradient descent. Althoughthe logistic network model made many errors on the MNIST dataset, our feedfor‐ward network performed very well, making only an average of 1.65 errors out ofevery 100 digits.

In the next section, we’ll begin to grapple with many of the problems that arise as webegin to make our networks deeper. While deep models afford us the power to tacklemore difficult problems, they are also notoriously difficult to train with vanilla sto‐chastic gradient descent. To tackle these issues, we’ll delve in to modern optimizationtheory to come up with better methods of training deep learning models.


Fundamentals of Deep Learning

Documents

Transcript of Fundamentals of Deep Learning