Unsupervised Feature Learning

Unsupervised Feature Learning:A Literature Review

By: Amgad Muhammad & Mohamed EL

Fadly

1

Outline

• Background

• Problem Definition

• Unsupervised Feature Learning

• Our Work

• Sparse Auto-encoder

• Preprocessing: PCA and Whitening

• Self-Taught Learning and Unsupervised Feature Learning

• References 2 of 37

Background

• Machine learning is one of the corner stone fields in Artificial Intelligence, where machines

learn to act autonomously, and react to new situations without being pre-programmed.

• Machine learning has seen numerous successes, but applying learning algorithms today often

means spending a long time hand-engineering the input feature representation. This is true for

many problems in vision, audio, NLP, robotics, and other areas.

• There are many learning algorithms for learning among them are [1]:

1) Supervised learning

2) Unsupervised learning

3 of 37

Problem Definition

• The target of the supervised learning method can be summarized as follows:

• Regression

• Classification

• The first step to train a machine using the supervised learning method, is collecting the data set,

which in most cases is a very difficult and an expensive process

• The alternative approach is to measure and use everything, which will lead to other problems, i.e. the

noisy data [2]

4 of 37

Unsupervised feature learning

• The unsupervised feature learning approach learns higher-level representation of the

unlabeled data features by detecting patterns using various algorithms, i.e. sparse encoding

algorithm [3]

• It is a self-taught learning framework developed to transfer knowledge from unlabeled data,

which is much easier to obtain, to be used as preprocessing step to enhance the supervised

inductive models.

• This framework is developed to tackle present issues in the supervised learning model and to

increase its accuracy regardless of the domain of interest (vision, sound, and text).[4]

5 of 37

Our Work

• We will present some of the methods for unsupervised feature learning and deep learning, each of

which automatically learns a good representation of the input from unlabeled data.

• We will be concentrating on the following algorithms, with more details in the following slides:

• Sparse Autoencoder

• PCA and Whitening

• Self-Taught

• We will also be focusing on the application of these algorithms to learn features from images

6 of 37

Sparse Autoencoders

7 of 37

Sparse Auto-encoder

An Autoencoder neural network is an unsupervised learning algorithm that applies back

propagation, on a set of unlabeled training examples , ,….} where by setting the target values to

be equal to the inputs.[6]

i.e. it uses =

8 of 37

Autoencoder [6]

Neural Network

Before we get further into the details of the algorithm, we need to quickly go through neural

network.

To describe neural networks, we will begin by describing the simplest possible neural network. One

that comprises a single "neuron." We will use the following diagram to denote a single neuron [5]

9 of 37

Single Neuron [8]

Neural Network

This "neuron" is a computational unit that takes as input x1,x2,x3 (and a +1 intercept term), and

outputs

where is called the activation function. [5]

10 of 37

Sigmoid Activation Function

The activation function can be:[8]

1) Sigmoid function : , output scale from [0,1]

11 of 37

Sigmoid Function [8]

Tanh Activation Function

2) Tanh function: : , output scale from [-1,1]

12 of 37

Tanh Function [8]

Neural Network Model

• A neural network is put together by hooking together many of our simple "neurons," so that the

output of a neuron can be the input of another. For example, here is a small neural network

• The circles labeled "+1" are called bias units, and correspond to the intercept term. The

leftmost layer of the network is called the input layer, and the rightmost layer the output

layer .The middle layer of nodes is called the hidden layer, because its values are not observed

in the training set.[8]

Small Neural Network[8]13 of 37

Neural Network Model

Neural network parameters are:

• (W,b) = (W(1),b(1),W(2),b(2)), where we write to denote the parameter (or weight) associated with the connection between unit j in layer l, and unit i in layer l+ 1.

• the bias associated with unit i in layer l + 1.

• will denote the activation (meaning output value) of unit i in layer l.

Given a fixed setting of the parameters W, b, our neural network defines a hypothesis hW,b(x) that outputs a real

number.

Specifically, the computation that this neural network represents is given by [8]

14 of 37

Autoencoders and Sparsity

• The auto-encoder tries to learn a function . In other words, it is trying an approximation to the

identity function, so as to output is similar to

• Placing constraints on the network, such as limiting the number of hidden units, or imposing

a sparsity constraint on the hidden units, lead to discover interesting structure in the data, even if

the number of hidden units is large.[6]

15 of 37

Autoencoders and Sparsity Algorithm

Assumption :

1. The neurons to be inactive most of the time (a neuron to be "active" (or as "firing") if its output value is

close to 1, or "inactive" if its output value is close to 0) and the activation function is sigmoid function.[6]

2. Recall that denotes the activation of hidden unit in the autoencoder

3. (x) to denote the activation of this hidden unit when the network is given a specific input

4. Let: be the average activation unit (averaged over the training set).

Objective:

We would like to (approximately) enforce the constraint: = where is a sparsity parameter, a small value

close to zero [6]

16 of 37

Autoencoders and Sparsity Algorithm –cont’d

• To achieve this, we will add an extra penalty term to our optimization objective that penalizes :

deviating significantly from .

• +(1- here “” is the number of neurons in the hidden layer, and the index is the summing over

the hidden units in the network.[6]

• It can also be written where = +(1- is the Kullback-Leibler (KL) divergence between a

Bernoulli random variable with mean and a Bernoulli random variable with mean. [6]

• KL-divergence is a standard function for measuring how different two different distributions are.

17 of 37

Autoencoders and Sparsity Algorithm –cont’d

• Kl penalty function has the following property =0 if and otherwise it increases monotonically as

diverges from .

• For example, if we plotted for a range of values

(set =0.2), We will see that the KL-divergence reaches its minimum

of 0 at = and approach ∞ as approaches 0 or 1.

• Thus, minimizing this penalty term has the effect of causing

to close to [6]

KL Function [6]

18 of 37

Autoencoders and Sparsity Algorithm – Cont’d

• Overall cost function is now =J(W,b) + β where J(W,b) is the neural Network Cost function and

βcontrols the weight of the sparsity penalty term

• The term depends on W, b also, because it is the average activation of hidden unit and the

activation of a hidden unit depends on the parameters W, b. [6]

19 of 37

Autoencoder Implementation

• We implemented a sparse autoencoder, trained with 8×8 image patches using the L-BFGS

optimization algorithm

Step 1: Generate training set

The first step is to generate a training set.

20 of 37

A random sample of 200 patches from the dataset.

Autoencoder Implementation

Step 2: Sparse autoencoder objective

Compute the sparse autoencoder cost function Jsparse(W,b) and the corresponding derivatives

of Jsparse with respect to the different parameters

Step3: Train the sparse autoencoder

After computing Jsparse and its derivatives, we will minimize Jsparse with respect to its parameters, and

thereby train our sparse autoencoder. We trained our sparse encoder with L-BFGS algorithm Our

neural network for training has 64 input units, 25 hidden units, and 64 output units.

21 of 37

Autoencoder Implementation Results

After training the sparse autoencoder, the sparse

autoencoder successfully learned a set of edge detectors.

22 of 37

CPU Intel corei7 Quad Core processor 2.7GHz

RAM 6 GB RAM

Training Set 200 patches 8x8 images

Neural Network for training 64 input units, 25 hidden units, and 64 output units.

Autoencoder Implementation Results

23 of 37

Training Time Expected Time [1]

39 seconds Less than a minute

Principle Component Analysis – PCA

24 of 32

Principle Component Analysis – PCA

• PCA is a dimensionality reduction mechanism used to eliminate highly correlated variables,

without sacrificing much of the details.[7]

25 of 37

PCA – Example

Example

• Given the 2D data example.

• This data has already been pre-processed using mean

normalization.

• We want to find the principle directions of variation.

2D data example[8]

26 of 37

PCA – Example (Cont’d)

As depicted, we can see that is the

strongest variation direction and is

the second strongest direction of

variation. [8]

27 of 37

u1u2

2D data example[8]

PCA – Math

• To compute the principle variation direction we first compute the covariance matrix, , for the data

points as follows:

• , m is the number of points

• It can be proven that is the top Eigen vector of and is the second Eigen vector.[8]

28 of 37

PCA – Math

• Having the Eigen vectors calculated, we construct a matrix U of the Eigen vectors.

• Retaining all the Eigen vectors of the covariance matrix into U, makes U a rotation matrix

of the input .[8]

2D data example[8]

29 of 37

PCA – Dimensionality Reduction

• If we decided to retain k Eigen vectors in matrix , then we can reduce the dimension of the input from n to k

[7]

• How would we chose k?

30 of 37

PCA – Dimensionality Reduction

• To decide on k, we consider the percentage of variance retained (PVR).

• Having a covariance matrix , with n Eigen vectors, it will have n Eigen values

• It’s common in images for example to choose [7]

31 of 37

Whitening

• In PCA, we got rid of highly correlated features to reduce the features dimensionality.

• Following this step, is to make sure that all the features have a unity variance.

• From PCA, we transformed our input matrix , to be .

• By scaling every feature by , where is the corresponding Eigen value for component i, we get

features with a unity variance. [8]

32 of 37

Self-Taught Learning

33 of 32

Self-Taught learning and Unsupervised feature learning

Given an unlabeled data set, we can start training a sparse autoencoder to

extract features to give us a better, condense representation of the data.

34 of 37 Neural Network[8]


• Once the training is done, the network is now ready to find better features to represent the input using the activations of the network hidden layer. [8]

35 of 37 Input layer of Neural Network[8]


• Given a labeled training set , we feed the auto encoder the input , to get the activation for the trained hidden layer .[8]



• Given the activation , we can either replace the labeled training data to be or concatenate the auto encoder extracted features to the original labeled features .[8]


Self-Taught Learning Application

• We used the self-taught learning paradigm with the sparse autoencoder and

softmax classifier to build a classifier for handwritten digits.

• The goal is to distinguish between the digits from 0 to 4. We will use the digits

5 to 9 as our "unlabeled" dataset; we will then use a labeled dataset with the

digits 0 to 4 with which to train the softmax classifier.

38 of 37

Self-Taught Learning Implementation

Step 1: Generate the input and test data sets

We used the datasets from the MNIST Handwritten Digit Database for this project.

Step 2: Train the sparse autoencoder

We used the unlabeled data (the digits from 5 to 9) to train a sparse autoencoder. These results are shown after

training is complete for a visualization of pen strokes like the image shown to the right

Step 3: Extracting features

After the sparse autoencoder is trained, we will use it to extract features from the handwritten digit images.

Step 4: Training and testing the logistic regression model

We will train a softmax classifier using the training set features and labels and finally computing the predictions

and accuracy

39 of 37

Self-Taught Learning Setup Environment

CPU Intel corei7 Quad Core processor 2.7GHz

RAM 6 GB RAM

Training Set 60,000 examples from MNIST database

Unlabeled set 29404 examples

Supervised training set 15298 examples

Supervised testing set 15298 examples

40 of 37

Self-Taught Learning Results

The results are shown below after training is complete for a visualization of pen strokes like the image

shown below:

41 of 37

Self-Taught Learning Anaylsis

We have done a comparison between

our application outputs and the

Stanford course tutorial outputs [8].

42 of 37

Our classifier

Tutorial’s classifier

Training Time

16 minutes 25 minutes

Classifier Score (Accuracy)

98.208916% 98 %

Future Work

We propose that if we were able to parallize our code or make the training part run on a GPU for

example, it will boost the performance and decrease the time needed to train the classifier

43 of 37

References

[1] Taiwo Oladipupo Ayodele. New Advances in Machine Learning. InTech, 2010.

[2] SB Kotsiantis, ID Zaharakis, and PE Pintelas. Supervised machine learning: A review of classication techniques. 31:249-268, 2007.

[3] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Ng. Ecient sparse coding algorithms. In Advances in neural information

processing systems, pages 801-808,2006.

[4] Bruno A Olshausen et al. Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature,

381(6583):607-609, 1996.

[5] Simon O. Haykin, ”Multilayer Perceptron,” in Neural Networks and Learning Machines, 3rd Edition ed. , Prentice Hall, 2009.

[6] Andrew Ng. CS294A . Lecture notes, Topic : “Sparse autoencoder ” Standford University, Jan 11, 2011. Available: http://

www.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf. [Accessed Dec. 10,2013].

[7] Aapo Hyvärinen, Jarmo Hurri, and Patrik O. Hoyer, “Principal components and whitening,” in Natural Image Statistics: A Probabilistic

Approach to Early Computational Vision., Vol. 39, Springer-Verlag, 2009,pp. 97-137

[8] Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, and Caroline Suen, “UFLDL Tutorial”, April 7, 2013. [Online]. Available:

http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial. [Accessed Dec. 10,2013].

44 of 37

http://www.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf

http://www.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf

http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial

http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial

Thank You!

Unsupervised Feature Learning

Technology

Transcript of Unsupervised Feature Learning