Unsupervised feature learning for audio classification using convolutional deep belief networks
Unsupervised Feature Learning
-
Upload
amgad-muhammad -
Category
Technology
-
view
2.284 -
download
0
Transcript of Unsupervised Feature Learning
Unsupervised Feature Learning:A Literature Review
By: Amgad Muhammad & Mohamed EL
Fadly
1
Outline
• Background
• Problem Definition
• Unsupervised Feature Learning
• Our Work
• Sparse Auto-encoder
• Preprocessing: PCA and Whitening
• Self-Taught Learning and Unsupervised Feature Learning
• References 2 of 37
Background
• Machine learning is one of the corner stone fields in Artificial Intelligence, where machines
learn to act autonomously, and react to new situations without being pre-programmed.
• Machine learning has seen numerous successes, but applying learning algorithms today often
means spending a long time hand-engineering the input feature representation. This is true for
many problems in vision, audio, NLP, robotics, and other areas.
• There are many learning algorithms for learning among them are [1]:
1) Supervised learning
2) Unsupervised learning
3 of 37
Problem Definition
• The target of the supervised learning method can be summarized as follows:
• Regression
• Classification
• The first step to train a machine using the supervised learning method, is collecting the data set,
which in most cases is a very difficult and an expensive process
• The alternative approach is to measure and use everything, which will lead to other problems, i.e. the
noisy data [2]
4 of 37
Unsupervised feature learning
• The unsupervised feature learning approach learns higher-level representation of the
unlabeled data features by detecting patterns using various algorithms, i.e. sparse encoding
algorithm [3]
• It is a self-taught learning framework developed to transfer knowledge from unlabeled data,
which is much easier to obtain, to be used as preprocessing step to enhance the supervised
inductive models.
• This framework is developed to tackle present issues in the supervised learning model and to
increase its accuracy regardless of the domain of interest (vision, sound, and text).[4]
5 of 37
Our Work
• We will present some of the methods for unsupervised feature learning and deep learning, each of
which automatically learns a good representation of the input from unlabeled data.
• We will be concentrating on the following algorithms, with more details in the following slides:
• Sparse Autoencoder
• PCA and Whitening
• Self-Taught
• We will also be focusing on the application of these algorithms to learn features from images
6 of 37
Sparse Autoencoders
7 of 37
Sparse Auto-encoder
An Autoencoder neural network is an unsupervised learning algorithm that applies back
propagation, on a set of unlabeled training examples , ,….} where by setting the target values to
be equal to the inputs.[6]
i.e. it uses =
8 of 37
Autoencoder [6]
Neural Network
Before we get further into the details of the algorithm, we need to quickly go through neural
network.
To describe neural networks, we will begin by describing the simplest possible neural network. One
that comprises a single "neuron." We will use the following diagram to denote a single neuron [5]
9 of 37
Single Neuron [8]
Neural Network
This "neuron" is a computational unit that takes as input x1,x2,x3 (and a +1 intercept term), and
outputs
where is called the activation function. [5]
10 of 37
Sigmoid Activation Function
The activation function can be:[8]
1) Sigmoid function : , output scale from [0,1]
11 of 37
Sigmoid Function [8]
Tanh Activation Function
2) Tanh function: : , output scale from [-1,1]
12 of 37
Tanh Function [8]
Neural Network Model
• A neural network is put together by hooking together many of our simple "neurons," so that the
output of a neuron can be the input of another. For example, here is a small neural network
• The circles labeled "+1" are called bias units, and correspond to the intercept term. The
leftmost layer of the network is called the input layer, and the rightmost layer the output
layer .The middle layer of nodes is called the hidden layer, because its values are not observed
in the training set.[8]
Small Neural Network[8]13 of 37
Neural Network Model
Neural network parameters are:
• (W,b) = (W(1),b(1),W(2),b(2)), where we write to denote the parameter (or weight) associated with the connection between unit j in layer l, and unit i in layer l+ 1.
• the bias associated with unit i in layer l + 1.
• will denote the activation (meaning output value) of unit i in layer l.
Given a fixed setting of the parameters W, b, our neural network defines a hypothesis hW,b(x) that outputs a real
number.
Specifically, the computation that this neural network represents is given by [8]
14 of 37
Autoencoders and Sparsity
• The auto-encoder tries to learn a function . In other words, it is trying an approximation to the
identity function, so as to output is similar to
• Placing constraints on the network, such as limiting the number of hidden units, or imposing
a sparsity constraint on the hidden units, lead to discover interesting structure in the data, even if
the number of hidden units is large.[6]
15 of 37
Autoencoders and Sparsity Algorithm
Assumption :
1. The neurons to be inactive most of the time (a neuron to be "active" (or as "firing") if its output value is
close to 1, or "inactive" if its output value is close to 0) and the activation function is sigmoid function.[6]
2. Recall that denotes the activation of hidden unit in the autoencoder
3. (x) to denote the activation of this hidden unit when the network is given a specific input
4. Let: be the average activation unit (averaged over the training set).
Objective:
We would like to (approximately) enforce the constraint: = where is a sparsity parameter, a small value
close to zero [6]
16 of 37
Autoencoders and Sparsity Algorithm –cont’d
• To achieve this, we will add an extra penalty term to our optimization objective that penalizes :
deviating significantly from .
• +(1- here “” is the number of neurons in the hidden layer, and the index is the summing over
the hidden units in the network.[6]
• It can also be written where = +(1- is the Kullback-Leibler (KL) divergence between a
Bernoulli random variable with mean and a Bernoulli random variable with mean. [6]
• KL-divergence is a standard function for measuring how different two different distributions are.
17 of 37
Autoencoders and Sparsity Algorithm –cont’d
• Kl penalty function has the following property =0 if and otherwise it increases monotonically as
diverges from .
• For example, if we plotted for a range of values
(set =0.2), We will see that the KL-divergence reaches its minimum
of 0 at = and approach ∞ as approaches 0 or 1.
• Thus, minimizing this penalty term has the effect of causing
to close to [6]
KL Function [6]
18 of 37
Autoencoders and Sparsity Algorithm – Cont’d
• Overall cost function is now =J(W,b) + β where J(W,b) is the neural Network Cost function and
βcontrols the weight of the sparsity penalty term
• The term depends on W, b also, because it is the average activation of hidden unit and the
activation of a hidden unit depends on the parameters W, b. [6]
19 of 37
Autoencoder Implementation
• We implemented a sparse autoencoder, trained with 8×8 image patches using the L-BFGS
optimization algorithm
Step 1: Generate training set
The first step is to generate a training set.
20 of 37
A random sample of 200 patches from the dataset.
Autoencoder Implementation
Step 2: Sparse autoencoder objective
Compute the sparse autoencoder cost function Jsparse(W,b) and the corresponding derivatives
of Jsparse with respect to the different parameters
Step3: Train the sparse autoencoder
After computing Jsparse and its derivatives, we will minimize Jsparse with respect to its parameters, and
thereby train our sparse autoencoder. We trained our sparse encoder with L-BFGS algorithm Our
neural network for training has 64 input units, 25 hidden units, and 64 output units.
21 of 37
Autoencoder Implementation Results
After training the sparse autoencoder, the sparse
autoencoder successfully learned a set of edge detectors.
22 of 37
CPU Intel corei7 Quad Core processor 2.7GHz
RAM 6 GB RAM
Training Set 200 patches 8x8 images
Neural Network for training 64 input units, 25 hidden units, and 64 output units.
Autoencoder Implementation Results
23 of 37
Training Time Expected Time [1]
39 seconds Less than a minute
Principle Component Analysis – PCA
24 of 32
Principle Component Analysis – PCA
• PCA is a dimensionality reduction mechanism used to eliminate highly correlated variables,
without sacrificing much of the details.[7]
25 of 37
PCA – Example
Example
• Given the 2D data example.
• This data has already been pre-processed using mean
normalization.
• We want to find the principle directions of variation.
2D data example[8]
26 of 37
PCA – Example (Cont’d)
As depicted, we can see that is the
strongest variation direction and is
the second strongest direction of
variation. [8]
27 of 37
u1u2
2D data example[8]
PCA – Math
• To compute the principle variation direction we first compute the covariance matrix, , for the data
points as follows:
• , m is the number of points
• It can be proven that is the top Eigen vector of and is the second Eigen vector.[8]
28 of 37
PCA – Math
• Having the Eigen vectors calculated, we construct a matrix U of the Eigen vectors.
• Retaining all the Eigen vectors of the covariance matrix into U, makes U a rotation matrix
of the input .[8]
2D data example[8]
29 of 37
PCA – Dimensionality Reduction
• If we decided to retain k Eigen vectors in matrix , then we can reduce the dimension of the input from n to k
[7]
• How would we chose k?
30 of 37
PCA – Dimensionality Reduction
• To decide on k, we consider the percentage of variance retained (PVR).
• Having a covariance matrix , with n Eigen vectors, it will have n Eigen values
• It’s common in images for example to choose [7]
31 of 37
Whitening
• In PCA, we got rid of highly correlated features to reduce the features dimensionality.
• Following this step, is to make sure that all the features have a unity variance.
• From PCA, we transformed our input matrix , to be .
• By scaling every feature by , where is the corresponding Eigen value for component i, we get
features with a unity variance. [8]
32 of 37
Self-Taught Learning
33 of 32
Self-Taught learning and Unsupervised feature learning
Given an unlabeled data set, we can start training a sparse autoencoder to
extract features to give us a better, condense representation of the data.
34 of 37 Neural Network[8]
Self-Taught learning and Unsupervised feature learning
• Once the training is done, the network is now ready to find better features to represent the input using the activations of the network hidden layer. [8]
35 of 37 Input layer of Neural Network[8]
Self-Taught learning and Unsupervised feature learning
• Given a labeled training set , we feed the auto encoder the input , to get the activation for the trained hidden layer .[8]
36 of 37 Input layer of Neural Network[8]
Self-Taught learning and Unsupervised feature learning
• Given the activation , we can either replace the labeled training data to be or concatenate the auto encoder extracted features to the original labeled features .[8]
37 of 37 Input layer of Neural Network[8]
Self-Taught Learning Application
• We used the self-taught learning paradigm with the sparse autoencoder and
softmax classifier to build a classifier for handwritten digits.
• The goal is to distinguish between the digits from 0 to 4. We will use the digits
5 to 9 as our "unlabeled" dataset; we will then use a labeled dataset with the
digits 0 to 4 with which to train the softmax classifier.
38 of 37
Self-Taught Learning Implementation
Step 1: Generate the input and test data sets
We used the datasets from the MNIST Handwritten Digit Database for this project.
Step 2: Train the sparse autoencoder
We used the unlabeled data (the digits from 5 to 9) to train a sparse autoencoder. These results are shown after
training is complete for a visualization of pen strokes like the image shown to the right
Step 3: Extracting features
After the sparse autoencoder is trained, we will use it to extract features from the handwritten digit images.
Step 4: Training and testing the logistic regression model
We will train a softmax classifier using the training set features and labels and finally computing the predictions
and accuracy
39 of 37
Self-Taught Learning Setup Environment
CPU Intel corei7 Quad Core processor 2.7GHz
RAM 6 GB RAM
Training Set 60,000 examples from MNIST database
Unlabeled set 29404 examples
Supervised training set 15298 examples
Supervised testing set 15298 examples
40 of 37
Self-Taught Learning Results
The results are shown below after training is complete for a visualization of pen strokes like the image
shown below:
41 of 37
Self-Taught Learning Anaylsis
We have done a comparison between
our application outputs and the
Stanford course tutorial outputs [8].
42 of 37
Our classifier
Tutorial’s classifier
Training Time
16 minutes 25 minutes
Classifier Score (Accuracy)
98.208916% 98 %
Future Work
We propose that if we were able to parallize our code or make the training part run on a GPU for
example, it will boost the performance and decrease the time needed to train the classifier
43 of 37
References
[1] Taiwo Oladipupo Ayodele. New Advances in Machine Learning. InTech, 2010.
[2] SB Kotsiantis, ID Zaharakis, and PE Pintelas. Supervised machine learning: A review of classication techniques. 31:249-268, 2007.
[3] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Ng. Ecient sparse coding algorithms. In Advances in neural information
processing systems, pages 801-808,2006.
[4] Bruno A Olshausen et al. Emergence of simple-cell receptive field properties by learning a sparse code for natural images.Nature,
381(6583):607-609, 1996.
[5] Simon O. Haykin, ”Multilayer Perceptron,” in Neural Networks and Learning Machines, 3rd Edition ed. , Prentice Hall, 2009.
[6] Andrew Ng. CS294A . Lecture notes, Topic : “Sparse autoencoder ” Standford University, Jan 11, 2011. Available: http://
www.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf. [Accessed Dec. 10,2013].
[7] Aapo Hyvärinen, Jarmo Hurri, and Patrik O. Hoyer, “Principal components and whitening,” in Natural Image Statistics: A Probabilistic
Approach to Early Computational Vision., Vol. 39, Springer-Verlag, 2009,pp. 97-137
[8] Andrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, and Caroline Suen, “UFLDL Tutorial”, April 7, 2013. [Online]. Available:
http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial. [Accessed Dec. 10,2013].
44 of 37
Thank You!