Deep Learning Bkk 03 03

download Deep Learning Bkk 03 03

of 16

Transcript of Deep Learning Bkk 03 03

  • 7/29/2019 Deep Learning Bkk 03 03

    1/16

    Stacking RBMs and Auto-encoders

    for Deep ArchitecturesReferences:[Bengio, 2009], [Vincent et al., 2008]

    2011/03/03

  • 7/29/2019 Deep Learning Bkk 03 03

    2/16

    Introduction

    Deep architectures for various levels of representations

    Implicitly learn representations Layer-by-layer unsupervised training

    Generative model

    Stack Restricted Boltzmann Machines (RBMs) Forms a Deep belief network (DBN)

    Discriminative model

    Stack Auto-encoders (AEs)

    Multi-layered classifier

  • 7/29/2019 Deep Learning Bkk 03 03

    3/16

    Generative Model

    Given a training set {xi}n,

    Construct a generative model that produces samples of thesame distribution

    Start with sigmoid belief networks

    Need parameters for eachcomponent of the top-most layer:i.e. Bernoulli priors

  • 7/29/2019 Deep Learning Bkk 03 03

    4/16

    Deep Belief Network

    Same as sigmoid BN, but with different top-layer structure

    Use RBM to model the top layer

    Restricted Boltzmann Machine: (More on next slide)

    Divided into hidden and visible layers (2 levels)

    Connection forms a bipartite graph

    Called Restricted because noconnection among same-layer units

  • 7/29/2019 Deep Learning Bkk 03 03

    5/16

    Restricted Boltzmann Machines

    Energy-based model for hidden-visible joint distribution

    Or express as a distribution of the visible variable:

  • 7/29/2019 Deep Learning Bkk 03 03

    6/16

    RBMs (Contd)

    How posteriors factorize: notice how the energy is of the form

    Then,

  • 7/29/2019 Deep Learning Bkk 03 03

    7/16

    More on Posteriors

    Using the same factorization trick, we can compute the posterior:

    Posterior on visible units can be derived similarly

    Due to factorization, Gibbs sampling is easy:

    This is just the sigmoid functionfor binomial h

  • 7/29/2019 Deep Learning Bkk 03 03

    8/16

    Training RBMs

    Given parameters ={W, b, c}

    Compute log-likelihood gradient for steepest ascent method

    The first term is OK, but the second term is intractable, due topartition function

    Use k-step Gibbs sampling to approximately sample forsecond term

    k=1 performs well empirically

    ),~()~,(

    ),()|(

    )(log hxExhp

    hxExhp

    xp

    hh

  • 7/29/2019 Deep Learning Bkk 03 03

    9/16

    Training DBNs

    Every time we see a sample x, we lower the energy of the

    distribution at that point Start from the bottom layer and move up and train unsupervised

    Each layer has its own set of parameters

    *Q(.) is the RBM

    posterior for thehidden variables

  • 7/29/2019 Deep Learning Bkk 03 03

    10/16

    How to sample from DBNs

    1. Sample a visible hl-1 from the top-level RBM (using Gibbs)

    2. For k = l 1 to 1Sample hk-1 ~ P(. | hk) from the DBN model

    3. x = h0 is the final sample

  • 7/29/2019 Deep Learning Bkk 03 03

    11/16

    Discriminative Model

    Receive input xto classify

    Unlike DBNs, which didnt have inputs

    Multi-layer neural network should do

    Use auto-encoders to discover compact representations

    Use denoising AEs to add robustness to corruption

  • 7/29/2019 Deep Learning Bkk 03 03

    12/16

    Auto-encoders

    A neural network where Input = Output

    Hence its name auto But has one hidden layer for input representation

    y

    z

    d-dimensional

    d'-dimensional(lower dimensionalrepresentation -d < d is necessary

    to avoid learningidentity function)

    x

  • 7/29/2019 Deep Learning Bkk 03 03

    13/16

    AE Mechanism

    Parameterize each layer with parameter ={W, b}

    Aim to reconstruct the input by minimizing reconstruction error

    where,

    Can train in an unsupervised way

    for any x in training set, train AE to reconstruct x

    )''()(

    )()(

    bxWsxgz

    bWxsxfy

    2),( zxzxL

  • 7/29/2019 Deep Learning Bkk 03 03

    14/16

    Denoising Auto-encoders

    Also need to be robust to missing data

    Same structure as regular AE But train against corrupted inputs

    Arbitrarily remove a fixed portion of input component

    Rationale: Latent structure learning is important for re-building

    missing data The hidden layer will learn the structure representation

  • 7/29/2019 Deep Learning Bkk 03 03

    15/16

    Training Stacked DAEs

    Stack the DAEs to form a deep architecture

    Take each DAEs hidden layer This hidden layer becomes the next layer

    Training is simple. Given training set {(xi, yi)},

    Initialize each layer (sequentially) in an unsupervised fashion

    Each layers output is fed as inputs to the next layer

    Finally tune the entire architecture with supervised learningusing training set

  • 7/29/2019 Deep Learning Bkk 03 03

    16/16

    References

    [Bengio, 2009] Yoshua Bengio. Learning deep architectures for AI.

    Foundations and Trends in Machine Learning. Vol. 2, No. 1, 2009.

    [Vincent et al., 2008] Pascal Vincent, Hugo Larochelle, YoshuaBengio, and Pierre-Antoine Manzagol. Extracting and composingrobust features with denoising autoencoders. In proceedings of

    ICML 2008.