perspectives from theoretical chemistry


why deep learning works:perspectives from theoretical chemistry


[email protected]



Problem: How can SGD possibly work?Aren’t Neural Nets non-Convex ?!


can Spin Glass models suggest why ?

what other models are out there ?

expected observed ?

Random Energy Model (REM)

Temperature, regularization and the glass transition

extending REM: Spin Glass of Minimal Frustration

protein folding analogy: Funneled Energy Landscapes

example: Dark Knowledge

Recent work: Spin Glass models for Deep Nets

condensed matter theory is about qualitative analogies

we may seek a toy modela mean field theory

a phenomenological description

What problem is Deep Learning solving ?


minimize cross-entropy

Problem: What is a good theoretical model for deep networks ?


p-spherical spin glass

LeCun … 2015

L Hamiltonian (Energy function)X Gaussian random variablesw real valued (spins) , spherical constraintH >= 3 (p)

can be solved analytically, simulated easily

What is a spin glass ?


Frustration: constraints that can not be satisfied

J = X = weightsS = w = spins

Energetically: all spins should be paired

why p-spherical spin glass ?


crudely: deep networks (effectively) have no local minima !

local minima

k=1 critical points

floor / ground state

k = 2 critical points

k = 3 critical points

the critical points are ordered

saddle points

why p-spherical spin glass ?


crudely: deep networks (effectively) have no local minima !


any local minima will do; the ground state is a state of overtraining

good generalization


Early Stopping: to avoid the ground state ?

it’s easy to find the ground state; it’s hard to generalize ?

Early Stopping: to avoid the ground state ?

Current Interpretation


•finding the ground state is easy (sic); generalizing is hard

•finding the ground state is irrelevant: any local minima will do

•the ground state is a state over training

recent p-spherical spin glass results


actually: recent results (2013) on the behavior (distribution of critical points, concentration of the means)

of an isotropic random function on a high dimensional manifold

require: the variables actually concentrate on their means the weights are drawn from isotropic random function

related to: old results TAP solutions (1977) # critical points ~ TAP complexity

avoid local minima? : increase Temperatureharder problem: low Temp behavior of spin glass

What problem is Deep Learning solving ?


minimize cross-entropy of output layer

entropic effects : not just min energy

more like min free energy (divergence)

Statistical Physics and Information Theory: Neri Merhav

infinite limit of p-spherical spin glass

A related approach: Random Energy Model (REM)

Random Energy Model (REM)


ground state is governed by Extreme Value Statistics

old result from protein folding theory

REM: What is Temperature ?


We can use statistical mechanics to analyze known algorithms

I don’t mean in the traditional sense of algorithmic analysis

take Ej as the objective = loss function + regularizer

study Z: form a mean field theory;take limits N -> inf, T -> 0

REM: What is Temperature ?


let E(T) by the effective energy

E(T) = E/T ~ sum of weights*activations

as T -> 0, E(T) effective energies diverge; weights explode

Temperature is a proxy for weight constraints

T sets the Energy Scale

Temperature: as Weight Constraints


•traditional weight regularization

•max norm constraints (i.e. w/dropout)

•batch norm regularization (2015)

we avoid situations when the weights explode

in deep networks, we temper the weightsand the distribution of the activations (i.e local entropy)

REM: a toy model for real Glasses


but it is believed that entropy collapse ‘drives’ the glass transition

the glass transition is not well understood

what is a real (structural) Glass ?


Sand + Fire = Glass

what is a real (structural) Glass ?


all liquids can be made into glassesif we cool then fast enough

the glass transition is not a normal phase transitionnot the melting point

arrangement of atoms is amorphous; not completely random

different cooling rates produce different glassy states

universal phenomena; not universal physicsmolecular details affect the thermodynamics

REM: the Glass Transition


Entropy collapses when T <~ Tc

Phase Diagram: entropy density

energy density

free energy density

REM: Dynamics on the Energy Landscape


let us assume some states trap the solver for some time;

of course, there is a great effort to design solvers that can avoid traps

Energy Landscapes: and Protein Folding


let us assume some states trap the solver in state E(j) for a short time

and the transitions E(j) -> E(j-1) are governed by finite, reversible transitions (i.e. SGD oscillates back and forth for a while)

classic result(s): for T near the glass Temp (Tc) the traversal times are slower than exponential !

in a physical system, like a protein or polymer, it would take longer than the known lifetime of the universe to find the ground (folded) state

Protein Folding: the Levinthal Paradox


folding could take longer than the known lifetime of the universe ?

Old analogy between Protein folding and Hopfield Associative Memories

Natural pattern recognition could

• use a mechanism with a glass Temp (Tc) that is as low as possible

• avoid the glass transition entirely, via energetics

Nature (i.e. folding) can not operate this way !

Protein Folding: around the Levinthal Paradox

Spin Glasses: Minimizing Frustration


31calculation | consulting why deep learning works

Spin Glasses: Minimizing Frustration


32calculation | consulting why deep learning works

Spin Glasses: vs Disordered FerroMagnets


33calculation | consulting why deep learning works

the Spin Glass of Minimal Frustration


34calculation | consulting why deep learning works

REM + strongly correlated ground state = no glass transition

the Spin Glass of Minimal Frustration


35calculation | consulting why deep learning works

Training a model induces an energy gap, with few local minima

Energy Funnels: Entropy vs Energy


36calculation | consulting why deep learning works

there is a tradeoff between Energy and Entropy minimization

Energy Landscape Theory of Protein Folding


37calculation | consulting why deep learning works

there is a tradeoff between Energy and Entropy minimization

Avoids the glass transition by having more favorable energetics

Levinthal paradoxglassy surfacevanishing gradients

Energy Landscape Theory of Protein Folding

funneled landscaperugged convexityenergy / entropy tradeoff

Dark Knowledge: an Energy Funnel ?


784 -> 800 -> 800 -> 10 MLP on MNIST


10,000 test cases, 10 classes

99 errors

same entropy (capacity); better loss function

fit to ensemble soft-max probabilities

146 errors

784 -> 800 -> 800 -> 10

Adversarial Deep Nets: an Energy Funnel ?


Discriminator learns a complex loss function

Generator: fake data

Discriminator: fake vs real ?

Random Energy Model (REM): simpler theoretical model

Glass Transition: temperature ~ weight constraints

extending REM: Spin Glass of Minimal Frustration

possible examples: Dark Knowledge

Funneled Energy Landscapes

Adversarial Deep Nets


