A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf ·...

21
A signal propagation perspective for pruning neural networks at initialization Namhoon Lee 1 , Thalaiyasingam Ajanthan 2 , Stephen Gould 2 , Philip Torr 1 1 University of Oxford, 2 Australian National University ICLR 2020 Spotlight presentation

Transcript of A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf ·...

Page 1: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

A signal propagation perspective forpruning neural networks at initialization

Namhoon Lee1, Thalaiyasingam Ajanthan2, Stephen Gould2, Philip Torr1

1University of Oxford, 2Australian National University

ICLR 2020 Spotlight presentation

Page 2: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

Motivation

Han et al. 2015

Page 3: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

MotivationA typical pruning approach requires training steps(Han et al. 2015, Liu et al. 2019).

Han et al. 2015

Page 4: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

MotivationA typical pruning approach requires training steps(Han et al. 2015, Liu et al. 2019).

Pruning can be done efficiently at initialization prior to training based on connection sensitivity (Lee et al., 2019).

Page 5: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

MotivationA typical pruning approach requires training steps(Han et al. 2015, Liu et al. 2019).

Pruning can be done efficiently at initialization prior to training based on connection sensitivity (Lee et al., 2019).

The initial random weights are drawn from appropriately scaled Gaussians (Glorot & Bengio, 2010).

Page 6: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

MotivationA typical pruning approach requires training steps(Han et al. 2015, Liu et al. 2019).

Pruning can be done efficiently at initialization prior to training based on connection sensitivity (Lee et al., 2019).

The initial random weights are drawn from appropriately scaled Gaussians (Glorot & Bengio, 2010).

It remains unclear exactly why pruning at initialization is effective.

Page 7: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

MotivationA typical pruning approach requires training steps(Han et al. 2015, Liu et al. 2019).

Pruning can be done efficiently at initialization prior to training based on connection sensitivity (Lee et al., 2019).

The initial random weights are drawn from appropriately scaled Gaussians (Glorot & Bengio, 2010).

It remains unclear exactly why pruning at initialization is effective.

Our take ⇒ Signal Propagation Perspective.

Page 8: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

Initialization & connection sensitivitySparsity pattern

Sensitivity scores

Page 9: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

(Linear) uniformly pruned throughout the network.→ learning capability secured.

Initialization & connection sensitivitySparsity pattern

Sensitivity scores

Page 10: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

(Linear) uniformly pruned throughout the network.→ learning capability secured.

(tanh) more parameters pruned in the later layers.→ critical for high sparsity pruning.

Initialization & connection sensitivitySparsity pattern

Sensitivity scores

Page 11: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

(Linear) uniformly pruned throughout the network.→ learning capability secured.

(tanh) more parameters pruned in the later layers.→ critical for high sparsity pruning.

Initialization & connection sensitivitySparsity pattern

Sensitivity scores

Page 12: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

(Linear) uniformly pruned throughout the network.→ learning capability secured.

(tanh) more parameters pruned in the later layers.→ critical for high sparsity pruning.

CS scores decrease towards the later layers.→ Choosing top salient parameters globally results in a network, in which parameters are distributed non-uniformly and sparsely towards the end.

Initialization & connection sensitivitySparsity pattern

Sensitivity scores

Page 13: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

(Linear) uniformly pruned throughout the network.→ learning capability secured.

(tanh) more parameters pruned in the later layers.→ critical for high sparsity pruning.

CS scores decrease towards the later layers.→ Choosing top salient parameters globally results in a network, in which parameters are distributed non-uniformly and sparsely towards the end.

CS metric can be decomposed as .→ necessary to ensure reliable gradient!

Initialization & connection sensitivitySparsity pattern

Sensitivity scores

Page 14: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

Proposition 1 (Gradients in terms of Jacobians).

For a feed-forward network, the gradients satisfy: , where denotes the error signal, is the Jacobian from layer to the output layer , and refers to the derivative of nonlinearity.

Layerwise dynamical isometry for faithful gradients

Page 15: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

Proposition 1 (Gradients in terms of Jacobians).

For a feed-forward network, the gradients satisfy: , where denotes the error signal, is the Jacobian from layer to the output layer , and refers to the derivative of nonlinearity.

Definition 1 (Layerwise dynamical isometry).

Let be the Jacobian matrix of layer . The network is said to satisfy layerwise dynamical isometry if the singular values of are concentrated near 1 for all layers; i.e., for a given , the singular value satisfies for all .

Layerwise dynamical isometry for faithful gradients

Page 16: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

Signal propagation and trainabilitySignal propagation

Trainability(sparsity: 90%)

Page 17: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

Signal propagation and trainabilityJacobian singular values (JSV) decrease as per increasing sparsity.→ Pruning weakens signal propagation.

JSV drop rapidly with random pruning, compared to connection sensitivity (CS) based pruning.→ CS pruning preserves signal propagation better.

Signal propagation

Trainability(sparsity: 90%)

Page 18: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

Signal propagation and trainabilityJacobian singular values (JSV) decrease as per increasing sparsity.→ Pruning weakens signal propagation.

JSV drop rapidly with random pruning, compared to connection sensitivity (CS) based pruning.→ CS pruning preserves signal propagation better.

Correlation between signal propagation and trainability.→ The better a network propagates signals, the faster it converges during training.

Signal propagation

Trainability(sparsity: 90%)

Page 19: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

Signal propagation and trainabilityJacobian singular values (JSV) decrease as per increasing sparsity.→ Pruning weakens signal propagation.

JSV drop rapidly with random pruning, compared to connection sensitivity (CS) based pruning.→ CS pruning preserves signal propagation better.

Correlation between signal propagation and trainability.→ The better a network propagates signals, the faster it converges during training.

Enforce Approximate Isometry:

→ Restore signal propagation and improve training!

Signal propagation

Trainability(sparsity: 90%)

Page 20: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

Validations and extensionsModern networks

Non-linearities

Architecture sculptingPruning without supervision

Transfer of sparsity

Page 21: A signal propagation perspective for ICLR 2020 Spotlight ...namhoon/doc/slides-spp-iclr2020.pdf · A signal propagation perspective for pruning neural networks at initialization Namhoon

● The initial random weights have critical impact on pruning.

● Layerwise dynamical isometry ensures faithful signal propagation.

● Pruning breaks dynamical isometry and degrades trainability of a neural network.Yet, enforcing approximate isometry can recover signal propagation and enhance trainability.

● A range of experiments verify the effectiveness of signal propagation perspective.

Summary