Lecture 3: The Principle of Sparsity
Transcript of Lecture 3: The Principle of Sparsity
![Page 1: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/1.jpg)
Lecture 3:The Principle of Sparsity
[§4, §8, §11.2, and §∞.3 of“The Principles of Deep Learning Theory (PDLT),”arXiv:2106.10165]
![Page 2: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/2.jpg)
Problems 1, 2, & 3
Fully-trained network output, Taylor-expanded around initialization:
![Page 3: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/3.jpg)
Fully-trained network output, Taylor-expanded around initialization:
• Problem 1: too many terms in general
Problems 1, 2, & 3
![Page 4: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/4.jpg)
Fully-trained network output, Taylor-expanded around initialization:
• Problem 1: too many terms in general
• Problem 2: complicated mapping
initial distributionsover
model parameters
Problems 1, 2, & 3
statistics at initialization
![Page 5: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/5.jpg)
Fully-trained network output, Taylor-expanded around initialization:
• Problem 1: too many terms in general
• Problem 2: complicated mapping
• Problem 3: complicated dynamics
Problems 1, 2, & 3
statistics at initializationinitial distributionsover
model parameters
![Page 6: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/6.jpg)
Fully-trained network output, Taylor-expanded around initialization:
• Problem 1: too many terms in general
• Problem 2: complicated mapping
• Problem 3: complicated dynamics
statistics at initialization
Dan has covered Dynamics
initial distributionsover
model parameters
![Page 7: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/7.jpg)
Fully-trained network output, Taylor-expanded around initialization:
• Problem 1: too many terms in general
• Problem 2: complicated mapping
• Problem 3: complicated dynamics
statistics at initialization statistics after training
Dan has covered Dynamics
initial distributionsover
model parameters
![Page 8: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/8.jpg)
Fully-trained network output, Taylor-expanded around initialization:
• Problem 1: too many terms in general
• Problem 2: complicated mapping
• Problem 3: complicated dynamics
statistics at initialization statistics after training
Sho will cover Statistics
initial distributionsover
model parameters
![Page 9: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/9.jpg)
statistics at initialization
Sho will cover Statistics
for WIDE & DEEP neural networks
initial distributionsover
model parameters
![Page 10: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/10.jpg)
statistics at initialization
Sho will cover Statistics
for WIDE & DEEP neural networks
Lecture 3: The Principle of Sparsity, deriving recursions
Lecture 4: The Principle of Criticality, solving recursions
initial distributionsover
model parameters
![Page 11: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/11.jpg)
Outline1. Neural Networks 101
2. One-Layer Neural Networks
3. Two-Layer Neural Networks
4. Deep Neural Networks
![Page 12: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/12.jpg)
1. Neural Networks 101
![Page 13: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/13.jpg)
Neural Networks
![Page 14: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/14.jpg)
Neural Networks
activation function
![Page 15: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/15.jpg)
Neural Networks
preactivations
network output
![Page 16: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/16.jpg)
Neural Networks
hat @ initialization
![Page 17: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/17.jpg)
Neural Networks
Biases and weights (model parameters) are independently (& symmetrically) distributed with variances
![Page 18: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/18.jpg)
Neural Networks
initialization hyperparameters
Biases and weights (model parameters) are independently (& symmetrically) distributed with variances
![Page 19: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/19.jpg)
Neural Networks
good wide limit
Biases and weights (model parameters) are independently (& symmetrically) distributed with variances
![Page 20: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/20.jpg)
One Aside on Gradient Descent
[Cf. Andrea’s “S” matrix]
![Page 21: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/21.jpg)
One Aside on Gradient Descent
Taylor expansion:
![Page 22: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/22.jpg)
One Aside on Gradient Descent
Taylor expansion:
Neural Tangent Kernel (NTK) [Cf. Dan’s ]
![Page 23: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/23.jpg)
One Aside on Gradient Descent
Taylor expansion:
Neural Tangent Kernel (NTK) [Cf. Dan’s ]
differential of NTK (dNTK) [Cf. Dan’s ]
![Page 24: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/24.jpg)
Taylor expansion:
Neural Tangent Kernel (NTK)
differential of NTK (dNTK)
One Aside on Gradient Descent
ddNTK
![Page 25: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/25.jpg)
Taylor expansion:
Neural Tangent Kernel (NTK)
differential of NTK (dNTK)
One Aside on Gradient Descent
ddNTK
![Page 26: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/26.jpg)
Neural Tangent Kernel (NTK)
![Page 27: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/27.jpg)
[Cf. Andrea’s “S” matrix]
Neural Tangent Kernel (NTK)
Diagonal, group-by-group, learning rate:
![Page 28: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/28.jpg)
good wide limit
Neural Tangent Kernel (NTK)
Diagonal, group-by-group, learning rate:
![Page 29: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/29.jpg)
Two Pedagogical Simplifications
1. Single input; drop sample indices
[See “PDLT” (arXiv:2106.10165) for more general cases.]
2. Layer-independent hyperparameters; drop layer indices from them
![Page 30: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/30.jpg)
![Page 31: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/31.jpg)
![Page 32: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/32.jpg)
![Page 33: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/33.jpg)
![Page 34: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/34.jpg)
2. One-Layer Neural Networks
![Page 35: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/35.jpg)
Statistics of
![Page 36: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/36.jpg)
Statistics of
![Page 37: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/37.jpg)
Statistics of
![Page 38: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/38.jpg)
Statistics of“Wick contraction”
![Page 39: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/39.jpg)
Statistics of
![Page 40: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/40.jpg)
Statistics of
![Page 41: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/41.jpg)
Statistics of
![Page 42: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/42.jpg)
Statistics of
![Page 43: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/43.jpg)
Statistics of
![Page 44: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/44.jpg)
Statistics of
![Page 45: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/45.jpg)
Statistics of
![Page 46: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/46.jpg)
Statistics of
![Page 47: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/47.jpg)
Statistics of
![Page 48: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/48.jpg)
Statistics of
![Page 49: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/49.jpg)
Statistics of
• Neurons don’t talk to each other; they are statistically independent.
• We marginalized over/integrated out and .
• Two interpretations:(i) outputs of one-layer networks; or(ii) preactivations in the first layer of deeper networks.
![Page 50: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/50.jpg)
Statistics of
![Page 51: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/51.jpg)
Statistics of
![Page 52: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/52.jpg)
Statistics of
![Page 53: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/53.jpg)
Statistics of
![Page 54: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/54.jpg)
Statistics of
• “Deterministic”: it doesn't depend on any particular initialization; you always get the same number.
• “Frozen”: it cannot evolve during training; no representation learning.
![Page 55: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/55.jpg)
Statistics of
![Page 56: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/56.jpg)
Statistics of
![Page 57: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/57.jpg)
Statistics of
• No representation learning.
• No algorithm dependence.
![Page 58: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/58.jpg)
Statistics of One-Layer Neural Networks
![Page 59: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/59.jpg)
Statistics of One-Layer Neural Networks
Linear dynamics:
Simple solution:
![Page 60: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/60.jpg)
Statistics of One-Layer Neural Networks
statistics at initialization statistics after training
![Page 61: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/61.jpg)
Statistics of One-Layer Neural Networks
• Same trivial statistics for infinite-width neural networks of any fixed depth.
• No representation learning, no algorithm dependence; not a good model of deep learning.
We must study deeper networks of finite width!
![Page 62: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/62.jpg)
3. Two-Layer Neural Networks
![Page 63: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/63.jpg)
Statistics of
![Page 64: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/64.jpg)
Statistics of
Wick
arrange
![Page 65: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/65.jpg)
Statistics of
![Page 66: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/66.jpg)
Statistics of
• Recursive.
• width-scaling was important.
![Page 67: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/67.jpg)
Statistics of
Wick
![Page 68: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/68.jpg)
Statistics of
Wick
![Page 69: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/69.jpg)
Statistics of
Wick
![Page 70: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/70.jpg)
Statistics of
Wick
![Page 71: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/71.jpg)
Statistics of
with
![Page 72: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/72.jpg)
Statistics of
with
Nearly-Gaussian distribution for
[Cf. Gaussian distribution in the first layer:
]
![Page 73: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/73.jpg)
Statistics of
• Gaussian in the infinite-width limit, too simple; specified by one number (one matrix – kernel – more generally)
• Sparse description at ; specified by two numbers (two tensors more generally, one of them having four sample indices)
• Interacting neurons at finite width.
![Page 74: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/74.jpg)
Statistics of
![Page 75: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/75.jpg)
Statistics of
![Page 76: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/76.jpg)
Statistics of
1st piece
2nd piece
![Page 77: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/77.jpg)
Statistics of
1st piece, the same as before:
![Page 78: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/78.jpg)
Statistics of
1st piece, the same as before:
width-scaling was important.
![Page 79: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/79.jpg)
Statistics of
2nd piece, chain rule:
![Page 80: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/80.jpg)
Statistics of
2nd piece, chain rule:
![Page 81: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/81.jpg)
Statistics of
2nd piece, chain rule:
![Page 82: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/82.jpg)
Statistics of
2nd piece, chain rule:
![Page 83: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/83.jpg)
Statistics ofPutting things together, NTK forward equation:
![Page 84: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/84.jpg)
Statistics of
• “Stochastic”: it fluctuates from instantiation to instantiation.
• “Defrosted”: it can evolve during training.
Putting things together, NTK forward equation:
![Page 85: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/85.jpg)
Putting things together, NTK forward equation:
Fun for the weekend (solutions in §8):
[*for smooth activation functions]
Statistics of
![Page 86: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/86.jpg)
Putting things together, NTK forward equation:
Fun for the weekend (solutions in §8, §11.2, and §∞.3):
[*for smooth activation functions]
Statistics of and beyond
![Page 87: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/87.jpg)
Putting things together, NTK forward equation:
Fun for the weekend (solutions in §8, §11.2, and §∞.3):
[*for smooth activation functions]
Statistics of and beyond
Representation Learning
![Page 88: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/88.jpg)
[*for smooth activation functions]
Statistics of and beyond
Representation Learning
![Page 89: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/89.jpg)
Statistics of Two-Layer Neural Networks
• Two interpretations:(i) outputs, NTK, … of a two-layer network; or(ii) preactivations, mid-layer NTK, … in the second layer of a deeper network.
• Neurons do talk to each other; they are statistically dependent.
• Yes representation learning (and yes algorithm dependence);they can now capture rich dynamics of real, finite-width, neural networks.
![Page 90: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/90.jpg)
Statistics of Two-Layer Neural Networks
But what is being amplified by deep learning?
• Two interpretations:(i) outputs, NTK, … of a two-layer network; or(ii) preactivations, mid-layer NTK, … in the second layer of a deeper network.
• Neurons do talk to each other; they are statistically dependent.
• Yes representation learning (and yes algorithm dependence);they can now capture rich dynamics of real, finite-width, neural networks.
![Page 91: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/91.jpg)
4. Deep Neural Networks
![Page 92: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/92.jpg)
Statistics of
![Page 93: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/93.jpg)
Statistics of
Wick
arrange
![Page 94: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/94.jpg)
Statistics of
leading
![Page 95: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/95.jpg)
Statistics ofTwo-point:
![Page 96: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/96.jpg)
Statistics ofTwo-point:
Four-point:
![Page 97: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/97.jpg)
Two-point:
Four-point:
NTK mean:
Statistics of
![Page 98: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/98.jpg)
1st trivial piece
2nd chain-rule piece
NTK forward equation:
Statistics of
![Page 99: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/99.jpg)
NTK forward equation:
Statistics of
1st trivial piece
2nd chain-rule piece
![Page 100: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/100.jpg)
Two-point:
Four-point:
NTK mean:
Statistics of and beyond
Similarly for
NTK fluctuations (§8)
dNTK (§11.2)
ddNTK (§∞.3)
![Page 101: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/101.jpg)
Two-point:
Four-point:
NTK mean:
Statistics of and beyond
Similarly for
NTK fluctuations (§8)
dNTK (§11.2)
ddNTK (§∞.3)
![Page 102: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/102.jpg)
Two-point:
Four-point:
NTK mean:
Statistics of and beyond
Similarly for
NTK fluctuations (§8)
dNTK (§11.2)
ddNTK (§∞.3)
![Page 103: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/103.jpg)
statistics at initialization statistics after training
The Principle of Sparsity for WIDE Neural NetworksDanSho
![Page 104: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/104.jpg)
• Infinite width:
specified by
The Principle of Sparsity for WIDE Neural Networks
statistics at initialization statistics after training
![Page 105: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/105.jpg)
• Infinite width:
• Large-but-finite width at :
specified by
specified by
The Principle of Sparsity for WIDE Neural Networks
statistics at initialization statistics after training
![Page 106: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/106.jpg)
All determined through recursion relations(RG-flow interpretation: §4.6 )
![Page 107: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/107.jpg)
All determined through recursion relations(RG-flow interpretation: §4.6 )
Next Lecture: Solving Recursions“The Principle of Criticality”
forDEEP Neural Networks
![Page 108: Lecture 3: The Principle of Sparsity](https://reader036.fdocuments.us/reader036/viewer/2022081412/629d2fb839073552ca228af5/html5/thumbnails/108.jpg)
One more thing…