Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning...

26
Some Thoughts about Learning Predictions Online Martha White TTT, 2019

Transcript of Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning...

Page 1: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Some Thoughts about Learning Predictions Online

Martha WhiteTTT, 2019

Page 2: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Online Prediction Learning

• Constant stream of data

• Goal: Predict target y, given input x

• Standard prediction problem, but

• input sequence is correlated (e.g., Markov chain, time series)

• predicting many things

• might add new predictions as time passes

Solution Manifold for Task 1

Solution manifolds for anunconstrained representation

Joint Training Soluion

Parameter Space

Solution manifolds for an idealrepresentation for continual learning

W W

p1

p2

p3

p1

p2p3

Figure 2: Effect of the representation on continual learning, for a problem where targets are generatedfrom three different distributions p1(Y |x), p2(Y |x) and p3(Y |x). The representation results indifferent solution manifolds for the three distributions; we depict two different possibilities here. Weshow the learning trajectory when training incrementally from data generates first by p1, then p2

and p3. On the left, the online updates interfere, jumping between distant points on the manifolds.On the right, the online updates either generalize appropriately—for parallel manifolds—or avoidinterference because manifolds are orthogonal.

regression and classification problems. We analyze the representations learned with our objectiveand find that they are highly sparse. We then show that this idea complements and improves on othercontinual learning strategies, like Meta Experience Replay [Riemer et al., 2019], which can learnmore effectively from our learned representations.

2 Problem Formulation

A Continual Learning Prediction (CLP) problem consists of an unending stream of samples

(X1, Y1), (X2, Y2), . . . , (Xt, Yt), . . .

for inputs Xt and prediction targets Yt, from sets X and Y respectively.1 The random vector Yt issampled according to an unknown distribution p(Y |Xt). We assume the process X1, X2, . . . , Xt, . . .has a marginal distribution µ : X ! [0, 1), that reflects how often each input is observed. Thisassumption allows for a variety of correlated sequences. For example, Xt could be sampled from adistribution potentially dependent on past variables Xt�1 and Xt�2. The targets Yt, however, aredependent only on Xt, and not on past Xi.

The goal of the agent is to learn a function f : X ! Y , to predict targets y from inputs x. Thisfunction can be parameterized, with parameter vector ✓ and W as

f✓,W (x) = gW (�✓(x)) for representation �✓ : X ! Rd (1)

with gW : Rd ! Y a task-specific prediction function learned on a shared representation �✓(x).For example, if the agent is making m predictions, then we could have W = [w1, . . . , wm], withprediction vector gW (�✓(x)) = tanh(�✓(x)W ) composed of predictions tanh(�✓(x)wi) for eachprediction target i.

A variety of continual problems can be represented by this formulation. One example is an onlineregression problem, such as predicting the next spatial location for a robot given the current location.Current classification benchmarks in continual learning can also be represented by this CLP formalism.One class is selected, with iid samples of Xt only from data for that class. This sequence is correlated,because the process consistently reveals samples only from one class until switching to another class.The CLP formulation also allows for targets Yt that are dependent on a history of the most recent kobservations. This can be obtained by defining each Xt to be the last k observations. The overlapbetween Xt and Xt�1 does not violate the assumptions on the correlated sequence of inputs. Finally,

1This definition encompasses the continual learning problem where the tuples also include task descriptorsTt [Lopez-Paz and Ranzato, 2017]. Tt in the tuple (Xt, Tt, Yt) can simply be considered as part of the inputs.

3

Page 3: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Why this setting matters

• It reflects how we really get data

• Even if you

• do not want the agent to update online (e.g., safety)

• or can store and update with all of your data

• You still get data online; it can be good to remember that

Page 4: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Desired Outcomes

• Generalization: learning on observed samples enables accurate predictions on unobserved (but related) samples

• Faster learning: learning on observed samples enables you to learn faster on new samples

• Minimal forgetting: maintain learning on all observed data

• updating on recent samples does not ruin accuracy on older samples

Page 5: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

How can we achieve this?Learn a representation that makes it easier

to learn a function with these properties

W<latexit sha1_base64="PL5IEwd5vwvwo46b4eEismiYoO8=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoPgKeyqoMegF48RzAOSJczO9iZDZh/OzAbCku/w4kERr36MN//GyWYPmljQUFR1T0+XlwiutG1/W6W19Y3NrfJ2ZWd3b/+genjUVnEqGbZYLGLZ9ahCwSNsaa4FdhOJNPQEdrzx3dzvTFAqHkePepqgG9JhxAPOqDaS289fyCT6M9IZVGt23c5BVolTkBoUaA6qX30/ZmmIkWaCKtVz7ES7GZWaM4GzSj9VmFA2pkPsGRrREJWb5Stn5MwoPgliaSrSJFd/T2Q0VGoaeqYzpHqklr25+J/XS3Vw42Y8SlKNEVssClJBdEzmCRCfS2RaTA2hTHLzV8JGVFKmTU4VE4KzfPIqaV/Uncu6/XBVa9wWcZThBE7hHBy4hgbcQxNawOAJnuEV3qyJ9WK9Wx+L1pJVzBzDH1ifP9kRkiM=</latexit>

✓<latexit sha1_base64="LpGnCwa+BfsItbBdWNbHlqo5iO4=">AAAB/HicbVDLSgNBEJyNrxhfqzl6GQyCp7Crgh6DXjxGMA9IljA76U2GzM4uM71CWOKvePGgiFc/xJt/4+Rx0MSChqKqe3q6wlQKg5737RTW1jc2t4rbpZ3dvf0D9/CoaZJMc2jwRCa6HTIDUihooEAJ7VQDi0MJrXB0O/Vbj6CNSNQDjlMIYjZQIhKcoZV6brk7eyMPZQYT2sUhIOu5Fa/qzUBXib8gFbJAved+dfsJz2JQyCUzpuN7KQY50yi4hEmpmxlIGR+xAXQsVSwGE+SzxRN6apU+jRJtSyGdqb8nchYbM45D2xkzHJplbyr+53UyjK6DXKg0Q1B8vijKJMWETpOgfaGBoxxbwrgW9q+UD5lmHG1eJRuCv3zyKmmeV/2Lqnd/WandLOIokmNyQs6IT65IjdyROmkQTsbkmbySN+fJeXHenY95a8FZzJTJHzifPytilRg=</latexit>

Representation Learning Network (RLN)

x1x2

xn

...

Input

Prediction Learning Network (TLN)

Learnedrepresentation

y

Output

...

r1r2r3r4

Network

Con

nections

rd

Network

Con

nections

Network

Con

nections

Network

Con

nections

Network

Con

nections

Network

Con

nections

Network

Con

nections

Page 6: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Or more simply for this talk

Representation Learning Network (RLN)

x1x2

xn

...

Input

PLNw

Learnedrepresentation

y

Output

...

r1r2r3r4

Network

Con

nections

rd

Network

Con

nections

Network

Con

nections

Network

Con

nections

Network

Con

nections

Page 7: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Learning functions or representations?

• Why do we talk about learning a representation?

• These three goals can be achieved just by thinking about the function itself directly

• Recall goals: Generalization, Faster Learning, Minimize Interference

• NN could implicitly learn a representation anyway

Page 8: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Comment 1

• Neural network solutions are likely under-constrained

• Learning a function to minimize a loss could

• produce an “interesting” representation (implicitly)

• OR it could produce features that mean very little

Page 9: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Hypothesis 1If we are going to talk about representation learning,

then we should learn representations explicitly

Representation Learning Network (RLN)

x1x2

xn

...

Input

PLNw

Learnedrepresentation

y

Output

...

r1r2r3r4

Network

Con

nections

rd

Network

Con

nections

Network

Con

nections

Network

Con

nections

Network

Con

nections

Page 10: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Consequences

• Consider different strategies for training representations

• Representations can be learned slowly, as a background process

• Representations could be learned using generate-and-test

• Representations can be learned using different objectives than the primary objective to minimize the error

Page 11: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Some of our work

• Meta-learned Representations for Continual Learning, or MRCL (with Khurram)

• Two-timescale Networks (with Wes, Somjit, Ajin)

Page 12: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

MRCL

Representation Learning Network (RLN)

x1x2

xn

...

Input

Prediction Learning Network (PLN)

Learnedrepresentation

y

Output

Only updated withthe meta-update Updated online

...

r1r2r3r4

Net

wor

kC

onne

ctio

ns

rd

Net

wor

kC

onne

ctio

ns

Net

wor

kC

onne

ctio

ns

Net

wor

kC

onne

ctio

ns

Net

wor

kC

onne

ctio

ns

Net

wor

kC

onne

ctio

ns

Net

wor

kC

onne

ctio

ns

*Paper on arXiv, Meta-learning representations for continual learning

Page 13: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Two-timescale Networks

• Train representation with related prediction problems

x1x2

xn

...

Input

y

Auxiliary Prediction

...r1r2r3r4

Net

wor

kC

onne

ctio

ns

rd

Net

wor

kC

onne

ctio

ns

Net

wor

kC

onne

ctio

ns

Net

wor

kC

onne

ctio

nsv

<latexit sha1_base64="gfGd9lilLkuqmwOX+dBCP3vq/yw=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHbRRI9ELx4hkUcCGzI79MLI7OxmZpaEEL7AiweN8eonefNvHGAPClbSSaWqO91dQSK4Nq777eQ2Nre2d/K7hb39g8Oj4vFJU8epYthgsYhVO6AaBZfYMNwIbCcKaRQIbAWj+7nfGqPSPJaPZpKgH9GB5CFn1FipPu4VS27ZXYCsEy8jJchQ6xW/uv2YpRFKwwTVuuO5ifGnVBnOBM4K3VRjQtmIDrBjqaQRan+6OHRGLqzSJ2GsbElDFurviSmNtJ5Ege2MqBnqVW8u/ud1UhPe+lMuk9SgZMtFYSqIicn8a9LnCpkRE0soU9zeStiQKsqMzaZgQ/BWX14nzUrZuypX6tel6l0WRx7O4BwuwYMbqMID1KABDBCe4RXenCfnxXl3PpatOSebOYU/cD5/AOR/jP4=</latexit>

*Paper at ICLR 2019, Two-Timescale Networks for Nonlinear Value Function Approximation

Page 14: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Comment 2

• Representation learning only makes sense if you will be learning more in the future

• Conversely, it usually does not make sense for a single prediction problem on a batch of data

• Representation learning is a second-order problem

Page 15: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Consequence

• Experimental design to test representation learning needs to account for learning the representation

• e.g., design environment where more predictions are added as time passes

• e.g., introduce non-stationarity

• e.g., allow for a pre-training phase, to simulate using previous learning for new learning

Page 16: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Comment 3

• Online prediction is a problem setting not a solution approach

• Batch is not the opposite of Online

Page 17: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Consequence

• We should be open to appropriate batch approaches

• Batch updating (say by storing data) could be part of the solution to the Online Prediction Problem

• Experience replay could be part of the solution (or some variant of it)

Page 18: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Comment 4

• Sample efficiency and minimizing interference are linked

• A small mini-batch is not representative of the whole space, even in an iid setting

• If a representation minimizes interference, each mini-batch update should mostly improve estimate

• If a representation does not minimize interference, improvement happens across (more) mini-batch updates

Page 19: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

VisualizationSolution Manifold

for Task 1

Solution manifolds for anunconstrained representation

Joint Training Soluion

Parameter Space

Solution manifolds for an idealrepresentation for continual learning

W W

p1

p2

p3

p1

p2p3

Page 20: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Hypothesis

Might want to consider strategies used to mitigate interference for online updating

even for iid data.

Page 21: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Comment 5• Mitigating interference in updates relates to

orthogonality between feature vectors

r`i(�)>r`j(�) ⇡ 0, � = [✓, w]

<latexit sha1_base64="LE7+uKIIKow43Mfq2u/S82MtrCk=">AAACOnicbZDPThsxEMa9tOVP2kLaHnuxGlWiEop2oVK5VEL0whGkBpDibTTrTIiL17bsWSCKeC4uPEVvHHrpoRXqlQfAG/ZAoWNZ+un7ZmTPVzitAqXpVTL35Omz+YXFpdbzFy+XV9qvXu8HW3mJPWm19YcFBNTKYI8UaTx0HqEsNB4Ux19q/+AEfVDWfKWJw7yEI6NGSgJFadDeEwYKDVyg1gO1Kgok+PBNkHX8vvO9cbgA57w94+kaF/WpVf6Z9wWNI63x03zQ7qTddFb8MWQNdFhTu4P2DzG0sirRkNQQQj9LHeVT8KSkxvOWqAI6kMdwhP2IBkoM+XS2+jl/H5UhH1kfryE+U+9PTKEMYVIWsbMEGoeHXi3+z+tXNNrMp8q4itDIu4dGleZkeZ0jHyqPkvQkAkiv4l+5HIMHSTHtVgwhe7jyY9hf72Yb3fW9j52t7SaORfaWvWOrLGOf2BbbYbusxyS7YD/Zb/YnuUx+JdfJ37vWuaSZecP+qeTmFnXMq1s=</latexit>

Page 22: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Comment 5• Mitigating interference in updates relates to

orthogonality between feature vectors

rf�(xi) = [�✓(xi),r�✓(xi)]<latexit sha1_base64="q96E7zA8lrsWYzpQRoxJxTrH98k=">AAACKnicbVBNS8NAFNz4WetX1aOXxSJUkJKooBeh6sWjgtVCE8LLdmMXN5uw+yKW4u/x4l/x0oNSvPpD3NYIWh1YGGbm8fZNlElh0HWHztT0zOzcfGmhvLi0vLJaWVu/NmmuGW+yVKa6FYHhUijeRIGStzLNIYkkv4nuzkb+zT3XRqTqCnsZDxK4VSIWDNBKYeXEVxBJoHHoRxyh9hCKHXpM237WFaGP3W9tlxbBSSMIK1W37o5B/xKvIFVS4CKsDPxOyvKEK2QSjGl7boZBHzQKJvlj2c8Nz4DdwS1vW6og4Sboj099pNtW6dA41fYppGP150QfEmN6SWSTCWDXTHoj8T+vnWN8FPSFynLkin0tinNJMaWj3mhHaM5Q9iwBpoX9K2Vd0MDQtlu2JXiTJ/8l13t1b7++d3lQbZwWdZTIJtkiNeKRQ9Ig5+SCNAkjT+SFvJI359kZOEPn/Ss65RQzG+QXnI9PTLGl+g==</latexit>

r`i(�) = �irf�(xi)<latexit sha1_base64="K97Ga4q5aea3k17+WUoIrDBXnxU=">AAACHHicbVA9SwNBEN3zM8avqKXNYhBiE+5U0EYI2lgqGBVy4ZjbzOni3t6xOyeG4A+x8a/YWChiYyH4b9zEK/x6MPB4b4aZeXGupCXf//DGxicmp6YrM9XZufmFxdrS8qnNCiOwLTKVmfMYLCqpsU2SFJ7nBiGNFZ7FVwdD/+wajZWZPqF+jt0ULrRMpAByUlTbCjXECniISkWyEcZIsMH3eNhDRRBJXvpJNLIaN5HciGp1v+mPwP+SoCR1VuIoqr2FvUwUKWoSCqztBH5O3QEYkkLhbTUsLOYgruACO45qSNF2B6Pnbvm6U3o8yYwrTXykfp8YQGptP41dZwp0aX97Q/E/r1NQstsdSJ0XhFp8LUoKxSnjw6R4TxoUpPqOgDDS3crFJRgQ5PKsuhCC3y//JaebzWCruXm8XW/tl3FU2CpbYw0WsB3WYofsiLWZYHfsgT2xZ+/ee/RevNev1jGvnFlhP+C9fwKERqBk</latexit>

r`i(�)>r`j(�) ⇡ 0, � = [✓, w]

<latexit sha1_base64="LE7+uKIIKow43Mfq2u/S82MtrCk=">AAACOnicbZDPThsxEMa9tOVP2kLaHnuxGlWiEop2oVK5VEL0whGkBpDibTTrTIiL17bsWSCKeC4uPEVvHHrpoRXqlQfAG/ZAoWNZ+un7ZmTPVzitAqXpVTL35Omz+YXFpdbzFy+XV9qvXu8HW3mJPWm19YcFBNTKYI8UaTx0HqEsNB4Ux19q/+AEfVDWfKWJw7yEI6NGSgJFadDeEwYKDVyg1gO1Kgok+PBNkHX8vvO9cbgA57w94+kaF/WpVf6Z9wWNI63x03zQ7qTddFb8MWQNdFhTu4P2DzG0sirRkNQQQj9LHeVT8KSkxvOWqAI6kMdwhP2IBkoM+XS2+jl/H5UhH1kfryE+U+9PTKEMYVIWsbMEGoeHXi3+z+tXNNrMp8q4itDIu4dGleZkeZ0jHyqPkvQkAkiv4l+5HIMHSTHtVgwhe7jyY9hf72Yb3fW9j52t7SaORfaWvWOrLGOf2BbbYbusxyS7YD/Zb/YnuUx+JdfJ37vWuaSZecP+qeTmFnXMq1s=</latexit>

Page 23: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Comment 5• Mitigating interference in updates relates to

orthogonality between feature vectors

rf�(xi) = [�✓(xi),r�✓(xi)]<latexit sha1_base64="q96E7zA8lrsWYzpQRoxJxTrH98k=">AAACKnicbVBNS8NAFNz4WetX1aOXxSJUkJKooBeh6sWjgtVCE8LLdmMXN5uw+yKW4u/x4l/x0oNSvPpD3NYIWh1YGGbm8fZNlElh0HWHztT0zOzcfGmhvLi0vLJaWVu/NmmuGW+yVKa6FYHhUijeRIGStzLNIYkkv4nuzkb+zT3XRqTqCnsZDxK4VSIWDNBKYeXEVxBJoHHoRxyh9hCKHXpM237WFaGP3W9tlxbBSSMIK1W37o5B/xKvIFVS4CKsDPxOyvKEK2QSjGl7boZBHzQKJvlj2c8Nz4DdwS1vW6og4Sboj099pNtW6dA41fYppGP150QfEmN6SWSTCWDXTHoj8T+vnWN8FPSFynLkin0tinNJMaWj3mhHaM5Q9iwBpoX9K2Vd0MDQtlu2JXiTJ/8l13t1b7++d3lQbZwWdZTIJtkiNeKRQ9Ig5+SCNAkjT+SFvJI359kZOEPn/Ss65RQzG+QXnI9PTLGl+g==</latexit>

r`i(�) = �irf�(xi)<latexit sha1_base64="K97Ga4q5aea3k17+WUoIrDBXnxU=">AAACHHicbVA9SwNBEN3zM8avqKXNYhBiE+5U0EYI2lgqGBVy4ZjbzOni3t6xOyeG4A+x8a/YWChiYyH4b9zEK/x6MPB4b4aZeXGupCXf//DGxicmp6YrM9XZufmFxdrS8qnNCiOwLTKVmfMYLCqpsU2SFJ7nBiGNFZ7FVwdD/+wajZWZPqF+jt0ULrRMpAByUlTbCjXECniISkWyEcZIsMH3eNhDRRBJXvpJNLIaN5HciGp1v+mPwP+SoCR1VuIoqr2FvUwUKWoSCqztBH5O3QEYkkLhbTUsLOYgruACO45qSNF2B6Pnbvm6U3o8yYwrTXykfp8YQGptP41dZwp0aX97Q/E/r1NQstsdSJ0XhFp8LUoKxSnjw6R4TxoUpPqOgDDS3crFJRgQ5PKsuhCC3y//JaebzWCruXm8XW/tl3FU2CpbYw0WsB3WYofsiLWZYHfsgT2xZ+/ee/RevNev1jGvnFlhP+C9fwKERqBk</latexit>

r`i(�)>r`j(�) ⇡ 0, � = [✓, w]

<latexit sha1_base64="LE7+uKIIKow43Mfq2u/S82MtrCk=">AAACOnicbZDPThsxEMa9tOVP2kLaHnuxGlWiEop2oVK5VEL0whGkBpDibTTrTIiL17bsWSCKeC4uPEVvHHrpoRXqlQfAG/ZAoWNZ+un7ZmTPVzitAqXpVTL35Omz+YXFpdbzFy+XV9qvXu8HW3mJPWm19YcFBNTKYI8UaTx0HqEsNB4Ux19q/+AEfVDWfKWJw7yEI6NGSgJFadDeEwYKDVyg1gO1Kgok+PBNkHX8vvO9cbgA57w94+kaF/WpVf6Z9wWNI63x03zQ7qTddFb8MWQNdFhTu4P2DzG0sirRkNQQQj9LHeVT8KSkxvOWqAI6kMdwhP2IBkoM+XS2+jl/H5UhH1kfryE+U+9PTKEMYVIWsbMEGoeHXi3+z+tXNNrMp8q4itDIu4dGleZkeZ0jHyqPkvQkAkiv4l+5HIMHSTHtVgwhe7jyY9hf72Yb3fW9j52t7SaORfaWvWOrLGOf2BbbYbusxyS7YD/Zb/YnuUx+JdfJ37vWuaSZecP+qeTmFnXMq1s=</latexit>

r`i(�)>r`j(�) = �i�jrf�(xi)

>rf�(xj) ⇡ 0<latexit sha1_base64="rDbaDvOo0fkLcIlkf0EBLJuscaQ=">AAACYXicbVFNTxsxEPVuaYG0wEKPXKxGSOES7QJSe0FC7aVHKjWAlA2rWWcWHLy2Zc+iRBF/srdeeukfqRMW8dWRLD+992ZsP5dWSU9p+juK36y8fbe6tt55/2FjcyvZ3jnzpnECB8Io4y5K8KikxgFJUnhhHUJdKjwvb74t9PNbdF4a/ZNmFkc1XGlZSQEUqCKZ5hpKBTxHpQrZy0sk2L/MyVj+VJm0Cj/m+RgVQSHbffLgq4qlpTct5PMBj8Jkn+dgrTNTnhZJN+2ny+KvQdaCLmvrtEh+5WMjmho1CQXeD7PU0mgOjqRQeNfJG48WxA1c4TBADTX60XyZ0B3fC8yYV8aFpYkv2acdc6i9n9VlcNZA1/6ltiD/pw0bqr6M5lLbhlCL+4OqRnEyfBE3H0uHgtQsABBOhrtycQ0OBIVP6YQQspdPfg3ODvrZYf/gx1H35GsbxxrbZZ9Yj2XsMzth39kpGzDB/kQr0Ua0Gf2N1+Mk3rm3xlHb85E9q3j3H6notPo=</latexit>

�✓(xi)>�✓(xj) ⇡ 0

<latexit sha1_base64="K8DYULHzkhPAN7T9o0JFO3R2Yac=">AAACGXicbVDLTgJBEJzFF+IL9ehlIjGBC9lVEz0SvXjERB4Ji5vZYYCR2Z3JTK+BEH7Di7/ixYPGeNSTf+PwOAhYSSeVqu50d4VKcAOu++OkVlbX1jfSm5mt7Z3dvez+QdXIRFNWoVJIXQ+JYYLHrAIcBKsrzUgUClYLe9djv/bItOEyvoOBYs2IdGLe5pSAlYKs66suD3zoMiD5fsAL9z5IhefVhwL2iVJa9rEbZHNu0Z0ALxNvRnJohnKQ/fJbkiYRi4EKYkzDcxU0h0QDp4KNMn5imCK0RzqsYWlMImaaw8lnI3xilRZuS20rBjxR/04MSWTMIAptZ0Sgaxa9sfif10igfdkc8lglwGI6XdROBAaJxzHhFteMghhYQqjm9lZMu0QTCjbMjA3BW3x5mVRPi95Z8fT2PFe6msWRRkfoGOWRhy5QCd2gMqogip7QC3pD786z8+p8OJ/T1pQzmzlEc3C+fwHHjaAh</latexit>

Page 24: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Comment 6

• Finding nearly orthogonal features is equivalent to finding nearly orthogonal feature vectors

argmin✓

X

j,k

⇣E[�✓,j(X)�✓,k(X)]� �j,k

⌘2

= argmin✓

Eh(�✓(X)>�✓(U))2 � k�✓(X)k22 � k�✓(U)k22

i

<latexit sha1_base64="xj1TpGGW0aW5613Q5/75kefTmzw=">AAAC2XicbVJdb9MwFHXCgBE+VuCRF2sVUyuNKilI7GXSNITE45BIVynOIsd1W6+OE9k3SFWWBx5AiFf+GW/8C34CTlPQuu5Klo7OOT7XvnZaSGHA93877p2du/fu7z7wHj56/GSv8/TZyOSlZjxkucz1OKWGS6F4CAIkHxea0yyV/DxdvGv0889cG5GrT7AseJzRmRJTwShYKun8OSBUz0gmVEJgzoFiYsosqS4PFzUmp2LWIxmFeZpW7+uIFHORVK3v8LLujft4g1o0VIxfYTLhEmib0oT0L4aYEO/gGHtb7f7HN8aotwpsNRt2QSAv8HUu7DdhtsUV3rRaJhneJoX/pKZBnHS6/sBfFd4GwRp00brOks4vMslZmXEFTFJjosAvIK6oBsEkrz1SGl5QtqAzHlmoaMZNXK1epsYvLTPB01zbpQCv2Os7KpoZs8xS62zmYG5qDXmbFpUwPYoroYoSuGJto2kpMeS4eWY8EZozkEsLKNPCnhWzOdWUgf0Mnh1CcPPK22A0HASvB8OPb7onp+tx7KIXaB/1UIDeohP0AZ2hEDFn5Fw5X51vbuR+cb+7P1qr66z3PEcb5f78C5+V4DE=</latexit>

�j,k =

(1 if j = k

0 if j 6= k<latexit sha1_base64="zYdcR570ZtuCHlN2ltRgUy6dvKs=">AAACQ3icbVBNb9QwFHQKpSUFui1HLk9si3pAq6RUgkulqlw4Foltq65XK8d52XrXcYL9UnUV5b9x6R/gxh/gwgGEuCLh3eZAP0ayNJp582xPUmrlKIq+BUsPHi4/Wll9HK49efpsvbOxeeyKykrsy0IX9jQRDrUy2CdFGk9LiyJPNJ4k0/dz/+QCrVOF+USzEoe5GBuVKSnIS6POGU9RkxjVk9fTBvaBJzhWppZ+pWsgjOEVcMJLqlUGW5P96VbDOYTRTRm4wc/gPQg5mrRNjzrdqBctAHdJ3JIua3E06nzlaSGrHA1JLZwbxFFJw1pYUlJjE/LKYSnkVIxx4KkRObphveiggW2vpJAV1h9DsFD/T9Qid26WJ34yF3Tubntz8T5vUFH2blgrU1aERl5flFUaqIB5oZAqi5L0zBMhrfJvBXkurJDkaw99CfHtL98lx7u9+E1v9+Ne9+CwrWOVvWAv2Q6L2Vt2wD6wI9Znkn1h39lP9iu4Cn4Ev4M/16NLQZt5zm4g+PsPbC+uOg==</latexit>

Page 25: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Comment 7

• Orthogonal non-negative features are likely sparse

• If is non-negative,

• is near zero for any j != k, only if a small number of features are active (instance sparsity)

• is small only if there is little overlap in activation between vectors (lifetime sparsity)

�(x)<latexit sha1_base64="G066fGpOyQTG9fUeSaVYqs0sClM=">AAAB7nicbVBNSwMxEJ2tX7V+VT16CRahXspuFfRY9OKxgv2AdinZNNuGZrMhyYpl6Y/w4kERr/4eb/4b0+0etPXBwOO9GWbmBZIzbVz32ymsrW9sbhW3Szu7e/sH5cOjto4TRWiLxDxW3QBrypmgLcMMp12pKI4CTjvB5Hbudx6p0iwWD2YqqR/hkWAhI9hYqdOXY1Z9Oh+UK27NzYBWiZeTCuRoDspf/WFMkogKQzjWuue50vgpVoYRTmelfqKpxGSCR7RnqcAR1X6anTtDZ1YZojBWtoRBmfp7IsWR1tMosJ0RNmO97M3F/7xeYsJrP2VCJoYKslgUJhyZGM1/R0OmKDF8agkmitlbERljhYmxCZVsCN7yy6ukXa95F7X6/WWlcZPHUYQTOIUqeHAFDbiDJrSAwASe4RXeHOm8OO/Ox6K14OQzx/AHzucPueyPKg==</latexit>

�(x)>�(u)<latexit sha1_base64="ubqQDzu0J97pokx34mgwhuhMev4=">AAAB/XicbVC7TsMwFHV4lvIKj43FokJqlyopSDBWsDAWiT6kJlSO67RWHceyHUSJKn6FhQGEWPkPNv4GN80ALUe60vE598r3nkAwqrTjfFtLyyura+uFjeLm1vbOrr2331JxIjFp4pjFshMgRRjlpKmpZqQjJEFRwEg7GF1N/fY9kYrG/FaPBfEjNOA0pBhpI/XsQ08MafmhcufpWMDskVR6dsmpOhngInFzUgI5Gj37y+vHOIkI15ghpbquI7SfIqkpZmRS9BJFBMIjNCBdQzmKiPLTbPsJPDFKH4axNMU1zNTfEymKlBpHgemMkB6qeW8q/ud1Ex1e+CnlItGE49lHYcKgjuE0CtinkmDNxoYgLKnZFeIhkghrE1jRhODOn7xIWrWqe1qt3ZyV6pd5HAVwBI5BGbjgHNTBNWiAJsDgETyDV/BmPVkv1rv1MWtdsvKZA/AH1ucPOi+UbQ==</latexit>

E[�j(X)�k(X)]<latexit sha1_base64="vcbD8OE+il1KUFx3zA1DzIe1Vzw=">AAACCHicbZDLSsNAFIYn9VbrLerShYNFqJuSVEGXRRFcVrAXSEKYTKft2MkkzEyEErp046u4caGIWx/BnW/jJM1CW38Y+PjPOcw5fxAzKpVlfRulpeWV1bXyemVjc2t7x9zd68goEZi0ccQi0QuQJIxy0lZUMdKLBUFhwEg3GF9l9e4DEZJG/E5NYuKFaMjpgGKktOWbh26I1CgI0uup48Yj6t/Xeicwp7EmzzerVt3KBRfBLqAKCrV888vtRzgJCVeYISkd24qVlyKhKGZkWnETSWKEx2hIHI0chUR6aX7IFB5rpw8HkdCPK5i7vydSFEo5CQPdma0t52uZ+V/NSdTgwkspjxNFOJ59NEgYVBHMUoF9KghWbKIBYUH1rhCPkEBY6ewqOgR7/uRF6DTq9mm9cXtWbV4WcZTBATgCNWCDc9AEN6AF2gCDR/AMXsGb8WS8GO/Gx6y1ZBQz++CPjM8fWuWY4w==</latexit>

Page 26: Some Thoughts about Learning Predictions Onlinewhitem/presentations/...Some Thoughts about Learning Predictions Online Martha White TTT, 2019 Online Prediction Learning • Constant

Question: How do we get good generalization?

• Do we build-in constraints onto our networks?

• Do we use lots of data/predictions? How much is enough?