An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling....

77
An Overview of Other Topics, Conclusions, and Perspectives Piyush Rai Machine Learning (CS771A) Nov 11, 2016 Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 1

Transcript of An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling....

Page 1: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

An Overview of Other Topics,Conclusions, and Perspectives

Piyush Rai

Machine Learning (CS771A)

Nov 11, 2016

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 1

Page 2: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Plan for today..

Survey of some other topics

Sparse Modeling

Time-series modeling

Reinforcement Learning

Multitask/Transfer Learning

Active Learning

Bayesian Learning

Conclusion and take-aways..

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 2

Page 3: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparse Modeling

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 3

Page 4: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparse Modeling

Many ML problems can be written in the following general form

y = Xw + ε

where y is N × 1, X is N × D, w is D × 1, and ε denotes observation noise

Often we expect/want w to be sparse (i.e., at most, say s � D, nonzeros)

Becomes especially important when D >> N

Examples: Sparse regression/classification, sparse matrix factorization, compressive sensing,dictionary learning, and many others.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 4

Page 5: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparse Modeling

Many ML problems can be written in the following general form

y = Xw + ε

where y is N × 1, X is N × D, w is D × 1, and ε denotes observation noise

Often we expect/want w to be sparse (i.e., at most, say s � D, nonzeros)

Becomes especially important when D >> N

Examples: Sparse regression/classification, sparse matrix factorization, compressive sensing,dictionary learning, and many others.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 4

Page 6: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparse Modeling

Many ML problems can be written in the following general form

y = Xw + ε

where y is N × 1, X is N × D, w is D × 1, and ε denotes observation noise

Often we expect/want w to be sparse (i.e., at most, say s � D, nonzeros)

Becomes especially important when D >> N

Examples: Sparse regression/classification, sparse matrix factorization, compressive sensing,dictionary learning, and many others.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 4

Page 7: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparse Modeling

Many ML problems can be written in the following general form

y = Xw + ε

where y is N × 1, X is N × D, w is D × 1, and ε denotes observation noise

Often we expect/want w to be sparse (i.e., at most, say s � D, nonzeros)

Becomes especially important when D >> N

Examples: Sparse regression/classification, sparse matrix factorization, compressive sensing,dictionary learning, and many others.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 4

Page 8: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparse Modeling

Ideally, our goal will be to solve the following problem

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

Note: ||w ||0 is known as the `0 norm of w

In general, an NP-hard problem: Combinatorial optimization problem with∑s

i=1

(Di

)possible

locations of nonzeros in w

Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).

A huge body of work on solving these problems. Primarily in two categories

Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)

arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min

w||y − Xw ||2 + λ||w ||1

(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)

Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)

†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5

Page 9: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparse Modeling

Ideally, our goal will be to solve the following problem

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

Note: ||w ||0 is known as the `0 norm of w

In general, an NP-hard problem: Combinatorial optimization problem with∑s

i=1

(Di

)possible

locations of nonzeros in w

Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).

A huge body of work on solving these problems. Primarily in two categories

Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)

arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min

w||y − Xw ||2 + λ||w ||1

(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)

Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)

†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5

Page 10: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparse Modeling

Ideally, our goal will be to solve the following problem

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

Note: ||w ||0 is known as the `0 norm of w

In general, an NP-hard problem: Combinatorial optimization problem with∑s

i=1

(Di

)possible

locations of nonzeros in w

Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).

A huge body of work on solving these problems. Primarily in two categories

Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)

arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min

w||y − Xw ||2 + λ||w ||1

(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)

Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)

†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5

Page 11: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparse Modeling

Ideally, our goal will be to solve the following problem

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

Note: ||w ||0 is known as the `0 norm of w

In general, an NP-hard problem: Combinatorial optimization problem with∑s

i=1

(Di

)possible

locations of nonzeros in w

Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).

A huge body of work on solving these problems. Primarily in two categories

Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)

arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min

w||y − Xw ||2 + λ||w ||1

(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)

Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)

†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5

Page 12: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparse Modeling

Ideally, our goal will be to solve the following problem

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

Note: ||w ||0 is known as the `0 norm of w

In general, an NP-hard problem: Combinatorial optimization problem with∑s

i=1

(Di

)possible

locations of nonzeros in w

Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).

A huge body of work on solving these problems. Primarily in two categories

Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)

arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min

w||y − Xw ||2 + λ||w ||1

(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)

Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)

†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5

Page 13: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparse Modeling

Ideally, our goal will be to solve the following problem

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

Note: ||w ||0 is known as the `0 norm of w

In general, an NP-hard problem: Combinatorial optimization problem with∑s

i=1

(Di

)possible

locations of nonzeros in w

Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).

A huge body of work on solving these problems. Primarily in two categories

Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)

arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min

w||y − Xw ||2 + λ||w ||1

(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)

Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)

†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5

Page 14: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparse Modeling

Ideally, our goal will be to solve the following problem

arg minw||y − Xw ||2 s.t. ||w ||0 ≤ s

Note: ||w ||0 is known as the `0 norm of w

In general, an NP-hard problem: Combinatorial optimization problem with∑s

i=1

(Di

)possible

locations of nonzeros in w

Also, the constraint ||w ||0 ≤ s is non-convex (figure on next slide).

A huge body of work on solving these problems. Primarily in two categories

Convex-ify the problem (e.g., replace the `0 norm by the `1 norm)

arg minw||y − Xw ||2 s.t. ||w ||1 ≤ t OR arg min

w||y − Xw ||2 + λ||w ||1

(note: the `1 constraint makes the objective non-diff. at 0, but many ways to handle this)

Use non-convex optimization methods, e.g., iterative hard threholding (rough idea: usegradient descent to solve for w and set D − s smallest entries to zero in every iteration;basically a projected GD method)

†See “Optimization Methods for `1 Regularization” by Schmidt et al (2009)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 5

Page 15: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparsity using `1 Norm

Why `1 norm gives sparsity? Many explanations. An informal one:

Chances of the error function contour meeting the constraint contour at the coordinate axes ismore likely in case of `1

Another explanation: Between `2 and `1 norms, `1 is “closer” to the `0 norm (in fact, `1 norm isthe closest convex approximation to `0 norm)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 6

Page 16: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparsity using `1 Norm

Why `1 norm gives sparsity? Many explanations. An informal one:

Chances of the error function contour meeting the constraint contour at the coordinate axes ismore likely in case of `1

Another explanation: Between `2 and `1 norms, `1 is “closer” to the `0 norm (in fact, `1 norm isthe closest convex approximation to `0 norm)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 6

Page 17: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Sparsity using `1 Norm

Why `1 norm gives sparsity? Many explanations. An informal one:

Chances of the error function contour meeting the constraint contour at the coordinate axes ismore likely in case of `1

Another explanation: Between `2 and `1 norms, `1 is “closer” to the `0 norm (in fact, `1 norm isthe closest convex approximation to `0 norm)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 6

Page 18: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Learning from Time-Series Data

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 7

Page 19: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Modeling Time-Series Data

The input is a sequence of (non-i.i.d.) examples y 1, y 2, . . . , yT

The problem may be supervised or unsupervised, e.g.,

Forecasting: Predict yT+1, given y 1, y 2, . . . , yT

Cluster the examples or perform dimensionality reduction

Evolution of time-series data can be attributed to several factors

Teasing apart these factors of variation is also an important problem

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 8

Page 20: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Auto-regressive Models

Auto-regressive (AR): Regress each example on p previous examples

y t = c +

p∑i=1

wiy t−i + εt : An AR(p) model

Moving Average (MA): Regress each example on p previous stochastic errors

y t = c + εt +

p∑i=1

wiεt−i : An MA(p) model

Auto-regressive Moving Average (ARMA): Regress each example of p previous examples and qprevious stochastic errors

y t = c + εt +

p∑i=1

wiy t−i +

q∑i=1

viεt−i : An ARMA(p, q) model

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 9

Page 21: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Auto-regressive Models

Auto-regressive (AR): Regress each example on p previous examples

y t = c +

p∑i=1

wiy t−i + εt : An AR(p) model

Moving Average (MA): Regress each example on p previous stochastic errors

y t = c + εt +

p∑i=1

wiεt−i : An MA(p) model

Auto-regressive Moving Average (ARMA): Regress each example of p previous examples and qprevious stochastic errors

y t = c + εt +

p∑i=1

wiy t−i +

q∑i=1

viεt−i : An ARMA(p, q) model

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 9

Page 22: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Auto-regressive Models

Auto-regressive (AR): Regress each example on p previous examples

y t = c +

p∑i=1

wiy t−i + εt : An AR(p) model

Moving Average (MA): Regress each example on p previous stochastic errors

y t = c + εt +

p∑i=1

wiεt−i : An MA(p) model

Auto-regressive Moving Average (ARMA): Regress each example of p previous examples and qprevious stochastic errors

y t = c + εt +

p∑i=1

wiy t−i +

q∑i=1

viεt−i : An ARMA(p, q) model

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 9

Page 23: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

State-Space Models

Assume that each observation y t in the time-series is generated by a low-dimensional latent factorx t (one-hot or continuous)

Basically, a generative latent factor model: y t = g(x t) and x t = f (x t−1), where g and f areprobability distributions

Very similar to PPCA/FA, except that latent factor x t depends on x t−1

Some popular SSMs: Hidden Markov Models (one-hot latent factor x t), Kalman Filters(real-valued latent factor x t)

Note: Models like RNN/LSTM are also similar, except that these are not generative (but can bemade generative)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 10

Page 24: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Reinforcement Learning

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 11

Page 25: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Reinforcement Learning

A paradigm for interactive learning or “learning by doing”

Different from supervised learning. Supervision is “implicit”

The learner (agent) performs “actions” and gets “rewards”

Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)

Order in which data arrives matters (sequential, non i.i.d data)

Agent’s actions affect the subsequent data it receives

Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 12

Page 26: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Reinforcement Learning

A paradigm for interactive learning or “learning by doing”

Different from supervised learning. Supervision is “implicit”

The learner (agent) performs “actions” and gets “rewards”

Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)

Order in which data arrives matters (sequential, non i.i.d data)

Agent’s actions affect the subsequent data it receives

Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 12

Page 27: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Reinforcement Learning

A paradigm for interactive learning or “learning by doing”

Different from supervised learning. Supervision is “implicit”

The learner (agent) performs “actions” and gets “rewards”

Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)

Order in which data arrives matters (sequential, non i.i.d data)

Agent’s actions affect the subsequent data it receives

Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 12

Page 28: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Reinforcement Learning

A paradigm for interactive learning or “learning by doing”

Different from supervised learning. Supervision is “implicit”

The learner (agent) performs “actions” and gets “rewards”

Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)

Order in which data arrives matters (sequential, non i.i.d data)

Agent’s actions affect the subsequent data it receives

Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 12

Page 29: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Reinforcement Learning

A paradigm for interactive learning or “learning by doing”

Different from supervised learning. Supervision is “implicit”

The learner (agent) performs “actions” and gets “rewards”

Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)

Order in which data arrives matters (sequential, non i.i.d data)

Agent’s actions affect the subsequent data it receives

Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 12

Page 30: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Reinforcement Learning

A paradigm for interactive learning or “learning by doing”

Different from supervised learning. Supervision is “implicit”

The learner (agent) performs “actions” and gets “rewards”

Goal: Learn a policy that maximizes the agent’s cumulative expected reward (policy tells what actionthe agent should take next)

Order in which data arrives matters (sequential, non i.i.d data)

Agent’s actions affect the subsequent data it receives

Many applications: Robotics and control, computer game playing (e.g., Atari, GO), onlineadvertising, financial trading, etc.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 12

Page 31: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Markov Decision Processes

Markov Decision Process (MDP) gives a way to formulate RL problems

MDP formulation assumes the agent knows the its state in the environment

Partially Observed MDP (POMDP) can be used when state is unknown

Assume there are K possible states and a set of actions at any state

A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)

Each Pa is a K × K matrix of transition probabilites (one for each action a)

A reward function R(st = k, at = a)

Goal: Find a “policy” π(st = k) which returns the optimal action for st = k

(Pa,R) and π can be estimated in an alternating fashion

Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13

Page 32: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Markov Decision Processes

Markov Decision Process (MDP) gives a way to formulate RL problems

MDP formulation assumes the agent knows the its state in the environment

Partially Observed MDP (POMDP) can be used when state is unknown

Assume there are K possible states and a set of actions at any state

A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)

Each Pa is a K × K matrix of transition probabilites (one for each action a)

A reward function R(st = k, at = a)

Goal: Find a “policy” π(st = k) which returns the optimal action for st = k

(Pa,R) and π can be estimated in an alternating fashion

Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13

Page 33: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Markov Decision Processes

Markov Decision Process (MDP) gives a way to formulate RL problems

MDP formulation assumes the agent knows the its state in the environment

Partially Observed MDP (POMDP) can be used when state is unknown

Assume there are K possible states and a set of actions at any state

A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)

Each Pa is a K × K matrix of transition probabilites (one for each action a)

A reward function R(st = k, at = a)

Goal: Find a “policy” π(st = k) which returns the optimal action for st = k

(Pa,R) and π can be estimated in an alternating fashion

Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13

Page 34: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Markov Decision Processes

Markov Decision Process (MDP) gives a way to formulate RL problems

MDP formulation assumes the agent knows the its state in the environment

Partially Observed MDP (POMDP) can be used when state is unknown

Assume there are K possible states and a set of actions at any state

A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)

Each Pa is a K × K matrix of transition probabilites (one for each action a)

A reward function R(st = k, at = a)

Goal: Find a “policy” π(st = k) which returns the optimal action for st = k

(Pa,R) and π can be estimated in an alternating fashion

Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13

Page 35: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Markov Decision Processes

Markov Decision Process (MDP) gives a way to formulate RL problems

MDP formulation assumes the agent knows the its state in the environment

Partially Observed MDP (POMDP) can be used when state is unknown

Assume there are K possible states and a set of actions at any state

A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)

Each Pa is a K × K matrix of transition probabilites (one for each action a)

A reward function R(st = k, at = a)

Goal: Find a “policy” π(st = k) which returns the optimal action for st = k

(Pa,R) and π can be estimated in an alternating fashion

Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13

Page 36: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Markov Decision Processes

Markov Decision Process (MDP) gives a way to formulate RL problems

MDP formulation assumes the agent knows the its state in the environment

Partially Observed MDP (POMDP) can be used when state is unknown

Assume there are K possible states and a set of actions at any state

A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)

Each Pa is a K × K matrix of transition probabilites (one for each action a)

A reward function R(st = k, at = a)

Goal: Find a “policy” π(st = k) which returns the optimal action for st = k

(Pa,R) and π can be estimated in an alternating fashion

Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13

Page 37: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Markov Decision Processes

Markov Decision Process (MDP) gives a way to formulate RL problems

MDP formulation assumes the agent knows the its state in the environment

Partially Observed MDP (POMDP) can be used when state is unknown

Assume there are K possible states and a set of actions at any state

A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)

Each Pa is a K × K matrix of transition probabilites (one for each action a)

A reward function R(st = k, at = a)

Goal: Find a “policy” π(st = k) which returns the optimal action for st = k

(Pa,R) and π can be estimated in an alternating fashion

Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13

Page 38: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Markov Decision Processes

Markov Decision Process (MDP) gives a way to formulate RL problems

MDP formulation assumes the agent knows the its state in the environment

Partially Observed MDP (POMDP) can be used when state is unknown

Assume there are K possible states and a set of actions at any state

A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)

Each Pa is a K × K matrix of transition probabilites (one for each action a)

A reward function R(st = k, at = a)

Goal: Find a “policy” π(st = k) which returns the optimal action for st = k

(Pa,R) and π can be estimated in an alternating fashion

Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13

Page 39: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Markov Decision Processes

Markov Decision Process (MDP) gives a way to formulate RL problems

MDP formulation assumes the agent knows the its state in the environment

Partially Observed MDP (POMDP) can be used when state is unknown

Assume there are K possible states and a set of actions at any state

A transition model p(st+1 = `|st = k, at = a) = Pa(k, `)

Each Pa is a K × K matrix of transition probabilites (one for each action a)

A reward function R(st = k, at = a)

Goal: Find a “policy” π(st = k) which returns the optimal action for st = k

(Pa,R) and π can be estimated in an alternating fashion

Estimating Pa and R requires some training data. Can be done even when the state space iscontinuous (requires solving a function approximation problem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 13

Page 40: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Multitask/Transfer Learning

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 14

Page 41: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Multitask/Transfer Learning

An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other

Each learning problem is a “task”

Note: We rarely know how the different tasks are related with each other

Have to learn the relatedness structure as well

For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM

Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)

Some other related problem settings: Domain Adaptation, Covariate Shift

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 15

Page 42: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Multitask/Transfer Learning

An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other

Each learning problem is a “task”

Note: We rarely know how the different tasks are related with each other

Have to learn the relatedness structure as well

For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM

Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)

Some other related problem settings: Domain Adaptation, Covariate Shift

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 15

Page 43: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Multitask/Transfer Learning

An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other

Each learning problem is a “task”

Note: We rarely know how the different tasks are related with each other

Have to learn the relatedness structure as well

For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM

Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)

Some other related problem settings: Domain Adaptation, Covariate Shift

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 15

Page 44: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Multitask/Transfer Learning

An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other

Each learning problem is a “task”

Note: We rarely know how the different tasks are related with each other

Have to learn the relatedness structure as well

For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM

Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)

Some other related problem settings: Domain Adaptation, Covariate Shift

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 15

Page 45: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Multitask/Transfer Learning

An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other

Each learning problem is a “task”

Note: We rarely know how the different tasks are related with each other

Have to learn the relatedness structure as well

For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM

Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)

Some other related problem settings: Domain Adaptation, Covariate Shift

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 15

Page 46: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Multitask/Transfer Learning

An umbrella term to refer to settings when we want to learn multiple models that could potentiallybenefit from sharing information with each other

Each learning problem is a “task”

Note: We rarely know how the different tasks are related with each other

Have to learn the relatedness structure as well

For M tasks, we will jointly learn M models, say w 1,w 2, . . . ,wM

Need to jointly regularize the models to make them similar to each other based on their degree/way ofrelatedness with each other (to be learned)

Some other related problem settings: Domain Adaptation, Covariate Shift

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 15

Page 47: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Covariate Shift

Note: The input or the feature vector x is also known as “covariate”

Suppose training inputs come from distribution ptr (x)

Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)

The following loss function can be used to handle this issueN∑

n=1

pte(xn)

ptr (xn)`(yn, f (xn)) + R(f )

Why will this work? Well, because

E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr

[pte(x , y)

ptr (x , y)`(y , x ,w)

]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)

ptr (x,y) = pte(x)ptr (x)

Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16

Page 48: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Covariate Shift

Note: The input or the feature vector x is also known as “covariate”

Suppose training inputs come from distribution ptr (x)

Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)

The following loss function can be used to handle this issueN∑

n=1

pte(xn)

ptr (xn)`(yn, f (xn)) + R(f )

Why will this work? Well, because

E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr

[pte(x , y)

ptr (x , y)`(y , x ,w)

]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)

ptr (x,y) = pte(x)ptr (x)

Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16

Page 49: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Covariate Shift

Note: The input or the feature vector x is also known as “covariate”

Suppose training inputs come from distribution ptr (x)

Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)

The following loss function can be used to handle this issueN∑

n=1

pte(xn)

ptr (xn)`(yn, f (xn)) + R(f )

Why will this work? Well, because

E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr

[pte(x , y)

ptr (x , y)`(y , x ,w)

]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)

ptr (x,y) = pte(x)ptr (x)

Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16

Page 50: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Covariate Shift

Note: The input or the feature vector x is also known as “covariate”

Suppose training inputs come from distribution ptr (x)

Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)

The following loss function can be used to handle this issueN∑

n=1

pte(xn)

ptr (xn)`(yn, f (xn)) + R(f )

Why will this work? Well, because

E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr

[pte(x , y)

ptr (x , y)`(y , x ,w)

]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)

ptr (x,y) = pte(x)ptr (x)

Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16

Page 51: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Covariate Shift

Note: The input or the feature vector x is also known as “covariate”

Suppose training inputs come from distribution ptr (x)

Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)

The following loss function can be used to handle this issueN∑

n=1

pte(xn)

ptr (xn)`(yn, f (xn)) + R(f )

Why will this work? Well, because

E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr

[pte(x , y)

ptr (x , y)`(y , x ,w)

]

If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)ptr (x,y) = pte(x)

ptr (x)

Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16

Page 52: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Covariate Shift

Note: The input or the feature vector x is also known as “covariate”

Suppose training inputs come from distribution ptr (x)

Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)

The following loss function can be used to handle this issueN∑

n=1

pte(xn)

ptr (xn)`(yn, f (xn)) + R(f )

Why will this work? Well, because

E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr

[pte(x , y)

ptr (x , y)`(y , x ,w)

]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)

ptr (x,y) = pte(x)ptr (x)

Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16

Page 53: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Covariate Shift

Note: The input or the feature vector x is also known as “covariate”

Suppose training inputs come from distribution ptr (x)

Suppose test inputs come from distribution ptr (x) and ptr (x) 6= pte(x)

The following loss function can be used to handle this issueN∑

n=1

pte(xn)

ptr (xn)`(yn, f (xn)) + R(f )

Why will this work? Well, because

E(x,y)∼pte [`(y , x ,w)] = E(x,y)∼ptr

[pte(x , y)

ptr (x , y)`(y , x ,w)

]If p(y |x) doesn’t change and only p(x) changes, then pte(x,y)

ptr (x,y) = pte(x)ptr (x)

Can actually estimate the ratio without estimating the densities (a huge body of work on thisproblem)

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 16

Page 54: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Active Learning

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 17

Page 55: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Active Learning

Allow the learner to ask for the most informative training examples

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 18

Page 56: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Active Learning

Allow the learner to ask for the most informative training examples

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 18

Page 57: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Active Learning

Allow the learner to ask for the most informative training examples

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 18

Page 58: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Active Learning

Allow the learner to ask for the most informative training examples

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 18

Page 59: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Active Learning

Allow the learner to ask for the most informative training examples

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 18

Page 60: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Bayesian Learning

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 19

Page 61: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Optimization vs Inference

Learning as Optimization

Parameter θ is a fixed unknown

Minimize a “loss” and find a point estimate (best answer) for θ, given data X

θ̂ = arg minθ∈Θ

Loss(X; θ)

or arg maxθ

log p(X|θ)︸ ︷︷ ︸Maximum Likelihood

or arg maxθ

log p(X|θ)p(θ)︸ ︷︷ ︸Maximum-a-Posteriori Estimation

Learning as (Bayesian) Inference

Treat the parameter θ as a random variable with a prior distribution p(θ)

Infer a posterior distribution over the parameters using Bayes rule

p(θ|X) =p(X|θ)p(θ)

p(X)∝ Likelihood× Prior

Posterior becomes the new prior for next batch of observed data

No “fitting”, so no overfitting!

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 20

Page 62: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Optimization vs Inference

Learning as Optimization

Parameter θ is a fixed unknown

Minimize a “loss” and find a point estimate (best answer) for θ, given data X

θ̂ = arg minθ∈Θ

Loss(X; θ) or arg maxθ

log p(X|θ)︸ ︷︷ ︸Maximum Likelihood

or arg maxθ

log p(X|θ)p(θ)︸ ︷︷ ︸Maximum-a-Posteriori Estimation

Learning as (Bayesian) Inference

Treat the parameter θ as a random variable with a prior distribution p(θ)

Infer a posterior distribution over the parameters using Bayes rule

p(θ|X) =p(X|θ)p(θ)

p(X)∝ Likelihood× Prior

Posterior becomes the new prior for next batch of observed data

No “fitting”, so no overfitting!

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 20

Page 63: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Optimization vs Inference

Learning as Optimization

Parameter θ is a fixed unknown

Minimize a “loss” and find a point estimate (best answer) for θ, given data X

θ̂ = arg minθ∈Θ

Loss(X; θ)

or arg maxθ

log p(X|θ)︸ ︷︷ ︸Maximum Likelihood

or arg maxθ

log p(X|θ)p(θ)︸ ︷︷ ︸Maximum-a-Posteriori Estimation

Learning as (Bayesian) Inference

Treat the parameter θ as a random variable with a prior distribution p(θ)

Infer a posterior distribution over the parameters using Bayes rule

p(θ|X) =p(X|θ)p(θ)

p(X)∝ Likelihood× Prior

Posterior becomes the new prior for next batch of observed data

No “fitting”, so no overfitting!

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 20

Page 64: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Optimization vs Inference

Learning as Optimization

Parameter θ is a fixed unknown

Minimize a “loss” and find a point estimate (best answer) for θ, given data X

θ̂ = arg minθ∈Θ

Loss(X; θ) or arg maxθ

log p(X|θ)︸ ︷︷ ︸Maximum Likelihood

or arg maxθ

log p(X|θ)p(θ)︸ ︷︷ ︸Maximum-a-Posteriori Estimation

Learning as (Bayesian) Inference

Treat the parameter θ as a random variable with a prior distribution p(θ)

Infer a posterior distribution over the parameters using Bayes rule

p(θ|X) =p(X|θ)p(θ)

p(X)∝ Likelihood× Prior

Posterior becomes the new prior for next batch of observed data

No “fitting”, so no overfitting!

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 20

Page 65: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Optimization vs Inference

Learning as Optimization

Parameter θ is a fixed unknown

Minimize a “loss” and find a point estimate (best answer) for θ, given data X

θ̂ = arg minθ∈Θ

Loss(X; θ) or arg maxθ

log p(X|θ)︸ ︷︷ ︸Maximum Likelihood

or arg maxθ

log p(X|θ)p(θ)︸ ︷︷ ︸Maximum-a-Posteriori Estimation

Learning as (Bayesian) Inference

Treat the parameter θ as a random variable with a prior distribution p(θ)

Infer a posterior distribution over the parameters using Bayes rule

p(θ|X) =p(X|θ)p(θ)

p(X)∝ Likelihood× Prior

Posterior becomes the new prior for next batch of observed data

No “fitting”, so no overfitting!

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 20

Page 66: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Optimization vs Inference

Learning as Optimization

Parameter θ is a fixed unknown

Minimize a “loss” and find a point estimate (best answer) for θ, given data X

θ̂ = arg minθ∈Θ

Loss(X; θ) or arg maxθ

log p(X|θ)︸ ︷︷ ︸Maximum Likelihood

or arg maxθ

log p(X|θ)p(θ)︸ ︷︷ ︸Maximum-a-Posteriori Estimation

Learning as (Bayesian) Inference

Treat the parameter θ as a random variable with a prior distribution p(θ)

Infer a posterior distribution over the parameters using Bayes rule

p(θ|X) =p(X|θ)p(θ)

p(X)∝ Likelihood× Prior

Posterior becomes the new prior for next batch of observed data

No “fitting”, so no overfitting!

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 20

Page 67: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Why be Bayesian?

Can capture/quantify the uncertainty (or “variance”) in θ via the posterior

Can make predictions by averaging over the posterior

p(y |x ,X ,Y )︸ ︷︷ ︸predictive posterior

=

∫p(y |x , θ)p(θ|X ,Y )dθ

Many other benefits (wait for next semester :) )

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 21

Page 68: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Why be Bayesian?

Can capture/quantify the uncertainty (or “variance”) in θ via the posterior

Can make predictions by averaging over the posterior

p(y |x ,X ,Y )︸ ︷︷ ︸predictive posterior

=

∫p(y |x , θ)p(θ|X ,Y )dθ

Many other benefits (wait for next semester :) )

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 21

Page 69: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Why be Bayesian?

Can capture/quantify the uncertainty (or “variance”) in θ via the posterior

Can make predictions by averaging over the posterior

p(y |x ,X ,Y )︸ ︷︷ ︸predictive posterior

=

∫p(y |x , θ)p(θ|X ,Y )dθ

Many other benefits (wait for next semester :) )

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 21

Page 70: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Conclusion and Take-aways

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 22

Page 71: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Conclusion and Take-aways

Most learning problems can be cast as optimizing a regularized loss function

Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data

Negative log-likelihood (NLL) = loss function, log-prior = regularizer

More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior

Always start with simple models. Linear models can be really powerful given a good featurerepresentation.

Learn to first diagnose a learning algorithm rather than trying new ones

No free lunch. No learning algorithm is “universally” good.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 23

Page 72: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Conclusion and Take-aways

Most learning problems can be cast as optimizing a regularized loss function

Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data

Negative log-likelihood (NLL) = loss function, log-prior = regularizer

More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior

Always start with simple models. Linear models can be really powerful given a good featurerepresentation.

Learn to first diagnose a learning algorithm rather than trying new ones

No free lunch. No learning algorithm is “universally” good.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 23

Page 73: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Conclusion and Take-aways

Most learning problems can be cast as optimizing a regularized loss function

Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data

Negative log-likelihood (NLL) = loss function, log-prior = regularizer

More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior

Always start with simple models. Linear models can be really powerful given a good featurerepresentation.

Learn to first diagnose a learning algorithm rather than trying new ones

No free lunch. No learning algorithm is “universally” good.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 23

Page 74: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Conclusion and Take-aways

Most learning problems can be cast as optimizing a regularized loss function

Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data

Negative log-likelihood (NLL) = loss function, log-prior = regularizer

More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior

Always start with simple models. Linear models can be really powerful given a good featurerepresentation.

Learn to first diagnose a learning algorithm rather than trying new ones

No free lunch. No learning algorithm is “universally” good.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 23

Page 75: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Conclusion and Take-aways

Most learning problems can be cast as optimizing a regularized loss function

Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data

Negative log-likelihood (NLL) = loss function, log-prior = regularizer

More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior

Always start with simple models. Linear models can be really powerful given a good featurerepresentation.

Learn to first diagnose a learning algorithm rather than trying new ones

No free lunch. No learning algorithm is “universally” good.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 23

Page 76: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Conclusion and Take-aways

Most learning problems can be cast as optimizing a regularized loss function

Probabilistic viewpoint: Most learning problems can be cast as doing MLE/MAP on a probabilisticmodel of the data

Negative log-likelihood (NLL) = loss function, log-prior = regularizer

More sophisticated models can be constructed with this basic understanding: Just think of theappropriate loss function/probability model for the data, and the appropriate regularizer/prior

Always start with simple models. Linear models can be really powerful given a good featurerepresentation.

Learn to first diagnose a learning algorithm rather than trying new ones

No free lunch. No learning algorithm is “universally” good.

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 23

Page 77: An Overview of Other Topics, Conclusions, and Perspectives · 2017. 8. 14. · Sparse Modeling. Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives

Thank You!

Machine Learning (CS771A) An Overview of Other Topics, Conclusions, and Perspectives 24