ml reinforcement learning - New York Universitymohri/mls/ml_reinforcement_learning.pdfMehryar Mohri...

Foundations of Machine Learning Reinforcement Learning

Mehryar MohriCourant Institute and Google Research

mohri@cims.nyu.edu

pageMehryar Mohri - Foundations of Machine Learning 2

Reinforcement Learning

Agent exploring environment.

Interactions with environment:

Problem: find action policy that maximizes cumulative reward over the course of interactions.

EnvironmentAgent

action

reward

Key Features

Contrast with supervised learning:

• no explicit labeled training data.

• distribution defined by actions taken.Delayed rewards or penalties.RL trade-off:

• exploration (of unknown states and actions) to gain more reward information; vs.

• exploitation (of known information) to optimize reward.

Applications

Robot control e.g., Robocup Soccer Teams (Stone et

al., 1999).

Board games, e.g., TD-Gammon (Tesauro, 1995).

Elevator scheduling (Crites and Barto, 1996).

Ads placement.

Telecommunications.

Inventory management.

Dynamic radio channel assignment.

This Lecture

Markov Decision Processes (MDPs)

Planning

Learning

Multi-armed bandit problem

pageMehryar Mohri - Foundations of Machine Learning

Markov Decision Process (MDP)

Definition: a Markov Decision Process is defined by:

• a set of decision epochs .

• a set of states , possibly infinite.

• a start state or initial state ;

• a set of actions , possibly infinite.

• a transition probability : distribution over destination states .

• a reward probability : distribution over rewards returned .

{0, . . . , T}

Pr[s�|s, a]

Pr[r�|s, a]

s� =�(s, a)

r� =r(s, a)

State observed at time :

Action taken at time :

State reached .

Reward received: .

t st � S.

t at � A.

st st+1 st+2

at/rt+1 at+1/rt+2

EnvironmentAgent

action

rewardst+1 =�(st, at)

rt+1 =r(st, at)

MDPs - Properties

Finite MDPs: and finite sets.

Finite horizon when .

Reward : often deterministic function.

r(s, a)

T <�

Example - Robot Picking up Balls

start search/[.1, R1]

search/[.9, R1] carry/[.5, R3]

carry/[.5, -1] pickup/[1, R2]

Policy

Definition: a policy is a mapping

Objective: find policy maximizing expected return.

• finite horizon return: .

• infinite horizon return: .

Theorem: for any finite MDP, there exists an optimal policy (for any start state).

� : S � A.

�T�1t=0 r

�st,�(st)

��+�

t=0 �tr�st,�(st)

Policy Value

Definition: the value of a policy at state is

• finite horizon:

• infinite horizon: discount factor ,

Problem: find policy with maximum value for all states.

�� [0, 1)

V�(s) = E

�+��

�tr�st,�(st)

��s0 = s

V�(s) = E

�T�1�

r�st,�(st)

��s0 = s

Policy Evaluation

Analysis of policy value:

Bellman equations (system of linear equations):

V�(s) = E[r(s, �(s)] + ��

Pr[s�|s, �(s)]V�(s�).

V�(s) = E

�+��

�tr�st,�(st)

��s0 = s

= E[r(s,�(s))] + � E

�+��

�tr�st+1,�(st+1)

��s0 = s

= E[r(s,�(s)] + � E[V�(�(s,�(s)))].

Bellman Equation - Existence and Uniqueness

Notation:

• transition probability matrix

• value column matrix

• expected reward column matrix:

Theorem: for a finite MDP, Bellman’s equation admits a unique solution given by

Ps,s� =Pr[s�|s, �(s)].

V0 =(I� �P)�1R.

V=V�(s).

R=E[r(s, �(s)].

Bellman Equation - Existence and Uniqueness

Proof: Bellman’s equation rewritten as

• is a stochastic matrix, thus,

• This implies that The eigenvalues of are all less than one and is invertible.

Notes: general shortest distance problem (MM, 2002).

�P�� = maxs

|Pss� | = maxs

Pr[s�|s, �(s)] = 1.

��P�� = � <1.(I� �P)

V=R + �PV.

Optimal Policy

Definition: policy with maximal value for all states

• value of (optimal value):

• optimal state-action value function: expected return for taking action at state and then following optimal policy.

��

s�S.

��

�s � S, V��(s) = max�

V�(s).

Q⇤(s, a) = E[r(s, a)] + � E[V ⇤(�(s, a))]

= E[r(s, a)] + �X

Pr[s0 | s, a]V ⇤(s0).

Optimal Values - Bellman Equations

Property: the following equalities hold:

Proof: by definition, for all , .

• If for some we had , then maximizing action would define a better policy.

�s � S, V �(s) = maxa�A

Q�(s, a).

s V �(s) � maxa�A

Q�(s, a)

s V �(s)<maxa�A

Q�(s, a)

V ⇤(s) = maxa2A

nE[r(s, a)] + �

Pr[s0|s, a]V ⇤(s0)o.

This Lecture

Planning

Learning

Known Model

Setting: environment model known.

Problem: find optimal policy.

Algorithms:

• value iteration.

• policy iteration.

• linear programming.

Value Iteration Algorithm

ValueIteration(V0)1 V � V0 � V0 arbitrary value2 while �V��(V)� � (1��)�

� do3 V � �(V)4 return �(V)

�(V) = max�

{R� + �P�V}.

�(V)(s) = maxa�A

�E[r(s, a)] + �

s��S

Pr[s�|s, a]V (s�)�

VI Algorithm - Convergence

Theorem: for any initial value , the sequence defined by converge to .Proof: we show that is -contracting for existence and uniqueness of fixed point for .

• for any , let be the maximizing action defining . Then, for and any ,

s � S

��

Vn+1 =�(Vn) V�

a�(s)�(V)(s) s � S U

� · ��

�(V)(s) ��(U)(s) � �(V)(s) ��

E[r(s, a�(s))] + ��

s��S

Pr[s� | s, a�(s)]U(s�)�

= ��

s��S

Pr[s�|s, a�(s)][V(s�)�U(s�)]

� ��

s��S

Pr[s�|s, a�(s)]�V �U�� = ��V�U��.

Complexity and Optimality

Complexity: convergence in . Observe that

-Optimality: let be the value returned. Then,

�V� �Vn+1�� V� ��(Vn+1)�� + ��(Vn+1)�Vn+1�� V� �Vn+1�� + ��Vn+1 �Vn��.

�V� �Vn+1��

1� ��Vn+1 �Vn�� .

�Vn+1 �Vn�� Vn �Vn�1�� n��(V0)�V0��.

�n��(V0)�V0�� (1� �)�

�� n = O

�log

�.Thus,

O(log 1� )

VI Algorithm - Example

a/[3/4, 2]

a/[1/4, 2]

b/[1, 2]d/[1, 3]

c/[1, 2]

Vn+1(1) = max�2 + �

�34Vn(1) +

14Vn(2)

�, 2 + �Vn(2)

Vn+1(2) = max�3 + �Vn(1), 2 + �Vn(2)

For ,V0(1) = �1,V0(2) = 1, � = 1/2 V1(1) = V1(2) = 5/2.

But, ,

V�(1) = 14/3,V�(2) = 16/3.

Policy Iteration Algorithm

PolicyIteration(�0)1 � � �0 � �0 arbitrary policy2 �� nil3 while (� �= ��) do4 V� V� � policy evaluation: solve (I� �P�)V = R�.5 �� 6 � � argmax�{R� + �P�V} � greedy policy improvement.7 return �

PI Algorithm - Convergence

Theorem: let be the sequence of policy values computed by the algorithm, then,

Proof: let be the policy improvement at the th iteration, then, by definition,

• therefore,

• note that preserves ordering:

• thus, 24

(Vn)n�N

Vn � Vn+1 � V�.

�n+1 n

R�n+1 + �P�n+1Vn � R�n + �P�nVn = Vn.

R�n+1 � (I� �P�n+1)Vn.

X � 0� (I� �P�n+1)�1X =��

k=0(�P�n+1)kX � 0.

(I� �P�n+1)�1

Vn+1 = (I� �P�n+1)�1R�n+1 � Vn.

Two consecutive policy values can be equal only at last iteration.The total number of possible policies is , thus, this is the maximal possible number of iterations.

• best upper bound known .

|A||S|

O� |A||S|

|S|�

PI Algorithm - Example

a/[3/4, 2]

a/[1/4, 2]

b/[1, 2]d/[1, 3]

c/[1, 2]

Initial policy: .V�0(1) = 1 + �V�0(2)V�0(2) = 2 + �V�0(2).

�0(1) = b, �0(2) = c

Evaluation:

Thus,V�0(1) =1 + �

1� �V�0(2) =

21� �

VI and PI Algorithms - Comparison

Theorem: let be the sequence of policy values generated by the VI algorithm, and the one generated by the PI algorithm. If , then,

Proof: we first show that is monotonic. Let and be such that and let be the policy such that . Then,

(Un)n�N(Vn)n�NU0 =V0

�n � N, Un � Vn � V�.

� UV U � V �

�(U) = R� + �P�U

�(U) � R� + �P�V � max��

{R�� + �P�

�V} = �(V).

VI and PI Algorithms - Comparison

• The proof is by induction on . Assume , then, by the monotonicity of ,

• Let be the maximizing policy:

• Then,

n Un�Vn

Un+1 = �(Un) � �(Vn) = max�

{R� + �P�Vn}.

�n+1

�n+1 = argmax�

{R� + �P�Vn}.

�(Vn) = R�n+1 + �P�n+1Vn � R�n+1 + �P�n+1Vn+1 = Vn+1.

The PI algorithm converges in a smaller number of iterations than the VI algorithm due to the optimal policy.

But, each iteration of the PI algorithm requires computing a policy value, i.e., solving a system of linear equations, which is more expensive to compute that an iteration of the VI algorithm.

Primal Linear Program

LP formulation: choose , with .

Parameters:

• number rows: .

• number of columns: .

|S||A||S|

�(s)V (s)

subject to �s � S, �a � A, V (s) � E[r(s, a)] + ��

s��S

Pr[s�|s, a]V (s�).

�(s)>0�

s �(s)=1

Dual Linear Program

LP formulation:

Parameters: more favorable number of rows.

• number rows: .

• number of columns: .

|S||S||A|

s�S,a�A

E[r(s, a)] x(s, a)

subject to �s � S,�

x(s�, a) = �(s�) + ��

s�S,a�A

Pr[s�|s, a] x(s�, a)

�s � S, �a � A, x(s, a) � 0.

This Lecture

Planning

Learning

Problem

Unknown model:

• transition and reward probabilities not known.

• realistic scenario in many practical problems, e.g., robot control.

Training information: sequence of immediate rewards based on actions taken.Learning approches:

• model-free: learn policy directly.

• model-based: learn model, use it to learn policy.

Problem

How do we estimate reward and transition probabilities?

• use equations derived for policy value and Q-functions.

• but, equations given in terms of some expectations.

• instance of a stochastic approximation problem.

Stochastic Approximation

Problem: find solution of with while

• cannot be computed, e.g., not accessible;

• i.i.d. sample of noisy observations , available, , with .

Idea: algorithm based on iterative technique:

• more generally .

x=H(x) x�RN

H(x) H

H(xi)+wi

i� [1, m] E[w]=0

xt+1 = (1� �t)xt + �t[H(xt) + wt]= xt + �t[H(xt) + wt � xt].

xt+1 = xt + �tD(xt,wt)

Mean Estimation

Theorem: Let be a random variable taking values in and let be i.i.d. values of . Define the sequence by

X[0, 1] X

µma.s�� E[X ].

�m� [0, 1]Then, for , with and

(µm)m�N

x0, . . . , xm

�m =+�

µm+1=(1�↵m)µm+↵mxm with µ0=x0.X

↵2m<+1,

Proof: By the independence assumption, for ,

• We have since .

• Let and suppose there exists such that for all , . Then, for ,

Var[µm+1] = (1� �m)2Var[µm] + �2mVar[xm]

� (1� �m)Var[µm] + �2m.

�m�0�

m�0 �2m <+�

which implies

�>0 N �Nm�N Var[µm]��

Var[µm+1] � Var[µm]� �m� + �2m,

contradicting .Var[µm+N ]�0

Var[µm+N ] � Var[µN ]� ��m+N

n=N �n +�m+N

n=N �2n� ��

�� when m��

Mean Estimation

• Thus, for all there exists such that

• Therefore, for all ( convergence).

Choose large enough so that Then,

µm�� m�m0

N �N m0�N

Var[µm0 ]<�.�m�N, �m��.

Var[µm0+1]�(1��m0)�+��m0=�.

special case: .

• Strong law of large numbers.

Connection with stochastic approximation.

�m = 1m

TD(0) Algorithm

Idea: recall Bellman’s linear equations giving

Algorithm: temporal difference (TD).

• sample new state .

• update: depends on number of visits of .

V�(s) = E[r(s, �(s)] + ��

Pr[s�|s, �(s)]V�(s�)

= Es�

�r(s, �(s)) + �V�(s�)|s

V (s)� (1 � �)V (s) + �[r(s, �(s)) + �V (s�)]= V (s) + �[r(s, �(s)) + �V (s�)� V (s)� ��

temporal di�erence of V values

TD(0) Algorithm

TD(0)()1 V� V0 � initialization.2 for t� 0 to T do3 s� SelectState()4 for each step of epoch t do5 r� � Reward(s, �(s))6 s� � NextState(�, s)7 V (s)� (1� �)V (s) + �[r� + �V (s�)]8 s� s�

9 return V

Q-Learning Algorithm

Idea: assume deterministic rewards.

Algorithm: depends on number of visits.

• sample new state .

• update:

Q(s, a)� �Q(s, a) + (1� �)[r(s, a) + � maxa��A

Q(s�, a�)].

� � [0, 1]

Q�(s, a) = E[r(s, a)] + ��

s��S

Pr[s� | s, a]V �(s�)

= Es�[r(s, a) + � max

a�AQ�(s�, a)]

Q-Learning Algorithm

(Watkins, 1989; Watkins and Dayan 1992)

Q-Learning(�)1 Q� Q0 � initialization, e.g., Q0 = 0.2 for t� 0 to T do3 s� SelectState()4 for each step of epoch t do5 a� SelectAction(�, s) � policy � derived from Q, e.g., �-greedy.6 r� � Reward(s, a)7 s� � NextState(s, a)8 Q(s, a)� Q(s, a) + �

�r� + � maxa� Q(s�, a�)�Q(s, a)

9 s� s�

10 return Q

Can be viewed as a stochastic formulation of the value iteration algorithm.

Convergence for any policy so long as states and actions visited infinitely often.

How to choose the action at each iteration? Maximize reward? Explore other actions? Q-learning is an off-policy method: no control over the policy.

Policies

Epsilon-greedy strategy:

• with probability greedy action from ;

• with probability random action.

Epoch-dependent strategy (Boltzmann exploration):

• : greedy selection.

• larger : random action.

1�� s

pt(a|s, Q) =e

Q(s,a)�t

�a��A e

Q(s,a�)�t

�t � 0�t

Convergence of Q-Learning

Theorem: consider a finite MDP. Assume that for all and , with . Then, the Q-learning algorithm converges to the optimal value (with probability one).

• note: the conditions on impose that each state-action pair is visited infinitely many times.

s�S a�A��

t=0 �t(s, a) =�,��

t=0 �2t (s, a) <�

�t(s, a)� [0, 1]

�t(s, a)

SARSA: On-Policy Algorithm

SARSA(�)1 Q� Q0 � initialization, e.g., Q0 = 0.2 for t� 0 to T do3 s� SelectState()4 a� SelectAction(�(Q), s) � policy � derived from Q, e.g., �-greedy.5 for each step of epoch t do6 r� � Reward(s, a)7 s� � NextState(s, a)8 a� � SelectAction(�(Q), s�) � policy � derived from Q, e.g., �-greedy.9 Q(s, a)� Q(s, a) + �t(s, a)

�r� + �Q(s�, a�)�Q(s, a)

10 s� s�

11 a� a�

12 return Q

Differences with Q-learning:

• two states: current and next states.

• maximum reward for next state not used for next state, instead new action.

SARSA: name derived from sequence of updates.

TD(λ) Algorithm

• TD(0) or Q-learning only use immediate reward.

• use multiple steps ahead instead, for steps:

• TD(λ) uses Algorithm:

Rnt = rt+1 + �rt+2 + . . . + �n�1rt+n + �nV (st+n)

V (s)� V (s) + � (Rnt � V (s)).

R�t = (1� �)

��n=0 �nRn

V (s)� V (s) + � (R�t � V (s)).

TD(λ) Algorithm

TD(�)()1 V� V0 � initialization.2 e� 03 for t� 0 to T do4 s� SelectState()5 for each step of epoch t do6 s� � NextState(�, s)7 � � r(s, �(s)) + �V (s�)� V (s)8 e(s)� �e(s) + 19 for u � S do

10 if u �= s then11 e(u)� ��e(u)12 V (u)� V (u) + ��e(u)13 s� s�

14 return V

TD-GammonLarge state space or costly actions: use regression algorithm to estimate Q for unseen values.

Backgammon:

• large number of positions: 30 pieces, 24-26 locations,

• large number of moves.

TD-Gammon: used neural networks.

• non-linear form of TD(λ),1.5M games played,

• almost as good as world-class humans (master level).

(Tesauro, 1995)

This Lecture

Planning

Learning

Multi-Armed Bandit Problem

Problem: gambler must decide which arm of a -slot machine to pull to maximize his total reward in a series of trials.

• stochastic setting: lever reward distributions.

• adversarial setting: reward selected by adversary aware of all the past.

(Robbins, 1952)

Applications

Clinical trials.

Adaptive routing.

Ads placement on pages.

Games.

Multi-Armed Bandit Game

For to do

• adversary determines outcome .

• player selects probability distribution and pulls lever , .

• player incurs loss (adversary is informed of and .

Objective: minimize regret

L(It, yt)pt

It�{1, . . . , N} It�pt

yt � Y

Regret(T ) =T�

L(It, yt)� mini=1,...,N

L(i, yt).

Player is informed only of the loss (or reward) corresponding to his own action.

Adversary knows past but not action selected.

Stochastic setting: loss drawn according to some distribution . Regret definition modified by taking expectations.

Exploration/Exploitation trade-off: playing the best arm found so far versus seeking to find an arm with a better payoff.

D = D1 � · · ·�DN

(L(1, yt), . . . , L(N, yt))

Equivalent views:

• special case of learning with partial information.

• one-state MDP learning problem.

Simple strategy: -greedy: play arm with best empirical reward with probability , random arm with probability .

1��t

Exponentially Weighted Average

Algorithm: Exp3, defined for by

Guarantee: expected regret of

pi,t = (1� �)exp

��

�t�1s=1

�li,t�

�Ni=1 exp

��

�t�1s=1

�li,t� +

�, � >0

�i � [1, N ], �li,t = L(It,yt)pIt,t

1It=i.

NT log N).

Exponentially Weighted Average

Proof: similar to the one for the Exponentially Weighted Average with the additional observation that:

E[�li,t] =�N

i=1 pi,tL(It,yt)

pIt,t1It=i = L(i, yt).

Mehryar Mohri - Foundations of Machine Learning page

References• Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. 2 vols. Belmont, MA:

Athena Scientific, 2007.

• Mehryar Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems. Journal of Automata, Languages and Combinatorics, 7(3):321-350, 2002.

• Martin L. Puterman Markov decision processes: discrete stochastic dynamic programming. Wiley-Interscience, New York, 1994.

• Robbins, H. (1952), "Some aspects of the sequential design of experiments", Bulletin of the American Mathematical Society 58 (5): 527–535.

• Sutton, Richard S., and Barto, Andrew G. Reinforcement Learning: An Introduction. MIT Press, 1998.

Mehryar Mohri - Foundations of Machine Learning page

References• Gerald Tesauro. Temporal Difference Learning and TD-Gammon. Communications of the ACM

38 (3), 1995.

• Watkins, Christopher J. C. H. Learning from Delayed Rewards. Ph.D. thesis, Cambridge University, 1989.

• Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, Vol. 8, No. 3-4, 1992.

Mehryar Mohri - Foudations of Machine Learning

Appendix

Stochastic Approximation

Problem: find solution of with while

• cannot be computed, e.g., not accessible;

• i.i.d. sample of noisy observations , available, , with .

Idea: algorithm based on iterative technique:

• more generally .

x=H(x) x�RN

H(x) H

H(xi)+wi

i� [1, m] E[w]=0

xt+1 = (1� �t)xt + �t[H(xt) + wt]= xt + �t[H(xt) + wt � xt].

xt+1 = xt + �tD(xt,wt)

Supermartingale Convergence

Theorem: let be non-negative random variables such that . If the following condition holds: , then,

• converges to a limit (with probability one).

Xt, Yt, Zt

Xt��t=0 Zt <�.

E�Xt+1

��Ft

��Xt+Yt�Zt

��t=0 Yt <�

Convergence Analysis

Convergence of , with history defined by

Theorem: let for some and assume that

xt+1 = xt + �tD(xt,wt)Ft

�t >0,��

t=0 �t =�,��

t=0 �2t <�.

Then, xta.s�� x�.

� : x� 12�x� x��22

�K1, K2 : E��D(xt,wt)�2

��Ft

�� K1 + K2 �(xt);

�c : ��(xt)�E�D(xt,wt)

��Ft

�� c �(xt);

Ft = {(xt�)t��t, (�t�)t��t, (wt�)t�<t}.

Convergence Analysis

Proof: since is a quadratic function,

By the supermartingale convergence theorem, converges andSince , must converge to 0.

�t >0,��

t=0 �t =�,��

t=0 �2t <�

�(xt+1) = �(xt)+��(xt)�(xt+1�xt)+12(xt+1�xt)��2�(xt)(xt+1�xt).

E��(xt+1)

��Ft

�= �(xt) + �t��(xt)� E

�D(xt,wt)

��Ft

��D(xt,wt)�2

��Ft

� �(xt)� �tc�(xt) +�2

2(K1 + K2�(xt))

= �(xt) +�2

��tc�

�2t K2

��(xt).

�(xt)

��t=0

��tc� �2

��(xt) <�.

non-neg. for large t

ml reinforcement learning - New York Universitymohri/mls/ml_reinforcement_learning.pdfMehryar Mohri...

Documents

Transcript of ml reinforcement learning - New York Universitymohri/mls/ml_reinforcement_learning.pdfMehryar Mohri...

Lecture 7 - New York Universitymohri/unix07/lect7.pdf · 2007-03-07 · Shell Quoting •Quoting causes characters to loose special meaning. •\Unless quoted, \ causes next character

Fast Global Alignment Kernels copy - New York Universitymohri/aml15/thanos_rodrigo_fast...6*24=144 Example #3 • Speech signals Agenda • Motivation • Time series Introduction

CV yuko mohri cv complete 20170420 - 就在藝術空間 Project Fulfill Art … · 2019-11-26 · Last updated: April.20.2017 Yuko MOHRI 1980 born in Kanagawa, Japan, lives in Tokyo,

Advanced Machine Learning - NYU Courantmohri/amls/aml_domain_adapt.pdfAdvanced Machine Learning - Mohri@ page Outline Domain adaptation. Multiple-source domain adaptation. 3

Lecture 12 - New York Universitymohri/unix07/lect12.pdf · 2007. 4. 5. · •init — reads /etc/inittab to determine what to start according to the run-level (initdefault) 6 reboot

Genetic Control of Temperature Preference in the Nematode Caenorhabditis elegans Akiko Mohri, Elji Kodama, Koutarou Kimura, Mizuho Koike, Takafumi Mizuno.

Lecture 11 - New York Universitymohri/unix07/lect11.pdfword splitting (IFS='') A Subtle Scripting Security Flaw •#! works by invoking the first line of the script with first argument

Introduction to Machine Learning Lecture 16mohri/mlu/mlu_lecture_16.pdf · 2011-12-04 · Introduction to Machine Learning Lecture 16 Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

Learning with Rejection - New York Universitymohri/pub/rej.pdfLearning with rejection based on two independent yet jointly learned functions hand rintroduces a completely novel approach

Lecture 11 - New York Universitymohri/unix08/lect11.pdf · 2008-01-22 · Worms and Trojan Horses •Trojan Horse: A program that compromises security by pretending to be an innocuous

ml learning with finite hypothesis sets - New York Universitymohri/mls/ml_learning_with_finite_hypothesis... · Foundations of Machine Learning Learning with Finite Hypothesis Sets

Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds.

Discriminative Language and Acoustic Modeling for Large ...mohri/asr12/talk_murat.pdf · Murat Sarac¸lar Electrical and Electronics Engineering Department, Bogazic¸i University,

edit - New York Universitymohri/pub/edit.pdf · Title: edit.dvi Created Date: 1/26/2007 5:27:32 PM

Learning Weighted Automata - New York Universitymohri/talks/LICS2017.pdfLICS2017 - Mohri@ page Learning Guarantees Existing analyses: • (Hsu et al., 2009; Denis et al., 2016): statistical

Lecture 3 - New York Universitymohri/unix08/lect3.pdfUnix Processes Process: An entity of execution •Definitions –program: collection of bytes stored in a file that can be run

Foundations of Machine Learning · 2019. 10. 17. · Title: Foundations of machine learning / Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Description: Second edition.

Foundations of Machine Learning - New York Universitymohri/mls/lecture_7.pdf · Foundations of Machine Learning page Conditional Relative Entropy Deﬁnition: let and be two probability

Discrepancy-Based Theory and Algorithms for Forecasting ...mohri/pub/tsj.pdfstationary ergodic sequences.Agarwal and Duchi[2013] gave generalization bounds for asymptotically stationary