Balancing Exploration and Exploitation Ratio in Reinforcement Learning Ozkan Ozcan (1stLT/ TuAF)...

Balancing Exploration and Exploitation Ratio

in Reinforcement Learning Ozkan Ozcan (1stLT/ TuAF)

oozcan@nps.edu

Outline

• Introduction• Abstract• Background• Approaches• Study

– Methods– Analysis

• Results• Future Work

Introduction

• Reinforcement Learning is a method by which agents can learn how to maximize rewards from their environment by managing exploration and exploitation strategies and reinforcing actions that led to a reward in the past.

• Balancing the ratio of exploration and exploitation is an important problem in reinforcement learning. The agent can choose to explore its environment and try new actions in search for better ones to be adopted in the future, or exploit already tested actions and adopt them.

Introduction

• Using a strategy of pure exploration or pure exploitation cannot give the best results. The agent must find a balance between exploration and exploitation to obtain the best utility.

Abstract

• Methods to control this ratio in a manner that

mimic human behavior are required for use in

cognitive social simulations, which seek to

constrain agent learning mechanisms in a manner

similar to that observed in human cognition.

• In my thesis I used traditional and novel

methods for adjusting the exploration and

exploitation ratio of agents in cognitive social

simulation.

Abstract

• There are different algorithms in Reinforcement

Learning, such as Q-learning which can be

combined with different action selection techniques

such as Boltzmann selection. In my thesis I

focused particularly on Q-learning algorithm and

Boltzmann selection.

Background

• Reinforcement Learning (RL)– Every round of the simulation each agent will be faced

with a choice of possible actions.– The agent pairs their current state, with the possible

actions called State-Action pairs. – Each State-Action pair will have an associated Q-

value, initially determined to be equal to the default utility and later updated by the reinforcement algorithm.

– Using the Q-values of all the actions for the agent’s current state, the agent will apply a selection algorithm to make a choice.

Environments

• Two-Armed Bandit

• Grid World (GW)

• Cultural Geography Model (CG)

Approaches

• Temperature (Exploitation-Exploration Ratio) – Controls how likely the agent is to select an action with a higher utility. Higher temp means more likely to explore, lower temp means more likely to exploit.

• I examined two methods for dynamically controlling the ratio of exploration and exploitation.

– Temperature will be varied dynamically• Technique 1: Time Based Search then Converge

– Start with high temp and cool over time so the agents are exploratory initially and then exploitive after some period of time.

• Technique 2: Aggregate Utility Driven Exploration– Same as above except now the agents will cool

based on current utility.

Technique 1: Time Based Search then Converge

When we decrease Exploit Start Time agent is behaving

greedier in a short time. By making analyze on this

technique we can define what is the best exploit time in a

given scenario length.

Technique 1: Two Armed Bandit

Expected Utility vs. Exploit Start Time

tility

0 100 200 300 400

Exploit Start Time

Scenario Length

Range 100

Range 200

Range 300

Range 400

Range 500

Range 600

Range 700

Range 800

Range 900

Range 1000

0 100 200 300 400

Exploit Start Time

Expected Utility

<= 150

<= 200

<= 250

<= 300

<= 350

<= 400

<= 450

<= 500

<= 550

Technique 1:Grid World Example

Expected Utility vs. Exploit Start Time

-100 0 100 200 300 400 500 600Exploit Start Time

Scenario Length

Range 500

Range 600

Range 700

Range 1000

Range1200

0 100 200 300 400 500

Exploit Start Time

Expected Utility

<= 0.600

<= 0.800

<= 1.000

<= 1.200

<= 1.400

<= 1.600

<= 1.800

> 1.800

Technique 1: Cultural Geography Model

Technique 2: Aggregate Utility Driven Exploration

This technique is based on current utility of agent.

When CurrentUtility is low, Temperature is getting high,

and probability to select that action is also getting high.

Technique 2: Two Armed Bandit

Expected Utility vs. Exploit Utility

tility

-0.5 0 0.5 1 1.5 2 2.5 3

Exploit Utility

Scenario Length

Range 100

Range 200

Range 300

Range 400

Range 500

Range 600

Range 700

Range 800

Range 900

Range 1000

Exploit Utility

Expected Utility

<= 150

<= 200

<= 250

<= 300

<= 350

<= 400

<= 450

<= 500

<= 550

Technique 2: Grid World Example

Expected Utility vs. Exploit Utility In Different Ranges

c t ed U

tility

-0.5 0 0.5 1 1.5 2 2.5 3 3.5Exploit Utility

Scenario LengthRange 500Range 1000Range 1500Range 2000Range 2500Range 3000Range 3500Range 4000Range 4500

Exploit Utility

Expected Utility

<= 0.500

<= 0.750

<= 1.000

<= 1.250

<= 1.500

<= 1.750

<= 2.000

<= 2.250

> 2.250

Technique 2: Cultural Geography Model

Expected Utility vs. Exploit Utility

0 1 2 3 4

Exploit Utility

Scenario Length

Range 180

Range 360

Range 540

0 1 2 3 4

Exploit Utility

Expected Utility-Action 1

<= -0.850

<= -0.800

<= -0.750

<= -0.700

<= -0.650

<= -0.600

<= -0.550

<= -0.500

<= -0.450

> -0.450

Results

In dynamic environments by having dynamic

Temperature we can get better utilities.

Expected Utility has a relationship with scenario

length and dynamic Temperature.

By figuring out this relationship we can say that for

this simulation and for this scenario length if you run

your simulation with this Temperature schedule you

get the best utility and you can make your agent learn

faster.

Future Work

The next step to adapting this model to more closely match human behavior will be to develop a technique that is:• A combination of time based and utility based learning• Allows for an increase in temperature due to agitation of the agent such as a change in perceived happiness

Questions ?

Balancing Exploration and Exploitation Ratio in Reinforcement Learning Ozkan Ozcan (1stLT/ TuAF)...

Documents

Transcript of Balancing Exploration and Exploitation Ratio in Reinforcement Learning Ozkan Ozcan (1stLT/ TuAF)...

Saint Mary's Parish€¦ · Anthony Daniel, W3 Rupert Daniel, Caleb Poirier, 2ndLt. Joseph Teed, 1stLt. Brian Fydenkevez, PO Christopher King, Pvt. Ryan B ... HM2 Torrence Watkins,

Yasemin KOCAK USLUEL, H.Tugba DOGAN, Ayse ATASAYAR, Ibrahim KASALAK, Ozkan MISIRLI

ÜNİVERSİTEPARK Bülten - UNIBULLETIN...ÜNİVERSİTEPARK Bülten • Volume 5 • Issue 1-2 • 2016 An Investigation of Performance-Based Assessment at High Schools Ozkan Kirmizi

AD-A248 165 - DTIC · ad-a248 165 ~of visual information processing and response time in traffic..-signal cognition thesis ptkc demirarslan, hasan h. .lcte captain, tuaf apro 11992p

Feasibility Level Evaluation of Seismic Stability for Remedy Selection Senda Ozkan, Tetra Tech Inc. Gary Braun, Tetra Tech Inc.

Integrated System Analysis of Offshore Wind Projects … · Integrated System Analysis of Offshore Wind Projects Deniz Ozkan, Ph.D. ... • RETSCREEN –Energy Diversification Research

Twin Crises in Turkey & Criticisms of IMF Ertugrul TEKEOGLU 1 st LT TuAF.

AIR FORCE INSTITUTE OF TECHNOLOGY · Ilhan Kaya, First Lieutenant, TUAF AFIT/GOR/ENS/02-11 DEPARTMENT OF THE AIR FORCE AIR UNIVERSITY AIR FORCE INSTITUTE OF TECHNOLOGY Wright-Patterson

OZKAN Butterfly Valves

David Ozkan - BA Project - DiVA portal

Development and Implementation of an Instructional Design ...Prof. Muhlis Ozkan in Uludag University Institute of Educational Science (Thesis title: Instructional Design and Application

Defence-Growth Relationship: Case Study on Turkey Ertugrul TEKEOGLU 1 st LT TuAF Advisor : Robert Looney Second Reader : Franck Raymond.

GRADUATE STUDENT MANUAL - University ofgraduate.ucr.edu/mse_grad_manual 13-14 v2.pdf · MSE Graduate Student Manual ... Administration and selection of ... Cengiz Ozkan, Professor

Collective protection and transport in entangled ... · 5/25/2020 · DRAFT Collective protection and transport in entangled biological and robotic active matter Yasemin Ozkan-Aydin1,

Preventive vs. Curative Medicinedenardim/teaching/Ozkan... · 2020. 10. 1. · Preventive vs. Curative Medicine Empirical Facts 8 / 48. Methodology Group individuals into age intervals:

AIR FORCE INSTITUTE OF TECHNOLOGY · performance tradeoff study of a gps-aided ins for a rocket trajectory thesis muhittin istanbulluoglu, 1st lt, tuaf afit/ge/eng/02m-11 department

TRICARE… The Basics TRICARE Liaison Office Headquarters, Marine Corps Manpower and Reserve Affairs 1stLt M. M. Hoesing.

DEPARTMENT OF ECONOMICS DISCUSSION PAPER SERIES...tions of Richard Barwell, Arnab Bhattacharjee, Jagjit Chadha, Gulcin Ozkan, Joe Pearlman, Peter Sinclair, Martin Weale and seminar

Mustafa Degerli - 2014 - Mustafa Degerli and Sevgi Ozkan - Achieving and Ensuring Business Process Acceptance for Systems and Software Engineering and Management - Book Chapter in

Kerem Ozkan Myron L. Braunstein University of California ...mlbrauns/lab/VSS08b.pdf · Kerem Ozkan Myron L. Braunstein University of California, IrvineHorizon Affects Size-Distance