Post on 02-May-2022
FeUdal NetworksforHierarchicalReinforcementLearning
byArtem BachysnkyiComputationalNeuroscienceSeminar
UniversityofTartu3May2017
Reinforcementlearning
The basic reinforcement learning model consists of:
• a set of environment and agent states S • a set of actions A of the agent• policies of transitioning from states to actions• rules that determine the scalar immediate reward of a
transition • rules that describe what the agent observes.
ATARIgames
Standart approach
• useanaction-repeatheuristic,whereeachactiontranslatesintoseveralconsecutiveactionsintheenvironment
• notapplicableinnon-Marcovian environmentsthatrequirememory• can’tlearnontheweakrewardsignal
Feudalreinforcementlearningintuition
• levelsofhierarchywithinanagentcommunicateviaexplicitgoals• goalscanbegeneratedinatop-downfashion• goalsettingcanbedecoupledfromgoalachievement
Manager-Workermodel
Manager:• setsgoalsatalowertemporalresolution
Worker:• operatesatahighertemporalresolution
• producesprimitiveactions• followsthegoalsbyanintrinsicreward
Mainproposals
• aconsistent,end-to-enddifferentiablemodel• approximatetransitionpolicygradientupdatefortrainingtheManager• useofgoalsthataredirectionalratherthanabsolute• dilatedLSTMfortheManagerRNNdesign
FuN modeldescription
FuN modeldescription
ℎ",ℎ# – internalstates𝑈% – workersoutput𝜙 – maps𝑔% into𝑤%𝜋 – vectorofprobabilitiesoverprimitiveactions
𝑠% – latentstaterepresentation𝑔% – goalvector𝑥% – observationfromtheenvironment𝑧% – sharedintermediaterepresentation
Learning
Learningsteps:1. receivesanobservationfromtheenvironment2. selectanactionfromafiniteset3. theenvironmentrespondswithanewobservationandascalar
reward4. theprocesscontinuesuntiltheterminalstateisreached
LearningBadidea:
trainfeudalnetworkend-to-endusingapolicygradientalgorithmoperatingontheactionstakenbytheWorker
Goodidea:independentlytrainManagertopredictadvantageousdirectionsinstatespaceandtointrinsicallyrewardtheWorkertofollowthesedirections
Theagentsgoal
Maximizethediscountedreturn
where
Theagent’sbehaviour isdefinedbyitsaction-selectionpolicyπ.FuN producesadistributionoverpossibleactions.
Managersupdaterule
where
– valuefunctionestimatefromtheinternalcritic
– cosinesimilarity
– advantagefunction
Workersintrinsicreward
where
𝑐 – horizon
TheWorkerspolicy
Advantageauthorcritic
Advantagefunction
Architecturedetails
𝑓012310%– ConvolutionalNeuralNetwork:1. 168x8filters,stride42.324x4fil- ters ofstride23.fullyconnectedlayerhas256hiddenunits*eachlayerisfollowedbyarectifiednon-linearity
𝑓"40531 – anotherfullyconnectedlayer𝑓#266 – standardLSTM𝑓"266 – dilatedLSTM
FuN modeldescription
DilatedLSTMStateofthenetworkwith𝑟 separategroupsofsub-states
Attime𝑡 wecanindicatewich groupofcoresisupdated
Ateachtimesteponlythecorrespondingpartofthestateisupdatedandtheoutputispooledacrossthepreviouscoutputs.ThisallowsthergroupsofcoresinsidethedLSTM topreservethememoriesforlongperiods.
*Intheexperimentsr=10.
Experiments:ATARI
Experiments:Montezuma’srevenge
https://www.youtube.com/watch?v=_zbg9rs5QZY
Experiments:Montezuma’srevenge
Experiments:Non-matchandT-maze
Experiments:Watermaze
Experiments:transitionpolicygradient
Experiments:Temporalresolution
Experiments:DilateLSTMagentbaseline