Value iteration networks

17
Value Iteration Networks A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel Dept. of Electrical Engineering and Computer Sciences, UC Berkeley Presenter: Keisuke Fujimoto (Twitter @peisuke)

Transcript of Value iteration networks

Page 1: Value iteration networks

ValueIterationNetworksA.Tamar,Y.Wu,G.Thomas,S.Levine,andP.Abbeel

Dept.ofElectricalEngineeringandComputerSciences,UCBerkeley

Presenter:KeisukeFujimoto(Twitter@peisuke)

Page 2: Value iteration networks

ValueIterationNetworks

Purpose: Machinelearningbasedrobotpathplanning.Thisplannerisavailableinnewenvironment notincludedintraindataset.

Strategy: Predictionofoptimalaction.Themethodcanlearnrewardsofeachplaceandactiontogetgoodrewards.

Result: Planningin28x28gridmap,Applicabletocontinuouscontrolrobot

MapPoseVelocityGoal

Action

A.Tamar,Y.Wu,G.Thomas,S.Levine,andP.AbbeelDept.ofElectricalEngineeringandComputerSciences,UCBerkeley

Presenter:KeisukeFujimoto(ABEJA)

Page 3: Value iteration networks

BackgroundTarget:AutonomousRobot• Manipulationrobot,Navigationrobot,Transferrobot

Problem:• Reinforcementlearningcannotworkoutsideoftraining

environments.

Goal

Targetobject

Manipulationrobot Navigationrobot

Page 4: Value iteration networks

Contribution

• ValueIterationNetworks(VIN)• Modelfreetraining• Itdoesnotrequirerobotdynamicsmodels.

• Generalizedactionpredictioninnewenvironments• Itcannotworkoutsideoftrainingenvironments.

• Keyapproach• Representsvalue-iterationplanningbyCNN• Predictionofrewardmapandcomputationofsumoffuturerewards.

Page 5: Value iteration networks

OverviewofVINInput:Stateoftherobot(pose,velocity),goal,map(leftfig.)Output:Action(direction,mortar'storque)

Strategy:Determinationofoptimalactionusingpredictedrewards(rightfig.).

State Rewards

Page 6: Value iteration networks

Rewardpropagation

• Actioncanbedeterminedbysumoffuturerewardgeneratedusingrewardpropagation

-10 -10 -10

-10 -10 1

-10 -10

Map Rewardfrommap

Leftmoveaction

-10 -10 -9 -10

-10 -10 -9 1 0.9

-10 -10 -9

-10 -10 -10

-10 -10 1 -9

-9 -10 -10 0.9

-9 -9

Upmovefrommap

One-steppropagationexample:

Page 7: Value iteration networks

Determinationofaction

• Optimalactionatrewardpropagatedplaceismaxrewardaction(middlefig.)

• Determinationofoptimalactionusingpropagatedreward(rightfig.)

Leftmoveaction

-10 -10 -9 -10

-10 -10 -9 1 0.9

-10 -10 -9

-10 -10 -10

-10 -10 1 -9

-9 -10 -10 0.9

-9 -9

Upmovefrommap -10 -10 -9 -10

-10 -10 -9 1 0.9

-9 -10 -10 0.9

-9 -9

Max

AfterRewardpropagation

-10 -10 -9 -8 -10

-10 -10 -9 1 0.9

-9 -10 -10 0.9 0.8

-8 -9 -9 0.8 0.7

-7 -8 -8 0.7 0.6

Currentrobotpose

Page 8: Value iteration networks

ValueIterationModule• RewardpropagationwithConvolutionalNeuralNetwork• Inputisrewardmapandoutputissumoffeaturerewardmap• Qishiddenrewardmap,Vissumoffeaturerewardmap

Output

ConvolutionMax

Page 9: Value iteration networks

ValueIterationNetworks

• DeepArchitectureofValueIterationNetworks• Inputismapandstate,fR predictsrewardmap• Attentionmodulescropsthevaluemaparoundrobotposition• 𝜓 outputsoptimalaction

Page 10: Value iteration networks

Attentionfunction• Attentionmodulecropsasubsetofthevaluesaroundcurrentrobotpose.• Optimalposehaverelativetoonlycurrentrobotpose.• Duetothisattentionmodule,predictionofoptimalactionbecomeseasy.

-10 -10 -9 -8 -10

-10 -10 -9 1 0.9

-9 -10 -10 0.9 0.8

-8 -9 -9 0.8 0.7

-7 -8 -8 0.7 0.6

Ifrobotishere.

-10 0.9 0.8

-9 0.8 0.7

-8 0.7 0.6

Selectedarea

Page 11: Value iteration networks

Grid-WorldDomainEnvironment:

Occupancygridmap,testsizeis8x8to28x28Thenumberofrecurrenceis20forthe28x28mapsTrainingdatasetis5000maps,7trajectories.

NetworksArch.:

Competitivemethod:CNNbasedDeepQ-Network,DirectactionpredictionusingFCN

Map,GoalCNN Rewardmap VImodule Attention FClayer

Action

CurrentPosition

3layernet150hiddennode 10channelsinQ-layer 80parameters

Page 12: Value iteration networks

ResultsofGrid-WorldDomain

Predictedpath Reward Sumoffeaturereward

Page 13: Value iteration networks

MarsRoverNavigationEnvironment:• NavigatingthesurfaceofMarsbyarover.• Itpredictspathfromonlysurfaceimagewithoutobstacle

information.• Successrateis90.3%.

Redpointshowselevationsharper,inpredictiontime,vindoesnotusestheelevationshapeinformation

Page 14: Value iteration networks

ContinuousControlEnvironment:• Applytocontinuouscontrolspace.• Gridsizeis28x28• inputispositionandvelocity

whichisfloatdata.• Outputis2dcontinuouscontrol

parameters.

Comparisonaboutfinaldistancetothegoal

Thisresultisfromauthor'spresentation

Page 15: Value iteration networks

WebNavChallengeEnvironment:• Navigatewebsitelinkstofindaquery• Features:averagewordembeddings• Usinganapproximategraphforplanning

Evaluation:• Successrateofwithintop-4predictions• Testset1:startfromindexpage• Testset2:startfromrandompage

Result:

Page 16: Value iteration networks

ConclusionPurpose:• Machinelearningbasedrobotpathplanning.

Method:• Learningrewardsofeachplaceandpredictaction

usingpropagatedreward.

Result:• VINpolicieslearnanapproximateplanning

computationrelevantforsolvingthetask.• Grid-worlds,tocontinuouscontrol,andevento

navigationofWikipedialinks.

Page 17: Value iteration networks

Code:https://github.com/peisuke/vinThiscodeisimplementedinchainer!

Twitter:@peisuke

Wearehiring!!https://www.wantedly.com/companies/abeja