Value iteration networks

ValueIterationNetworksA.Tamar,Y.Wu,G.Thomas,S.Levine,andP.Abbeel

Dept.ofElectricalEngineeringandComputerSciences,UCBerkeley

Presenter:KeisukeFujimoto(Twitter@peisuke)

ValueIterationNetworks

Purpose: Machinelearningbasedrobotpathplanning.Thisplannerisavailableinnewenvironment notincludedintraindataset.

Strategy: Predictionofoptimalaction.Themethodcanlearnrewardsofeachplaceandactiontogetgoodrewards.

Result: Planningin28x28gridmap,Applicabletocontinuouscontrolrobot

MapPoseVelocityGoal

Action

A.Tamar,Y.Wu,G.Thomas,S.Levine,andP.AbbeelDept.ofElectricalEngineeringandComputerSciences,UCBerkeley

Presenter:KeisukeFujimoto(ABEJA)

BackgroundTarget:AutonomousRobot• Manipulationrobot,Navigationrobot,Transferrobot

Problem:• Reinforcementlearningcannotworkoutsideoftraining

environments.

Goal

Targetobject

Manipulationrobot Navigationrobot

Contribution

• ValueIterationNetworks(VIN)• Modelfreetraining• Itdoesnotrequirerobotdynamicsmodels.

• Generalizedactionpredictioninnewenvironments• Itcannotworkoutsideoftrainingenvironments.

• Keyapproach• Representsvalue-iterationplanningbyCNN• Predictionofrewardmapandcomputationofsumoffuturerewards.

OverviewofVINInput:Stateoftherobot(pose,velocity),goal,map(leftfig.)Output:Action(direction,mortar'storque)

Strategy:Determinationofoptimalactionusingpredictedrewards(rightfig.).

State Rewards

Rewardpropagation

• Actioncanbedeterminedbysumoffuturerewardgeneratedusingrewardpropagation

-10 -10 -10

-10 -10 1

-10 -10

Map Rewardfrommap

Leftmoveaction

-10 -10 -9 -10

-10 -10 -9 1 0.9

-10 -10 -9

-10 -10 -10

-10 -10 1 -9

-9 -10 -10 0.9

-9 -9

Upmovefrommap

One-steppropagationexample:

Determinationofaction

• Optimalactionatrewardpropagatedplaceismaxrewardaction(middlefig.)

• Determinationofoptimalactionusingpropagatedreward(rightfig.)

Leftmoveaction

-10 -10 -9 -10

-10 -10 -9 1 0.9

-10 -10 -9

-10 -10 -10

-10 -10 1 -9

-9 -10 -10 0.9

-9 -9

Upmovefrommap -10 -10 -9 -10

-10 -10 -9 1 0.9

-9 -10 -10 0.9

-9 -9

Max

AfterRewardpropagation

-10 -10 -9 -8 -10

-10 -10 -9 1 0.9

-9 -10 -10 0.9 0.8

-8 -9 -9 0.8 0.7

-7 -8 -8 0.7 0.6

Currentrobotpose

ValueIterationModule• RewardpropagationwithConvolutionalNeuralNetwork• Inputisrewardmapandoutputissumoffeaturerewardmap• Qishiddenrewardmap,Vissumoffeaturerewardmap

Output

ConvolutionMax

ValueIterationNetworks

• DeepArchitectureofValueIterationNetworks• Inputismapandstate,fR predictsrewardmap• Attentionmodulescropsthevaluemaparoundrobotposition• 𝜓 outputsoptimalaction

Attentionfunction• Attentionmodulecropsasubsetofthevaluesaroundcurrentrobotpose.• Optimalposehaverelativetoonlycurrentrobotpose.• Duetothisattentionmodule,predictionofoptimalactionbecomeseasy.

-10 -10 -9 -8 -10

-10 -10 -9 1 0.9

-9 -10 -10 0.9 0.8

-8 -9 -9 0.8 0.7

-7 -8 -8 0.7 0.6

Ifrobotishere.

-10 0.9 0.8

-9 0.8 0.7

-8 0.7 0.6

Selectedarea

Grid-WorldDomainEnvironment:

Occupancygridmap,testsizeis8x8to28x28Thenumberofrecurrenceis20forthe28x28mapsTrainingdatasetis5000maps,7trajectories.

NetworksArch.:

Competitivemethod:CNNbasedDeepQ-Network,DirectactionpredictionusingFCN

Map,GoalCNN Rewardmap VImodule Attention FClayer

Action

CurrentPosition

3layernet150hiddennode 10channelsinQ-layer 80parameters

ResultsofGrid-WorldDomain

Predictedpath Reward Sumoffeaturereward

MarsRoverNavigationEnvironment:• NavigatingthesurfaceofMarsbyarover.• Itpredictspathfromonlysurfaceimagewithoutobstacle

information.• Successrateis90.3%.

Redpointshowselevationsharper,inpredictiontime,vindoesnotusestheelevationshapeinformation

ContinuousControlEnvironment:• Applytocontinuouscontrolspace.• Gridsizeis28x28• inputispositionandvelocity

whichisfloatdata.• Outputis2dcontinuouscontrol

parameters.

Comparisonaboutfinaldistancetothegoal

Thisresultisfromauthor'spresentation

WebNavChallengeEnvironment:• Navigatewebsitelinkstofindaquery• Features:averagewordembeddings• Usinganapproximategraphforplanning

Evaluation:• Successrateofwithintop-4predictions• Testset1:startfromindexpage• Testset2:startfromrandompage

Result:

ConclusionPurpose:• Machinelearningbasedrobotpathplanning.

Method:• Learningrewardsofeachplaceandpredictaction

usingpropagatedreward.

Result:• VINpolicieslearnanapproximateplanning

computationrelevantforsolvingthetask.• Grid-worlds,tocontinuouscontrol,andevento

navigationofWikipedialinks.

Code:https://github.com/peisuke/vinThiscodeisimplementedinchainer!

Twitter:@peisuke

Wearehiring!!https://www.wantedly.com/companies/abeja

Value iteration networks

Technology

Transcript of Value iteration networks