Reinforcement Learning for CPS Safety Engineeringkoclab.cs.ucsb.edu/cpsed/files/Green1.pdf ·...
Transcript of Reinforcement Learning for CPS Safety Engineeringkoclab.cs.ucsb.edu/cpsed/files/Green1.pdf ·...
ReinforcementLearningforCPSSafetyEngineering
SamGreen,Çetin KayaKoç,Jieliang LuoUniversityofCalifornia,SantaBarbara
Motivations
Safety-criticaldutiesdesiredbyCPS?
• Autonomousvehiclecontrol:UAV,passengervehicles,deliverytrucks• Automaticallyrespondingto,orpreventing,damage• Industrialrobotcontrolforusearoundhumans• Largeprocessautomation• E.g.,optimizationoffactory
ReinforcementLearning
GeorgiaTech,https://www.youtube.com/watch?v=f2at-cqaJMM
Deepmind,https://arxiv.org/abs/1707.02286
MachineLearning
Supervised Unsupervised Reinforcement
IntroductiontoRL
• Acomputationalapproachtolearningfrominteraction• Establishedinthe1980s• Objectiveistotakeactionstomaximizeareward(orminimizeacost)• SeenasapathtowardArtificialGeneralIntelligence
• RLisattheintersectionbetween• Psychology• ControlTheory• ComputerScience/AI
• Resurgencewithadventofdeeplearningmethods
[Mnih,etal.AsynchronousMethodsforDeepReinforcementLearning,2016]
AdvancesinRLsince2015
20152015201520152015201620162016
Terminology
• Agent – Thethingwearelearningtocontrol• Environment – Allthefactorsaffectingtheagent• Action – Performedbyagentinanattempttoaffectchangeontheenvironment• Reward – Returnedbytheenvironmenttotheagentaftertheagentmakesanaction.Usedtohelptheagentlearn.• AKAthenegativecost
[R.Sutton,andA.Barto.ReinforcementLearning:AnIntroduction.2016]
MarkovDecisionProcess
• WhatRLsolves• Environmentswhereagent’sdecisionsareonlydependentonpresent• Anobjectinflight• Self-drivingcar• Manufacturingprocess• Robotcontrol
• It’snotthatthepastdoesn’tmatter,butthelawsofphysicsguaranteecertainthings,e.g.momentum• MethodsalsoexisttosolveapproximateMDP
Example:StudentMarkovChain
[http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf]
Starthereatthebeginningofeachepisode
RLforCPSSafetyEngineering
• InterdisciplinarynaturesmakesRLinterestingforCPSengineering• AI,ML(Math,Statistics)• Mechanicsdesignandsimulation(ME,Physics,CS)• Programmingandimplementation(CS,EE)
MountainCarExample
• Agentisanunderpoweredcarwith3actions:• Backward,Neutral,Forward
• Reward:=-1pertimestep• Implicitgoal:=Reachtheflagasfastaspossible
• State:=x-pos andvelocity
Canonicalexample:MountainCar
[R.Sutton,andA.Barto.ReinforcementLearning:AnIntroduction.2016]
Model-FreeControlviaPolicy-BasedRL• Asimplephysicsmodeldeterminesthebehaviorofcar• Capturespositionofthecaronthehill• Captureseffectoflimitedenginepower
• Usingaphysicsmodelsimplifiesapproach• Useanefficienttraditionalcontroller
• Butinmanyscenariosthemodelisnotavailableortoocomplex• Amazonpackagedeliverydrone
• Solvemountaincarusingsophisticatedmethodastoyexample• Directlytrainaneuralnetwork-basedpolicy
RLTerminologyandNotation
• 𝑆𝑡 – Stateoftheenvironmentattime𝑡• x-axispositionandvelocity
• 𝐴𝑡 – Actiontakenbyagentattime𝑡• Backward,Neutral,Forward
• 𝜋 – Thepolicyfunction;returnsthenextactiontotake.Stochasticinthisexample• 𝜃– Aparametervectorforthepolicy;i.e.theweightslearnedinaneuralnetwork
Puttingeverythingtogether:𝐴'()~𝜋𝜃 𝐴𝑡,𝑆𝑡 = 𝑃(𝐴𝑡|𝑆𝑡, 𝜃)
Thepolicy𝜋𝜃• 𝜋𝜃 isoftenapproximated• Deepneuralnetworksarepowerforapproximation• WewillusegradientascenttooptimizetheDNN
Thepolicyfunction𝜋𝜃,approximatedbyNN
• Stateinformationattime𝑡:• PositionandVelocity
• Actionoptionsattime𝑡:• Forwardacceleration• Neutral• Backwardacceleration
PositionVelocity
Input Output
𝜋𝜃Prob(F)Prob(N)Prob(B)
Rewardfunction• Ateverytimesteptakeanaction• Forward,neutral,backward• Eachactionhasarewardof-1• Trainagenttoreachtheflaginminimumtimesteps
Example:MarkovRewardProcess
[http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/MDP.pdf]
Starthereatthebeginningofeachepisode
HowtotraintheNN?
• Smallnetworkscanbeeffectivelytrainedwithgeneticalgorithms• Geneticalgorithmsworkpoorlywithlargenetworks(parameterspaceistoolarge)• Gradient-ascentoptimizationworkswithlargeparameterspace Position
Velocity
Prob(F)Prob(N)Prob(B)
𝜋𝜃
Monte-CarloPolicyGradient(REINFORCE)
• FindDNNparametervector𝜃 suchthat𝜋𝜃 maximizesthereward• Foreveryepisode,untilflagisreached• Getstateinformation(position&velocity)fromenvironment• FeedNNwithstateinformation• NNwilloutputaprobabilityfor(F)orward,(N)eutral,and(B)ackward• RandomlyselectactionF,N,andB(usingtheaboveprobabilities)• Storethestateinformationandactiontaken
• Onceflagisreached• Assignthemostrewardtothelastaction…leastrewardtothefirstaction• Update𝜃 s.t. actionsmadeattheendaremoreprobable
[http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html]
Monte-CarloPolicyGradient
• Methodleveragesmethodscreatedforsupervisedlearning• Inputs≔ thestateinformation(position,velocity)• Predictions:=forward,neutral,orbackwardactiontaken• Labels(“groundtruth”):=Aftertheepisodewasover,assignmostvaluetothelastactions.Assignleastvaluetothefirstactions
• Runmanyepisodes,aftereachepisodefinishes(flagisreached)strengthenthenetworksuchthatthelastmovesbecomemoreprobable
[http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html]
Gradient-ascent
• Gradientalgorithmsfindalocalextremum• Atendofeachepisode,adjusteachparameterin𝜃 s.t. actionsmadeneartheendarestrengthened• Howmuchandinwhichdirectiontomoveeachparameterisdeterminedbythebackpropagationmethod
𝜃1𝜃2
EpisodeRewards
Caveats
• DeepRLisusuallyslowtolearn
• Transferringknowledgefromoneproblemtoanotherisdifficult
• Rewardfunctioncanbecomplex
SafetyandSecurityConsiderations
SafetyandSecurityConsiderations
• DNNsareblack-boxmodels• PossibletogiveaninputwhichcausesDNNtoprovidewildoutput
• Effortstomitigatethislimitation• E.g.ConstrainedPolicyOptimization
ConstrainedPolicyOptimization
• School-bookRLspecifiesonlytherewardfunction• Problem:whenanagentislearning,itmaytryanything• Potentiallyunsafewhentrainingisinphysicalenvironment
• Constraintscanbeaddedtotheobjectivefunction
[Achiam etal.“ConstrainedPolicyOptimization”,2017]
CurrentEfforts
DevelopingRLforQuadcopterControl• GoodcasestudyforcomplexautonomousCPS• Collisionavoidance• Targettracking• Packagedelivery
• Usingopensourcefirmwareandhardware
UsingMicrosoftAirSim for1st-orderlearning
[S.Shahetal.AirSim:High-FidelityVisualandPhysicalSimulationforAutonomousVehicles.2017.]
Conclusions
• RLisageneralizablemethodtotacklemanyCPSdecisionmakingproblems• High-capacitymodelscanmakesophisticateddecisions
• GoodapproachforCPSeducation,becauseofinterdisciplinarynature
• Openproblemswhenusingblack-boxfunctionsforsafetyapplications
Questions?