1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen...
-
date post
21-Dec-2015 -
Category
Documents
-
view
222 -
download
5
Transcript of 1 University of Southern California Security in Multiagent Systems by Policy Randomization Praveen...
1University of Southern California
Security in Multiagent Systems by Policy Randomization
Praveen Paruchuri, Milind Tambe, Fernando Ordonez
University of Southern California
Sarit Kraus
Bar-Ilan University,Israel
University of Maryland, College Park
2University of Southern California
Motivation: The Prediction Game
An UAV (Unmanned Aerial Vehicle) Flies between the 4 regions
Can you predict the UAV-fly pattern ??
Pattern 11, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,……Pattern 21, 4, 3, 1, 1, 4, 2, 4, 2, 3, 4, 3,… (as generated by 4-sided dice)Can you predict if 100 numbers in pattern 2 are given ??
Randomization decreases Predictability Increases Security
Region 1 Region 2
Region 3 Region 4
3University of Southern California
Problem Definition
Problem : Increase security by decreasing predictability for agent/agent-team acting in uncertain adversarial environments. Even if Policy Given, it is Secure Efficient Algorithms for Reward/Randomness Tradeoff
Assumptions for Agent/agent-team: Adversary is unobservable
– Adversary’s actions/capabilities or payoffs are unknown
Assumptions for Adversary: Knows the agents plan/policy Exploits the action predictability Can see the agent’s state (or belief state)
4University of Southern California
Solution Technique
Technique developed: Intentional policy randomization MDP/POMDP framework
– Sequential decision making
– MDP Markov Decision Process
– POMDP Partially Observable MDP
Increase Security => Solve Multi-criteria problem for agents Maximize action unpredictability (Policy randomization) Maintain reward above threshold (Quality constraints)
5University of Southern California
Domains
Scheduled activities at airports like security check, refueling etc Observable by anyone Randomization of schedules helpful
UAV/UAV-team patrolling humanitarian mission Adversary disrupts mission – Can disrupt food, harm refugees,
shoot down UAV’s etc Randomize UAV patrol policy
6University of Southern California
My Contributions
Two main contributions
Single Agent Case :
– Formulate as Non linear program : Entropy based metric
– Convert to Linear Program called BRLP (Binary search for randomization)
– Randomize single agent policies with reward > threshold
Multi Agent Case : RDR (Rolling Down Randomization)
– Randomized policies for decentralized POMDPs
– Threshold on team reward
7University of Southern California
MDP based single agent case
MDP is tuple < S, A, P, R > S – Set of states A – Set of actions P – Transition function R – Reward function
Basic terms used : x(s,a) : Expected times action a is taken in state s Policy (as function of MDP flows) :
^
^
( , ) ( , ) / ( , )a A
s a x s a x s a
8University of Southern California
Entropy : Measure of randomness
Randomness or information content : Entropy (Shannon 1948)Entropy for MDP - Additive Entropy – Add entropies of each state
(π is a function of x) Weighted Entropy – Weigh each state by it contribution to
total flow
where, alpha_j is the initial flow of the system
H x s a s aAs S a A
( ) ( ( , ) lo g ( , ))
H x
x s a
s a s aWa A
jj S
a As S
( )
( , )
( , ) lo g( ( , ) )
^
^
9University of Southern California
Tradeoff : Reward vs Entropy
Non-linear Program: Max entropy, Reward above threshold Objective (Entropy) is non-linear
BRLP ( Binary Search for Randomization LP ) : Linear Program No entropy calculation, Entropy as function of flows
m ax ( )
m in
H x
st A x a lpha
x
R x R
W
0
10University of Southern California
BRLP
Input and target reward (n% * maximum reward)
Poly-time convergence
Monotonicity: Entropy decreases or constant with increasing reward. Control through
Input can be any high entropy policy
One such input is the uniform policy Equal probability for all actions out of all states
11University of Southern California
LP for Binary Search
Policy as function of and
Linear Program
),(),(
0
)(max
asas
x
alphaAxst
xR
12University of Southern California
BRLP in Action
= 1
- Max entropy
= 0DeterministicMax Reward Target
Reward
Beta = .5
x
),(),( asas
13University of Southern California
Results (Averaged over 10 MDPs)
0123456789
10
50 60 70 80 90 100
Reward Threshold(%)
Ave
. Wei
gh
ted
En
tro
py
BRLP
Hw(x)
Ha(x)
Max Entropy
Max entropy : Expected Entropy Method : 10% avg gain over BRLPFastest : BRLP : 7 fold average speedup over Expected Entropy
0
20
40
60
80
100
120
50 60 70 80 90 100
Reward Threshold(%)
Exe
cuti
on
Tim
e (s
ec) BRLP
Hw(x)
Ha(x)
14University of Southern California
Multi Agent Case: Problem
Maximize entropy for agent teams subject to reward threshold
For agent team : Decentralized POMDP framework used Agents know initial joint belief state No communication possible between agents
For adversary : Knows the agents policy Exploits the action predictability Can calculate the agent’s belief state
15University of Southern California
RDR : Rolling Down Randomization
Input : Best ( local or global ) deterministic policy Percent of reward loss d parameter – Number of turns each agent gets
– Ex: d = .5 => Number of steps = 1/d = 2– Each agent gets one turn (for 2 agent case)– Single agent MDP problem at each step
For agent 1’s turn : Fix policy of other agents (Agent 2) Find randomized policy
– Maximizes joint entropy ( w1 * Entropy(agent1) + w2 * Entropy(agent2) )
– Maintains joint reward above threshold
16University of Southern California
RDR : d = .5
Max Reward
80% of Max Reward
Agent 1Maximize joint entropy
Joint Reward > 90%
Reward = 90%
Agent 2Maximize joint entropy
Joint reward > 80%
17University of Southern California
Experimental Results : Reward Threshold vs Weighted Entropy ( Averaged 10 instances )
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
50 70 90
Reward Threshold(%)
We
igh
ted
En
tro
py
T=2T=3T=2 MaxT=3 Max
18University of Southern California
Summary
Intentional randomization as main focus
Single agent case : BRLP algorithm introduced
Multi agent case : RDR algorithm introduced
Multi-criterion problem solved that Maximizes entropy Maintains Reward > Threshold
19University of Southern California
Thank You
Any comments/questions ??
20University of Southern California
21University of Southern California
Difference between safety and security ??
Security: It is defined as the ability of the system to deal with threats that are intentionally caused by other intelligent agents and/or systems.
Safety : A system's safety is its ability to deal with any other threats to its goals.
22University of Southern California
Probing Results : Single agent Case
0
2
4
6
8
10
12
14
0 2 4 6 8 10
Entropy
# o
f o
bse
rvat
ion
s
Observe All
Observe Select
Observe Noisy
23University of Southern California
Probing Results : Multi agent Case
0
1
2
3
4
5
0 1 2 3 4Joint Entropy
Join
t #
of
Ob
serv
atio
ns
Observe AllObserve Select
Observe Noisy
24University of Southern California
Define POMDP
25University of Southern California
Define Distributed POMDP
Dec-POMDP is a tuple <S,A,P,Ω,O,R>, where
S – Set of states
A – Joint action set <a1,a2,…,an>
P – Transition function
Ω – Set of joint observations
O- Observation function – Probability of joint observation given current state and previous joint action. Observations independent of each other
R – Immediate, Joint reward
A DEC-MDP is a DEC-POMDP with the restriction that at each time step the agents observations together uniquely determine the state.
26University of Southern California
Counterexample : Entropy
Lets say adversary shoots down UAV Hence targets highest probable action --- Called Hit rate
Assume UAV has 3 actions.
2 possible probability distributions H ( 1/2, 1/2, 0 ) = 1 ( log base 2 ) H ( 1/2 - delta, 1/4 + delta, 1/4 ) ~ 3/2
Entropy = 3/2, Hit rate = 1/2-delta
Entropy = 1, Hit rate = 1/2
Higher entropy but lower hit rate
27University of Southern California
d-parameter & Comments on Results
Effect of d-parameter (avg of 10 instances)
RDR : Avg runtime in sec and (Entropy), T = 2
Conclusions:Greater tolerance of reward loss => Higher entropyReaching maximum entropy tougher than single agent caseLower miscoordination cost implies higher entropyd parameter of .5 is good for practical purposes.
Reward Threshold
1 .5 .25 .125
90% .67(.59) 1.73(.74) 3.47(.75) 7.07(.75)
50% .67(1.53) 1.47(2.52) 3.4(2.62) 7.47(2.66)
28University of Southern California
Example where uniform policy is not best
29University of Southern California
Entropies
For uniform policy – 1 + ½ * 1 + 2 * ¼ * 1 + 4 * 1/8 * 1 = 2.5
If initially deterministic policy and then uniform – 0 + 1 * 1 + 2 * ½ * 1 + 4 * ¼ * 1 = 3
Hence, uniform policies need not always be optimal.