Increasing Security through Communication and Policy Randomization in Multiagent Systems
description
Transcript of Increasing Security through Communication and Policy Randomization in Multiagent Systems
1University of Southern California
Increasing Security through Communication and Policy Randomization in Multiagent Systems
Praveen Paruchuri, Milind Tambe, Fernando Ordonez
University of Southern California
Sarit Kraus
Bar-Ilan University,Israel
University of Maryland, College Park
2University of Southern California
Motivation: The Prediction Game
An UAV (Unmanned Aerial Vehicle) Flies between the 4 regions
Can you predict the UAV-fly pattern ??
Pattern 11, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,……Pattern 21, 4, 3, 1, 1, 4, 2, 4, 2, 3, 4, 3,… (as generated by 4-sided dice)Can you predict if 100 numbers in pattern 2 are given ??
Randomization decreases Predictability Increases Security
Region 1 Region 2
Region 3 Region 4
3University of Southern California
Problem Definition
Problem : Increase security by decreasing predictability for agent-team acting in adversarial environments. Even if Policy Given, it is Secure Environment is stochastic and observable (MDP-based) Communication is a limited Efficient Algorithms for
Reward/Randomization/Communication Tradeoff
4University of Southern California
Assumptions
Assumptions for agent-team: Adversary is unobservable
– Adversary’s actions/capabilities or payoffs are unknown Communication is encrypted (safe)
Assumptions for Adversary: Knows the agents plan/policy Exploits action predictability Can see the agent’s state
5University of Southern California
Solution Technique
Technique developed: Intentional policy randomization CMDP based framework :
– Sequential Decision Making
– Limited Communication Resources
– CMDP Constrained Markov Decision Process
Increase Security => Solve Multi-criteria problem for agents Maximize action unpredictability (Policy randomization) Maintain reward above threshold (Quality constraints) Communication usage below threshold (Resource constraints)
6University of Southern California
Domains
Scheduled activities at airports like security check, refueling etc Can be observed by adversaries Randomization of schedules helpful
UAV-team patrolling humanitarian mission Adversary disrupts mission – Can disrupt food, harm refugees,
shoot down UAV’s etc Randomize UAV patrol policy
7University of Southern California
Our Contributions
1. Randomized policies for Multi-agent CMDP (MCMDP)
2. Solve Miscoordination Randomized polices in team settings
– Policy not implementable!
(Reward constraint gets violated)
Communication Resource <Threshold
Expected TeamReward >Threshold
Maximize Policy
Randomization
8University of Southern California
Miscoordination: Effect of Randomization
Meeting tomorrow 9am – 40%, 10am – 60%
Communicate to coordinate Limited Communication
Agent 1/ Agent 2
9 am
(.4)
10 am
(.6)
9am (.4) .16 .24
10am (.6) .24 .36
Should have been 0(Violates Threshold
Rewards)
9University of Southern California
Communication Issue
Generate Randomized Implementable policies Limited communication
Problem of communication M coordination points N units of communication Generate best communication policy Communication policy can also be randomized
Transform MCMDP to implementable MCMDPSolution algorithm for transformed MCMDP
10University of Southern California
MCMDP: Formally Defined
An MCMDP (for a 2 agent case) is a tuple
<S,A,P,R, C1,C2, T1,T2, N,Q> where, S,A,R – Joint states, actions, rewards P – Transition function C1 - Cost vector for resource k T1 - Threshold on expected resource k consumption. N - Joint communication cost vector Q - Threshold on communication costs
Basic terms used : x(s,a) : Expected times action a is taken in state s Policy (as function of x) :
^
^
( , ) ( , ) / ( , )a A
s a x s a x s a
11University of Southern California
Entropy : Measure of randomness
Randomness or information content quantified using Entropy ( Shannon 1948 )Entropy for CMDP - Additive Entropy – Add entropies of each state
Weighted Entropy – Weigh each state by it contribution to total flow
where alpha_j is the initial flow of the system
H x s a s aAs S a A
( ) ( ( , ) lo g ( , ))
H x
x s a
s a s aWa A
jj S
a As S
( )
( , )
( , ) lo g( ( , ) )
^
^
12University of Southern California
Issue 1: Randomized Policy Generation
Non-linear Program: Max entropy, Reward above threshold, Communication below threshold
Obtains required randomization
Appends communication for every action
Issue 2: Generate the Communication Policy
QNx
RRx
x
alphaAxst
xH w
min
0
)(max
13University of Southern California
Issue 2: Transformed MCMDP
11ba
12ba
22ba
S1S1
a2o
a2C
a1o
a1C
For each state, for each joint action,
Transition between original and new statesTransitions between new states and original target states
Introduce C (communication) and NC for different individual action, add corresponding new states
21ba )( 1aC
)( 1aNC)( 2aC
)( 2aNC
a1b1a1b2
a1b1a1b2
a2b1a2b2
a2b1a2b2
14University of Southern California
Non-linear Constraints
Need to introduce non-linear constraints
For each original state For each new state introduced by no communication action
– Conditional probability of corresponding actions equal
Ex: P(b1/ ) = P(b1/ ) &&
P(b2/ ) = P(b2/ )
, - Observable, Reached by Comm action
, - Unobservable, No Comm action
A o1 A o2
A o1 A o2
A c1 A c2
A o1 A o2
15University of Southern California
Non-Linear constraints: Handling Miscoordination
Agent B has no hint of state if NC actions. Necessity to make its actions independent of source state. Probability of action b1 from state should equal
probability of same action (i.e b1) from .
Meeting scenario: Irrespective of agent A’s plan If agent B’s plan is 20% 9am & 80% 10am B is independent of A
Miscoordination avoided Actions independent of state.
A o1A o2
16University of Southern California
Experimental Results
0 1 2 3 4 5 6
> 11> 7
> 3
0
0.5
1
1.5
2
Weighted Entropy
Communication Threshold
Reward (>1 to >11)
Entropy vs Reward vs Communication
X-axis
Y –
axis
Z-axis
17University of Southern California
Experimental Conclusions
Reward Threshold decreases => Entropy increases
Communication increases => Agents coordinate better Coordination invisible to adversary Agents coordinate better to fool the adversary
Increased communication Higher entropy !!!
18University of Southern California
Summary
Randomized Policies in Multiagent MDP settings
Developed NLP to maximize weighted entropy with reward and communication constraints.
Provided transformation algorithm to explicitly reason about communication actions.
Showed that communication increases security.
19University of Southern California
Thank You
Any Questions ???