Distributed Intrusion Detection in Partially Observable Markov Decision Processes
Mathematical & Computer Sciences Department
ABSTRACT
malfunctioning of the agents due to internal failure or even
The problem of decentralized control occurs frequently in
due to enemy intrusion cannot be discounted. There is a
realistic domains where agents have to cooperate to achieve
growing need to develop multiagent systems that can ad-
a universal goal. Planning for domain-level joint strategy
dress such situations and can efficiently compensate for such
takes into account the uncertainty of the underlying en-
setbacks. We analyze the problem from the perspective of
vironment in computing near-optimal joint-strategies that
the malicious agent and come up with one possible attack
can handle the intrinsic domain uncertainty. However, un-
that it can launch on the system. We then use decision the-
certainty related to agents deviating from the recommended
oretic paradigms to obtain a solution to such attacks and
joint-policy is not taken into consideration. We focus on
propose reputation update schemes that distributively iden-
hostile domains, where the goal is to quickly identify devi-
tify such malicious agents. We substantiate our claims with
ations from planned behavior by any compromised agents.
results from the popular multiagent tiger domain [2].
There is a growing need to develop techniques that enablethe system to recognize and recover from such deviations. DEC-POMDP MODEL
We discuss the problem from the intruder’s perspective and
A DEC-POMDP is given by a tuple < I, S, {Ai}, {Ωi}, O,
then present a distributed intrusion detection scheme that
P, R, b0 > where I is the finite set of agents indexed by
can detect a particular type of attack.
1 . . . n, S is a finite set of states, Ai is a finite set of actionsavailable to agent i and A = ×i∈IAi is the set of joint actions
Categories and Subject Descriptors
where a =< a1, . . . , an > denotes a joint action, Ωi is a finite
I.2.11 [Artificial Intelligence]: Distributed Artificial In-
set of observations available to agent i and Ω = ×i∈IΩi is the
set of joint observations where o =< o1, . . . , on > denotesa joint observation, O is the observation function given by
O(s, a1, . . . , an, o1, . . . , on), the probability of observing the
General Terms
joint observation (o1, . . . , on) when transitioning to state s
after taking joint action (a1, . . . , an), P is the set of Marko-vian state transition probabilities where P (s, a, s ) denotes
Keywords
the probability of taking action a in state s and reachingstate s , R : S × A →
b0 is the initial belief state for all agents. We assume thatthe agent’s observations are independent. Thus the obser-
INTRODUCTION
vation function can be represented as O = ×i∈IOi where
The problem of decentralized control can be effectively
Oi(s, a1, . . . , an, oi) is the probability that agent i observes
modeled as a decentralized partially observable Markov De-
oi given the joint-action < a1, . . . , an > resulted in state s.
cision Process (DEC-POMDP) [1] where the agents in the
The decision problem spans over a finite horizon T .
model are the decision makers. The uncertainty related toeach agent’s action does not only span over the uncertainty
POLICY REPRESENTATION AND RUN-
of the underlying environment but also on the decision mak-
NING BELIEF STATE
ing of the other agents in the system. In hostile domains,
The policy for agent i, πi is represented by a policy tree
(see Figure 1). Each node corresponds to an action and eachedge corresponds to an observation that the agent makes atthat time interval. We assume that there is a centralized
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are
planner that computes the policy tree for each agent. We
not made or distributed for profit or commercial advantage and that copies
introduce the concept of running belief state of an agent
bear this notice and the full citation on the first page. To copy otherwise, to
as an execution time estimate of an agent over the physical
republish, to post on servers or to redistribute to lists, requires prior specific
states and the observation histories of the other agents in the
system. The running belief state of agent i at time interval
AAMAS’07 May 14–18 2007, Honolulu, Hawai’i, USA. Copyright 2007 IFAAMAS .
t is given by RBti : S × ot−i → [0, 1] where ot−i are the t’th
observation histories of other agents. We define Belt
set of all such possible combinations of physical states and
observation histories that have a positive probability in RBt
Thus Belti can be defined as Belti = {b|RBti(b) > 0}. Running Belief State Update
The agents update RBti and Belti with each execution
step. The update occurs in two phases. The first exhaustive
Belti. We call this estimate the priori estimate, RBpriori
Figure 1: Policy tree.
P (s, < πi(oti), π−i(ot−i) >, s ) × Oi(s , < πi(oti), π−i(ot−i) >,
k has no incentive to choose an action from another node at
) × O−i(s , < πi(oti), π−i(ot−i) >, ot+1
that the joint reward received would be a member of ηti, ∀i,
where α1 is a normalizing constant and s, s ∈ S. Belpriori
= k. We refer to such agents as simple malicious agents. A
is calculated accordingly. The second phase of the update
simple malicious agent can be deterministic if it always takes
happens after the agent takes its t+1 ’th step action and
the action corresponding to the least probable observation
obtains joint reward given by the random variable Rt+1. The
of the subtree of the policy tree it is in or stochastic if it
agent can now calculate the posteriori estimate RBposteriori
takes any action corresponding to the observations of thesubtree with equal probability. Reputation management
To identify deviation from committed plans from com-
promised agents, each agent maintains updated reputation
where α2 is another normalizing constant and r is the re-
about all other agents in the system. This reputation man-
ward received at the t+1 ’th time step. The second probabil-
agement scheme runs in all agents in completely distributed
ity term in Equation 2 returns a value of 1 or 0 depending
fashion. Each agent i maintains a set Vi = {Rji} where Rji
on whether it is possible to get the reward r after taking
is the reputation of the jth agent as computed by agent i,
∀j = i, and is updated in each iteration by:
ingly. Agent i knows its own physical state and using RBt
i , it tries to approximately estimate the state of all
j, j ∈ I, j = i. It can be shown that Belti is always an
over-estimate of the actual world state. Anticipated set of rewards
is a set of all possibles rewards that it can get by taking the
is a subset of beliefs most convincing to i. Based on Beltmax
i tries to reason the last observational transition that each
i at time t. The anticipated set of rewards
of the other agents have made. The numerator of the sec-
i at time t is ηti = {R(s, < πi(oti), π−i(ot−i) >)|(s, ot−i) ∈
ond expression in Equation 3 gives the difference between
}. Note that the agent has to use the Belpriori
the probability of observing the most probable observation
estimate as the anticipated rewards are calculated before
given t-1 ’th joint-action with the probability of observing
the t’th action has been taken. If an agent i receives a
the t’th observation given the same. Note, a simple mali-
∈ ηti at time t, it concludes that some agent(s)
cious agent k would often fake observations and this incon-
have deviated from their committed policies.
sistency would gradually reflect in the Belti of i and result
We assume that there can be compromised agents in the
in a higher value for the numerator. This would result in a
system which want to have a covert harmful impact on the
sharp drop of Rki. The function κ is monotonically decreas-
system. A malicious agent k can only avoid suspicion if
it takes actions which are consistent with the anticipated
i and thus facilitates faster detection.
reward estimates of other agents. The moment k takes anaction that produces a joint reward r /
that some agent(s) must have deviated from the committed
For our experiments we use a three agent version of the
joint policy. k can choose from the set of actions available
Tiger problem [2] and refer to the agents as Agent1, Agent2
with the different observations for the particular node of
and Agent3. The observation function for each agent is gen-
policy tree it is in and thereby avoid detection. We refer to
erated randomly over each pass of the run. We assume do-
such agents as simple malicious agents. In Figure 1, if k is
mains, where probability of one observation given a state
at branch b1 of πk at time t , it chooses between the actions
and joint-action is significantly higher than the other obser-
a1 and a2 pertaining to observations o1 and o2 respectively.
vations. All results presented are averaged over ten runs. The Sixth Intl. Joint Conf. on Autonomous Agents and Multi-Agent Systems (AAMAS 07)Figure 2: Agent3 deterministically takes actions cor- Figure 3: Both agent2 and agent3 deterministically responding to wrong observation. takes actions corresponding to wrong observation.
The initial policies for each agent is generated randomly for
each iteration. We do without calculating the optimal joint
policy as the reputation mechanism works well for any com-
mitted joint policy. The results presented are based on thecalculations made by Agent1.
In the first simulation, Agent3 is a deterministic simple
malicious agent whereas Agent2 takes actions based on its
true observations. Figure 2 shows the individual reputation
plots of the two agents in the system. The drop in reputa-tion for Agent2 is due to the domain effect but the much
steeper drop for Agent3 is far from the normal drop due to
the domain level uncertainties. In the next simulation, wechanged Agent3 to a deterministic simple malicious agent. Figure 4: Agent3 stochastically takes actions corre-
Figure 3 shows the corresponding reputation plots and we
sponding to wrong observation.
see that the reputation of both the agents show a sharp fall.
In the next set of experiments, in an attempt to avoid de-
tection, the simple malicious agents stochastically chooses
the action pertaining to the wrong observation. Again in thefirst case we assume that Agent3 is the only malicious agent
in the system. Figure 4 is the reputation plot for the cor-responding case. The steeper drop in the reputation value
for Agent3 clearly identifies that Agent3 have been taking
more actions pertaining to wrong observations. So Agent1
can conclude that either the agent has faulty observational
sensors or it has been corrupted by external influence. Inthe next simulation, both Agent2 and Agent3 behave as ma-
licious agents that stochastically take actions corresponding
to the wrong observation. Figure 5 shows the correspond-ing reputation plot of the two agents. The sharp dip in the
Figure 5: Both Agent2 and Agent3 stochastically
reputation of the two agents signify an abnormality in the
take actions corresponding to wrong observation. REFERENCES CONCLUSION AND FUTURE WORK
[1] D. S. Bernstein, R. Givan, N. Immerman, and
In this paper we dealt with the problem of detecting agents
S. Zilberstein. The complexity of decentralized control
that deviate from committed joint-policy. Assuming that
of markov decision processes. Math. Oper. Res.,
the agents can be compromised, we discussed one possible
type of deviations from prior commitments. We presented
[2] R. Nair, M. Tambe, M. Yokoo, D. V. Pynadath, and
a distributed reputation update mechanism of uniquely de-
S. Marsella. Taming decentralized pomdps: Towards
tecting such malicious agents. Our scheme is based on track-
efficient policy computation for multiagent settings. In
ing of the execution time estimates of the belief state of the
Proceedings of the Eighteenth International Joint
system and the anticipated rewards. As future work we
conference on Artificial Intelligence (IJCAI), 2003.
would also like to explore more challenging scenarios deal-ing with malicious agents such as colluding agents that tryto compromise the system. Acknowledgment: US National Science Foundation award
IIS- 0209208 partially supported this work. The Sixth Intl. Joint Conf. on Autonomous Agents and Multi-Agent Systems (AAMAS 07)
GLOUCESTER COUNTY PUBLIC SCHOOLS Limited Flexible Spending Account (“FSA”) Who should enroll in the Limited FSA Plan? If you are participating in the Health Savings Account, you must enroll in the Limited FSA Plan if you wish to participate under an FSA Plan for the renewing Plan Year that runs from October 1, 2013 through September 30, 2014. Although you are not required
On a recent bus trip to Toronto, following the route of the 1837 rebels from Lloydtown, we were asked to re-create the experience in our mind'seye. To imagine the effort of a 35 mile hike, to pass farmers’ fields andsmall inns, to feel the privation of thirst or hunger and the anxiety of potential conflict that they must have experienced on this long walk. A tough task,considering the amo