## 01t0083o_aamas07_0565_ecd6c80d79be396baaf4a0d81eeb89d8.dvi

**Distributed Intrusion Detection in Partially Observable**
**Markov Decision Processes**
Mathematical & Computer Sciences Department

**ABSTRACT**
malfunctioning of the agents due to internal failure or even
The problem of decentralized control occurs frequently in
due to enemy intrusion cannot be discounted. There is a
realistic domains where agents have to cooperate to achieve
growing need to develop multiagent systems that can ad-
a universal goal. Planning for domain-level joint strategy
dress such situations and can eﬃciently compensate for such
takes into account the uncertainty of the underlying en-
setbacks. We analyze the problem from the perspective of
vironment in computing near-optimal joint-strategies that
the malicious agent and come up with one possible attack
can handle the intrinsic domain uncertainty. However, un-
that it can launch on the system. We then use decision the-
certainty related to agents deviating from the recommended
oretic paradigms to obtain a solution to such attacks and
joint-policy is not taken into consideration. We focus on
propose reputation update schemes that distributively iden-
hostile domains, where the goal is to quickly identify devi-
tify such malicious agents. We substantiate our claims with
ations from planned behavior by any compromised agents.

results from the popular multiagent tiger domain [2].

There is a growing need to develop techniques that enablethe system to recognize and recover from such deviations.

**DEC-POMDP MODEL**
We discuss the problem from the intruder’s perspective and
A DEC-POMDP is given by a tuple < I, S, {Ai}, {Ωi}, O,
then present a distributed intrusion detection scheme that
P, R, b0 > where I is the ﬁnite set of agents indexed by
can detect a particular type of attack.

1 . . . n, S is a ﬁnite set of states, Ai is a ﬁnite set of actionsavailable to agent i and A = ×i∈IAi is the set of joint actions

**Categories and Subject Descriptors**
where a =< a1, . . . , an > denotes a joint action, Ωi is a ﬁnite
I.2.11 [

**Artiﬁcial Intelligence**]: Distributed Artiﬁcial In-

set of observations available to agent i and Ω = ×i∈IΩi is the
set of joint observations where o =< o1, . . . , on > denotesa joint observation, O is the observation function given by
O(s, a1, . . . , an, o1, . . . , on), the probability of observing the

**General Terms**
joint observation (o1, . . . , on) when transitioning to state s
after taking joint action (a1, . . . , an), P is the set of Marko-vian state transition probabilities where P (s, a, s ) denotes

**Keywords**
the probability of taking action a in state s and reachingstate s , R : S × A →
b0 is the initial belief state for all agents. We assume thatthe agent’s observations are independent. Thus the obser-

**INTRODUCTION**
vation function can be represented as O = ×i∈IOi where
The problem of decentralized control can be eﬀectively
Oi(s, a1, . . . , an, oi) is the probability that agent i observes
modeled as a decentralized partially observable Markov De-
oi given the joint-action < a1, . . . , an > resulted in state s.

cision Process (DEC-POMDP) [1] where the agents in the
The decision problem spans over a ﬁnite horizon T .

model are the decision makers. The uncertainty related toeach agent’s action does not only span over the uncertainty

**POLICY REPRESENTATION AND RUN-**
of the underlying environment but also on the decision mak-

**NING BELIEF STATE**
ing of the other agents in the system. In hostile domains,
The policy for agent i, πi is represented by a policy tree
(see Figure 1). Each node corresponds to an action and eachedge corresponds to an observation that the agent makes atthat time interval. We assume that there is a centralized
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are
planner that computes the policy tree for each agent. We
not made or distributed for proﬁt or commercial advantage and that copies
introduce the concept of running belief state of an agent
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
as an execution time estimate of an agent over the physical
republish, to post on servers or to redistribute to lists, requires prior speciﬁc
states and the observation histories of the other agents in the
system. The running belief state of agent i at time interval

*AAMAS’07 *May 14–18 2007, Honolulu, Hawai’i, USA.

Copyright 2007 IFAAMAS .

t is given by RBti : S × ot−i → [0, 1] where ot−i are the t’th
observation histories of other agents. We deﬁne Belt
set of all such possible combinations of physical states and
observation histories that have a positive probability in RBt
Thus Belti can be deﬁned as Belti = {b|RBti(b) > 0}.

**Running Belief State Update**
The agents update RBti and Belti with each execution
step. The update occurs in two phases. The ﬁrst exhaustive
Belti. We call this estimate the priori estimate, RBpriori

**Figure 1: Policy tree.**
P (s, < πi(oti), π−i(ot−i) >, s ) × Oi(s , < πi(oti), π−i(ot−i) >,
k has no incentive to choose an action from another node at
) × O−i(s , < πi(oti), π−i(ot−i) >, ot+1
that the joint reward received would be a member of ηti, ∀i,
where α1 is a normalizing constant and s, s ∈ S. Belpriori
= k. We refer to such agents as simple malicious agents. A
is calculated accordingly. The second phase of the update
simple malicious agent can be deterministic if it always takes
happens after the agent takes its t+1 ’th step action and
the action corresponding to the least probable observation
obtains joint reward given by the random variable Rt+1. The
of the subtree of the policy tree it is in or stochastic if it
agent can now calculate the posteriori estimate RBposteriori
takes any action corresponding to the observations of thesubtree with equal probability.

**Reputation management**
To identify deviation from committed plans from com-
promised agents, each agent maintains updated reputation
where α2 is another normalizing constant and r is the re-
about all other agents in the system. This reputation man-
ward received at the t+1 ’th time step. The second probabil-
agement scheme runs in all agents in completely distributed
ity term in Equation 2 returns a value of 1 or 0 depending
fashion. Each agent i maintains a set Vi = {Rji} where Rji
on whether it is possible to get the reward r after taking
is the reputation of the jth agent as computed by agent i,
∀j = i, and is updated in each iteration by:
ingly. Agent i knows its own physical state and using RBt
i , it tries to approximately estimate the state of all
j, j ∈ I, j = i. It can be shown that Belti is always an
over-estimate of the actual world state.

**Anticipated set of rewards**
is a set of all possibles rewards that it can get by taking the
is a subset of beliefs most convincing to i. Based on Beltmax
i tries to reason the last observational transition that each
i at time t. The anticipated set of rewards
of the other agents have made. The numerator of the sec-
i at time t is ηti = {R(s, < πi(oti), π−i(ot−i) >)|(s, ot−i) ∈
ond expression in Equation 3 gives the diﬀerence between
}. Note that the agent has to use the Belpriori
the probability of observing the most probable observation
estimate as the anticipated rewards are calculated before
given t-1 ’th joint-action with the probability of observing
the t’th action has been taken. If an agent i receives a
the t’th observation given the same. Note, a simple mali-
∈ ηti at time t, it concludes that some agent(s)
cious agent k would often fake observations and this incon-
have deviated from their committed policies.

sistency would gradually reﬂect in the Belti of i and result
We assume that there can be compromised agents in the
in a higher value for the numerator. This would result in a
system which want to have a covert harmful impact on the
sharp drop of Rki. The function κ is monotonically decreas-
system. A malicious agent k can only avoid suspicion if
it takes actions which are consistent with the anticipated
i and thus facilitates faster detection.

reward estimates of other agents. The moment k takes anaction that produces a joint reward r /
that some agent(s) must have deviated from the committed
For our experiments we use a three agent version of the
joint policy. k can choose from the set of actions available
Tiger problem [2] and refer to the agents as Agent1, Agent2
with the diﬀerent observations for the particular node of
and Agent3. The observation function for each agent is gen-
policy tree it is in and thereby avoid detection. We refer to
erated randomly over each pass of the run. We assume do-
such agents as simple malicious agents. In Figure 1, if k is
mains, where probability of one observation given a state
at branch b1 of πk at time t , it chooses between the actions
and joint-action is signiﬁcantly higher than the other obser-
a1 and a2 pertaining to observations o1 and o2 respectively.

vations. All results presented are averaged over ten runs.

*The Sixth Intl. Joint Conf. on Autonomous Agents and Multi-Agent Systems (AAMAS 07)*
**Figure 2: Agent3 deterministically takes actions cor-**
**Figure 3: Both agent2 and agent3 deterministically**
**responding to wrong observation.**
**takes actions corresponding to wrong observation.**
The initial policies for each agent is generated randomly for
each iteration. We do without calculating the optimal joint
policy as the reputation mechanism works well for any com-
mitted joint policy. The results presented are based on thecalculations made by Agent1.

In the ﬁrst simulation, Agent3 is a deterministic simple
malicious agent whereas Agent2 takes actions based on its
true observations. Figure 2 shows the individual reputation
plots of the two agents in the system. The drop in reputa-tion for Agent2 is due to the domain eﬀect but the much
steeper drop for Agent3 is far from the normal drop due to
the domain level uncertainties. In the next simulation, wechanged Agent3 to a deterministic simple malicious agent.

**Figure 4: Agent3 stochastically takes actions corre-**
Figure 3 shows the corresponding reputation plots and we

**sponding to wrong observation.**
see that the reputation of both the agents show a sharp fall.

In the next set of experiments, in an attempt to avoid de-
tection, the simple malicious agents stochastically chooses
the action pertaining to the wrong observation. Again in theﬁrst case we assume that Agent3 is the only malicious agent
in the system. Figure 4 is the reputation plot for the cor-responding case. The steeper drop in the reputation value
for Agent3 clearly identiﬁes that Agent3 have been taking
more actions pertaining to wrong observations. So Agent1
can conclude that either the agent has faulty observational
sensors or it has been corrupted by external inﬂuence. Inthe next simulation, both Agent2 and Agent3 behave as ma-
licious agents that stochastically take actions corresponding
to the wrong observation. Figure 5 shows the correspond-ing reputation plot of the two agents. The sharp dip in the

**Figure 5: Both Agent2 and Agent3 stochastically**
reputation of the two agents signify an abnormality in the

**take actions corresponding to wrong observation.**
**REFERENCES**
**CONCLUSION AND FUTURE WORK**
[1] D. S. Bernstein, R. Givan, N. Immerman, and
In this paper we dealt with the problem of detecting agents
S. Zilberstein. The complexity of decentralized control
that deviate from committed joint-policy. Assuming that
of markov decision processes. Math. Oper. Res.,
the agents can be compromised, we discussed one possible
type of deviations from prior commitments. We presented
[2] R. Nair, M. Tambe, M. Yokoo, D. V. Pynadath, and
a distributed reputation update mechanism of uniquely de-
S. Marsella. Taming decentralized pomdps: Towards
tecting such malicious agents. Our scheme is based on track-
eﬃcient policy computation for multiagent settings. In
ing of the execution time estimates of the belief state of the
Proceedings of the Eighteenth International Joint
system and the anticipated rewards. As future work we
conference on Artiﬁcial Intelligence (IJCAI), 2003.

would also like to explore more challenging scenarios deal-ing with malicious agents such as colluding agents that tryto compromise the system.

**Acknowledgment: **US National Science Foundation award

IIS- 0209208 partially supported this work.

*The Sixth Intl. Joint Conf. on Autonomous Agents and Multi-Agent Systems (AAMAS 07)*
Source: http://www.aamas-conference.org/Proceedings/aamas07/html/pdf/AAMAS07_0565_ecd6c80d79be396baaf4a0d81eeb89d8.pdf

GLOUCESTER COUNTY PUBLIC SCHOOLS Limited Flexible Spending Account (“FSA”) Who should enroll in the Limited FSA Plan? If you are participating in the Health Savings Account, you must enroll in the Limited FSA Plan if you wish to participate under an FSA Plan for the renewing Plan Year that runs from October 1, 2013 through September 30, 2014. Although you are not required

On a recent bus trip to Toronto, following the route of the 1837 rebels from Lloydtown, we were asked to re-create the experience in our mind'seye. To imagine the effort of a 35 mile hike, to pass farmers’ fields andsmall inns, to feel the privation of thirst or hunger and the anxiety of potential conflict that they must have experienced on this long walk. A tough task,considering the amo