Bayesian Network

BAYESIAN NETWORKBAYESIAN NETWORK

Submitted BySubmitted By Faisal IslamFaisal Islam Srinivasan GopalanSrinivasan Gopalan Vaibhav MittalVaibhav Mittal Vipin MakhijaVipin Makhija

Prof. Anita Wasilewska Prof. Anita Wasilewska State University of New York at Stony BrookState University of New York at Stony Brook

ReferencesReferences [[1]Jiawei Han:”Data Mining Concepts and Techniques”,ISBN 1-1]Jiawei Han:”Data Mining Concepts and Techniques”,ISBN 1-

53860-489-853860-489-8Morgan Kaufman Publisher.Morgan Kaufman Publisher.

[2] Stuart Russell,Peter Norvig “Artificial Intelligence – A [2] Stuart Russell,Peter Norvig “Artificial Intelligence – A modern Approach ,Pearson education.modern Approach ,Pearson education.

[3] Kandasamy,Thilagavati,Gunavati , Probability, Statistics [3] Kandasamy,Thilagavati,Gunavati , Probability, Statistics and Queueing Theory , Sultan Chand Publishers.and Queueing Theory , Sultan Chand Publishers.

[4] D. Heckerman: “A Tutorial on Learning with Bayesian [4] D. Heckerman: “A Tutorial on Learning with Bayesian Networks”, In “Learning in Graphical Models”, ed. M.I. Jordan, Networks”, In “Learning in Graphical Models”, ed. M.I. Jordan, The MIT Press, 1998.The MIT Press, 1998.

[5] [5] http://http://en.wikipedia.org/wiki/Bayesian_probabilityen.wikipedia.org/wiki/Bayesian_probability [6] [6]

http://www.construction.ualberta.ca/civ606/myFiles/Intro%20thttp://www.construction.ualberta.ca/civ606/myFiles/Intro%20to%20Belief%20Network.pdfo%20Belief%20Network.pdf

[7] [7] http://http://www.murrayc.com/learning/AI/bbn.shtmlwww.murrayc.com/learning/AI/bbn.shtml [8] [8] http://http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.htmlwww.cs.ubc.ca/~murphyk/Bayes/bnintro.html [9] [9] http://http://en.wikipedia.org/wiki/Bayesian_belief_networken.wikipedia.org/wiki/Bayesian_belief_network

http://en.wikipedia.org/wiki/Bayesian_probability

http://en.wikipedia.org/wiki/Bayesian_probability

http://www.construction.ualberta.ca/civ606/myFiles/Intro%20to%20Belief%20Network.pdf

http://www.construction.ualberta.ca/civ606/myFiles/Intro%20to%20Belief%20Network.pdf

http://www.murrayc.com/learning/AI/bbn.shtml

http://www.murrayc.com/learning/AI/bbn.shtml

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html

http://en.wikipedia.org/wiki/Bayesian_belief_network

http://en.wikipedia.org/wiki/Bayesian_belief_network

CONTENTSCONTENTS

HISTORYHISTORY CONDITIONAL PROBABILITYCONDITIONAL PROBABILITY BAYES THEOREMBAYES THEOREM NAÏVE BAYES CLASSIFIERNAÏVE BAYES CLASSIFIER BELIEF NETWORKBELIEF NETWORK APPLICATION OF BAYESIAN NETWORKAPPLICATION OF BAYESIAN NETWORK PAPER ON CYBER CRIME DETECTIONPAPER ON CYBER CRIME DETECTION

HISTORYHISTORY Bayesian Probability was named after Bayesian Probability was named after

Reverend Thomas Bayes (1702- Reverend Thomas Bayes (1702-1761).1761).

He proved a special case of what is He proved a special case of what is currently known as the Bayes Theorem.currently known as the Bayes Theorem.

The term “Bayesian” came into use The term “Bayesian” came into use around the 1950’s.around the 1950’s.

Pierre-Simon, Marquis de Laplace (1749-Pierre-Simon, Marquis de Laplace (1749-1827) independently proved a 1827) independently proved a generalized version of Bayes Theorem.generalized version of Bayes Theorem.

http://en.wikipedia.org/wiki/Bayesian_probabilityhttp://en.wikipedia.org/wiki/Bayesian_probability

HISTORY (Cont.)HISTORY (Cont.) 1950’s – New knowledge in Artificial 1950’s – New knowledge in Artificial

IntelligenceIntelligence 1958 Genetic Algorithms by Friedberg 1958 Genetic Algorithms by Friedberg

(Holland and Goldberg ~1985)(Holland and Goldberg ~1985) 1965 Fuzzy Logic by Zadeh at UC Berkeley1965 Fuzzy Logic by Zadeh at UC Berkeley 1970 Bayesian Belief Network at Stanford 1970 Bayesian Belief Network at Stanford

University (Judea Pearl 1988) University (Judea Pearl 1988) The idea’s proposed above was not fully The idea’s proposed above was not fully

developed until later. BBN became developed until later. BBN became popular in the 1990s.popular in the 1990s.

http://www.construction.ualberta.ca/civ606/myFiles/Intro%20to%20Belief%20Network.pdfhttp://www.construction.ualberta.ca/civ606/myFiles/Intro%20to%20Belief%20Network.pdf

HISTORY (Cont.)HISTORY (Cont.)

Current uses of Bayesian Networks:Current uses of Bayesian Networks: Microsoft’s printer troubleshooter.Microsoft’s printer troubleshooter. Diagnose diseases (Mycin).Diagnose diseases (Mycin). Used to predict oil and stock pricesUsed to predict oil and stock prices Control the space shuttleControl the space shuttle Risk Analysis – Schedule and Cost Risk Analysis – Schedule and Cost

Overruns.Overruns.

CONDITIONAL CONDITIONAL PROBABILITYPROBABILITY

Probability : How likely is it that an event will happen?Probability : How likely is it that an event will happen? Sample Space Sample Space SS

Element of Element of SS: elementary event: elementary event An event An event AA is a subset of is a subset of SS

P(P(AA) ) P(P(SS) = 1) = 1

Events A and BEvents A and B

P(A|B)- Probability that event A occurs given that event B P(A|B)- Probability that event A occurs given that event B has already occurred.has already occurred.

Example:Example:There are 2 baskets. B1 has 2 red ball and 5 blue ball. B2 has There are 2 baskets. B1 has 2 red ball and 5 blue ball. B2 has

4 red ball and 3 blue ball. Find probability of picking a 4 red ball and 3 blue ball. Find probability of picking a red ball from basket 1?red ball from basket 1?

CONDITIONAL CONDITIONAL PROBABILITYPROBABILITY

The question above wants P(red ball | The question above wants P(red ball | basket 1). basket 1).

The answer intuitively wants the The answer intuitively wants the probability of red ball from only the probability of red ball from only the sample space of basket 1.sample space of basket 1.

So the answer is 2/7So the answer is 2/7 The equation to solve it is:The equation to solve it is:

P(A|B) = P(A∩B)/P(B) [Product Rule]P(A|B) = P(A∩B)/P(B) [Product Rule]

P(A,B) = P(A)*P(B) [ If A and B are P(A,B) = P(A)*P(B) [ If A and B are independent ] independent ]

How do you solve P(basket2 | red ball) ???How do you solve P(basket2 | red ball) ???

BAYESIAN THEOREMBAYESIAN THEOREM

A special case of A special case of Bayesian Theorem:Bayesian Theorem:

P(A∩B) = P(B) x P(A|B)P(A∩B) = P(B) x P(A|B)

P(B∩A) = P(A) x P(B|A)P(B∩A) = P(A) x P(B|A)

Since P(A∩B) = P(B∩A),Since P(A∩B) = P(B∩A),

P(B) x P(A|B) = P(A) x P(B) x P(A|B) = P(A) x P(B|A)P(B|A)

=> P(A|B) = [P(A) x => P(A|B) = [P(A) x P(B|A)] / P(B)P(B|A)] / P(B)

A B

ABPAPABPAP

ABPAP

BP

ABPAPBAP

||

)|()(

)(

)|()()|(


Example 2:Example 2: A medical cancer diagnosis A medical cancer diagnosis problem problem

There are 2 possible outcomes of a There are 2 possible outcomes of a diagnosis: +ve, -ve. We know .8% of diagnosis: +ve, -ve. We know .8% of world population has cancer. Test gives world population has cancer. Test gives correct +ve result 98% of the time and correct +ve result 98% of the time and gives correct –ve result 97% of the time. gives correct –ve result 97% of the time.

If a patient’s test returns +ve, should we If a patient’s test returns +ve, should we diagnose the patient as having diagnose the patient as having cancer?cancer?


General Bayesian Theorem:General Bayesian Theorem:

Given E1, E2,…,En are mutually Given E1, E2,…,En are mutually disjointdisjoint events and P(Ei) ≠ 0, (i = 1, 2,…, n)events and P(Ei) ≠ 0, (i = 1, 2,…, n)

P(Ei/A) = [P(Ei) x P(A|Ei)] / P(Ei/A) = [P(Ei) x P(A|Ei)] / ΣΣ P(Ei) x P(A|Ei) P(Ei) x P(A|Ei)

i = 1, 2,…, ni = 1, 2,…, n


Example:Example:

There are 3 boxes. B1 has 2 white, 3 There are 3 boxes. B1 has 2 white, 3 black and 4 red balls. B2 has 3 white, black and 4 red balls. B2 has 3 white, 2 black and 2 red balls. B3 has 4 2 black and 2 red balls. B3 has 4 white, 1 black and 3 red balls. A box white, 1 black and 3 red balls. A box is chosen at random and 2 balls are is chosen at random and 2 balls are drawn. 1 is white and other is red. drawn. 1 is white and other is red. What is the probability that they came What is the probability that they came from the first box??from the first box??


Let E1, E2, E3 denote events of choosing Let E1, E2, E3 denote events of choosing B1, B2, B3 respectively. Let A be the B1, B2, B3 respectively. Let A be the event that 2 balls selected are white event that 2 balls selected are white and red.and red.

P(E1) = P(E2) = P(E3) = 1/3P(E1) = P(E2) = P(E3) = 1/3

P(A|E1) = [2c1 x 4c1] / 9c2 = 2/9P(A|E1) = [2c1 x 4c1] / 9c2 = 2/9

P(A|E2) = [3c1 x 2c1] / 7c2 = 2/7P(A|E2) = [3c1 x 2c1] / 7c2 = 2/7

P(A|E3) = [4c1 x 3c1] / 8c2 = 3/7P(A|E3) = [4c1 x 3c1] / 8c2 = 3/7


P(E1|A) = [P(E1) x P(A|E1)] / P(E1|A) = [P(E1) x P(A|E1)] / ΣΣ P(Ei) x P(Ei) x P(A|Ei)P(A|Ei)

= 0.23727= 0.23727

P(E2|A) = 0.30509P(E2|A) = 0.30509

P(E3|A) = 1 – (0.23727 + 0.30509) = P(E3|A) = 1 – (0.23727 + 0.30509) = 0.457640.45764

BAYESIAN BAYESIAN CLASSIFICATIONCLASSIFICATION

Why use Bayesian Classification:Why use Bayesian Classification: Probabilistic learning:Probabilistic learning: Calculate Calculate

explicit probabilities for hypothesis, explicit probabilities for hypothesis, among the most practical approaches among the most practical approaches to certain types of learning to certain types of learning problemsproblems

Incremental:Incremental: Each training example Each training example can incrmentally can incrmentally increase/decrease the probability that a increase/decrease the probability that a hypothesis is correct. Prior knowledge hypothesis is correct. Prior knowledge can be combined with observed data.can be combined with observed data.

BAYESIAN BAYESIAN CLASSIFICATIONCLASSIFICATION

Probabilistic predictionProbabilistic prediction: Predict : Predict multiple hypotheses, weighted by multiple hypotheses, weighted by their probabilitiestheir probabilities

Standard:Standard: Even when Bayesian Even when Bayesian methods are computationally methods are computationally intractable, they can provide a intractable, they can provide a standard of optimal decision standard of optimal decision making against which other methods making against which other methods can be measuredcan be measured

NAÏVE BAYES CLASSIFIERNAÏVE BAYES CLASSIFIER

A simplified assumption: attributes are A simplified assumption: attributes are conditionally independent: conditionally independent:

Greatly reduces the computation cost, Greatly reduces the computation cost, only count the class distribution.only count the class distribution.

NAÏVE BAYES CLASSIFIERNAÏVE BAYES CLASSIFIERThe probabilistic model of NBC is to find the The probabilistic model of NBC is to find the

probability of a certain class given multiple probability of a certain class given multiple dijoint (assumed) events.dijoint (assumed) events.

The naïve Bayes classifier applies to learning The naïve Bayes classifier applies to learning tasks where each instance x is described by a tasks where each instance x is described by a conjunction of attribute values and where the conjunction of attribute values and where the target function f(x) can take on any value from target function f(x) can take on any value from some finite set V. A set of training examples of some finite set V. A set of training examples of the target function is provided, and a new the target function is provided, and a new instance is presented, described by the tuple instance is presented, described by the tuple of attribute values <a1,a2,…,an>. The learner is of attribute values <a1,a2,…,an>. The learner is asked to predict the target value, or asked to predict the target value, or classification, for this new instance.classification, for this new instance.

NAÏVE BAYES CLASSIFIERNAÏVE BAYES CLASSIFIERAbstractly, probability model for a classifier is a Abstractly, probability model for a classifier is a

conditional model conditional modelP(C|F1,F2,…,Fn)P(C|F1,F2,…,Fn)Over a dependent class variable C with a small Over a dependent class variable C with a small

nuumber of outcome or classes conditional nuumber of outcome or classes conditional over several feature variables F1,…,Fn.over several feature variables F1,…,Fn.

Naïve Bayes Formula:Naïve Bayes Formula:P(C|F1,F2,…,Fn) = argmaxP(C|F1,F2,…,Fn) = argmaxcc [P(C) x P(F1|C) x [P(C) x P(F1|C) x

P(F2|C) x…x P(Fn|C)] / P(F1,F2,…,Fn)P(F2|C) x…x P(Fn|C)] / P(F1,F2,…,Fn)

Since P(F1,F2,…,Fn) is common to all Since P(F1,F2,…,Fn) is common to all probabilities, we donot need to evaluate the probabilities, we donot need to evaluate the denomitator for comparisons.denomitator for comparisons.

NAÏVE BAYES CLASSIFIERNAÏVE BAYES CLASSIFIER Tennis-ExampleTennis-Example


Problem:Problem:

Use training data from above to classify Use training data from above to classify the following instances:the following instances:

a)a) <Outlook=sunny, <Outlook=sunny, Temperature=cool, Temperature=cool, Humidity=high, Wind=strong>Humidity=high, Wind=strong>

b)b) <Outlook=overcast, <Outlook=overcast, Temperature=cool, Humidity=high, Temperature=cool, Humidity=high, Wind=strong>Wind=strong>


Answer to (a):Answer to (a):P(PlayTennis=yes) = 9/14 = 0.64P(PlayTennis=yes) = 9/14 = 0.64P(PlayTennis=n) = 5/14 = 0.36P(PlayTennis=n) = 5/14 = 0.36P(Outlook=sunny|PlayTennis=yes) = 2/9 = 0.22P(Outlook=sunny|PlayTennis=yes) = 2/9 = 0.22P(Outlook=sunny|PlayTennis=no) = 3/5 = 0.60P(Outlook=sunny|PlayTennis=no) = 3/5 = 0.60P(Temperature=cool|PlayTennis=yes) = 3/9 = P(Temperature=cool|PlayTennis=yes) = 3/9 =

0.330.33P(Temperature=cool|PlayTennis=no) = 1/5 P(Temperature=cool|PlayTennis=no) = 1/5

= .20= .20P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33P(Humidity=high|PlayTennis=no) = 4/5 = 0.80P(Humidity=high|PlayTennis=no) = 4/5 = 0.80P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33P(Wind=strong|PlayTennis=no) = 3/5 = 0.60P(Wind=strong|PlayTennis=no) = 3/5 = 0.60


Answer to (b):Answer to (b):P(PlayTennis=yes) = 9/14 = 0.64P(PlayTennis=yes) = 9/14 = 0.64P(PlayTennis=no) = 5/14 = 0.36P(PlayTennis=no) = 5/14 = 0.36P(Outlook=overcast|PlayTennis=yes) = 4/9 = 0.44P(Outlook=overcast|PlayTennis=yes) = 4/9 = 0.44P(Outlook=overcast|PlayTennis=no) = 0/5 = 0P(Outlook=overcast|PlayTennis=no) = 0/5 = 0P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33P(Temperature=cool|PlayTennis=no) = 1/5 = .20P(Temperature=cool|PlayTennis=no) = 1/5 = .20P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33P(Humidity=high|PlayTennis=no) = 4/5 = 0.80P(Humidity=high|PlayTennis=no) = 4/5 = 0.80P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33P(Wind=strong|PlayTennis=no) = 3/5 = 0.60P(Wind=strong|PlayTennis=no) = 3/5 = 0.60


Estimating Probabilities:Estimating Probabilities:

In the previous example, P(overcast|no) In the previous example, P(overcast|no) = 0 which causes the formula-= 0 which causes the formula-

P(no)xP(overcast|no)xP(cool|no)xP(high|P(no)xP(overcast|no)xP(cool|no)xP(high|no)xP(strong|nno) = 0.0no)xP(strong|nno) = 0.0

This causes problems in comparing This causes problems in comparing because the other probabilities are because the other probabilities are not considered. We can avoid this not considered. We can avoid this difficulty by using m- estimate.difficulty by using m- estimate.


M-Estimate Formula:M-Estimate Formula:

[c + k] / [n + m] where c/n is the [c + k] / [n + m] where c/n is the original probability used before, original probability used before, k=1 and m= equivalent sample k=1 and m= equivalent sample size.size.

Using this method our new values of Using this method our new values of probility is given below-probility is given below-


New answer to (b):New answer to (b):P(PlayTennis=yes) = 10/16 = 0.63P(PlayTennis=yes) = 10/16 = 0.63P(PlayTennis=no) = 6/16 = 0.37P(PlayTennis=no) = 6/16 = 0.37P(Outlook=overcast|PlayTennis=yes) = 5/12 = 0.42P(Outlook=overcast|PlayTennis=yes) = 5/12 = 0.42P(Outlook=overcast|PlayTennis=no) = 1/8 = .13P(Outlook=overcast|PlayTennis=no) = 1/8 = .13P(Temperature=cool|PlayTennis=yes) = 4/12 = 0.33P(Temperature=cool|PlayTennis=yes) = 4/12 = 0.33P(Temperature=cool|PlayTennis=no) = 2/8 = .25P(Temperature=cool|PlayTennis=no) = 2/8 = .25P(Humidity=high|PlayTennis=yes) = 4/11 = 0.36P(Humidity=high|PlayTennis=yes) = 4/11 = 0.36P(Humidity=high|PlayTennis=no) = 5/7 = 0.71P(Humidity=high|PlayTennis=no) = 5/7 = 0.71P(Wind=strong|PlayTennis=yes) = 4/11 = 0.36P(Wind=strong|PlayTennis=yes) = 4/11 = 0.36P(Wind=strong|PlayTennis=no) = 4/7 = 0.57P(Wind=strong|PlayTennis=no) = 4/7 = 0.57


The conditional probability values of all theThe conditional probability values of all the

attributes with respect to the class areattributes with respect to the class are

pre-computed and stored on disk.pre-computed and stored on disk. This prevents the classifier from This prevents the classifier from

computing the conditional probabilities computing the conditional probabilities every time it runs.every time it runs.

This stored data can be reused to reduce This stored data can be reused to reduce thethe

latency of the classifier.latency of the classifier.

BAYESIAN BELIEF BAYESIAN BELIEF NETWORKNETWORK In Naïve Bayes Classifier we make the In Naïve Bayes Classifier we make the

assumption of class conditional independence, assumption of class conditional independence, that is given the class label of a sample, the that is given the class label of a sample, the value of the attributes are conditionally value of the attributes are conditionally independent of one another. independent of one another.

However, there can be dependences between However, there can be dependences between value of attributes. To avoid this we use value of attributes. To avoid this we use Bayesian Belief Network which provide joint Bayesian Belief Network which provide joint conditional probability distribution. conditional probability distribution.

A Bayesian network is a form of probabilistic A Bayesian network is a form of probabilistic graphical model. Specifically, a Bayesian graphical model. Specifically, a Bayesian network is a directed acyclic graph of nodes network is a directed acyclic graph of nodes representing representing variables and arcs representing variables and arcs representing dependence relations among the variables. dependence relations among the variables.

BAYESIAN BELIEF BAYESIAN BELIEF NETWORKNETWORK

A Bayesian network is a representation of A Bayesian network is a representation of the joint distribution over all the the joint distribution over all the variables represented by nodes in the graph. variables represented by nodes in the graph. Let the variables be X(1), ..., X(n). Let the variables be X(1), ..., X(n).

Let parents(A) be the parents of the node A. Let parents(A) be the parents of the node A. Then the joint distribution for X(1) through Then the joint distribution for X(1) through X(n) is represented as the product of the X(n) is represented as the product of the probability distributions P(Xi|Parents(Xi)) probability distributions P(Xi|Parents(Xi)) for i = 1 to n. If X has no parents, its for i = 1 to n. If X has no parents, its probability distribution is said to be probability distribution is said to be unconditional, otherwise it is conditional. unconditional, otherwise it is conditional.


By the chaining rule of probability, the joint By the chaining rule of probability, the joint probability of all the nodes in the graph probability of all the nodes in the graph above is:above is:

P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S,R) S,R)

W=Wet Grass, C=Cloudy, R=Rain, W=Wet Grass, C=Cloudy, R=Rain, S=SprinklerS=Sprinkler

Example: P(W∩-R∩S∩C) Example: P(W∩-R∩S∩C)

= P(W|S,-R)*P(-R|C)*P(S|C)*P(C)= P(W|S,-R)*P(-R|C)*P(S|C)*P(C)

= 0.9*0.2*0.1*0.5 = 0.009= 0.9*0.2*0.1*0.5 = 0.009

Advantages of Bayesian Advantages of Bayesian ApproachApproach

Bayesian networks can readily Bayesian networks can readily handle handle

incomplete data sets. incomplete data sets. Bayesian networks allow one to Bayesian networks allow one to

learn learn

about causal relationshipsabout causal relationships Bayesian networks readily facilitate Bayesian networks readily facilitate

use of prior knowledge.use of prior knowledge.

APPLICATIONS APPLICATIONS OF OF

Bayesian-Bayesian-NetworkNetwork

Sources/ReferencesSources/References

Naive Bayes Spam Filtering Using Word-Position-Based Attributes- - http://www.ceas.cc/papers-2005/144.pdfhttp://www.ceas.cc/papers-2005/144.pdf

by-: Johan Hovold, Department of Computer Science,Lund University Box 118, 221 00 Lund, Sweden.[E-mail [email protected]] [Presented at CEAS 2005 Second Conference on Email and Anti-SpamJuly 21 & 22, at Stanford University]

Tom Mitchell , “ Machine Learning” , Tata Mcgraw HillTom Mitchell , “ Machine Learning” , Tata Mcgraw Hill

A Bayesian Approach to Filtering Junk EMail,Mehran Sahami Susan Dumaisy David Heckermany Eric Horvitzy Gates Building

Computer Science Department Microsoft Research, Stanford University Redmond W Stanford CA fsdumais heckerma horvitzgmicrosoftcom [Presented at AAAI Workshop on Learning for Text Categorization, July 1998, Madison,

Wisconsin]

http://www.ceas.cc/papers-2005/144.pdf

Problem???Problem??? real world Bayesian network application – real world Bayesian network application – “ “Learning to classify text. “Learning to classify text. “ Instances are text documents Instances are text documents we might wish to learn the target concept “electronic we might wish to learn the target concept “electronic

news articles that I find interesting,” or “pages on the news articles that I find interesting,” or “pages on the World Wide Web that discuss data mining topics.” World Wide Web that discuss data mining topics.”

In both cases, if a computer could learn the target In both cases, if a computer could learn the target concept accurately, it could automatically filter the concept accurately, it could automatically filter the large volume of large volume of

online text documents to present only the most online text documents to present only the most relevant relevant

documents to the user. documents to the user.

TECHNIQUE TECHNIQUE learning how to classify text, based on the learning how to classify text, based on the naive Bayes classifiernaive Bayes classifier it’s a probabilistic approach and is among the most it’s a probabilistic approach and is among the most

effective algorithms currently known for learning to effective algorithms currently known for learning to classify text documents, classify text documents,

Instance space X consists of all possible Instance space X consists of all possible text documentstext documents given training examples of some unknown target function given training examples of some unknown target function ff

(x), (x), which can take on any value from some finite set which can take on any value from some finite set VV we will consider the target function classifying we will consider the target function classifying

documents as interesting or uninteresting to a documents as interesting or uninteresting to a particular person, using the target values particular person, using the target values like like and and dislike dislike to indicate these two classes. to indicate these two classes.

Design issues Design issues

how to represent an arbitrary text how to represent an arbitrary text document in terms of attribute values document in terms of attribute values

decide how to estimate the decide how to estimate the probabilities required by the naive probabilities required by the naive Bayes classifier Bayes classifier

ApproachApproach Our approach to representing arbitrary text Our approach to representing arbitrary text

documents is disturbingly simple: Given a text documents is disturbingly simple: Given a text document, such as this paragraph, we define an document, such as this paragraph, we define an attribute for each word position in the document attribute for each word position in the document and define the value of that attribute to be the and define the value of that attribute to be the English word found in that positionEnglish word found in that position. Thus, the . Thus, the current paragraph would be described by 111 current paragraph would be described by 111 attribute values, corresponding to the 111 word attribute values, corresponding to the 111 word positions. The value of the first attribute is the positions. The value of the first attribute is the word “our,” the value of the second attribute is word “our,” the value of the second attribute is the word “approach,” and so on. Notice that long the word “approach,” and so on. Notice that long text documents will require a larger number of text documents will require a larger number of attributes than short documents. As we shall see, attributes than short documents. As we shall see, this will not cause us any trouble. this will not cause us any trouble.

ASSUMPTIONSASSUMPTIONS

assume we are given a set of 700 training assume we are given a set of 700 training documents that a friend has classified as documents that a friend has classified as dislike dislike and another 300 she has classified and another 300 she has classified as as likelike

We are now given a new document and askWe are now given a new document and asked to classify it ed to classify it

let us assume the new text document is thlet us assume the new text document is the preceding paragraph e preceding paragraph

We know We know (P(like) (P(like) = .3 and = .3 and P (dislike) = .7 P (dislike) = .7 in the current examplein the current example P(aP(ai i , = , = wwkk|v|vjj) (here we introduce ) (here we introduce wwkk to indicate the kto indicate the kthth word in the E word in the E

nglish vocabulary) nglish vocabulary) estimating the class conditional probabilities (e.g., estimating the class conditional probabilities (e.g., P(aP(aii = “our”Idisl= “our”Idisl

ike)) is more problematic because we must estimate one such proike)) is more problematic because we must estimate one such probability term for each combination of text position, English word, abability term for each combination of text position, English word, and target value.nd target value.

there are approximately 50,000 distinct words in the English vocathere are approximately 50,000 distinct words in the English vocabulary, 2 possible target values, and 111 text positions in the currbulary, 2 possible target values, and 111 text positions in the current example, so we must estimate 2*111* ent example, so we must estimate 2*111* 50, 50, 000 =~10 million 000 =~10 million such terms from the training data. such terms from the training data.

we make assumption that reduces the number of probabilities thawe make assumption that reduces the number of probabilities that must be estimated t must be estimated

we shall assume the probability of encountering a specific word we shall assume the probability of encountering a specific word wwkk (e.g., “chocolate”) is independent of the specific word position (e.g., “chocolate”) is independent of the specific word position being considered (e.g., a23 versus a95) .being considered (e.g., a23 versus a95) .

we estimate the entire set of probabilities we estimate the entire set of probabilities P(aP(a11= w= wkk|v|vjj), ), P(aP(a22= w= wkk|v|v

jj)... by the single position-independent probability )... by the single position-independent probability P(wP(wkklvlvjj)) net effect is that we now require only 2* net effect is that we now require only 2* 50, 50, 000 distinct terms 000 distinct terms

of the form of the form P(wP(wkklvlvjj)) We adopt the rn-estimate, with uniform priors and with We adopt the rn-estimate, with uniform priors and with m m equal tequal t

o the size of the word vocabulary o the size of the word vocabulary

n n total number of word positions in all training total number of word positions in all training examples whose target value is examples whose target value is v, nv, nkk is the number of is the number of times word times word WWkk is found among these is found among these n n word positions, word positions, and and Vocabulary Vocabulary is the total number of distinct words is the total number of distinct words (and other tokens) found within the training data. (and other tokens) found within the training data.

Final AlgorithmFinal Algorithm Examples is a set of text documents along with their target values. V is the set of Examples is a set of text documents along with their target values. V is the set of

all possible target values. This function learns the probability terms P( wall possible target values. This function learns the probability terms P( wkk| v| vjj), descr), describing the probability that a randomly drawn word from a document in class vibing the probability that a randomly drawn word from a document in class v jj will be will be the English word Wthe English word Wkk. It also learns the class prior probabilities P(v. It also learns the class prior probabilities P(v ii). ). 1. collect all words, punctuation, and other tokens that occur in Examples 1. collect all words, punctuation, and other tokens that occur in Examples Vocabulary • Vocabulary • set of all distinct words & tokens occurring in any text document frset of all distinct words & tokens occurring in any text document fr

om om Examples Examples 2. calculate the required P(v2. calculate the required P(v ii) and P( w) and P( wkk| v| vjj) probability terms ) probability terms For each target value v• For each target value v• jj in in V V do do docs• docs• jj the subset of documents from the subset of documents from Examples Examples for which the target value is for which the target value is vvjj P(v1) • P(v1) • IdocsIdocsjjII / \Examplesl / \Examplesl Text• Text• jj a single document created by concatenating all members of a single document created by concatenating all members of docsdocsjj n • n • total number of distinct word positions in total number of distinct word positions in TextTextjj for each word • for each word • WWkk in in Vocabulary Vocabulary

n nkk number of times word number of times word wwkk occurs in occurs in TextTextjj P(w• P(w• kkIvIvjj) ) nnkk+1/n+|Vocabulary| +1/n+|Vocabulary|

CLASSIFY_NAIVE_BAYES_TEXT( CLASSIFY_NAIVE_BAYES_TEXT( Doc) Doc) Return the estimated target value for the document Doc. aReturn the estimated target value for the document Doc. a ii denotes the word foun denotes the word found in the id in the ithth position within Doc. position within Doc. positions • positions • all word positions in all word positions in Doc Doc that contain tokens found in that contain tokens found in Vocabulary Vocabulary Return • Return • VVNBNB, , where where

During learning, the procedure LEARN_NAIVE_BAYES_TEXDuring learning, the procedure LEARN_NAIVE_BAYES_TEXT examines all training documents to extract the vocabulaT examines all training documents to extract the vocabulary of all words and tokens that appear in the text, then cory of all words and tokens that appear in the text, then counts their frequencies among the different target classes unts their frequencies among the different target classes to obtain the necessary probability estimates. Later, giveto obtain the necessary probability estimates. Later, given a new document to be classified, the procedure CLASSIn a new document to be classified, the procedure CLASSIFY_NAIVE_BAYESTEXT uses these probability estimates tFY_NAIVE_BAYESTEXT uses these probability estimates to calculate o calculate VNB VNB according to Equation Note that any woraccording to Equation Note that any words appearing in the new document that were not observeds appearing in the new document that were not observed in the training set are simply ignored by CLASSIFY_NAIVd in the training set are simply ignored by CLASSIFY_NAIVE_BAYESTEXTE_BAYESTEXT

Effectiveness of the Algorithm Effectiveness of the Algorithm Problem Problem classifying usenet news articles classifying usenet news articles target classification for an article target classification for an article name of the usenet newsgroup in which the articname of the usenet newsgroup in which the artic

le appeared le appeared In the experiment described by Joachims (1996), 20 electronic newsgroups were conIn the experiment described by Joachims (1996), 20 electronic newsgroups were con

sidered sidered 1,000 articles were collected from each newsgroup, forming a data set of 1,000 articles were collected from each newsgroup, forming a data set of

20,000 documents. The naive Bayes algorithm was then applied using 20,000 documents. The naive Bayes algorithm was then applied using two-thirds of these 20,000 documents as training examples, and two-thirds of these 20,000 documents as training examples, and performance was measured performance was measured over the remaining third. over the remaining third.

100 most frequent words were removed (these include words such as “the” and “of’), 100 most frequent words were removed (these include words such as “the” and “of’), and any word occurring fewer than three times was also removed. The resulting vocaand any word occurring fewer than three times was also removed. The resulting vocabulary contained approximately 38,500 words. bulary contained approximately 38,500 words.

The accuracy achieved by the program was 89%.The accuracy achieved by the program was 89%.

comp.graphics misc.forsale soc.religion.christian alt.atheism

comp.os.ms-winclows.misc rec.autos talk.politics.guns sci.space

cornp.sys.ibm.pc.hardware rec.sport.baseball talk.politics.mideast sci.crypt

comp.windows.x rec.motorcycles talk.politics.misc sci.electronics

comp.sys.mac.hardware rec.sport.hockey talk.creligion.misc sci .med

APPLICATIONSAPPLICATIONS

A newsgroup posting service that learns to assign documents to the appropriate newsgroup.

NEWSWEEDER system—a program for reading netNEWSWEEDER system—a program for reading netnews that allows the user to rate articles as he or news that allows the user to rate articles as he or she reads them. NEWSWEEDER then uses these rashe reads them. NEWSWEEDER then uses these rated articles (i.e its learned profile of user interests ted articles (i.e its learned profile of user interests to suggest the most highly rated new articles each to suggest the most highly rated new articles each day day

Naive Bayes Spam Filtering Using Word- Position-Based Attributes

Thank you !Thank you !

Bayesian Learning Bayesian Learning Networks Networks

Approach toApproach to

Cybercrime DetectionCybercrime Detection

Bayesian Learning Networks Approach to Bayesian Learning Networks Approach to Cybercrime DetectionCybercrime Detection

N S ABOUZAKHAR, A GANI and G MANSONN S ABOUZAKHAR, A GANI and G MANSONThe Centre for Mobile Communications ResearchThe Centre for Mobile Communications Research

(C4MCR),(C4MCR),University of Sheffield, SheffieldUniversity of Sheffield, Sheffield

Regent Court, 211 Portobello Street,Regent Court, 211 Portobello Street,Sheffield S1 4DP, UKSheffield S1 4DP, UK

[email protected]@[email protected]@dcs.shef.ac.uk

[email protected]@dcs.shef.ac.uk

M ABUITBEL and D KINGM ABUITBEL and D KINGThe Manchester School of Engineering,The Manchester School of Engineering,

University of ManchesterUniversity of ManchesterIT Building, Room IT 109,IT Building, Room IT 109,

Oxford Road,Oxford Road,Manchester M13 9PL, UKManchester M13 9PL, UK

[email protected]@[email protected]@man.ac.uk

REFERENCES

1. David J. Marchette, Computer Intrusion Detection and Network Monitoring, A statistical Viewpoint, 2001,Springer-Verlag, New York, Inc, USA.2. Heckerman, D. (1995), A Tutorial on Learning with Bayesian Networks,

Technical Report MSR-TR-95-06, Microsoft Corporation.3. Michael Berthold and David J. Hand, Intelligent Data Analysis, An

Introduction, 1999, Springer, Italy. 4. http://www.ll.mit.edu/IST/ideval/data/data_index.html, accessed on

01/12/20025. http://kdd.ics.uci.edu/ , accessed on 01/12/2002.6. Ian H. Witten and Eibe Frank, Data Mining, Practical Machine Learning

Tools and Techniques with Java Implementations, 2000, Morgan Kaufmann, USA.7. http://www.bayesia.com , accessed on 20/12/2002

http://www.ll.mit.edu/IST/ideval/data/data_index.html

http://kdd.ics.uci.edu/

http://www.bayesia.com/

Growing dependence of modern society Growing dependence of modern society

on telecommunication and information on telecommunication and information

networks.networks.

Increase in the number of Increase in the number of interconnected interconnected

networks to the Internet has led to an networks to the Internet has led to an

increase in security threats and cyber increase in security threats and cyber crimes.crimes.

Motivation behind the paper..Motivation behind the paper..

In order to detect distributed In order to detect distributed network network

attacks as early as possible, an attacks as early as possible, an under under

research and development research and development probabilistic probabilistic

approach, based on Bayesian approach, based on Bayesian networks networks

has been proposed.has been proposed.

Structure of the paperStructure of the paper

Learning Agents which deploy Learning Agents which deploy Bayesian network approach are Bayesian network approach are considered to be a promising considered to be a promising and useful tool in determining and useful tool in determining suspicious early events of suspicious early events of Internet Internet

threats.threats.

Where can this model be Where can this model be utilizedutilized

Before we look at the Before we look at the details given in the paper details given in the paper

lets lets understand what understand what

Bayesian Bayesian Networks are and how Networks are and how

they are they are constructed………….constructed………….

Bayesian NetworksBayesian Networks A simple, graphical notation for conditional A simple, graphical notation for conditional

independence assertions and hence for compact independence assertions and hence for compact specification of full specification of full

joint distributions.joint distributions.

Syntax:Syntax: a set of nodes, one per variablea set of nodes, one per variable a directed, acyclic graph (link a directed, acyclic graph (link ≈ ≈ "directly influences")"directly influences") a conditional distribution for each node given its a conditional distribution for each node given its

parents:parents:

P (XP (Xi i | Parents (X| Parents (Xii))))

In the simplest case, conditional distribution represented In the simplest case, conditional distribution represented as a as a conditional probability tableconditional probability table (CPT) giving the (CPT) giving the

distribution over distribution over XXii for each combination of parent for each combination of parent valuesvalues

Some conventions……….Some conventions……….

Variables depicted as Variables depicted as nodesnodes

Arcs represent Arcs represent probabilistic dependence probabilistic dependence between between

variables.variables. Conditional probabilities Conditional probabilities

encode the strength of encode the strength of

dependencies.dependencies. Missing arcs implies Missing arcs implies

conditional conditional independence.independence.

SemanticsSemanticsThe full joint distribution is defined as the product The full joint distribution is defined as the product

of the of the

local conditional distributions:local conditional distributions:

P P (X(X11, … ,X, … ,Xnn) = ) = ππi = 1i = 1 PP (X (Xi i | Parents(X| Parents(Xii))))

e.g., e.g., PP(j (j m m a a b b e)e)

= = P P (j | a) (j | a) P P (m | a) (m | a) P P (a | (a | b, b, e) e) P P ((b) b) P P ((e)e)

Example of Construction of Example of Construction of a BNa BN

Back to the discussion of Back to the discussion of the the

paper……….paper……….

DescriptionDescription

This paper shows how This paper shows how probabilistically Bayesian probabilistically Bayesian network detects communication network detects communication network attacks, allowing for network attacks, allowing for generalization of Network generalization of Network Intrusion Detection Systems Intrusion Detection Systems

(NIDSs).(NIDSs).

GoalGoalHow well does our model detect or How well does our model detect or

classify classify

attacks and respond to them later on.attacks and respond to them later on.

The system requires the estimation of The system requires the estimation of two two

quantities: quantities: The probability of detection (PD) The probability of detection (PD) Probability of false alarm (PFA).Probability of false alarm (PFA). It is not possible to simultaneously It is not possible to simultaneously

achieve a PD of 1 and PFA of 0. achieve a PD of 1 and PFA of 0.

Input DataSetInput DataSet

The 2000 DARPA Intrusion Detection The 2000 DARPA Intrusion Detection Evaluation Program which was prepared Evaluation Program which was prepared and managed by MIT Lincoln Labs has and managed by MIT Lincoln Labs has provided the necessary dataset.provided the necessary dataset.

Sample datasetSample dataset

Construction of the Construction of the networknetwork

The following figure shows the Bayesian The following figure shows the Bayesian network that has been automatically network that has been automatically constructed by the constructed by the learning algorithmslearning algorithms of of BayesiaLab. BayesiaLab. The target variable, The target variable, activity_typeactivity_type, is directly, is directlyconnected to the variables that heavily connected to the variables that heavily contribute to its knowledge such as contribute to its knowledge such as serviceserviceand and protocol_typeprotocol_type. .

http://www.bayesia.com/GB/produits/bLab/BLabApprentissage.php

Data GatheringData Gathering

MIT Lincoln Labs set up an environment to MIT Lincoln Labs set up an environment to

acquire several weeks of raw TCP dump acquire several weeks of raw TCP dump

data for a local-area network (LAN) data for a local-area network (LAN)

simulating a typical U.S. Air Force LAN. Thesimulating a typical U.S. Air Force LAN. The

generated raw dataset contains about few generated raw dataset contains about few

million connection records.million connection records.

Mapping the simple Mapping the simple Bayesian Network that we saw to Bayesian Network that we saw to

the one used in the paperthe one used in the paper

Observation 1Observation 1::

As shown in the next figure, the most As shown in the next figure, the most probable activity corresponds to a probable activity corresponds to a smurf smurf attack (52.90%), an ecr_i attack (52.90%), an ecr_i (ECHO_REPLY) service (52.96%) and (ECHO_REPLY) service (52.96%) and an an icmpicmp protocol (53.21%). protocol (53.21%).


What would happen if the probability What would happen if the probability of receiving ICMP protocol packets is of receiving ICMP protocol packets is increased? Would the probability of increased? Would the probability of having a smurf attack increase? having a smurf attack increase?

Setting the protocol to its Setting the protocol to its ICMP ICMP value value increases the probability of having a increases the probability of having a smurf attack from 52.90% to 99.37%.smurf attack from 52.90% to 99.37%.


Let’s look at the problem from the opposite Let’s look at the problem from the opposite direction. If we set the probability of direction. If we set the probability of portsweepportsweep attack to 100%,then the value of attack to 100%,then the value of some associated variables would inevitably some associated variables would inevitably vary. vary.

We note from Figure 4 that the probabilities We note from Figure 4 that the probabilities of the TCP protocol and private service have of the TCP protocol and private service have been increased from 38.10% to 97.49% and been increased from 38.10% to 97.49% and from 24.71% to 71.45% respectively. Also, from 24.71% to 71.45% respectively. Also, we can notice an increase in the REJ and we can notice an increase in the REJ and RSTR flags.RSTR flags.

How do the previous examples How do the previous examples work??work??

PROPOGATIONPROPOGATION

Data

Data

Benefits of the Bayesian Benefits of the Bayesian ModelModel The benefit of using Bayesian IDSs is the The benefit of using Bayesian IDSs is the

ability to adjust our IDS’s sensitivity. ability to adjust our IDS’s sensitivity. This would allow us to trade off between This would allow us to trade off between accuracy and sensitivity. accuracy and sensitivity. Furthermore, the automatic detection Furthermore, the automatic detection

network anomalies by learning allows network anomalies by learning allows distinguishing the normal activities from distinguishing the normal activities from the abnormal ones. the abnormal ones.

Allow network security analysts to see the Allow network security analysts to see the amount of informationamount of information being contributed being contributed

by each variable in the detection model to by each variable in the detection model to the knowledge of the target node the knowledge of the target node

Performance evaluationPerformance evaluation

Thank you !Thank you !

QUESTIONS OR QUERIESQUESTIONS OR QUERIES

Bayesian Network

Documents

Transcript of Bayesian Network