Appendix A Probability, Distribution Theory, and ...978-1-59745-290-8/1.pdfAppendix A Probability,...

Appendix A

Probability, Distribution Theory,

and Statistical Inference

A.1 Definitions

We define an event as any set of possible outcomes of a random experiment. In turn,

an outcome is the result of a single trial of a random experiment. A randomexperiment is an experiment that has an outcome that is not completely predictable,

that is, we can repeat the experiment under the same conditions and potentially get

a different result. The sample space (sometimes called universe) is the set of all

possible outcomes for one trial of the random experiment and is often denoted by

O.1 The symbol f denotes the set containing no outcomes, or the empty set.

Using the above definitions, we can now define the union, intersection, and

complement of events. We use two events A and B:

The union of two events A and B is the set

of outcomes that are in either A or B or

both and is denoted by A [ B.Ω

(continued)

1 The letter U is also sometimes used for this and we may use either interchangeably.

R. Sullivan, Introduction to Data Mining for the Life Sciences,DOI 10.1007/978-1-59745-290-8, # Springer Science+Business Media, LLC 2012

The intersection of two events A and B is the

set of outcomes that are in both A and Bsimultaneously and is denoted by A \ B.

A∩BA B

The complement of an event A is the set of outcomes

that are not in A and is denoted by �A or A0. Ω

An event B is a subset of an event A if all of the

outcomes of B are also outcomes of A. ΩA

Two events A and B which have no outcomes in

common are mutually exclusive or disjoint.In this case the occurrence of A, without lossof generality, precludes the occurrence of event B.

594 Appendix A

A.2 Axioms of Probability

A probability is a real number between 0 and 1 that is assigned to each of the events

that an experiment generates, with higher values indicating that the event is more

likely to occur. Assigning a value of 1 to an event means that the event is certain

to occur while assigning a value of 0 means that the event can never occur. The

following axioms must be satisfied for any consistent probability assignment:

1. 0�PðAÞ� 1 for any event A.2. P Oð Þ ¼ 1:3. If A and B are mutually exclusive events, that is, P A \ Bð Þ ¼ f, P A [ Bð Þ

¼ PðAÞ þ PðBÞ. In the more general case, P A [ B [ C [ :::ð Þ ¼ PðAÞ þ PðBÞþPðCÞ þ � � �

We can now use these axioms to prove the following rules:

1. P fð Þ ¼ 0:Since O [ f ¼ O and O \ f ¼ f, by axiom 3, 1 ¼ 1þ P fð Þ.

2. P �Að Þ ¼ 1� PðAÞ:Since O ¼ A [ �A and A \ �A ¼ f, by axiom 3, 1 ¼ PðAÞ þ P �Að Þ.

3. P A [ Bð Þ ¼ PðAÞ þ PðBÞ � P A \ Bð Þ:A [ B ¼ A [ �A \ Bð Þ which are disjoint. By axiom 3, P A [ Bð Þ ¼ PðAÞþP �A \ Bð Þ. B ¼ A \ Bð Þ [ �A \ Bð Þ which are disjoint. By axiom 3, PðBÞ¼ P A \ Bð Þ þ P �A \ Bð Þ. Substituting for P �A \ Bð Þ in above gives P A [ Bð Þ¼ PðAÞ þ PðBÞ � P A \ Bð Þ.

A visualization of this rule from the Venn diagram shows us that P(A) + P(B)would actually include P A \ Bð Þ twice, thus we subtract one from the sum.

A.3 Joint Probability

Consider a random experiment in which O includes two events A and B. We define

the joint probability as the probability of both events occurring simultaneously.2

The implication of this is that the set of outcomes occurs in both event A and event

B, which, on the Venn diagram, is the quantity A \ B. That is, the joint probabilityof the events A and B is P A \ Bð Þ. But what happens if the events are independent ofeach other? In this case, P A \ Bð Þ ¼ PðAÞ � PðBÞ.

We note here that independent events and disjoint events are not identical.

In probability, two events are independent if they have no effect on each other:

the occurrence of one has no impact on the occurrence, or lack thereof, of the other.

Thus, independence is a property that arises from the probabilities assigned to the

2Actually, we really mean simultaneously on the same iteration or repetition of the random

experiment.

Appendix A 595

events and their intersection. Disjoint, or mutually exclusive, events are events

which have no outcomes in common: such events with non-zero probabilities

cannot be independent since their intersection is the empty set, with P fð Þ ¼ 0,

which can never equal the product of the probabilities of the two events.

We call the probability of an event A, in the joint probability setting, itsmarginalprobability, and this is calculated as P A \ Bð Þ þ P A \ �Bð Þ using the axioms of

probability. (The confirmation of this is left as an exercise for the reader.)

A.4 Conditional Probability

If we are told that event A has occurred, then we know that everything outside of

AðA0Þ is no longer possible. We can consider the reduced universe Or ¼ A since we

only need to consider outcomes within A. Consider a second event B. If we knowthat event A has occurred, will this affect the probability of our second event Boccurring? That is, what is the conditional probability of B occurring given that Ahas occurred? The only part of the event B which will be relevant is that part which

is also a part of A, that is, A \ B. Given that we know that A has occurred,

P Orð Þ ¼ 1, by the second axiom or probability. Therefore, the probability of Bgiven A, written P B Ajð Þ, is the probability of A \ B multiplied by the scaling factor1

PðAÞ due to our reduced universe Or:

P B Ajð Þ ¼ P A \ Bð ÞPðAÞ (A.1)

Note that this equation tells us that P B Ajð Þ / P A \ Bð Þ.In the event that A and B are independent events, P B Ajð Þ ¼ P A\Bð Þ

PðAÞ ¼PðAÞ�PðBÞ

PðAÞ ¼ PðBÞ, which states that knowledge about the event A does not affect

the probability of the event B occurring when A and B are independent events.

Although it is tempting to consider P A Bjð Þ in the analogous way and write

P A Bjð Þ ¼ P A\Bð ÞPðBÞ , we need to remember that B is an unobservable event. That is,

when we calculate the conditional probability – the probability of B occurring giventhat A occurred – we are not observing either the occurrence or the nonoccurrence

of B. Thus, the probability of A occurring will be dependent on which one of Bor �Bhas occurred: P(A) is conditional on the occurrence of nonoccurrence of event B.Rewriting the conditional probability formula as

P A \ Bð Þ ¼ PðBÞ � P A Bjð Þ (A.2)

we get the rule known as the multiplication rule for probability. Although this may

seem trivial, it provides us with a conditional probability rule for any observable

596 Appendix A

event given an unobservable event and allows us to find the joint probability.

Equation A.2 shows the relationship when B occurs and similarly,

P A \ �Bð Þ ¼ P �Bð Þ � P A �Bjð Þ

covers the case when B does not occur.

A.5 Bayes’ Theorem

Although we discuss Bayes’ theorem at various points in the book, we include it

here for completeness.

From its definition, the marginal probability of an event A is calculated by

summing the probabilities of its disjoint parts: PðAÞ ¼ P A \ Bð Þ þ P A \ �Bð Þ andsubstituting this into Eq. A.1 gives

P B Ajð Þ ¼ P A \ Bð ÞP A \ Bð Þ þ P A \ �Bð Þ

Using the multiplication rule, we can find the join probabilities and rewrite the

equation to give Bayes’ theorem for an event A:

P B Ajð Þ ¼ P A Bjð Þ � PðBÞP A Bjð Þ � PðBÞ þ P A �Bjð Þ � P �Bð Þ (A.3)

In A.3 and our definition for P(A), we used B and �B which comprised the

universe – O ¼ B [ �B. B and �B are disjoint. We say that B and �B partition the

universe. This is the simplest case. Often we will have more than two events which

partition the universe. Consider the case where we have n events such that B1 [B2 [ ::: [ Bn ¼ O and Bi \ Bj ¼ f 8i; j, where i ¼ 1,. . .,n, j ¼ 1,. . .,n, i 6¼ j.

An observable event A will therefore be partitioned into n parts: A ¼ A \ B1ð Þ [A \ B2ð Þ [ ::: [ A \ Bnð Þ where A \ Bi and A \ Bj are disjoint because Bi andBj are

disjoint. Thus,

Appendix A 597

r AΩ =B4

A∩B1

A∩B2 A∩B3A∩B4

PðAÞ ¼Xni¼1

A \ Bið Þ Law of Total Probabilityð Þ

Using the multiplication rule gives

PðAÞ ¼Xni¼1

P A Bijð Þ � P Bið Þ

where each conditional probability isP Bi Ajð Þ ¼ P A\Bið ÞPðAÞ and the multiplication rule

is used to determine the joint probability of the numerator P A \ Bið Þ ¼ P Bið Þ �P A Bijð Þ to give

P Bi Ajð Þ ¼ P A Bijð Þ � P Bið ÞPni¼1

P A Bijð Þ � P Bið Þ

A.6 Estimators

Consider a random sample from a distribution where we are tossing a thumb tack

(or drawing pint) and where the random variable X takes a value of 1 if the tack

lands on its head and 0 otherwise. Then in tossing the tack we are sampling from a

distribution of X which has the probability distribution p0 ¼ P(Xi ¼ 0) ¼ 1 � p,p1 ¼ P(Xi ¼ 1) ¼ p, where p is the probability that the tack lands on its side.

We will not typically know the value of p, and so a problem we often face is to

estimate the value of p using a random sample. In such cases, we know the

distribution type, but it depends upon the values of one or more unknown

parameters. In our example, the unknown parameter is p.We therefore use a random sample to obtain estimates for the values of these

parameters. In the general case, we have a random variable X whose distribution is

of a known type, but which depends upon the value of some unknown parameter y.We wish to obtain a numerical value for y and use the procedure of obtaining a

random sample from the distribution of X and then use the value of an appropriate

statistic to estimate y.

598 Appendix A

Thus, formally, an estimator for a parameter y is a statistic (random variable)

whose value is used to estimate y. We typically use the notation y to denote an

estimator for y.On occasions, it will be obvious what an estimator is for the unknown parameter.

In our thumb tack example above, an obvious procedure would be to toss the tack

n times, counting the number of times the tack lands on its side. If it lands on its side

a total of a times, that is, the proportion of tosses which result in the tack landing on

its size, then we could use

p ¼ a

as our estimator for p.However, the choice is not always so clear. In fact, there may be several choices

for an estimator. For example, a random variable with a probability distribution

f ðxÞ ¼ 1ye

�xy if x� 0;

¼ 0 otherwise

could use either the samplemean or the sample standard deviation of a random sample

of size n to estimate y. An important question is which is the best estimator for y.If y is an estimator for y, we obviously wish to know how good an estimator it is.

An intuitive question to help us answer this is: How close is the estimated value to

the real value? We use the mean to help us with this. Other terms for the mean of a

random variable X with probability function px is defined as

EðXÞ ¼Xni¼1

where the summation is over all possible values of X. The notation E(X) is read as

the expected value, or expectation.

If EðyÞ ¼ y, we say that y is an unbiased estimator for y. If EðyÞ 6¼ y, then we

define the bias as given by

biasðyÞ ¼ EðyÞ � y

and say that y is a biased estimator for y.If X is a random variable with mean m and variance s2, then the sample mean �X

and the sample variance S2 are unbiased estimators for m and s2, respectively.

If we consider S2 ¼ 1n

Pni¼1

ðXi � �XÞ2as an estimator for s2, then EðS2Þ ¼ n�1ðn�1Þs2n

and so biasðS2Þ ¼ �1ns

2, which explains why S2 ¼ 1n�1

Pni¼1

Xi � �Xð Þ2 is usually used

as the estimator of the population variance.

The variance of a random variable X is defined by VarðXÞ ¼ E½ X � mð Þ2�where m ¼ EðXÞ and the symbol s2 is reserved for the variance, with S2 being

Appendix A 599

used for the variance of a set of data. This is a measure of the dispersion of the

probability distribution.

We say that an unbiased estimator y1 is a more efficient (better) estimator for ythan a second unbiased estimator y2 if Varðy1Þ<Vary2Þ and the relative efficiency ofy2 with respect to y1 is Varðy1Þ=Varðy2Þ:. This fraction is often expressed as

a percentage by ½Varðy1Þ=Varðy2Þ:� � 100. Thus, the relative efficiency is >1

(or 100%) if y2 is more efficient than y1 and less than 1 if y1 is more efficient than y2.The criterion outlined above for comparing estimators can only be used when all

the estimators under consideration are unbiased. However, under certain

conditions, we may prefer to use a biased estimator. We illustrate this by returning

to our thumb tack example above where we selected p ¼ an as our estimator, which

we will now denote by p1. Let us define a second estimator as follows:

p2 ¼ aþ 1

We won’t explain the origin of this estimator since we’re using it for illustration

purposes. Instead, we note that this does in fact occur as an estimator in certain

more complex models.

E p2ð Þ ¼ EðaÞ þ 1

nþ 2¼ npþ 1

In order for p2 to be unbiased, p ¼ 12, otherwise it is biased. However, if n is

large, the bias is small. Further, the value of Varðp2Þ is also small, where n is large.Thus, the estimate is close to the true value of p. So how can we compare two such

estimators? One common method is to use the mean square error.The mean square error (MSE) of an estimator y for a parameter y is defined as

MSEðyÞ ¼ E y� y� �2

� �

As can be seen below, it is related to other quantities we have introduced above:

MSE y� �

¼ E y� y� �2

� �

¼ E y2 � 2yyþ y2

¼ E y2

� �� 2yE y

� �þ y2

¼ E y2

� �� E y

� �h i2þ E y

� �h i2� 2yE y

� �þ y2

� �

¼ Var y� �

þ E y� �

� yh i2

¼ Var y� �

þ bias y� �h i2

600 Appendix A

Thus, MSEðyÞ ¼ VarðyÞ þ ½biasðyÞ�2Note also that if y is unbiased, then MSEðyÞ ¼ VarðyÞ. Therefore, we can think

of the MSE as being a generalization of the variance that allows for bias.

An estimator y, based on a sample of size n, for a parameter y is said to be a

consistent estimator for y if,

EðyÞ ! y and VarðyÞ ! 0 as n ! 1: Equivalently; MSEðyÞ ! 0:

This is an important consideration because when an estimator is consistent, it

becomes increasingly likely that the estimate is close to the parameter’s actual

value as the sample size increases.

A.7 The Law of Large Numbers

The Law of Large Numbers states that for repeated, independent trials with the

same probability p of success in each trial, the percentage of successes is increas-

ingly likely to be close to the chance of success as the number of trials increases.

Formally, the chance that the percentage of successes differs from the probabil-

ity p of success by more than a fixed positive amount, e > 0, tends to zero as the

number of trials n tends to infinity, for every number t > 0.

Another way of thinking about this is to consider an experiment where the

outcome is a random variable, and where the experiment is conducted repeatedly.

The outcome for different repetitions is independent from any other. The law of

large numbers says that as the number of independent repetitions increases, the

average of the observed outcomes approaches the average of all possible outcomes.

This can be easily illustrated by using the example of a die being rolled. The

average of all possible outcomes is the average value of the six numbers:

(1 + 2 + 3 + 4 + 5 + 6)/6 ¼ 3.5. Obviously, 3.5 is not a possible, observable

outcome for any individual experiment since it is not on the face of a die. If the

observed result we see for the ith experiment is denoted xi, the average of the

observed outcomes is

Pni¼1

If x1 ¼ 3, and x2 ¼ 5, then the average of the observed outcomes is

ð3þ 5Þ=2 ¼ 4, if x3 ¼ 1, the average of the observed outcomes is

ð3þ 5þ 1Þ=3 ¼ 3, and so on. We can make the value as close as we like to 3.5

by increasing the number of repetitions of the experiment.

Appendix A 601

Appendix B

Databases and Software Projects

A plethora of software products exist to support data mining efforts. These products

include core technical infrastructure products such as the underlying database

management systems and facilities within to support mining models, through

general-purpose products to help with the various stages of the mining effort, on

to those products which offer very specific, and complete support for specific

subject areas such as microarray analysis. To include an exhaustive list of products

is simply not feasible and would certainly be completely out of date by the time this

book is published.

Instead, we have included a number of products and databases that we have

found useful in our experiences and which, we believe, will continue to be avail-

able, developed, and supported. We include these software products and databases

with the kind agreement of the respective various authors of the respective products

and databases to illustrate the various mining techniques we describe.

All of the products or databases we use are listed here for easy reference by the

reader. We have kept the information to a minimum, preferring to point the reader

to the official web site or reference since the versions, licensing details, and scope of

these products or databases may change drastically between the time of writing and

when the reader accesses the software or data.

Wherever we have been granted permission, the actual version of the software/

database used in this book can be located at www.lsdm.org/thirdparty.

The descriptions of the products and databases contained in this appendix are

from the various vendors and not from the authors.

The nature of the Internet is such that URLs accurate at the time of writing may

no longer be accessible. We have tried to provide enough context for a search in the

event that the URL does not work.

B.1 Software Projects

B.1.1 Alyuda NeuroIntelligence

http://www.alyuda.com/neural-networks-software.htm

This is a neural networks software application designed to assist experts in solving

real-world problems. NeuroIntelligence features only proven algorithms and

techniques. It is fast and easy to use.

B.1.2 AmiGO

www.godatabase.org

This tool provides a mechanism for searching the gene ontology database, which

comprises a controlled vocabulary of terms for biological concepts and for genes

and gene products which have been annotated using the controlled vocabulary. For

example, an inquiry using “protein kinase activity” returned 93 matches in the

database.3

Appendix A Probability, Distribution Theory, and ...978-1-59745-290-8/1.pdfAppendix A Probability,...

Documents

Transcript of Appendix A Probability, Distribution Theory, and ...978-1-59745-290-8/1.pdfAppendix A Probability,...

Probability concept and Probability distribution

1 1 Slide Chapter 6 Continuous Probability Distributions n Uniform Probability Distribution n Normal Probability Distribution n Exponential Probability.

Probability distribution notes

1 Chapter 6 Continuous Probability Distributions Uniform Probability Distribution Normal Probability Distribution Exponential Probability Distribution.

The Normal Probability Distribution The Normal Probability Distribution Chapter.

Probability Distribution

Probability Distribution , Continuous and DIscreate Probability Distribution

Probability distribution and sampling distribution การ...Probability distribution and sampling distribution การแจกแจงความน าจะเป นและการแจกแจงของต

Chapter 3 part B Probability Distribution. Chapter 3, Part B Probability Distributions n Uniform Probability Distribution n Normal Probability Distribution.

Distribution Probability

Probability Distribution - University of Torontoolgac/sta255_2013/notes/sta255_Lecture3.pdf · Probability Distribution ... Example: Given a probability distribution for Y: y p(y)

Exponential probability distribution

f5089Binomial Probability Distribution

Normal Probability Distribution

Continuous Probability Distributions n Uniform Probability Distribution n Normal Probability Distribution n Exponential Probability Distribution f ( x.

Probability Distribution

Developing Poisson probability distribution applications ... · Formal probability models are ... The Poisson distribution is a probability distribution of a discrete random variable

Binomial Probability Distribution

Probability distribution 2

Probability Distribution eBook