Probability Theory Full Version

download Probability Theory Full Version

of 25

Transcript of Probability Theory Full Version

  • 8/3/2019 Probability Theory Full Version

    1/25

    Probability Theory Summary

    1. Basic Concepts

    1.1 Set Theory

    Before we venture into the depths of probability, we start our journey with a field trip on set theory.

    1.1.1 What is a set?

    Sets are collections of elements or samples. Sets are denoted by capital letters (A, B). Elements aredenoted by i. The set containing all elements is the sample space . The set with no elements iscalled the empty set .Now lets discuss some notation. If we say that A = {1, 2, . . . , n}, then we say that A consists of theelements 1, 2, . . . , n. i

    A means that i is in A, whereas i

    A means that i is not in A.

    1.1.2 Comparing sets

    We can compare and manipulate sets in many ways. Lets take a look at it. We say that A is a subsetof B (denoted by A B) if every element in A is also in B. When (at least) one element in A is notin B, then A B. If A B and B A (they consist of the same elements), then A and B are equal:A = B. Sets are said to be disjoint if they have no common elements.

    We cant just add up sets. But we can take the intersection and the union of two sets. The intersectionof A and B (denoted by A B) consists of all elements i that are in both A and B. (So i A Bif i A and i B.) On the other hand, the union of A and B (denoted by A B) consists of allelements i that are in either A or B, or both. (So i A B if i A or i B.)

    The set difference A \ B consists of all elements that are in A, but not in B. There is also thecomplement of a set A, denoted by Ac. This consists of all elements that are not in A. Note thatAc = \A and A \B = A Bc.Theres one last thing we need to define. A partition of is a collection of subsets Ai, such that

    the subsets Ai are disjoint: Ai Aj = for i = j. the union of the subsets equals : ni=1 Ai = A1 A2 . . . An = .

    1.2 Introduction to Probability

    Its time to look at probability now. Probability is all about experiments and their outcomes. What canwe say about those outcomes?

    1.2.1 Definitions

    Some experiments always have the same outcome. These experiments are called deterministic. Otherexperiments, like throwing a dice, can have different outcomes. Theres no way of predicting the outcome.We do, however, know that when throwing the dice many times, there is a certain regularity in theoutcomes. That regularity is what probability is all about.

    1

  • 8/3/2019 Probability Theory Full Version

    2/25

    A trial is a single execution of an experiment. The possible outcomes of such an experiment are denotedby i. Together, they form the probability space . (Note the similarity with set theory!) An eventA is a set of outcomes. So A . We say that is the sure event and is the impossible event.Now what is the probability P(A) of an event A? There are many definitions of probability, out ofwhich the axiomatic definition is mostly used. It consists of three axioms. These axioms are ruleswhich P(A) must satisfy.

    1. P(A) is a nonnegative number: P(A) 0.2. The probability of the sure event is 1: P() = 1.

    3. If A and B have no outcomes in common (so if A B = ), then the probability of A B equalsthe sum of the probabilities of A and B: P(A B) = P(A) + P(B).

    1.2.2 Properties of probability

    From the three axioms, many properties of the probability can be derived. Most of them are, in fact,quite logical.

    The probability of the impossible event is zero: P() = 0. The probability of the complement of A satisfies: P(Ac) = 1 P(A). If A B, then B is equally or more likely than A: P(A) P(B). The probability of an event A is always between 0 and 1: 0 P(A) 1. For any events A and B, there is the relation: P(A B) = P(A) + P(B) P(A B).

    Using the probability, we can also say something about events. We say two events A and B are mutuallyindependent if

    P(A B) = P(A)P(B). (1.2.1)Identically, a series of n events A1, . . . , An are called mutually independent if any combination ofevents Ai, Aj , . . . , Ak (with i , j , . . . , k being numbers between 1 and n) satisfies

    P(Ai Aj . . . Ak) = P(Ai)P(Aj)P(Ak). (1.2.2)

    1.2.3 Conditional Probability

    Sometimes we already know some event B happened, and we want to know what the chances are thatevent A also happened. This is the conditional probability of A, given B, and is denoted by P(A|B).It is defined as

    P(A|B) = P(A B)P(B)

    . (1.2.3)

    The conditional probability satisfies the three axioms of probability, and thus also all the other rules.However, using this conditional probability, we can derive some more rules. First, there is the productrule, stating that

    P(A1 A2 . . . An) = P(A1)P(A2|A1)P(A3|A2 A1) . . . P (An|An1 . . . A2 A1). (1.2.4)Another rule, which is actually quite important, is the total probability rule. Lets suppose we havea partition B1, . . . , Bn of . The total probability rule states that

    P(A) =

    ni=1

    P(A Bi) =ni=1

    P(Bi)P(A|Bi). (1.2.5)

    By combining this rule with the definition of conditional probability, we find another rule. This rule iscalled Bayes rule. It says that

    P(Bj |A) = P(A|Bj)P(Bj)ni=1 P(A|Bi)P(Bi)

    . (1.2.6)

    2

  • 8/3/2019 Probability Theory Full Version

    3/25

    2. Random Variables and Distributions

    2.1 Random Variable Definitions

    Suppose we know all the possible outcomes of an experiment, and their probabilities. What can we do

    with them? Not much, yet. What we need, are some tools. We will now introduce these tools.

    2.1.1 Random variables

    It is often convenient to attach a number to each event i. This number is called a random variableand is denoted by x(i) or simply x. You can see the random variable as a number, which can takedifferent values. For example, when throwing a dice we can say that x(head) = 0 and x(tail) = 1. So xis now a number that can be either 0 or 1.

    Random variables can generally be split up in two categories: discrete and continuous random variables.A random variable x is discrete if it takes a finite or countable infinite set of possible values. (Withcountable finite we mean the degree of infinity. The sets of natural numbers N and rational numbers Qare countable finite, while the set of real numbers R is not.)

    Both types of random variables have fundamental differences, so in the coming chapters we will oftenexplicitly mention whether a rule/definition applies to discrete or continuous random variables.

    2.1.2 Probability mass function

    Lets look at the probability that x = x for some number x. This probability depends on the randomvariable function x(i) and the number x. It is denoted by

    Px(x) = P(x = x). (2.1.1)

    The function Px(k) is called the probability mass function (PMF). It, however, only exists for discreterandom variables. For continuous random variables Px(k) = 0 (per definition).

    2.1.3 Cumulative distribution function

    Now lets take a look at the probability that x x for some x. This is denoted byFx(x) = P(x x). (2.1.2)

    The function Fx(x) is called the cumulative distribution function (CDF) of the random variable x.The CDF has several properties. Lets name a few.

    The limits of Fx(x) are given bylim

    xFx(x) = 0 and lim

    xFx(x) = 1. (2.1.3)

    Fx(x) is increasing. If x1 x2, then Fx(x1) Fx(x2). P(x > x) = 1 Fx(x). P(x1 < x x2) = Fx(x2) Fx(x1).

    The CDF exists for both discrete and continuous random variables. For discrete random variables, thefunction Fx(x) takes the form of a staircase function: its graph consists of a series of horizontal lines. Forcontinuous random variables the function Fx(x) is continuous.

    3

  • 8/3/2019 Probability Theory Full Version

    4/25

    2.1.4 Probability density function

    For continuous random variables there is a continuous CDF. From it, we can derive the probabilitydensity function (PDF), which is defined as

    fx(x) =dFx(x)

    dx Fx(x) =

    x

    fx(t)dt. (2.1.4)

    Since the CDF Fx(x) is always increasing, we know that fx(x) 0. The PDF does not exist for discreterandom variables.

    2.2 Discrete Distribution types

    There are many distribution types. Well be looking at discrete distributions in this part, while continuousdistributions will be examined in the next part. But before we even start examining any distributions,we have to increase our knowledge on combinations. We use the following paragraph for that.

    2.2.1 Permutations and combinations

    Suppose we have n elements and want to order them. In how many ways can we do that? The answerto that is

    n! = n (n 1) . . . 2 1. (2.2.1)Here n! means n factorial. But what if we only want to order k items out of a set of n items? Theamount of ways is called the amount of permutations and is

    n!

    (n k)! = n (n 1) . . . (n k + 1). (2.2.2)

    Sometimes the ordering doesnt matter. What if we just want to select k items out of a set of n items?In how many ways can we do that? This result is the amount of combinations and is

    nk = n!

    k!(n k)!=

    n (n 1) . . . (n k + 1)k (k 1) . . . 2 1

    . (2.2.3)

    2.2.2 The binomial distribution and related distributions

    Now we will examine some types of discrete distributions. The most important parameter for discretedistributions is the probability mass function (PMF) Px(k). So we will find it for several distributiontypes.

    Suppose we have an experiment with two outcomes: success and failure. The chance for success is alwaysjust p. We do the experiment n times. The random variable x denotes the amount of successes. We nowhave

    Px(k) = P(x = k) =

    n

    k

    pk(1p)nk. (2.2.4)

    This distribution is called the binomial distribution.

    Sometimes we want to know the probability that we need exactly k trials to obtain r successes. In otherwords, the rth success should occur in the kth trial. The random variable x now denotes the amount oftrials needed. In this case we have

    Px(k) = P(x = k) =

    k 1r 1

    pr(1p)kr. (2.2.5)

    4

  • 8/3/2019 Probability Theory Full Version

    5/25

    This distribution is called the negative binomial distribution.

    We can also ask ourselves: how many trials do we need if we only want one success? This is simply thenegative binomial distribution with r = 1. We thus have

    Px(k) = P(x = k) = p(1p)k1. (2.2.6)This distribution is called the geometric distribution.

    2.2.3 Other discrete distributions

    Lets discuss some other discrete distributions. A random variable x follows a Poisson distributionwith parameter > 0 if

    Px(k) = e

    k

    k!. (2.2.7)

    This distribution is an approximation of the binomial distribution if np = , p 0 and n .A random variable x has a uniform distribution if

    Px(k) =1

    n, (2.2.8)

    where n is the amount of possible outcomes of the experiment. In this case every outcome is equallylikely.

    A random variable has a Bernoulli distribution (with parameter p) if

    Px(k) =

    p for k = 1,

    1p for k = 0. (2.2.9)

    Finally there is the hypergeometric distribution, for which

    Px(k) =

    r

    k

    m rn k

    mn

    . (2.2.10)

    2.3 Continuous Distribution Types

    Its time we switch to continuous distributions. The most important function for continuous distributionsis the probability density function (PDF) fx(k). We will find it for several distribution types.

    2.3.1 The normal distribution

    We start with the most important distribution type there is: the normal distribution (also calledGaussian distribution). A random variable x is a normal random variable (denoted by x N(x,

    2x)) if

    fx(x) =1

    2xe

    12 (

    xxx

    )2

    . (2.3.1)

    Here x and x are, respectively, the mean and the standard deviation. (We will discuss them in the nextpart.) It follows that the cumulative distribution function (CDF) is

    Fx(x) =1

    2x

    x

    e12 (

    txx

    )2

    dt. (2.3.2)

    5

  • 8/3/2019 Probability Theory Full Version

    6/25

    The above integral doesnt have an analytical solution. To get a solution anyway, use is made of thestandard normal distribution. This is simply the normal distribution with parameters x = 0 andx = 1. So,

    (z) = P(z < z ) =12

    z

    e12 t

    2

    dt. (2.3.3)

    There are a lot of tables in which you can simply insert z and retrieve (z). To get back to the variable

    x, you make use of the transformation

    z =x x

    x x = xz + x. (2.3.4)

    2.3.2 Other continuous distributions

    There is also a continuous uniform distribution. A random variable x has a uniform distribution(denoted by x U(a, b)) on the interval (a, b) if

    fx(x) =

    1

    bafor a x b,

    0 otherwise.(2.3.5)

    A random variable has an exponential distribution if

    fx(x) =

    ex for x 00 for x < 0.

    (2.3.6)

    Finally, a random variable has a gamma distribution if

    fx(x) =

    ba

    (a)xa1ebx for x 0

    0 for x < 0,(2.3.7)

    where is the gamma function, given by

    (a) =

    0

    xa1ex dx. (2.3.8)

    2.4 Important parameters

    Certain parameters apply to all distribution types. They say something about the distribution. Letstake a look at what parameters there are.

    2.4.1 The mean

    The mean is the expected (average) value of a random variable x. It is denoted by E(x) = x. Fordiscrete distributions we have

    E(x) = x =

    ni=1

    xiPx(xi), (2.4.1)

    with x1, . . . xn the possible outcomes. For continuous distributions we have

    E(x) = x =

    xfx(x) dx. (2.4.2)

    By the way, E(. . .) is the mathematical expectation operator. It is subject to the rules of linearity, so

    E(ax + b) = aE(x) + b, (2.4.3)

    E(g1(x) + . . . + gn(x)) = E(g1(x)) + . . . + E(gn(x)). (2.4.4)

    6

  • 8/3/2019 Probability Theory Full Version

    7/25

    2.4.2 The variance

    The variance or dispersion of a random variable is denoted by 2x. Here x is the standard deviation.If x is discrete, then the variance is given by

    2x = D(x) = E(x x)2

    =n

    i=1

    (xi x)2 Px(xi) (2.4.5)

    If x is continuous, then it is given by

    2x = D(x) = E

    (x x)2

    =

    (x x)2fx(x) dx. (2.4.6)

    Here D(. . .) is the mathematical dispersion operator. It can be shown that 2x can also be found (forboth discrete and continuous random variables) using

    2x = E(x2) x2. (2.4.7)

    Note that in general E(x2) = x2. The value E(x) = x is called the first moment, while E(x2) is calledthe second moment. The variance 2x is called the second central moment.

    This is all very nice to know, but what is it good for? Lets take a look at that. The variance 2x tellssomething about how far values are away from the mean x. In fact, Chebyshevs inequality statesthat for every > 0 we have

    P(|x x| ) 2x

    2. (2.4.8)

    2.4.3 Other moments

    After the first and the second moment, there is of course also the third moment, being

    E

    (x x)3

    . (2.4.9)

    The third moment is a measure of the symmetry around the center (the skewness). For symmetricaldistributions this third moment is 0.

    The fourth moment E

    (x x)4

    is a measure of how peaked a distribution is (the kurtosis). The

    kurtosis of the normal distribution is 3. If the kurtosis of a distribution is less than 3 (so the distributionis less peaked than the normal distribution), then the distribution is platykurtic. Otherwise it isleptokurtic.

    2.4.4 Median and mode

    Finally there are the median and the mode. The median is the value x for which Fx(x) = 1/2. So halfof the possible outcomes has a value lower than x and the other half has values higher than x.

    The mode is the value x for which (for discrete distributions) Px(x) or (for continuous distributions)fx(x) is at a maximum. So you can see the mode as the value x which is most likely to occur.

    7

  • 8/3/2019 Probability Theory Full Version

    8/25

    3. Multiple Random Variables

    3.1 Random Vectors

    Previously we have only dealt with one random variable. Now suppose we have more random variables.

    What distribution functions can we then define?

    3.1.1 Joint and marginal distribution functions

    Lets suppose we have n random variables x1, x2, . . . , xn. We can put them in a so-called random vector

    x = [x1, x2, . . . , xn]T

    . The joint distribution function (also called the simultaneous distributionfunction) Fx(x) is then defined as

    Fx(x1, x2, . . . , xn) = Fx(x) = P(x1 x1, x2 x2, . . . , xn xn). (3.1.1)(You should read the commas , in the above equation as and or, equivalently, as the intersectionoperator .) From this joint distribution function, we can derive the marginal distribution functionFxi(xi) for the random variable xi. It can be found by inserting

    in the joint distribution function for

    every xj other than xi. In an equation this becomes

    Fxi(xi) = Fx(,, . . . ,, xi,, . . . ,). (3.1.2)The marginal distribution function can always be derived from the joint distribution function using theabove method. The opposite is, however, not always true. It often isnt possible to derive the jointdistribution function from the marginal distribution functions.

    3.1.2 Density functions

    Just like for random variables, we can also distinguish discrete and continuous random vectors. A randomvector is discrete if its random variables xi are discrete. Similarly, it is continuous if its random variablesare continuous.

    For discrete random vectors the joint (mass) distribution function Px(x) is given by

    Px(x) = P(x1 = x1, x2 = x2, . . . , xn = xn). (3.1.3)

    For continuous random vectors, there is the joint density function fx. It can be derived from the jointdistribution function Fx(x) according to

    fx(x1, x2, . . . , xn) = fx(x) =nFx(x1, x2, . . . , xn)

    x1 x2 . . . xn. (3.1.4)

    3.1.3 Independent random variables

    In the first chapter of this summary, we learned how to check whether a series of events A1, . . . , An areindependent. We can also check whether a series of random variables are independent. This is the case if

    P(x1 x1, x2 x2, . . . , xn xn) = P(x1 1)P(x2 2) . . . P (xn n). (3.1.5)If this is, indeed the case, then we can derive the joint distribution function Fx(x) from the marginaldistribution functions Fxi(xi). This goes according to

    Fx(x) = Fx1(x1)Fx2(x2) . . . F xn(xn) =

    ni=1

    Fxi(xi). (3.1.6)

    8

  • 8/3/2019 Probability Theory Full Version

    9/25

    3.2 Covariance and Correlation

    Sometimes it may look like there is a relation between two random variables. If this is the case, youmight want to take a look at the covariance and the correlation of these random variables. We will nowtake a look at what they are.

    3.2.1 Covariance

    Lets suppose we have two random variables x1 and x2. We also know their joint distribution functionfx1,x2(x1, x2). The covariance of x1 and x2 is defined as

    C(x1, x2) = E((x1 x1) (x2 x2)) =

    (x1 x1) (x2 x2) fx1,x2(x1, x2)dx1 dx2 = E(x1x2) x1x2.(3.2.1)

    The operator C(. . . , . . .) is called the covariance operator. Note that C(x1, x2) = C(x2, x1). We alsohave C(x1, x1) = D(x1) =

    2x1

    .

    If the random variables x1 and x2 are independent, then it can be shown that E(x1, x2) = E(x1)E(x2) =x1x2. It directly follows that C(x1, x2) = 0. The opposite, however, isnt always true.

    But the covariance operator has more uses. Suppose we have random variables x1, x2, . . . , xn. Lets definea new random variable z as z = x1 + x2 + . . . + xn. How can we find the variance of z? Perhaps we canadd up all the variances of xi? Well, not exactly, but we are close. We can find

    2z using

    2z =

    ni=1

    nj=1

    C(xi, xj) =

    ni=1

    2xi + 21i