EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

226
EMAT 20205 Data Analysis WEEK -2 Nello Cristianini

Transcript of EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Page 1: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

EMAT 20205Data Analysis

WEEK -2Nello Cristianini

Page 2: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Axioms of Probability

• The probability law (assigning a number to each event E) must satisfy the following axioms:– Nonnegativity:– Additivity: if E and F are two disjoint events,

then the probability of their union satisfies:

– Normalization: the probability of the entire sample space is equal to 1:

Eevent every for 0)(EP

)()()( FPEPFEP

1)( P

Page 3: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Some comments…

• The maximum value for the probability of an event is 1 (probability of the entire sample space)

• This means that that event is CERTAIN

• P()=1 means: the outcome will be one of the possible outcomes (obviously)(e.g.: the dice roll will certainly give outcome 1 or 2 or 3 or 4 or 5 or 6)

Page 4: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Comments…

• An event E is IMPOSSIBLE if it has zero probability P(E)=0

• An event is CERTAIN if it has probability P(E)=1

• The interesting things happen in between …

Page 5: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Comments on Additivity…

– Additivity: if E and F are two disjoint events, then the probability of their union satisfies:

– Probability of E or F is P(E) + P(F)

– E.g. in dice roll: probability of 1 or 2 is P(1)+P(2)

)()()( FPEPFEP

Page 6: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Consequences

• If we use a sample space ={O1, O2, O3, O4,…On} the probabilities of the outcomes Oi must satisfyP(O1)+P(O2)+…P(On)=1

• We will write this sum as:

n

iiOP

1

1)(

Page 7: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Consequences

From this axiom we can see that:the probability of the empty event is 0

(so: there MUST be an outcome, think of dice roll example)

0)(

)(1)()()()(1

OP

OPOPPOPP

Page 8: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Probability Law

• We have seen 3 axioms that must be satisfied by the probability assignment to the outcomes (sample space) and some of their consequences

• BUT: who gives us the probabilities ?

• They are largely an arbitrary design choice (although we will see practical methods)

Page 9: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Example

• Think again of the case of the dice roll.• Given our knowledge of physics, and the symmetry

of a dice, we see no reason why a certain outcome should be more likely than another. So we want:P(1)=P(2)=P(3)=P(4)=P(5)=P(6)

• The normalization axiom gives P(…)=1/6 for each of them

• We can then use these probabilities and the axioms to compute probabilities of more complex events

Page 10: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Example

• Coin toss.

• Again: no reason to prefer one outcome over another, so:P(H)=P(T)=1/2

• Unless …

Page 11: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Frequency information…

• Unless we actually know that specific coin (or dice) and we know the exact frequency of the outcomes in the last 1000s experiments

• Possibly the coin is not fair, and we observe 80% head, 20% tail outcomes …

• We can incorporate this in the model, assigning P(H)=0.8 P(T)=0.2

In the first case we have used our knowledge of the situation; in the second case we have estimated the probabilities by using frequencies

Page 12: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Probabilistic Model of Coin Toss

• Sample space is: ={H,T}• Possible events are all subsets:

{H,T}, {H}, {T}, 0 (empty)• Fair coin P({H})=P({T})=0.5• P({H,T})=P({H})+P({T})=1• P(0)=0• So we have assigned a probability to EACH

possible event based on the probabilities on the outcomes, in a way to satisfy all axioms

Page 13: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Model: Toss of Three Coins

• Sample space (8 possible outcomes):={HHH,HHT,HTH, HTT, TTT, THH, THT, TTH}

• We assume they are all equally likely, so we assign to each of them probability 1/8

• The probability law should assign probabilities to EVERY POSSIBLE EVENT

Page 14: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Tossing Three Coins

• A possible event: 2 heads occur

• How many outcomes are in this event ?{HHT, HTH, THH}

• 3 disjoint events, their union has probability equal to the sum of their probabilities:P({HHT, HTH, THH})= =P({HHT})+P({HTH})

+P({THH})==1/8+1/8+1/8=3/8

Page 15: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Tossing Three Coins

• We can calculate similarly the probability of all possible events, and this gives a probability law that satisfies the axioms.

• We can see that obtaining 3 heads has probability 1/8, less than observing 2 heads (3/8), and so on …

Page 16: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Probability law for finite sample spaces

• For finite sample spaces, we specify the probability law by just assigning probabilities to the individual outcomes

• Often the outcomes are equiprobable, thenP(E)=number of outcomes in E / total number of outcomes

Page 17: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Continuous Sample Space

• In the case of the dart and target, things are different …

• If each outcome is a point,its probability cannotbe bigger than zero,else the total probability will exceed one

• Solution: outcomes must be(infinitesimally) small areas, not points

• Do not worry too much about this for now

Page 18: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Properties of Probability Law

)()()(

)()()()(

)()(

FPEPFEP

FEPFPEPFEP

FPEPFE

Assume area of set = probability of event!

Page 19: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Using Probabilistic Models

• Say we want to model an uncertain situation(e.g. an experiment)

• We first decide a sample space and a probability law. This step is somewhat arbitrary, and fully specifies the model.

• Then operating within the model we derive the probabilities of the events of interest, or other properties. This is fully unambiguous.

Page 20: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Example

• We want to choose a day in 2009 when to organize a picnic

• We want to avoid: rain, cold and traffic

• These are three possible events(day=rain; day=cold; day=traffic)not mutually exclusive …

Page 21: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

R,C,T R,nC,T nR,nC,T

R,nC,nT R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT nR,C,nT nR,C,nT nR,C,nT nR,C,nT R,C,T

Assume this is a generic month. A random day will havevalues for R,C,T … we can compute the probability forR (rain), or for nT (not traffic); but also for R AND T …

Page 22: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

R,C,T R,nC,T nR,nC,T

R,nC,nT R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT nR,C,nT nR,C,nT nR,C,nT nR,C,nT R,C,T

Event: RAIN

Page 23: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

R,C,T R,nC,T nR,nC,T

R,nC,nT R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT nR,C,nT nR,C,nT nR,C,nT nR,C,nT R,C,T

Event: COLD

Page 24: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

R,C,T R,nC,T nR,nC,T

R,nC,nT R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT nR,C,nT nR,C,nT nR,C,nT nR,C,nT R,C,T

Event: TRAFFIC

Page 25: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Unions and Intersections of Events

• We may want to calculate the probability to randomly selecting a day that is both not-rainy and not-cold

• Today we talk of probabilities of COMBINATIONS of events

Page 26: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Intersection of Events

• Probability that BOTH events occur simultaneously

• We DEFINE A NEW EVENT consisting of the outcomes that are in both events E and F and we calculate its probability

• New event

• The probability of both events occurring is

FEG

)( FEP

Page 27: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Intersection of Events

• The probability of this event is the sum of the probabilities of the outcomes that are both in E and in F (e.g.: fraction of days that are both R and T)

• Two events are mutually exclusive (or disjoint) if their intersection is empty(e.g.: R and nR are disjoint)

Page 28: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

R,C,T R,nC,T nR,nC,T

R,nC,nT R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT nR,C,nT nR,C,nT nR,C,nT nR,C,nT R,C,T

R,C,T R,nC,T nR,nC,T

R,nC,nT R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT nR,C,nT

nR,C,nT nR,C,nT nR,C,nT R,C,T

R,C,T R,nC,T nR,nC,T

R,nC,nT

R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT

R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT

nR,C,nT

nR,C,nT

nR,C,nT

nR,C,nT

R,C,T

rain cold

Event:Rain and cold

Page 29: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Union of Events

• We want to calculate the probability that at least one of the events E and F occurs

• This is the probability of the union event

• The probability of G is the sum of the probability of the outcomes that are in either E or F(e.g. number of days that are either R or C)

FEG

Page 30: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

R,C,T R,nC,T nR,nC,T

R,nC,nT R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT nR,C,nT nR,C,nT nR,C,nT nR,C,nT R,C,T

R,C,T R,nC,T nR,nC,T

R,nC,nT R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT nR,C,nT

nR,C,nT nR,C,nT nR,C,nT R,C,T

rain cold

Event:Rain OR cold

R,C,T R,nC,T nR,nC,T

R,nC,nT R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT nR,C,nT nR,C,nT nR,C,nT nR,C,nT R,C,T

Page 31: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Other combinations …

• We can consider the probability of being in E and not in F by considering the probability of being in E and in FC

Page 32: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Dice Example…

• Event E = {1,2,3} outcome is small (less than 3)

• Event F = {2,4,6} outcome is even number

• Probability of being either even OR small ?

• Probability of being even AND small ?

Page 33: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

R,C,T 6 6/30nR,C,T 0 0/30R,nC,T 16 16/30nR,NC,T 1 1/30R,C,nT 0 0/30nR,C,nT 6 6/30R,nC,nT 1 1/30nR,NC,nT 0 0/30

R,C,T R,nC,T nR,nC,T

R,nC,nT R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT nR,C,nT nR,C,nT nR,C,nT nR,C,nT R,C,T

Page 34: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Important …

• Calculate the joint probabilities from the table …

• P(R,C,nT)=0/30

• P(R,C,T)=6/30

• P(R,C)=P(R,C,nT)+P(R,C,T)=6/30

Page 35: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Conditional Probability

• What is the probability of rain in this month?(count all rainy days and divide by 30)– P ( R )=#R / #Days

• What is the probability of rain given that it is cold ?

Page 36: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Conditional Probability

• Outcomes of experiment: days• Being a cold day is an event• Being a rainy day is an event• Probability of being cold AND rainy ?• Cold AND NOT rainy ?

• NOW: Is it more likely to be cold in rainy days ?• What about: COLD ‘given that’ it is RAINY ?

Page 37: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

• P(d is rainy | d is cold)

• P(d is cold) = 12/30

• P(d is rainy) = 23/30

• P(d is rainy and cold)=6/30

R,C,T 6 6/30nR,C,T 0 0/30R,nC,T 16 16/30nR,NC,T 1 1/30R,C,nT 0 0/30nR,C,nT 6 6/30R,nC,nT 1 1/30nR,NC,nT 0 0/30

Page 38: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Is it more likely to have rain in cold days ?

• P(rain)=23/30

• What is the rain probability IN THE COLD DAYS ?

• Probability of rain given cold is … P(rain|cold)= P(rain AND cold)/P(cold)

• P(rain|cold)= 6/12=0.5

Page 39: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Definition

• We define conditional probability of E given F:

• Given that F is true, what is the probability of E ?• In a way, restrict to the case when only F exists, F

is the universe here …

)(

)()|(

FP

FEPFEP

Page 40: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Conditional probability

• We can consider the conditional probability P(E|F) as a new probability law defined on a new universe, F

• P(F|F)=1

• All other axioms also remain valid …

Page 41: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Properties of Conditional Probability

• It satisfies all the axioms to be a probability law

1)|(

)|()|(...)|(

1)(

)(

)(

)()|(

2121

FFP

FEPFEPFEEP

FP

FP

FP

FPFP

Page 42: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Properties of Conditional Probability

• Definition:

• This can be seen as a new probability law in the restricted universe F

• For finite sample spaces:

)(

)()|(

FP

FEPFEP

F#

FE#

Fin elements

FEin elements)|(

FEP

Page 43: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Independent Events

• We define 2 independent events as follows:

Independent events:P(E|F)=P(E)

Page 44: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Independent Events

• 2 independent events:rain and monday

• 2 dependent events:rain and january

• 2 dependent (?) events:traffic and Monday

• 2 independent events:january and monday

In theory(not sure aboutour finite dataset)

Page 45: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Bayes Theorem

• Calculation• P(cold)=12/30=2/5• P(traffic)=6/30=1/5

• P(cold AND traffic)=6/30=1/5

• P(cold|traffic)=1• P(traffic|cold)=1/2

R,C,T R,nC,T nR,nC,T

R,nC,nT R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT nR,C,nT

nR,C,nT nR,C,nT nR,C,nT R,C,T

R,C,T R,nC,T nR,nC,T

R,nC,nT

R,nC,T R,nC,T

R,nC,T R,nC,T R,nC,T R,nC,T R,C,T R,nC,T

nR,C,nT

R,nC,T R,C,T R,nC,T R,C,T R,nC,T

R,nC,T R,nC,T R,nC,T R,C,T R,nC,T R,nC,T

nR,C,nT

nR,C,nT

nR,C,nT

nR,C,nT

nR,C,nT

R,C,T

Page 46: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Bayes Theorem

• Calculation

• P(cold)=12/30=2/5

• P(traffic)=6/30=1/5

• P(cold AND traffic)=6/30=1/5

• P(cold|traffic)=1

• P(traffic|cold)=1/2

P(cold|traffic)P(traffic)=P(cold AND traffic)

P(traffic|cold)P(cold)=P(traffic AND cold)

1*1/5=1/5

½*2/5=1/5

P(cold|traffic)P(traffic)= P(traffic|cold)P(cold)

Page 47: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Bayes Theorem

P(cold|traffic)P(traffic)= P(traffic|cold)P(cold)

P(cold|traffic)= P(traffic|cold)P(cold)/P(traffic)

)(

)E|)P(FP(EF)|P(E

FP

Page 48: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Independent Events

• P(E|F)=P(E)

• E independent of F

Page 49: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Independent Events

• Since it was:

• And we are assuming

• it follows that for independent events:

)(

)()|(

FP

FEPFEP

)()()( FPEPFEP

)()|( EPFEP

Page 50: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Independent Events

• If E and F are independent, so are E and FC

Page 51: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Independence of 3 events…

• E,F,G are independent if every subset of these 3 events is independent…

• E,F are independent

• E,G are independent

• F,G are independent

• And: P(E,F,G)=P(E)P(F)P(G)

Page 52: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Independent Events

• We can decompose joint probabilities:P(E,F,G)=P(E)P(F)P(G)if they are independent

• Otherwise, we should write:P(E,F,G)=P(E|F,G)P(F|G)P(G)

Page 53: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Bernoulli Trials

• Toss a coin N times …

• Probability of starting with H= ½

• Probability of starting with HH= ½ ½

• …

• Probability of N consecutive H = (½)N

Page 54: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

MATLAB INTERLUDE

• INTERSECT Set intersection.

• INTERSECT(A,B) when A and B are vectors returns the values common to both A and B. The result will be sorted. A and B can be cell arrays of strings.

Page 55: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

MATLAB INTERLUDE

• UNION Set union.

UNION(A,B) when A and B are vectors returns the combined values from A and B but with no repetitions. The result will be sorted.

Page 56: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

MATLAB INTERLUDE

• FIND Find indices of nonzero elements.• I = FIND(X) returns the linear indices

corresponding to the nonzero entries of the array X.

X may be a logical expression.

So you can find elements in a set with a given property, and make a new set…

Page 57: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

MATLAB INTERLUDE

• LENGTH Length of vector.

LENGTH(X) returns the length of vector X. It is equivalent to MAX(SIZE(X)) for non-empty arrays and 0 for empty ones.

Page 58: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

MATLAB INTERLUDE

• You can use these set commands to count the elements in various sets, and hence to compute probabilities…

Page 59: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Topics

• Modeling with Random Variables• Discrete Random Variables• Events and • Probability Mass Function• Examples of RV:

– Bernoulli– Binomial– Geometric

• The concept of Expectation…

Page 60: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• We have studied Probabilistic Models in general, the notions of outcome, sample space and event.

• Now an important special case:in many probabilistic models the outcomes are NUMBERS, or can be associated to numbers

Page 61: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• Examples of numerical outcome:– How many people showed up today ?– How many are sitting next to a statistics major?– How many days of rain in january ?– Temperature on a given day ?

• OR we can ASSOCIATE numerical values to non-numerical outcomes …

Page 62: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• Associating numerical values to non-numerical outcomes …

HOMEWORK EXPERIMENT

• Outcome: the homework• Sample space: set of all possible answers you

COULD have given• Associated numerical value: the GRADE

Page 63: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• Easier model: multiple choice quiz• 10 questions, 3 choices each (A,B,C)• Experiment: give the test to a student• Outcome: a string of 10 symbols• Sample space: set of all possible 10 symbols

strings• Numeric value: the grade assigned to each string

(some form of distance to ‘correct string’)

Page 64: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• We call RANDOM VARIABLE a real-valued function of the outcome of an experiment

• Given an experiment, and the corresponding set of possible outcomes, a random variable associates a particular number with each outcome

Page 65: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• Example:– Sample space = {AAA, AAB, AAC, ….}– Random variable:

AAA3AAB2AAC3…

• This could be a model of grading a test

Page 66: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• Why are RANDOM VARIABLES important ?

• They allow us to model uncertain situations in a quantitative way, we will talk about:the EXPECTED temperature on january 25, or the EXPECTED number of students that will pass the test, etc. …

• We can also talk about expected deviations from this estimate …

Page 67: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES(continuous vs discrete)

• A random variable is called discrete if its range (the set of values it can take) is finite or COUNTABLY infinite

• It is called continuous – for example - if its range is the real axis (but we will not deal with this case today)

Page 68: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• Examples of discrete random variables:– Number of things (number of ‘tails’ in1000

coin tosses)– Number of minutes this class will last– Roll of 2 dice, sum or product of the outputs is

a discrete random variable

Page 69: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• The 2- dice example

• Let us call: A=* B=** C=***D=****E=*****F=******

Page 70: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• Let us consider the following random variable N associated to one dice:N(A)=1N(B)=2N( C)=3N(D)=4N(E)=5N(F)=6

Page 71: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• Sample space of the 2 dice experiment: AA,AB,AC,AD,AE,AF, BA,BB,BC,BD,BE,BF, CA,CB,CC,CD,CE,CF, DA,DB,DC,DD,DE,DF, EA,EB,EC,ED,EE,EF, FA,FB,FC,FD,FE,FF,

Page 72: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• Sum random variable:AA1+1=2 =S(AA)AB1+2=3 = S(AB)…FF6+6=12 = S(FF)

• Range of random variable:{2,3,4,5,6,7,8,9,10,11,12}

Page 73: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• Similarly we can define the random variable PRODUCT, etc …

• So after the same experiment (rolling 2 dice) we may define different random variables (sum, absolute difference, product, max, min, etc … of the two individual outcomes …)

• Whatever attaches a numeric value to the OUTCOME of the experiment is a RANDOM VARIABLE

Page 74: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLESimportant concepts

• A discrete random variable is a real valued function of the outcome of the experiment that can take a finite or countably infinite number of values

• A function of a discrete random variable defines another random variable

• We will define MEAN and VARIANCE of a random variable

• We will define independence and all other concepts we defined in the previous classes

Page 75: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• For discrete random variables we will define PROBABILITY MASS FUNCTIONS, that are probability laws that assign a probability to each possible numerical value the random variable can assume

• It will be analogous to what done so far …

Page 76: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES:notation

• We will denote by uppercase letters (X) the random variable, by lowercase letters (x) the actual value it assumes in a given experiment

• So we will talk about the probability that X=x, for example … and we will write it: P({X=x})

Page 77: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• Look at the website of the course, where we publish the statistics about the past homeworks

• Random variable: GRADE, G

• A particular grade: “g”

• For example we can talk about P({G=27})

Page 78: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• Easier model: multiple choice quiz• 10 questions, 3 choices each (A,B,C)• Experiment: give the test to a student• Outcome: a string of 10 symbols• Sample space: set of all possible 10 symbols

strings• Numeric value: the grade assigned to each string

(some form of distance to ‘correct string’)

Page 79: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• We call RANDOM VARIABLE a real-valued function of the outcome of an experiment

• Given an experiment, and the corresponding set of possible outcomes, a random variable associates a particular number with each outcome

Page 80: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLESimportant concepts

• A discrete random variable is a real valued function of the outcome of the experiment that can take a finite or countably infinite number of values

• A function of a discrete random variable defines another random variable

• We will define MEAN and VARIANCE of a random variable

• We will define independence and all other concepts we defined in the previous classes

Page 81: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES

• For discrete random variables we will define PROBABILITY MASS FUNCTIONS, that are probability laws that assign a probability to each possible numerical value the random variable can assume

• It will be analogous to what done so far …

Page 82: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

RANDOM VARIABLES:notation

• We will denote by uppercase letters (X) the random variable, by lowercase letters (x) the actual value it assumes in a given experiment

• So we will talk about the probability that X=x, for example … and we will write it: P({X=x})

Page 83: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Probability Mass Function (PMF)• The most important way to characterize a random

variable is through the probabilities of the values that it can take

• For the random variable X, these are given by the PMF of X, denoted pX.

• If x is any possible value of X, the probability mass of x, pX(x) is the probability of the event {X=x}, consisting of all outcomes that give rise to a value of X equal to x

• pX(x)=P({X=x})

Page 84: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

PMF

• Example: experiment = tossing 2 fair coins

• Random Variable X = number of heads obtained (range = {0,1,2})

• Compute the PMF of X

• pX(x)=

¼ if x=0½ if x=1¼ if x=20 otherwise (=impossible)

Page 85: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

PMF

• Event x=0 corr. Outcome TT• Event x=1 corr. Outcomes HT or TH• Event x=2 corr. Outcome HH

• Each outcome has probability ¼ hence the probabilities given before …

• (grouping outcomes based on value of random variable = a way to define events )

Page 86: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

PMF

• Some properties:since the events corresponding to each value of the random variable must be disjoint, and form a partition of the sample space,

• From probability axioms we obtain:

1)( x X xp

Page 87: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

PMF

• By a similar argument, we have for any set S of possible values of X:

In coin example before, we can say: probability of at least 1 head is ¾ (sum of prob 1 heat + prob 2 heads)

Sx

X xpSXP )()(

Page 88: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

PMF

0

0.05

0.1

0.15

0.2

0.25

0.3

outcome: 12345678

Page 89: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

PMF

• Some properties:since the events corresponding to each value of the random variable must be disjoint, and form a partition of the sample space,

• From probability axioms we obtain:

1)( x X xp

0

0.05

0.1

0.15

0.2

0.25

0.3

outcome: 12345678

Page 90: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

PMF

• By a similar argument, we have for any set S of possible values of X:

In coin example before, we can say: probability of at least 1 head is ¾ (sum of prob 1 heat + prob 2 heads)

Sx

X xpSXP )()(

0

0.05

0.1

0.15

0.2

0.25

0.3

outcome: 12345678

Page 91: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Functions of Random Variables

• One can generate new random variables as functions of random variables

Page 92: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

CALCULATION OF PMF OF A RANDOM VARIABLE X

• For each possible value x of X:

• Collect all the possible outcomes that give rise to the event {X=x}

• Add their probabilities to obtain pX(x)

• THIS IS IMPORTANT !!

Page 93: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Example

• Probability of having HW grade larger than 30 ?

• Prob G=30 + prob G=31+…+ prob G=40

• Each probability: count number of outcomes, divide by total sample space size

Page 94: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expectation

• The PMF of a random variable provides us with several numbers: the probabilities of all possible values of X

• We would like to summarize this in few numbers that represent the PMF

• One such number is the EXPECTATION

Page 95: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expectation

• Expected value of X:weighted average of all possible values of X (using probabilities as weights)

Page 96: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expectation

• Suppose you roll a dice many times, and each time you receive as many dollars as the outcome of the dice-roll …

• How much money would you ‘expect’ for each roll ?

• We need to specify these terms …

Page 97: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expectation

• Suppose you roll the dice K times, and Ki is the number of times the outcome is “i”

• Sample space = {1,2,3,4,5,6}

• The total amount of money you receive is:

KKi

i 6,...,1

6

1654321 654321

iiiKkkkkkk

Page 98: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expectation

• The total amount in K rolls is:

• So the amount per roll is:

6

1654321 654321

iiiKkkkkkk

6

1

6

1654321 654321

ii

ii

K

iK

K

kkkkkkA

Page 99: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expectation

• If we have been rolling the dice many times (=K is v. large), we can approximate the probability of an outcome with its frequency: pi=Ki/K

• Then we can write the expected amount of money as:

6

1654321 )(654321

i

iipppppppA

Page 100: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expectation

• We define the expected value (expectation, or mean) of a random variable X, with PMF pX, by

x

X xxpXE )(][

Page 101: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expectation

• Remark: we can consider this as the ‘center of gravity’ of the distribution

0

0.05

0.1

0.15

0.2

0.25

0.3

outcome: 12345678

Page 102: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Variance

• Other important quantity to describe PMF.

• Expectation: we know the ‘average’ behavior of the random variable

• But: how often does the random variable deviate from the average behavior ?

Page 103: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Variance

• Let us create a NEW random variable describing the deviation of X from its mean E[X], and let us study it …

• What is the expected value of the random variable (X-E[X])2 ?

Page 104: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Variance

• New random variable: (X-E[X])2

• Its expectation: E[(X-E[X])2]=Var(X) is called ‘the variance of X’

• It is always nonnegative

• Provides a measure of dispersion of X around its mean

Page 105: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Variance

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

outcome: 12345678

0

0.020.04

0.06

0.08

0.10.12

0.14

0.16

0.18

0.2

outcome: 12345678

Page 106: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Variance

• Another related measure of dispersion is the standard deviation of X, defined as the square root of the variance

• From a practical viewpoint, the STD is easier to use because its has the same units as X(I.e.: if X is in meters, STD will be in meters, Var(X) in square meters)

)var(XX

Page 107: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Calculation of Variance

• Can just study expectation of R.V. Z=(X-E[X])2

• X=…

• Z=…

• Var(X)=E[Z]=…

Page 108: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expected Value of Functions of Random Values

• Let X be a random variable with PMF p(x), and let g(X) be a function of X

The expected value of the random variable g(X) is:

x

X xpxgXgE )()()]([

Page 109: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Variance

• So the variance can be calculated as:

x

X xpXExXEXExVar )(])[(]])[[()( 22

Page 110: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Properties of Mean and Variance

• Let X be a random variable and let us consider the linear function: Y=aX+b where a,b are given scalars. Then:

• E[Y]=aE[X]+b• Var(Y)=a2•Var(X)

• THIS ONLY if g(X) is linear !!

Page 111: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

A useful relation(variance as a function of moments)

• Var(X)=E[(X-E[X]) 2]• Var(X)=E[X2]-(E[X]) 2

• Proof: SEE IN LATER SLIDES FOR FULL PROOF …Use the relation

x

X xpXExXEXExVar )(])[(]])[[()( 22

Page 112: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Variance

• The variance can be calculated as:

x

X xpXExXEXExVar )(])[(]])[[()( 22

Page 113: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Properties of Mean and Variance

• Let X be a random variable and let us consider the linear function: Y=aX+b where a,b are given scalars. Then:

• E[Y]=aE[X]+b• Var(Y)=a2•Var(X)

• THIS ONLY if g(X) is linear !!

Page 114: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

A useful relation…

• Var(X)=E[(X-E[X] 2)]

• Var(X)=E[X2]-(E[X]) 2

• Proof: either as HW or with Tas…Use the relation

x

X xpXExXEXExVar )(])[(]])[[()( 22

Page 115: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Variance Calculation

• Var(X)=E[(X-E[X] 2)] = E[X2]-(E[X]) 2

22

222

22

22

2

])[(][

)][(])[(2][

)(])[()(][2)(

)(])[(][2

)(])[()var(

XEXE

XEXEXE

xpXExxpXExpx

xpXEXxEx

xpXExX

xX

x xXX

xX

xX

We will use this a lot

Page 116: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Covariance of 2 RVs

• In probability theory and statistics, covariance is a measure of how much two variables change together (variance is a special case of the covariance when the two variables are identical).

• If two variables tend to vary together (that is, when one of them is above its expected value, then the other variable tends to be above its expected value too), then the covariance between the two variables will be positive. On the other hand, when one of them is above its expected value the other variable tends to be below its expected value, then the covariance between the two variables will be negative.

• [from wikipedia]

Page 117: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Covariance of 2 RVs

• The covariance between two real-valued random variables X and Y, with expected values E(X)=m E(Y)=n is defined as

• Cov(X, Y) = E[(X - m) (Y - n)]

Page 118: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

In Matlab

• COV Covariance matrix.• COV(X), if X is a vector, returns the variance. For matrices,• where each row is an observation, and each column a variable,• COV(X) is the covariance matrix. DIAG(COV(X)) is a vector of• variances for each column, and SQRT(DIAG(COV(X))) is a vector• of standard deviations. COV(X,Y), where X and Y are matrices

with• the same number of elements, is equivalent to COV([X(:) Y(:)]).

Page 119: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Correlation Coefficient

From wikipedia

Page 120: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Correlation CoefficientBetween 2 Random Variables

• CORRCOEF Correlation coefficients.• R=CORRCOEF(X) calculates a matrix R of correlation coefficients for• an array X, in which each row is an observation and each column is a• variable.• • R=CORRCOEF(X,Y), where X and Y are column vectors, is the same as• R=CORRCOEF([X Y]).• • If C is the covariance matrix, C = COV(X), then CORRCOEF(X) is• the matrix whose (i,j)'th element is• • C(i,j)/SQRT(C(i,i)*C(j,j)).•

Page 121: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.
Page 122: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

EXTRA MATERIALBELOW THIS POINT

• WHAT FOLLOWS IS EXTRA MATERIAL FOR REFERENCE

• Not covered in class 1 of week 2 (refers to class 2 of week 2)

Page 123: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.
Page 124: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.
Page 125: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.
Page 126: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Bernoulli Random Variable

• Consider the toss of a (generally not fair) coin, probability H = p; prob T = 1-p

• The BERNOULLI random variable is a RV that takes the two values 0 or 1 depending on whether the outcome is H or T(remember: RV is a function of the outcome)

• X=1 if outcome is H; X=0 if outcome is T

Page 127: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Bernoulli Random Variable

• The PMF of this Bernoulli RV is:

• PX(x)=

• Very important RV in modeling any generic situation with just 2 outcomes, e.g. outcome of the football match on Sunday, …

P if x=1

1-p if x=0

Page 128: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Random Variable

• Experiment = N coin tosses, each one with prob(H)=p; prob(T)=1-p

• The random variable X is the number of heads in the n-toss sequence

• We refer to X as a BINOMIAL RANDOM VARIABLE WITH PARAMETERS n AND p

Page 129: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Random Variable

• The PMF of X consists of the binomial probabilities we have seen some time ago …

• Two parts: – probability of a sequence with k heads and n-k tails

– Number of sequences with k heads and n-k tails

knkX pp

k

nkXPkp

)1()()(

Page 130: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Random Variable

• The normalization property can be written as

• We will study this more in the future …

1)1(0

n

k

knk ppk

n

Page 131: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Geometric Random Variable

• We repeatedly toss the same coin as before.• RV: number of tosses before the first head

comes up …• TTTTTTTH• TTH• H• TTTTTTTTTTTTTTTTTTTTTTTTTTTTH

Page 132: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Geometric Random Variable

• PMF

two parts: probability of the ‘prefix’ of k=1 tails, and probability of the end H

• Normalization:

ppxp kX

1)1()(

0

1 1

1 1)1(1

1)1()1()(

k

k

k k

kX p

pppppkp

Page 133: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Geometric Random Variable

• This can model the process of you trying to connect with the modem to an internet service provider … (how many fails before 1 success ?)

Page 134: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Poisson Random Variable

Page 135: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Functions of Random Variables

• One can generate new random variables as functions of random variables

Page 136: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expectation

• The PMF of a random variable provides us with several numbers: the probabilities of all possible values of X

• We would like to summarize this in few numbers that represent the PMF

• One such number is the EXPECTATION

Page 137: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expectation

• Expected value of X:weighted average of all possible values of X (using probabilities as weights)

• Next time we will develop this and other concepts…

Page 138: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Conclusion

• Random Variables

• Probability Mass Functions

• How to calculate PMFs

• Bernoulli

• Binomial

• Geometric

• Poisson ?

Page 139: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.
Page 140: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Probability Mass Function (PMF)• The most important way to characterize a random

variable is through the probabilities of the values that it can take

• For the random variable X, these are given by the PMF of X, denoted pX.

• If x is any possible value of X, the probability mass of x, pX(x) is the probability of the event {X=x}, consisting of all outcomes that give rise to a value of X equal to x

• pX(x)=P({X=x})

Page 141: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

PMF

• Example: experiment = tossing 2 fair coins• Random Variable X = number of heads obtained

(range = {0,1,2})• Each outcome has probability ¼ hence the probabilities are• Event x=0 corr. Outcome TT• Event x=1 corr. Outcomes HT or TH• Event x=2 corr. Outcome HH

Page 142: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

PMF

• Compute the PMF of X

• pX(x)=¼ if x=0½ if x=1¼ if x=20 otherwise (=impossible)

Page 143: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

PMF

0

0.05

0.1

0.15

0.2

0.25

0.3

outcome: 12345678

Page 144: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

PMF

• Some properties:since the events corresponding to each value of the random variable must be disjoint, and form a partition of the sample space,

• From probability axioms we obtain:

1)( x X xp

0

0.05

0.1

0.15

0.2

0.25

0.3

outcome: 12345678

Page 145: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

PMF

• By a similar argument, we have for any set S of possible values of X:

In coin example before, we can say: probability of at least 1 head is ¾ (sum of prob 1 heat + prob 2 heads)

Sx

X xpSXP )()(

0

0.05

0.1

0.15

0.2

0.25

0.3

outcome: 12345678

Page 146: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Functions of Random Variables

• One can generate new random variables as functions of random variables

Page 147: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Bernoulli Random Variable

• Consider the toss of a (generally not fair) coin, probability H = p; prob T = 1-p

• The BERNOULLI random variable is a RV that takes the two values 0 or 1 depending on whether the outcome is H or T(remember: RV is a function of the outcome)

• X=1 if outcome is H; X=0 if outcome is T

Page 148: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Bernoulli Random Variable

• The PMF of this Bernoulli RV is:

• PX(x)=

• Very important RV in modeling any generic situation with just 2 outcomes, e.g. outcome of the football match on Sunday, …

P if x=1

1-p if x=0

Page 149: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Mean and Variance

• E[X]=1*p + 0*(1-p)=p

• E[X2]= 12*p + 02*(1-p)=p

• Var(X)=E[X2]-(E[X]) 2=p-p2=p(1-p)

Page 150: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Uniform Distribution: dice roll

• … see later slides …

Page 151: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Random Variable

• Experiment = N coin tosses, each one with prob(H)=p; prob(T)=1-p

• The random variable X is the number of heads in the n-toss sequence

• We refer to X as a BINOMIAL RANDOM VARIABLE WITH PARAMETERS n AND p

Page 152: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Random Variable

• The PMF of X consists of the binomial probabilities we have seen some time ago …

• Two parts: – probability of a sequence with k heads and n-k tails

– Number of sequences with k heads and n-k tails

knkX pp

k

nkXPkp

)1()()(

Page 153: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Random Variable

• The normalization property can be written as

• We will study this more in the future …

1)1(0

n

k

knk ppk

n

Page 154: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Geometric Random Variable

• We repeatedly toss the same coin as before.• RV: number of tosses before the first head

comes up …• TTTTTTTH• TTH• H• TTTTTTTTTTTTTTTTTTTTTTTTTTTTH

Page 155: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Geometric Random Variable

• PMF

two parts: probability of the ‘prefix’ of k=1 tails, and probability of the end H

• Normalization:

ppxp kX

1)1()(

0

1 1

1 1)1(1

1)1()1()(

k

k

k k

kX p

pppppkp

Page 156: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Bernoulli Random Variable

• Consider the toss of a (generally not fair) coin, probability H = p; prob T = 1-p

• The BERNOULLI random variable is a RV that takes the two values 0 or 1 depending on whether the outcome is H or T(remember: RV is a function of the outcome)

• X=1 if outcome is H; X=0 if outcome is T

Page 157: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Bernoulli Random Variable

• The PMF of this Bernoulli RV is:

• PX(x)=

• Very important RV in modeling any generic situation with just 2 outcomes, e.g. outcome of the football match on Sunday, …

P if x=1

1-p if x=0

Page 158: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Mean and Variance

• E[X]=1*p + 0*(1-p)=p

• E[X2]= 12*p + 02*(1-p)=p

• Var(X)=E[X2]-(E[X]) 2=p-p2=p(1-p)

Page 159: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Two Important Series

• We do not derive them here.We will apply these to calculations of variance…

6

)12)(1(

2

)1(

1

2

1

nnni

nni

n

i

n

i

X

X X

X X X

X X X X

X X X X X

X X X X X X

X X X X X X X

X X X X X X X X

X X X X X X X X X

X

X

X

X

X

X X

X X

X X

X X X

Page 160: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

2

))(1(

2

)())((

22

)1(

2

)1(

2

)1(

22

11

1

ababababab

ababaabbiii

nni

a

i

b

i

b

ai

n

i

X

X X

X X X

X X X X

X X X X X

X X X X X

X X X X X

X X X X X

X X X X X

Page 161: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Uniform Distribution: dice roll

Discrete Uniform PMF over [a,b](case of the dice rolls)

otherwise 0

b1,...,aa,k if 1

1

)( abkpX

Page 162: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Uniform

otherwise 0

b1,...,aa,k if 1

1)( abkpX

The expectation is:

This can be seen directly, since the PMF is symmetric around (a+b/2). Or use the series given before...

Dice example: 1+2+3+4+5+6=21Direct Computation of Expectation: 21/6=3.5Formula says: (1+6)/2=3.5

2][

baXE

Page 163: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Variance of Discrete Uniform

• We first study case where a=1; b=n [the general case will reduce to this]

• We will use relation: Var(X)=E[X2]-(E[X])2

n

k

nnkn

XE1

22 )12)(1(6

11][

Can verify this by inductionof just believe it

Page 164: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Variance of Discrete Uniform

n

k

nnkn

XE1

22 )12)(1(6

11][

2][

baXE

22 ])[(][)( XEXExVar

12

1)1(

4

1)12)(1(

6

1)(

22

nnnnxVar

Notice: we are still working with special case a=1; b=n

Page 165: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Variance of Discrete Uniform

• Now we can study the general case: by SHIFTING a distribution, its variance does not change (so we can study [a,b] case by studying variance of [1,b-a+1] case)

• So: setting n=b-a+1 in the previous equation gives the general case

Page 166: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Variance of Discrete Uniform

• Example: I get 1 $ for each point on the dice, I can expect 3.5 dollars at each roll, and a Standard Deviation of sqrt(35/12)~1.7

12

)2)(()var(

2][

ababX

baXE

Page 167: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Random Variable

• Experiment = N coin tosses, each one with prob(H)=p; prob(T)=1-p

• The random variable X is the number of heads in the n-toss sequence

• We refer to X as a BINOMIAL RANDOM VARIABLE WITH PARAMETERS n AND p

Page 168: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Random Variable

• The PMF of X consists of the binomial probabilities we have seen some time ago …

• Two parts: – probability of a sequence with k heads and n-k tails

– Number of sequences with k heads and n-k tails

knkX pp

k

nkXPkp

)1()()(

Page 169: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Random Variable

• The normalization property can be written as

• We will study this more in the future …

1)1(0

n

k

knk ppk

n

Page 170: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

QUESTION

• There are 94 students

• Each has probability 1/3 to get an A

• The number of students that get an A is a random variable

• What is its mean ? (how many are expected to get an A)

Page 171: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Mean of the Binomial

• If we want the mean of the binomial, we first need to learn how to handle JOINT PMFs of MULTIPLE RANDOM VARIABLES

Page 172: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

JOINT PMFs of MULTIPLE RANDOM VARIABLES

• Consider 2 discrete random variables, X and Y associated with the same experiment

• The probabilities of the values that X and Y can take, are captured by the JOINT PMF of X and Y, written: pX,Y

• pX,Y(x,y)=P(X=x,Y=y)

Page 173: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

JOINT PMF of 2 RV

• (if we consider the pair X,Y as a random variable, all ideas transfer …)

• If A is an event (set of pairs (x,y) that have a certain property) then

P((X,Y) in A)=(x,y in A)pX,Y(x,y)

Page 174: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

students

• Consider the random variable Xi that is 1 if student “i” gets an A, and 0 otherwise

• If n students, probability p, this is np

][ ii XEXE

Page 175: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Conclusion

• Mean of Random Variables• Variance of Random Variables• Properties, relations for variance and moments• Bernoulli• Discrete Uniform, …• General Methods for variance calculation

Page 176: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.
Page 177: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

topics

• Some probability distributions

• Some real applications:decision making; modeling clashes between ants

• Modeling the distribution of ‘ping’ times …

Page 178: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Marginalization

• For a fixed value y,

• Using the definition of conditional probability, we have:

y

x

yYxXPxXP

yYxXPyYP

} and {}{

x valuefixed afor similarly and

} and {}{

}{

} and {}|{

yYP

yYxXPyYxXP

Page 179: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Random Variables

• Joint probability

• Conditional probability

• Independence

Page 180: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Joint Probability

• It is common for several random variables to be defined on the same sample space. If X and Y are random variables, the function

f(x,y) = Pr{X = x and Y = y} is the joint probability mass function of X and Y.

Page 181: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Independent Random Variables

• We define two random variables X and Y to be independent if for all x and y, the events X = x and Y = y are independent or, equivalently, if for all x and y, we have Pr{X = x and Y = y} = Pr{X = x} Pr{Y = y}.

Page 182: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Functions of Random Variables

• Given a set of random variables defined over the same sample space, one can define new random variables as sums, products, or other functions of the original variables.

Page 183: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expected value of a random variable

• The simplest and most useful summary of the distribution of a random variable is the "average" of the values it takes on. The expected value (or, synonymously, expectation or mean) of a discrete random variable X is

x

X xxpXE )(][

Page 184: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expectation of joint RVs

• Given random variables X and Y, and given their PMF: P{X=x and Y=y}, what is their joint expectation ? E[X,Y]

• Easy if they are independent …

Page 185: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Expectation of Joint Independent RVs

][][

}{}{

}{}{

} and {],[

YEXE

yYyPxXxP

yYPxXxyP

yYxXxyPYXE

yx

x y

x y

Page 186: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

In general…

• In general, when n random variables X1,

X2, . . . , Xn are mutually independent,

E[X1X2 Xn] = E[X1]E[X2] E[Xn] .

Page 187: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

More about independent RVs…

• When X and Y are independent random variables,

• Var[X + Y] = Var[X] + Var[Y].

• (whereas for ANY random variablesthe expectation of the sum is the sum of their expectations, that is,

• E[X + Y] = E[X] + E[Y] , )

Page 188: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

The Geometric Distribution

• A coin flip is an instance of a Bernoulli trial, which is defined as an experiment with only two possible outcomes: success, which occurs with probability p, and failure, which occurs with probability q = 1 - p.

• When we speak of Bernoulli trials collectively, we mean that the trials are mutually independent and that each has the same probability p for success.

• Two important distributions arise from Bernoulli trials: the geometric distribution and the binomial distribution.

Page 189: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Geometric Distribution

• Take a sequence of Bernoulli trials, each with a probability p of success and a probability q = 1 - p of failure.

• How many trials occur before we obtain a success?

Page 190: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Geometric Distribution

• Let the random variable X be the number of trials needed to obtain a success. Then X has values in the range {1, 2, . . .}, and Pr{X = k} = qk-1p , (for k larger than 0)

since we have k - 1 failures before the one success. • A probability distribution satisfying this equation

is said to be a geometric distribution.

Page 191: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Geometric Distribution

• This is the geometric dictribution (picture taken from Cormen, Leiserson and Rivest’s book on Algorithms)

• In this case, the coin has probability p = 1/3 of success and a probability q = 1 - p of failure

Page 192: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Geometric distribution

• Expectation:we can use the relation

• That holds when the summation is infinite and |x| < 1

20 )1(

1

xkx

k

k

Page 193: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Geometric Distribution

p

q

q

q

p

kqq

p

pkqXE

k

k

k

k

/1

)1(

][

2

0

1

1

The expectation of the distribution is 1/p = 3.

Page 194: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Geometric Distribution

• The variance, which can be calculated similarly, isVar[X] = q/p2

• Example:

repeatedly roll two dice until we obtain

either a seven or an eleven.

Of the 36 possible outcomes, 6 yield a

seven and 2 yield an eleven. Thus, the

probability of success is p = 8/36 = 2/9,

and we must roll 1/p = 9/2 = 4.5 times on

average to obtain a seven or eleven.

• NEXT WEEK we will implement things

like this ….

Page 195: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

BINOMIAL DISTRIBUTION

• How many successes occur during n Bernoulli trials, where a success occurs with probability p and a failure with probability q = 1 - p?

Page 196: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Distribution

• Define the random variable X to be the number of successes in n trials. Then X has values in the range {0, 1, . . . , n}, and for k = 0, . . . , n,

• since there are ways to pick which k of the n trials are successes, and the probability that each occurs is pkqn-k. A probability distribution satisfying this equation is said to be a binomial distribution.

knk qpk

nkXP

}{

k

n

Page 197: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Distribution

• Let Xi be the random

variable describing the number of successes in the ith trial. Then E[Xi] = p*1+ q*0 = p,

and by linearity of expectation, the expected number of successes for n trials is

npp

XE

XEXE

n

i i

n

ii

1

1

][

][][

Page 198: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Distribution

• Similarly we can do for the variance, exploiting the relation Var[X]=E[X2] - E2[X]

• Since Xi only takes on the

values 0 and 1, we have E[X2] = E[X]=p

• And hence Var[Xi] = p - p2 = pq .

• Then we can use independence, to move from Var[Xi] to the

variance of the binomial …

npqpqXVar

XVarXVar

i

i

][

][][

Page 199: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Distribution

• The binomial distribution increases as k runs from 0 to n until it reaches the mean np, and then it decreases.

• Picture from cormen, leiserson, rivest’s book

Page 200: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Distribution

Page 201: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Conclusion

• Conditional PMF in RVs• Independence• Expectation and Variance for RVs

• Geometric distribution• Binomial distribution

next: we will implement all of these ideas…

Page 202: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

• EXTRA MATERIAL (NOT COVERED IN CLASS)

Page 203: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Cards♠ ♣ ♥ ♦

Ace2345678910 JackQueenKing

Page 204: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.
Page 205: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Counting …

• Probability of Generating a growing sequence of cards … (1,2,3,4,5,6,7,8,9,…)

• Probability of starting with a 1 * probability of having a 2 * …* probability of having a king…

Page 206: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

COUNTING METHODS

• How many ways to obtain K heads and N-K tails in N coin tosses ?

• How many ways to have a 4-of-a-kind ?

Page 207: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Basic Counting

• Two experiments are performed. The first one can have any one of N possible outcomes, the second one any of M possible outcomes.

there are MN possible outcomes for the two experiments considered together

Page 208: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Basic Counting• How many different arrangements of the letters

A,B,C are possible ?• ABC

ACBBACBCACABCBA

• Each arrangement known as a PERMUTATION.• There are 6 possible permutations of a set of 3 objects• There are N! permutations of a set of N objects

N!=N(N-1)(N-2)…3*2*1

Page 209: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Combinations

• How many different groups of M objects can I form from a total of N objects ?(e.g. how many groups of 5 cards can I form from a deck of 52 ?)

• (there are 52 ways to select the first; 51 to select the second; … but we are counting each group each time we see one of its possible orderings… we need to correct for this …)

• (52*51*50*49*48)/(5*4*3*2*1)

Page 210: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Combinations

• Ways of choosing k elements out of a set of n elements:

!)!(

!

!

)1)...(1(

kkn

n

k

knnn

k

n

Page 211: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Combinations and Permutations• How many ways to put N balls in K boxes ?• OOO11O1O1OOO11 example• • 1 is the boundary of the box will use G=(K-1) 1s

• O is the ball N will use: Os• (N+G)! • Correct for permutations of the 1s and of the 0s:• (N+G)!/(N!G!)• If create and M=N+G• M!/(M-G)!G! Same as before…

In example:G=6; K=7N=8; M=14

Page 212: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

COUNTING METHODS

• Combinations VS permutations

• How many sets of 3 numbers out of 10 ?

• How many ordered sets of 3 numbers ?

Page 213: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Pascal’s Triangle

Page 214: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Pascal’s Triangle

Page 215: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Binomial Coefficient and Pascal’s Triangle

• A number in the triangle can be found by nCr (n Choose r) where n is the number of the row and r is the element in that row. For example, in row 3, 1 is the zeroth element, 3 is element number 1, the next three is the 2nd element, and the last 1 is the 3rd element. The formula for nCr is:

• n!--------r!(n-r)!

Page 216: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Examples

• How many ways to select 5 cards from the deck ?

• How many ways to have 4 equal cards in a set of 5 ?

• Probability of selecting 5 cards containing a poker ?

Page 217: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Poker Probabilities

• Deck of 52 cards, ranked:ace, king, queen, jack, 10,9,8,7,6,5,4,3,2 (and ace again: it can be either high or low)

• 4 suits: spades, hearts, diamonds and clubs

• 5 card draw; 5 cards make up a poker hand

• The highest hand wins

• Hands are ranked as follows:

Page 218: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Poker Probabilities

• Royal flush 10, J, Q, K, A of the same suit

• Four of a kind 4 cards of the same RANK

• Full house 3 cards of the same rank + 2 cards of the same rank

• Flush 5 cards of the same suit• …

Page 219: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Poker Probabilities

• How many poker hands ?2,598,960

960,598,25

52

Page 220: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Poker Probabilities

• How many combinations of royal flush?4 (probability: 0.00000154)

• How many combinations of 4-of-a-kind ?624

24641213 :TOTAL

left card ofsuit for the choices 4

kind-a-of-4 the tochoice) (nosuit theassigningin choice 1

1561213

left card 1 ofrank for the choices 12

all)for same (the cards 4 theofrank for the choices 13

Page 221: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

• Consider a number of experiments with poker cards

• Write down SAMPLE SPACE

• Count possible outcomes for each experiment (see book, or handouts)

♠ ♣ ♥ ♦Ace2345678910 JackQueenKing

Page 222: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Kind of questions…

• Probability of having King of ♣ at first draw ?• Probability of having 4 kings ?• Probability of having any set of 4 equal cards ?

When we ask to write sample space for 5-cards experiment, we do not mean to list all of the outcomes (they are about 2.5 million), just to show you know what the sample space is: e.g.{all hands of 5 cards}, or {{2S, 2C,2D, 2H,3S},…{KS,KC,KD,KH, AC},…}

Page 223: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

How to do the homework…

• Always: write down probabilistic model

• Use one of the 3 formulae we have for COUNTING number of events of a certain type, or of outcomes

• Use definitions like: P(event)=# outcomes in event / #possible outcomes

Page 224: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Combinations

• Ways of choosing k elements out of a set of n elements:

• HOW MANY COMMITTEES OF 5 PEOPLE CAN WE MAKE OUT OF A CLASS OF 10 PEOPLE ?

!)!(

!

!

)1)...(1(

kkn

n

k

knnn

k

n

Page 225: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Poker Probabilities

• How many poker hands ?2,598,960

960,598,25

52

Page 226: EMAT 20205 Data Analysis WEEK -2 Nello Cristianini.

Poker Probabilities

• How many combinations of royal flush?4 (probability: 0.00000154)

• How many combinations of 4-of-a-kind ?624

24641213 :TOTAL

left card ofsuit for the choices 4

kind-a-of-4 the tochoice) (nosuit theassigningin choice 1

1561213

left card 1 ofrank for the choices 12

all)for same (the cards 4 theofrank for the choices 13