Scales and probability measures The states of a random variable can be given on different scales 1)...

Scales and probability measures

The states of a random variable can be given on different scales

1) Nominal scale

A scale where the states have no numerical interrelationships

Example: The colour of a sampled pill from a seizure of suspected illicit drug pills

Each state can be assigned a probability > 0

2) Numerical scale

a) Discrete states

(i) Ordinal scale

A scale where the states can be put in ascending order

Example: Classification of a dental cavity as small, medium- sized or large

Each state can be assigned a probability > 0

Once probabilities have been assigned it is also meaningful to interpret statements as “at most”, “at least”, “smaller than”, “larger than”…..

If we denote the random variable by X assigned state probabilities would be written Pr (X = a ) and we can also interpret Pr (X a ), Pr (X a ), Pr (X < a ) and Pr (X > a )

(ii) Interval scale

An ordinal scale where the distance between two consecutive states is the same no matter where in the scale we are.

Example: The number of gun shot residues found on the hands of a person suspected to have fired a gun.

The distance between 5 and 4 is the same as the distance between 125 and 124

Probabilities are assigned and interpreted the same way as for an ordinal scale.

Interval scale discrete random variables very often fit into a family of discrete probability distributions where the assignment consists of choosing one or several parameters.

Probabilities can be written on parametric form using a probability mass function, e.g. if X denotes the random variable:

Examples:

Binomial distribution:

Typical application: The number of “successes” out of n independent trials where for each trial the assigned probability of success is

Poisson distribution:

Typical application: Count data, e.g. the number of times an event occur in a fixed time period where is the expected number of counts in that period.

xXxp Pr

nxxnx

nnxp xnx ,,1,0;1

!!

!,

,1,0;!

xex

xpx

b) Continuous states

(i) Interval scale

This scale is of the same kind as for discrete states

Example: Daily temperature in Celsius degrees

However, a probability > 0 cannot be assigned to a particular state

Instead probabilities can be assigned to intervals of states

The whole range of states has probability one.

The probability of an interval of states depends on the assigned probability density function for the range of states.

Denote the random variable by X. It is thus only meaningful to assign probabilities like Pr ( a < X < b ) [which is equal to Pr ( a X b ) ].

Such probabilities are obtained by integrating the assigned density function (see further below)

(ii) Ratio scale

An interval scale with a well-defined zero state.

Example: Measurements of weight and length

The probability measure is the same as for continuous interval scale random variables

The probability density function and probabilities:

The random variable, X, is almost always assumed to belong to a family of continuous probability distributions

The density function is then specified on parametric form:

and probabilities for intervals of states are computed as integrals:

xf

dxxfbXab

a Pr

Examples:

1) Normal distribution, N (165,6.4) proxy for the length of a randomly selected adult woman

2

2

4.62

165

24.6

1

4.6,165

x

e

xf

130 140 150 160 170 180 190

0.0

10

.02

0.0

30

.04

0.0

50

.06

x

f(x|

16

5,6

.4)

E.g. Pr(150 < X < 160) is calculated as the area under the curve between 150 and 160 (i.e. an integral)

130 140 150 160 170 180 190

0.0

10

.02

0.0

30

.04

0.0

50

.06

x

f(x|

16

5,6

.4)

2) Gamma distribution [Gamma (k, )]

0;,1

xek

xkxf

x

k

k

with k = 2 and = 4 (might be a proxy for the time until death for an organism)

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

x

f(x|

2,4

)

E.g. the probability that the time exceeds 0.5 (for the scaling used) is Pr ( X > 0.5) and is the area under the curve from 0.5 to infinity.

0.0 0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

x

f(x|

2,4

)

Probability and Likelihood

Two synonyms?

An event can be likely or probable, which for most people would be the same.

Yet, the definitions of probability and likelihood are different.

In a simplified form:

• The probability of an event measures the degree of belief that this event is true and is used for reasoning about not yet observed events

• The likelihood of an event is a measure of how likely that event is in light of another observed event.

• Both uses probability calculus

More formally…

Consider the unobserved event A and the observed event B.

There are probabilities for both representing the degrees of belief for these events in general:

IBIA |Pr,|Pr

However, as B is observed we might be interested in

IBA ,|Pr

which measures the updated degree of belief that A is true once we know that B holds. Still a probability, though.

How interesting is

?,|Pr IAB

Pr (B | A, I ) might look meaningless to consider as we have actually observed B.

However, it says something about A.

We have observed B and if A is relevant for B we may compare Pr (B | A, I ) with Pr (B | Ā, I ) .

Now, even if we have not observed A or Ā, one of them must be true (as a consequence of A and B being relevant for each other)

If Pr (B | A, I ) > Pr (B | Ā, I ) we may conclude that A is more likely to have occurred than is Ā , or better phrased: “A is a better explanation to why B has occurred than is Ā”

Pr (B | A, I ) is called the likelihood of A given the observed B (and Pr (B | Ā, I ) is the likelihood of Ā ).

Note! This is different from the conditional probability of A given B: Pr (A | B, I )

Extending…

• The likelihood represents what the observed event(s) or data says about A

• The probability represents what the model says about A (with our without conditioning on data)

The likelihood needs not necessarily be a strict probability expression.

If the data consists of continuous measurements (interval or ratio scale) , no distinct probability can be assigned to a specific value, but the specific value might be the event of interest.

Instead, the randomness of an event is measured through the probability density function

where x (usually) stands for a specific value.

Axfxf |;

Example:

Suppose you have observed a person at quite a long distance and you try to determine the sex of this person.

Your observations are the following:

1) The person was short in length

2) The person’s skull was shaved

Based on observation 1 only your provisional conclusion would be that it is a woman.

This is so because women in general are shorter than men. The likelihood for the event “It is a woman” is the density for women’s lengths evaluated at the length of this person.

Histograms and densities for length (red: women, blue: men)

length

like

liho

od

140 160 180 200 220

0.0

00

.02

0.0

40

.06

0.0

8

person's length

Based on observation 2 only your provisional conclusion would be that it is a man.

This is so because more men than women have shaved skulls.

The likelihood here for the event “It is a woman” is the proportion of women that have shaved skulls.

Note that it is different to consider how big is the proportion of women among those persons that have the same length as the person of interest.

Note that it is different to consider the proportion of women among persons with shaved skulls

What if we combine observations 1 and 2?

Provided we can assume that a person’s length is not relevant for whether the person’s skull is shaved or not,the likelihood for “It is a woman” in view of the combined observations is the product of the individual likelihoods

Note that it would be even more problematic to consider the proportion women among those person’s that have the same length as the person of interest and a shaved skull

This might lead to a combined likelihood that is equally large for both events (It is a woman and It is a man )

The general definition of likelihood:

Assume we have a number of unobserved events A, B, … and some observed data.

The observed data can be one specific value (or state) of a variable, x, or a collection of values (states)

A probability model can be used that can either

• assign a distinct probability to the observed data Pr(x | I ) . This is the case when there is either an enumerable set of possible values or when the observed data is a continuous interval of values

or

• evaluate the density of the observed data f (x | I ). This is the case when x is a continuous variable

The likelihood of A given the data is

xData

xIAxf

xData

xIAData

IDataAL

of valuedistinct one is

and continuous is if,|

of valuesof intervalan is

or enumerable are of values theif,Pr

,|

The likelihood ratio of A versus B given the data is

IDataBL

IDataALLR

,

,

LR > 1 A is a better explanation than is B for the observed data

Example:

Return to the example with detection of dye on bank notes.

Unobserved event is A = “Dye is present”

Observed event, Data is B = “Method gives positive result”

5.49

02.0

99.0

,|Pr

,|Pr

,

,

02.0,|Pr,

99.0,|Pr,

IAB

IAB

IDataAL

IDataALLR

IABIDataAL

IABIDataAL

A positive result makes the event “Dye is present” a better explanation than the event “Dye is absent”

Potential danger in mixing things up:

When we say that an event is the more likely one in light of data we do not say that this event has the highest probability.

Using the likelihood as a measure of how likely is an event is a matter of inference to the best explanation.

Logics: Implication:

A B

• If A is true then B is true, i.e. Pr(B | A, I ) 1

• If B is false then A is false, i.e.

• If B is true we cannot say anything about whether A is true or not (implication is different from equivalence)

0),|Pr( IBA

BA Pr

“Probabilistic implication”:

• If A is true then B may be true, i.e. Pr(B | A, I ) > 0

• If B is false the A may still be true, i.e.

• If B is true then we may decide which of A and Ā that is the best explanation

0),|Pr( IBA

Inference to the best explanation:

• B is observed

• A1, A2, … , Am are potential alternative explanations to B

• If for each j k Pr(B | Ak , I ) > Pr(B | Aj , I ) then Ak is considered the best explanation for B and is provisionally accepted

Bayesian hypothesis testing

In an inferential setup we may work with propositions or hypotheses.

A hypothesis is a central component in the building of science and forensic science is no exception. Successive falsification of hypotheses (cf. Popper) is an important component of crime investigation.

The “standard situation” would be that we have two hypotheses:

H0 The forwarded hypothesis

H1 The alternative hypothesis

These must be mutually exclusive

Classical statistical hypothesis testing

(Neyman J. and Pearson E.S. , 1933)

The two hypotheses are different explanations to the Data. Each hypothesis provides model(s) for Data

The purpose is to use Data to try to falsify H0

Type-I-error: Falsifying a true H0

Type-II-error: Not falsifying a false H0

Size or Significance level: = Pr(Type-I-error)

If each hypothesis provides one and only one model for Data:

Power: 1 – Pr(Type-II-error) = 1 –

The hypothesis are then referred to as simple

Most powerful test for simple hypotheses (Neyman-Pearson lemma):

A

DataHL

DataHLH

0

10 when (falsify)Reject

where A > 0 is chosen so that

Minimises for fixed .

Note that the probability is taken with respect to Data , i.e. with respect to the probability model each hypothesis provides for Data.

0

0

1 HAHL

HLP

Data

Data

Extension to composite hypotheses: Uniformly most powerful test (UMP)

Example: A seizure of pills, suspected to be Ecstasy, is sampled for the purpose of investigating whether the proportion of Ecstasy pills is “around” 80% or “around” 50%.

In a sample of 50 pills, 39 proved to be Ecstasy pills

As the forwarded hypothesis we can formulate

H0: Around 80% of the pills in the seizure are Ecstasy

and as the alternative hypothesis

H1: Around 50% of the pills in the seizure are Ecstasy

The likelihood of the two hypotheses are

L (H0 | Data) = Probability of obtaining 39 Ecstasy pills out of 50 sampled when the seizure proportion of Ecstasy pills is 80%

L (H1 | Data) = Probability of obtaining 39 Ecstasy pills out of 50 sampled when the seizure proportion of Ecstasy pills is 50%

Assuming a large seizure these probabilities can be calculated using a binomial model Bin(50, p ), where H0 states that p = p0 = 0.8 and H1 states that p = p1 = 0.5.

In generic form, if we have obtained x Ecstasy pills out of n sampled:

xnx

xnx

ppx

nnxHLDataHL

ppx

nnxHLDataHL

1111

0000

1)(,

1)(,

The Neyman-Pearson lemma now states that the most powerful test is of the form

01

1lnln01 since

0

1

0

1

0

1

0

1

0

1

0

1

0

1

00

11

0

1

0

1

0

1)(

11

lnln

11

lnln

ln1

1lnln

1

1

1

1

p

p

p

ppp

xnx

xnx

xnx

nB

pp

pp

pp

nA

x

Ap

pxn

p

px

Ap

p

p

p

pp

ppA

DataHL

DataHL

Hence, H0 should be rejected in favour of H1 as soon as x B

How to choose B?

Normally, we would set the significance level and the find B so that

If is chosen to 0.05 we can search the binomial distribution valid under H0 for a value B such that

0HBXP

05.02.08.050

05.00

50

00

B

k

kkB

k kHkXP

MSExcel:

BINOM.INV(50;0.8;0.05) returns the lowest value of B for which the sum is at least 0.05 35

BINOM.DIST(35;50;0.8;TRUE) 0.06072208BINOM.DIST(34;50;0.8;TRUE) 0.030803423

Choose B = 34. Since x = 39 we cannot reject H0

Drawbacks with the classical approach

• “Isolated” falsification (or no falsification) – Tests using other data but with the same hypotheses cannot be easily combined

• Data alone “decides”. Small amounts of data Low power

• Difficulties in interpretation:

When H0 is rejected, it means

“If we repeat the collection of data under (in principal) identical circumstances then

in (at most) 100 % of all cases” A

DataHL

DataHL

0

1

Can we (always) repeat the collection of data?

• “Falling off the cliff” – What is the difference between “just rejecting” and “almost rejecting” ?

The Bayesian Approach

There is always a process that leads to the formulation of the hypotheses. There exist a prior probability for each of them:

110

111

000

pp

HPIHPp

HPIHPp

Non-informative priors: p0 = p1 = 0.5 gives prior odds = 1

Simpler expressed as prior odds for the hypothesis H0:

IHP

IHP

p

pIHOdds

1

0

1

00

Data should help us calculate posterior odds

1

0

1

00 ,

,,

q

q

IDataHP

IDataHPIDataHOdds

1,

,,

0

000

IDataHOdds

IDataHOddsIDataHPq

The “hypothesis testing” is then a judgement upon whether q0 is

• small enough to make us believe in H1

• large enough to make us believe in H0

i.e. no pre-setting of the decision direction is made

The odds ratio (posterior odds/prior odds) is know as the Bayes factor:

IHPIHP

IDataHPIDataHP

IHOdds

IDataHOddsB

10

10

0

0 ,,,

How can be obtain the posterior odds?

IHOddsBIDataHOdds 00 ,

Hence, if we know the Bayes factor, we can calculate the posterior odds (since we can always set the prior odds)

1. Both hypotheses are simple, i.e. give one and only one model each for Data

a) Distinct probabilities can be assigned to Data

Bayes’ theorem on odds-form then gives

Hence, the Bayes factor is

The probabilities of the numerator and denominator respectively can be calculated (estimated) using the model

provided by respective hypothesis.

IHP

IHP

IHDataP

IHDataP

IDataHP

IDataHP

1

0

1

0

1

0

,

,

,

,

IHDataP

IHDataPB

,

,

1

0

b) Data is the observed value x of a continuous (possibly multidimensional) random variable

It can be shown that

where f (x | H0, I ) and f (x | H1, I ) are the probability density functions given by the models specified by H0 and H1 .

Hence, the Bayes factor is

Known (or estimated) density functions under each model can then be used to calculate the Bayes factor

IHP

IHP

IHf

IHf

IDataHP

IDataHP

1

0

1

0

1

0

,

,

,

,

x

x

IHf

IHfB

,

,

1

0

x

x

In both cases we can see that the Bayes factor is a likelihood ratio since the numerator and denominator are likelihoods for respective hypothesis.

IDataHL

IDataHLB

,

,

1

0

Example Ecstasy pills revisited

The likelihoods for the hypotheses are

3831053176783

1271082.0

0531767835.05.039

50

1271082.02.08.039

50

11391

11390

e-.B

e-.DataHL

DataHL

Hence, Data are 3831 times more probable if H0 is true compared to if H1 is true.

Assume we have no particular belief in any of the two hypothesis prior to obtaining the data

9997.013831

3831

13831|

1

0

0

0

DataHP

DataHOdds

HOdds

Hence, upon the analysis of data we can be 99.97% certain that H0 is true.

Note however that it may be unrealistic to assume only two possible proportions of Ecstasy pills in the seizure!

2. The hypothesis H0 is simple but the hypothesis H1 is composite, i.e. it provides several models for Data (several explanations)

The various models of H1 would (in general) provide different likelihoods for the different explanations We cannot come up with one unique likelihood for H1.

If in addition, the different explanations have different prior probabilities we have to weigh the different likelihoods with these.

If the composition in H1 is in form of a set of discrete alternatives, the Bayes factor can be written

where P(H1i | H1) is the conditional prior probability that H1i is true given that H1 is true (relative prior) , and the sum is over all alternatives H11 , H12 , …

iii HHPDataHL

DataHLB

111

0

If the relative priors are (fairly) equal the denominator reduces to the average likelihood of the alternatives.

If the likelihoods of the alternatives are equal the denominator reduces to that likelihood since the relative priors sum to one.

If the composition is defined by a continuously valued parameter, we must use conditional prior density of given that H1 is true:

p( |H1) and integrate the likelihood with respect to that density.

The Bayes factor can be written

dHpDataL

DataHLB

1

0

3. Both hypothesis are composite, i.e. each provides several models for Data (several explanations)

This gives different sub-cases, depending on whether the compositions in the hypotheses are discrete or according to a continuously valued parameter.

The “discrete-discrete” case gives the Bayes factor

and the “continuous-continuous” case gives the Bayes factor

where p( | H0 ) is the conditional prior density of given that H0 is

true

iii

jj

HHPDataHL

HHPDataHL

Bj

111

000

dHpDataL

dHpDataL

B1

0

Example Ecstasy pills revisited again

Assume a more realistic case where we from a sample of the seizure shall investigate whether the proportion of Ecstasy pills is higher than 80%.

H0: Proportion > 0.8 H1: Proportion 0.8

We further assume like before that we have no particular belief in any of the two hypotheses. The prior density for can thus be defined as

18.05.22.0/5.0

8.00625.08.05.0

p

i.e. both are composite

The likelihood function is (irrespective of the hypotheses)

1139 139

50|

DataL

The conditional prior densities under each hypothesis become uniform over each interval of potential values of ( (0.8, 1] and [0,0.8] ).

The Bayes factor is

8.0

0

1139

1

8.0

1139

8.0

0

1139

1

8.0

1139

1

0

11

11

1139

50

1139

50

d

d

d

d

dHpDataL

dHpDataL

B

How do we solve these integrals?

The Beta distribution:

A random variable is said to have a Beta distribution with parameters a and b if its probability density function is

1

0

11

11

),(B1with

10;1

badxxxC

xxxCxf

ba

ba

Hence, we can identify the integrals of the Bayes factor as proportional to different probabilities of the same beta distribution

8.0

0

112140

1

8.0

112140

8.0

0

1139

1

8.0

1139

8.0

0

1139

1

8.0

1139

11

11

11

11

11

11

dC

dC

dC

dC

d

d

namely a beta distribution with parameters a = 40 and b =12

> num<-1-pbeta(0.8,40,12)> den<-pbeta(0.8,40,12)> num[1] 0.314754> den[1] 0.685246> B<-num/den> B[1] 0.45933

Hence, the Bayes factor is 0.45933

With even prior odds (Odds(H0) = 1) we get the posterior odds equal to the Bayes factor and the posterior probability of H0 is

31.0145933.0

45933.00

DataHP

Data does not provide us with evidence clearly against any of the hypotheses.

Scales and probability measures The states of a random variable can be given on different scales 1)...

Documents

Transcript of Scales and probability measures The states of a random variable can be given on different scales 1)...