Scales and probability measures The states of a random variable can be given on different scales 1)...
-
Upload
roderick-tucker -
Category
Documents
-
view
217 -
download
1
Transcript of Scales and probability measures The states of a random variable can be given on different scales 1)...
Scales and probability measures
The states of a random variable can be given on different scales
1) Nominal scale
A scale where the states have no numerical interrelationships
Example: The colour of a sampled pill from a seizure of suspected illicit drug pills
Each state can be assigned a probability > 0
2) Numerical scale
a) Discrete states
(i) Ordinal scale
A scale where the states can be put in ascending order
Example: Classification of a dental cavity as small, medium- sized or large
Each state can be assigned a probability > 0
Once probabilities have been assigned it is also meaningful to interpret statements as “at most”, “at least”, “smaller than”, “larger than”…..
If we denote the random variable by X assigned state probabilities would be written Pr (X = a ) and we can also interpret Pr (X a ), Pr (X a ), Pr (X < a ) and Pr (X > a )
(ii) Interval scale
An ordinal scale where the distance between two consecutive states is the same no matter where in the scale we are.
Example: The number of gun shot residues found on the hands of a person suspected to have fired a gun.
The distance between 5 and 4 is the same as the distance between 125 and 124
Probabilities are assigned and interpreted the same way as for an ordinal scale.
Interval scale discrete random variables very often fit into a family of discrete probability distributions where the assignment consists of choosing one or several parameters.
Probabilities can be written on parametric form using a probability mass function, e.g. if X denotes the random variable:
Examples:
Binomial distribution:
Typical application: The number of “successes” out of n independent trials where for each trial the assigned probability of success is
Poisson distribution:
Typical application: Count data, e.g. the number of times an event occur in a fixed time period where is the expected number of counts in that period.
xXxp Pr
nxxnx
nnxp xnx ,,1,0;1
!!
!,
,1,0;!
xex
xpx
b) Continuous states
(i) Interval scale
This scale is of the same kind as for discrete states
Example: Daily temperature in Celsius degrees
However, a probability > 0 cannot be assigned to a particular state
Instead probabilities can be assigned to intervals of states
The whole range of states has probability one.
The probability of an interval of states depends on the assigned probability density function for the range of states.
Denote the random variable by X. It is thus only meaningful to assign probabilities like Pr ( a < X < b ) [which is equal to Pr ( a X b ) ].
Such probabilities are obtained by integrating the assigned density function (see further below)
(ii) Ratio scale
An interval scale with a well-defined zero state.
Example: Measurements of weight and length
The probability measure is the same as for continuous interval scale random variables
The probability density function and probabilities:
The random variable, X, is almost always assumed to belong to a family of continuous probability distributions
The density function is then specified on parametric form:
and probabilities for intervals of states are computed as integrals:
xf
dxxfbXab
a Pr
Examples:
1) Normal distribution, N (165,6.4) proxy for the length of a randomly selected adult woman
2
2
4.62
165
24.6
1
4.6,165
x
e
xf
130 140 150 160 170 180 190
0.0
10
.02
0.0
30
.04
0.0
50
.06
x
f(x|
16
5,6
.4)
E.g. Pr(150 < X < 160) is calculated as the area under the curve between 150 and 160 (i.e. an integral)
130 140 150 160 170 180 190
0.0
10
.02
0.0
30
.04
0.0
50
.06
x
f(x|
16
5,6
.4)
2) Gamma distribution [Gamma (k, )]
0;,1
xek
xkxf
x
k
k
with k = 2 and = 4 (might be a proxy for the time until death for an organism)
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
x
f(x|
2,4
)
E.g. the probability that the time exceeds 0.5 (for the scaling used) is Pr ( X > 0.5) and is the area under the curve from 0.5 to infinity.
0.0 0.5 1.0 1.5 2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
x
f(x|
2,4
)
Probability and Likelihood
Two synonyms?
An event can be likely or probable, which for most people would be the same.
Yet, the definitions of probability and likelihood are different.
In a simplified form:
• The probability of an event measures the degree of belief that this event is true and is used for reasoning about not yet observed events
• The likelihood of an event is a measure of how likely that event is in light of another observed event.
• Both uses probability calculus
More formally…
Consider the unobserved event A and the observed event B.
There are probabilities for both representing the degrees of belief for these events in general:
IBIA |Pr,|Pr
However, as B is observed we might be interested in
IBA ,|Pr
which measures the updated degree of belief that A is true once we know that B holds. Still a probability, though.
How interesting is
?,|Pr IAB
Pr (B | A, I ) might look meaningless to consider as we have actually observed B.
However, it says something about A.
We have observed B and if A is relevant for B we may compare Pr (B | A, I ) with Pr (B | Ā, I ) .
Now, even if we have not observed A or Ā, one of them must be true (as a consequence of A and B being relevant for each other)
If Pr (B | A, I ) > Pr (B | Ā, I ) we may conclude that A is more likely to have occurred than is Ā , or better phrased: “A is a better explanation to why B has occurred than is Ā”
Pr (B | A, I ) is called the likelihood of A given the observed B (and Pr (B | Ā, I ) is the likelihood of Ā ).
Note! This is different from the conditional probability of A given B: Pr (A | B, I )
Extending…
• The likelihood represents what the observed event(s) or data says about A
• The probability represents what the model says about A (with our without conditioning on data)
The likelihood needs not necessarily be a strict probability expression.
If the data consists of continuous measurements (interval or ratio scale) , no distinct probability can be assigned to a specific value, but the specific value might be the event of interest.
Instead, the randomness of an event is measured through the probability density function
where x (usually) stands for a specific value.
Axfxf |;
Example:
Suppose you have observed a person at quite a long distance and you try to determine the sex of this person.
Your observations are the following:
1) The person was short in length
2) The person’s skull was shaved
Based on observation 1 only your provisional conclusion would be that it is a woman.
This is so because women in general are shorter than men. The likelihood for the event “It is a woman” is the density for women’s lengths evaluated at the length of this person.
Histograms and densities for length (red: women, blue: men)
length
like
liho
od
140 160 180 200 220
0.0
00
.02
0.0
40
.06
0.0
8
person's length
Based on observation 2 only your provisional conclusion would be that it is a man.
This is so because more men than women have shaved skulls.
The likelihood here for the event “It is a woman” is the proportion of women that have shaved skulls.
Note that it is different to consider how big is the proportion of women among those persons that have the same length as the person of interest.
Note that it is different to consider the proportion of women among persons with shaved skulls
What if we combine observations 1 and 2?
Provided we can assume that a person’s length is not relevant for whether the person’s skull is shaved or not,the likelihood for “It is a woman” in view of the combined observations is the product of the individual likelihoods
Note that it would be even more problematic to consider the proportion women among those person’s that have the same length as the person of interest and a shaved skull
This might lead to a combined likelihood that is equally large for both events (It is a woman and It is a man )
The general definition of likelihood:
Assume we have a number of unobserved events A, B, … and some observed data.
The observed data can be one specific value (or state) of a variable, x, or a collection of values (states)
A probability model can be used that can either
• assign a distinct probability to the observed data Pr(x | I ) . This is the case when there is either an enumerable set of possible values or when the observed data is a continuous interval of values
or
• evaluate the density of the observed data f (x | I ). This is the case when x is a continuous variable
The likelihood of A given the data is
xData
xIAxf
xData
xIAData
IDataAL
of valuedistinct one is
and continuous is if,|
of valuesof intervalan is
or enumerable are of values theif,Pr
,|
The likelihood ratio of A versus B given the data is
IDataBL
IDataALLR
,
,
LR > 1 A is a better explanation than is B for the observed data
Example:
Return to the example with detection of dye on bank notes.
Unobserved event is A = “Dye is present”
Observed event, Data is B = “Method gives positive result”
5.49
02.0
99.0
,|Pr
,|Pr
,
,
02.0,|Pr,
99.0,|Pr,
IAB
IAB
IDataAL
IDataALLR
IABIDataAL
IABIDataAL
A positive result makes the event “Dye is present” a better explanation than the event “Dye is absent”
Potential danger in mixing things up:
When we say that an event is the more likely one in light of data we do not say that this event has the highest probability.
Using the likelihood as a measure of how likely is an event is a matter of inference to the best explanation.
Logics: Implication:
A B
• If A is true then B is true, i.e. Pr(B | A, I ) 1
• If B is false then A is false, i.e.
• If B is true we cannot say anything about whether A is true or not (implication is different from equivalence)
0),|Pr( IBA
BA Pr
“Probabilistic implication”:
• If A is true then B may be true, i.e. Pr(B | A, I ) > 0
• If B is false the A may still be true, i.e.
• If B is true then we may decide which of A and Ā that is the best explanation
0),|Pr( IBA
Inference to the best explanation:
• B is observed
• A1, A2, … , Am are potential alternative explanations to B
• If for each j k Pr(B | Ak , I ) > Pr(B | Aj , I ) then Ak is considered the best explanation for B and is provisionally accepted
Bayesian hypothesis testing
In an inferential setup we may work with propositions or hypotheses.
A hypothesis is a central component in the building of science and forensic science is no exception. Successive falsification of hypotheses (cf. Popper) is an important component of crime investigation.
The “standard situation” would be that we have two hypotheses:
H0 The forwarded hypothesis
H1 The alternative hypothesis
These must be mutually exclusive
Classical statistical hypothesis testing
(Neyman J. and Pearson E.S. , 1933)
The two hypotheses are different explanations to the Data. Each hypothesis provides model(s) for Data
The purpose is to use Data to try to falsify H0
Type-I-error: Falsifying a true H0
Type-II-error: Not falsifying a false H0
Size or Significance level: = Pr(Type-I-error)
If each hypothesis provides one and only one model for Data:
Power: 1 – Pr(Type-II-error) = 1 –
The hypothesis are then referred to as simple
Most powerful test for simple hypotheses (Neyman-Pearson lemma):
A
DataHL
DataHLH
0
10 when (falsify)Reject
where A > 0 is chosen so that
Minimises for fixed .
Note that the probability is taken with respect to Data , i.e. with respect to the probability model each hypothesis provides for Data.
0
0
1 HAHL
HLP
Data
Data
Extension to composite hypotheses: Uniformly most powerful test (UMP)
Example: A seizure of pills, suspected to be Ecstasy, is sampled for the purpose of investigating whether the proportion of Ecstasy pills is “around” 80% or “around” 50%.
In a sample of 50 pills, 39 proved to be Ecstasy pills
As the forwarded hypothesis we can formulate
H0: Around 80% of the pills in the seizure are Ecstasy
and as the alternative hypothesis
H1: Around 50% of the pills in the seizure are Ecstasy
The likelihood of the two hypotheses are
L (H0 | Data) = Probability of obtaining 39 Ecstasy pills out of 50 sampled when the seizure proportion of Ecstasy pills is 80%
L (H1 | Data) = Probability of obtaining 39 Ecstasy pills out of 50 sampled when the seizure proportion of Ecstasy pills is 50%
Assuming a large seizure these probabilities can be calculated using a binomial model Bin(50, p ), where H0 states that p = p0 = 0.8 and H1 states that p = p1 = 0.5.
In generic form, if we have obtained x Ecstasy pills out of n sampled:
xnx
xnx
ppx
nnxHLDataHL
ppx
nnxHLDataHL
1111
0000
1)(,
1)(,
The Neyman-Pearson lemma now states that the most powerful test is of the form
01
1lnln01 since
0
1
0
1
0
1
0
1
0
1
0
1
0
1
00
11
0
1
0
1
0
1)(
11
lnln
11
lnln
ln1
1lnln
1
1
1
1
p
p
p
ppp
xnx
xnx
xnx
nB
pp
pp
pp
nA
x
Ap
pxn
p
px
Ap
p
p
p
pp
ppA
DataHL
DataHL
Hence, H0 should be rejected in favour of H1 as soon as x B
How to choose B?
Normally, we would set the significance level and the find B so that
If is chosen to 0.05 we can search the binomial distribution valid under H0 for a value B such that
0HBXP
05.02.08.050
05.00
50
00
B
k
kkB
k kHkXP
MSExcel:
BINOM.INV(50;0.8;0.05) returns the lowest value of B for which the sum is at least 0.05 35
BINOM.DIST(35;50;0.8;TRUE) 0.06072208BINOM.DIST(34;50;0.8;TRUE) 0.030803423
Choose B = 34. Since x = 39 we cannot reject H0
Drawbacks with the classical approach
• “Isolated” falsification (or no falsification) – Tests using other data but with the same hypotheses cannot be easily combined
• Data alone “decides”. Small amounts of data Low power
• Difficulties in interpretation:
When H0 is rejected, it means
“If we repeat the collection of data under (in principal) identical circumstances then
in (at most) 100 % of all cases” A
DataHL
DataHL
0
1
Can we (always) repeat the collection of data?
• “Falling off the cliff” – What is the difference between “just rejecting” and “almost rejecting” ?
The Bayesian Approach
There is always a process that leads to the formulation of the hypotheses. There exist a prior probability for each of them:
110
111
000
pp
HPIHPp
HPIHPp
Non-informative priors: p0 = p1 = 0.5 gives prior odds = 1
Simpler expressed as prior odds for the hypothesis H0:
IHP
IHP
p
pIHOdds
1
0
1
00
Data should help us calculate posterior odds
1
0
1
00 ,
,,
q
q
IDataHP
IDataHPIDataHOdds
1,
,,
0
000
IDataHOdds
IDataHOddsIDataHPq
The “hypothesis testing” is then a judgement upon whether q0 is
• small enough to make us believe in H1
• large enough to make us believe in H0
i.e. no pre-setting of the decision direction is made
The odds ratio (posterior odds/prior odds) is know as the Bayes factor:
IHPIHP
IDataHPIDataHP
IHOdds
IDataHOddsB
10
10
0
0 ,,,
How can be obtain the posterior odds?
IHOddsBIDataHOdds 00 ,
Hence, if we know the Bayes factor, we can calculate the posterior odds (since we can always set the prior odds)
1. Both hypotheses are simple, i.e. give one and only one model each for Data
a) Distinct probabilities can be assigned to Data
Bayes’ theorem on odds-form then gives
Hence, the Bayes factor is
The probabilities of the numerator and denominator respectively can be calculated (estimated) using the model
provided by respective hypothesis.
IHP
IHP
IHDataP
IHDataP
IDataHP
IDataHP
1
0
1
0
1
0
,
,
,
,
IHDataP
IHDataPB
,
,
1
0
b) Data is the observed value x of a continuous (possibly multidimensional) random variable
It can be shown that
where f (x | H0, I ) and f (x | H1, I ) are the probability density functions given by the models specified by H0 and H1 .
Hence, the Bayes factor is
Known (or estimated) density functions under each model can then be used to calculate the Bayes factor
IHP
IHP
IHf
IHf
IDataHP
IDataHP
1
0
1
0
1
0
,
,
,
,
x
x
IHf
IHfB
,
,
1
0
x
x
In both cases we can see that the Bayes factor is a likelihood ratio since the numerator and denominator are likelihoods for respective hypothesis.
IDataHL
IDataHLB
,
,
1
0
Example Ecstasy pills revisited
The likelihoods for the hypotheses are
3831053176783
1271082.0
0531767835.05.039
50
1271082.02.08.039
50
11391
11390
e-.B
e-.DataHL
DataHL
Hence, Data are 3831 times more probable if H0 is true compared to if H1 is true.
Assume we have no particular belief in any of the two hypothesis prior to obtaining the data
9997.013831
3831
13831|
1
0
0
0
DataHP
DataHOdds
HOdds
Hence, upon the analysis of data we can be 99.97% certain that H0 is true.
Note however that it may be unrealistic to assume only two possible proportions of Ecstasy pills in the seizure!
2. The hypothesis H0 is simple but the hypothesis H1 is composite, i.e. it provides several models for Data (several explanations)
The various models of H1 would (in general) provide different likelihoods for the different explanations We cannot come up with one unique likelihood for H1.
If in addition, the different explanations have different prior probabilities we have to weigh the different likelihoods with these.
If the composition in H1 is in form of a set of discrete alternatives, the Bayes factor can be written
where P(H1i | H1) is the conditional prior probability that H1i is true given that H1 is true (relative prior) , and the sum is over all alternatives H11 , H12 , …
iii HHPDataHL
DataHLB
111
0
If the relative priors are (fairly) equal the denominator reduces to the average likelihood of the alternatives.
If the likelihoods of the alternatives are equal the denominator reduces to that likelihood since the relative priors sum to one.
If the composition is defined by a continuously valued parameter, we must use conditional prior density of given that H1 is true:
p( |H1) and integrate the likelihood with respect to that density.
The Bayes factor can be written
dHpDataL
DataHLB
1
0
3. Both hypothesis are composite, i.e. each provides several models for Data (several explanations)
This gives different sub-cases, depending on whether the compositions in the hypotheses are discrete or according to a continuously valued parameter.
The “discrete-discrete” case gives the Bayes factor
and the “continuous-continuous” case gives the Bayes factor
where p( | H0 ) is the conditional prior density of given that H0 is
true
iii
jj
HHPDataHL
HHPDataHL
Bj
111
000
dHpDataL
dHpDataL
B1
0
Example Ecstasy pills revisited again
Assume a more realistic case where we from a sample of the seizure shall investigate whether the proportion of Ecstasy pills is higher than 80%.
H0: Proportion > 0.8 H1: Proportion 0.8
We further assume like before that we have no particular belief in any of the two hypotheses. The prior density for can thus be defined as
18.05.22.0/5.0
8.00625.08.05.0
p
i.e. both are composite
The likelihood function is (irrespective of the hypotheses)
1139 139
50|
DataL
The conditional prior densities under each hypothesis become uniform over each interval of potential values of ( (0.8, 1] and [0,0.8] ).
The Bayes factor is
8.0
0
1139
1
8.0
1139
8.0
0
1139
1
8.0
1139
1
0
11
11
1139
50
1139
50
d
d
d
d
dHpDataL
dHpDataL
B
How do we solve these integrals?
The Beta distribution:
A random variable is said to have a Beta distribution with parameters a and b if its probability density function is
1
0
11
11
),(B1with
10;1
badxxxC
xxxCxf
ba
ba
Hence, we can identify the integrals of the Bayes factor as proportional to different probabilities of the same beta distribution
8.0
0
112140
1
8.0
112140
8.0
0
1139
1
8.0
1139
8.0
0
1139
1
8.0
1139
11
11
11
11
11
11
dC
dC
dC
dC
d
d
namely a beta distribution with parameters a = 40 and b =12
> num<-1-pbeta(0.8,40,12)> den<-pbeta(0.8,40,12)> num[1] 0.314754> den[1] 0.685246> B<-num/den> B[1] 0.45933
Hence, the Bayes factor is 0.45933
With even prior odds (Odds(H0) = 1) we get the posterior odds equal to the Bayes factor and the posterior probability of H0 is
31.0145933.0
45933.00
DataHP
Data does not provide us with evidence clearly against any of the hypotheses.