Post on 30-Mar-2019
Statistics of Contingency Tables -
Extension to I x Jstat 557
Heike Hofmann
Outline
• Testing Independence
• Local Odds Ratios
• Concordance & Discordance
• Intro to GLMs
• Simpson’s paradox: marginal association between X and Y is opposite to conditional associations between X and Y for each level of Z
• due to: very strong marginal association between X and Z or Y and Z
Simpson’s paradox
Conditional Odds Ratios
• X, Y are conditionally independent for level k of Z, if the conditional log odds ratio is 0
• X,Y are conditionally independent given Z, if all conditional odds ratios are 0.(Does not imply marginal independence)
• X,Y have homogenous association, if all conditional odds ratios given Z are constant.
Testing Independence
• Odds ratio of 1 indicates independence, confidence interval helps to determine deviation from independence, but CI is approximation.
• Alternative solution: table tests
Testing independence
• null hypothesis:
• Score Test (Pearson, 1900):
• Likelihood-Ratio Test:
• both X2 and G2 have the same limiting distribution of chi2(I-1)(J-1)
πij = πi. · π.j ∀i, j
Assume that a + b = 1 and c + d = 1
then Odds Ratio θ:
log θ = logad
bc= log
1− b
b+ log
d
1− d
Taylor≈ 4(d− b)
p̃ =y + 2
n + 4
p̃ ± zα/2
�1
np̃(1− p̃)
p− po�1npo(1− po)
= ±zα/2
p +
z2α/2
2n± zα/2
�����
p(1− p) +
z2α/2
4n
�/n
�
1 +
z2α/2
n
�−1
p =Y
n
|p− π|�1np(1− p)
.∼ N(0, 1)
nπ ≥ 5, n(1− π) ≥ 5
λ ·��
β2+
�α2
�
{α0m, αm : m = 1, ...,M} M(p + 1)
{β0k, βk : k = 1, ...,K} K(M + 1)
gk(T ) =eTk
�K�=1 eT�
σ(ν) =1
1 + e−ν
1
X2=
�
i,j
(nij − µ̂ij)2
µ̂ij
πij = πi. · π.j ∀i, j
Assume that a + b = 1 and c + d = 1
then Odds Ratio θ:
log θ = logad
bc= log
1− b
b+ log
d
1− d
Taylor≈ 4(d− b)
p̃ =y + 2
n + 4
p̃ ± zα/2
�1
np̃(1− p̃)
p− po�1npo(1− po)
= ±zα/2
p +
z2α/2
2n± zα/2
�����
p(1− p) +
z2α/2
4n
�/n
�
1 +
z2α/2
n
�−1
p =Y
n
|p− π|�1np(1− p)
.∼ N(0, 1)
nπ ≥ 5, n(1− π) ≥ 5
λ ·��
β2+
�α2
�
{α0m, αm : m = 1, ...,M} M(p + 1)
{β0k, βk : k = 1, ...,K} K(M + 1)
gk(T ) =eTk
�K�=1 eT�
1
G2= 2
�
i,j
log
�nij
µ̂ij
�
X2=
�
i,j
(nij − µ̂ij)2
µ̂ij
πij = πi. · π.j ∀i, j
Assume that a + b = 1 and c + d = 1
then Odds Ratio θ:
log θ = logad
bc= log
1− b
b+ log
d
1− d
Taylor≈ 4(d− b)
p̃ =y + 2
n + 4
p̃ ± zα/2
�1
np̃(1− p̃)
p− po�1npo(1− po)
= ±zα/2
p +
z2α/2
2n± zα/2
�����
p(1− p) +
z2α/2
4n
�/n
�
1 +
z2α/2
n
�−1
p =Y
n
|p− π|�1np(1− p)
.∼ N(0, 1)
nπ ≥ 5, n(1− π) ≥ 5
λ ·��
β2+
�α2
�
{α0m, αm : m = 1, ...,M} M(p + 1)
{β0k, βk : k = 1, ...,K} K(M + 1)
1
Example: Cholesterol/Heart Disease
• 1329 patients of same age/sex
present absent
Cholesterol ≤ 220
> 220
y11= 20 y12= 553
y21= 72 y22= 684
mg/l
Coronary Disease
Cholesterol/Heart Disease• Expected Values under independence
present absent Total
Cholesterol ≤ 220
> 220
Total
39.67 533.33 573
52.33 703.67 756
92 1237 1329
mg/l
Coronary Disease
Cholesterol/Heart Disease
• loglikelihood ratio test G2 = 19.8
• Pearson score test X2 = 18.4
• with df = (2-1)*(2-1) = 1independence seems to be violated
Extensions to I x J Contingency Tables
• Each set of four cells forming a rectangle yields one odds ratio
a b
c d
• Local Odds Ratio:Use only neighboring cells
• local odds ratios form a minimal sufficient set
Local Odds Ratios
• Study on Marijuana use (based on parental use)
never occasional regular
neither
one
both
141 54 40
68 44 51
17 11 19
student
parent
• evidence of association?
Example: Marijuana Use
neither one both
• Student by Parent Use
student
parent• positive association?
Example: Marijuana Use
prodplot(mj, count~student+parent, c("vspine","hspine"), subset=level==2)
• Concordance/Discordance:For each pair of subjects count #concordant/discordant pairs, where
• a pair is concordant, if subject 2 is ranked higher on X, it is also ranked higher on Y
• a pair is discordant, if subject 2 is ranked higher on X, but ranked lower on Y
X=1 X=2 X=iY=1Y=2...
Y=J
π11 π12 πIi
π21 π22 π2i
...
πJ1 πJ2 ... πJi
Summaries of I x J Tables(ordinal variables)
i,ji,j
concordance to (i,j):<i, <j or >i, >j
i,jdiscordance to (i,j):
>i, <j or >i, <j
Concordance/Discordancegk(T ) =eTk
�K�=1 eT�
σ(ν) =1
1 + e−ν
Zm = σ(α0m + α�mX) m = 1, ...,M
Tk = β0k + β�kZ k = 1, ...,K
fk(X) = gk(T ) k = 1, ...,K
Πc = 2I�
i=1
J�
j=1
πij ·�
h>i
�
k>j
πhk =�
i,j
πcij
σ2(γ̂)∞ =16
n(ΠC + ΠD)4
I�
i=1
J�
j=1
πij
�ΠCπd
ij −ΠDπcij
�2
γ =ΠC −ΠD
ΠC + ΠD
θ :=π00 π11
π10 π01
π : (1− π)
πj|i=0 − πj|i=1 =: π1 − π2,
r :=πj=1|i=0
πj=1|i=1
χ21,0.05 = 3.84, χ2
1,0.01 = 6.634897
πij = πi+ · π+j
Ho : P ( heart disease | Cholesterol ≤ 220) =P ( heart disease | Cholesterol > 220)
π11
π11 + π12=
π21
π21 + π22
2
• let ∏C, ∏D be the probabilities for concordance and discordance, resp.
•
• approx. normal with
Gamma Statistic
σ2(γ̂)∞ =16
n(ΠC + ΠD)4
I�
i=1
J�
j=1
πij
�ΠCπd
ij −ΠDπcij
�2
γ =ΠC −ΠD
ΠC + ΠD
θ :=π00 π11
π10 π01
π : (1− π)
πj|i=0 − πj|i=1 =: π1 − π2,
r :=πj=1|i=0
πj=1|i=1
χ21,0.05 = 3.84, χ2
1,0.01 = 6.634897
πij = πi+ · π+j
Ho : P ( heart disease | Cholesterol ≤ 220) =P ( heart disease | Cholesterol > 220)
π11
π11 + π12=
π21
π21 + π22
Let X, Y be two categorical variables, with I, J categories respectively and X ∈ {x1, x2, ..., xI}, Y ∈{y1, y2, ..., yJ}.
Then the pair (X, Y ) is categorical variable with IJ outcomes.The table
X\Y y1 y2 ... yJ
x1 n11 n12 ... n1J n1.
x2 n21 n22 ... n2J n2....
...... . . . ...
...xI nI1 nI2 ... nIJ nI.
n.1 n.2 ... n.J n
2
σ2(γ̂)∞ =16
n(ΠC + ΠD)4
I�
i=1
J�
j=1
πij
�ΠCπd
ij −ΠDπcij
�2
γ =ΠC −ΠD
ΠC + ΠD
θ :=π00 π11
π10 π01
π : (1− π)
πj|i=0 − πj|i=1 =: π1 − π2,
r :=πj=1|i=0
πj=1|i=1
χ21,0.05 = 3.84, χ2
1,0.01 = 6.634897
πij = πi+ · π+j
Ho : P ( heart disease | Cholesterol ≤ 220) =P ( heart disease | Cholesterol > 220)
π11
π11 + π12=
π21
π21 + π22
Let X, Y be two categorical variables, with I, J categories respectively and X ∈ {x1, x2, ..., xI}, Y ∈{y1, y2, ..., yJ}.
Then the pair (X, Y ) is categorical variable with IJ outcomes.The table
X\Y y1 y2 ... yJ
x1 n11 n12 ... n1J n1.
x2 n21 n22 ... n2J n2....
...... . . . ...
...xI nI1 nI2 ... nIJ nI.
n.1 n.2 ... n.J n
2
Marijuana Example:
• C=2·(141·(44+51+11+19)+ 68·(11+19)+ 54·(51+19)+ 44·19)=48562
• D=2·(17·(54+40+44+51)+ 68·(54+40)+ 11·(40+51)+ 44·40)=24732
• γ = (C − D)/(C + D) = = 0.3251
• sd(γ) = 0.0654
fairly strong indication of a positive association
Generalized Linear Models
• Three components: random, systematic, link
• Random part: distribution of observations y
• Systematic component: linear combination of explanatory variables
• link function: link between random and systematic component
Exponential Family
• Two parameter family, φ > 0
• density f(yi; θi, φ) = exp ((yiθi − b(θi))/a(φ) + c(yi, φ))
• θi is the natural parameter, φ is called dispersion parameter.
• E[Y] = b’(θ)Var[Y] = b”(θ) a(φ)
Normal Distribution
• define θ = 2µ, φ = 2σ2
f(x;µ, σ) =1√
2πσ2exp
�−(x− µ)2
2σ2
�=
(−x2+ 2µx− µ2
)/(2σ2)− 1/2 log(2πσ2
)
f(x; θ, φ) = exp(θx− b(θ)
a(φ)+ c(y, φ))
G2= 2
�
i,j
log
�nij
µ̂ij
�
X2=
�
i,j
(nij − µ̂ij)2
µ̂ij
πij = πi. · π.j ∀i, j
Assume that a + b = 1 and c + d = 1
then Odds Ratio θ:
log θ = logad
bc= log
1− b
b+ log
d
1− d
Taylor≈ 4(d− b)
p̃ =y + 2
n + 4
p̃ ± zα/2
�1
np̃(1− p̃)
p− po�1npo(1− po)
= ±zα/2
p +
z2α/2
2n± zα/2
�����
p(1− p) +
z2α/2
4n
�/n
�
1 +
z2α/2
n
�−1
p =Y
n
|p− π|�1np(1− p)
.∼ N(0, 1)
nπ ≥ 5, n(1− π) ≥ 5
λ ·��
β2+
�α2
�
1
f(x;µ, σ) =1√
2πσ2exp
�−(x− µ)2
2σ2
�=
(−x2+ 2µx− µ2
)/(2σ2)− 1/2 log(2πσ2
)
f(x; θ,φ) = exp
�θx− (θ/2)2
φ− x2/φ− 1/2 log(φπ)
�
f(x; θ,φ) = exp(θx− b(θ)
a(φ)+ c(x, φ))
G2= 2
�
i,j
log
�nij
µ̂ij
�
X2=
�
i,j
(nij − µ̂ij)2
µ̂ij
πij = πi. · π.j ∀i, j
Assume that a + b = 1 and c + d = 1
then Odds Ratio θ:
log θ = logad
bc= log
1− b
b+ log
d
1− d
Taylor≈ 4(d− b)
p̃ =y + 2
n + 4
p̃ ± zα/2
�1
np̃(1− p̃)
p− po�1npo(1− po)
= ±zα/2
p +
z2α/2
2n± zα/2
�����
p(1− p) +
z2α/2
4n
�/n
�
1 +
z2α/2
n
�−1
p =Y
n
|p− π|�1np(1− p)
.∼ N(0, 1)
nπ ≥ 5, n(1− π) ≥ 5
1
Binomial Distribution
P (X = x) =
�n
x
�πx
(1− π)n−x
f(x;µ, σ) =1√
2πσ2exp
�−(x− µ)2
2σ2
�=
(−x2+ 2µx− µ2
)/(2σ2)− 1/2 log(2πσ2
)
f(x; θ,φ) = exp
�θx− (θ/2)2
φ− x2/φ− 1/2 log(φπ)
�
f(x; θ,φ) = exp(θx− b(θ)
a(φ)+ c(x, φ))
G2= 2
�
i,j
log
�nij
µ̂ij
�
X2=
�
i,j
(nij − µ̂ij)2
µ̂ij
πij = πi. · π.j ∀i, j
Assume that a + b = 1 and c + d = 1
then Odds Ratio θ:
log θ = logad
bc= log
1− b
b+ log
d
1− d
Taylor≈ 4(d− b)
p̃ =y + 2
n + 4
p̃ ± zα/2
�1
np̃(1− p̃)
p− po�1npo(1− po)
= ±zα/2
p +
z2α/2
2n± zα/2
�����
p(1− p) +
z2α/2
4n
�/n
�
1 +
z2α/2
n
�−1
p =Y
n
|p− π|�1np(1− p)
.∼ N(0, 1)
1
P (X = x) =
�n
x
�πx
(1− π)n−x
log(1− π) = − log(1 + eθ)
P (X = x; θ = logπ
1− π, φ = 1) =
= exp
�xθ − n log(1 + eθ
) + log
�n
x
��
f(x;µ, σ) =1√
2πσ2exp
�−(x− µ)2
2σ2
�=
(−x2+ 2µx− µ2
)/(2σ2)− 1/2 log(2πσ2
)
f(x; θ,φ) = exp
�θx− (θ/2)2
φ− x2/φ− 1/2 log(φπ)
�
f(x; θ,φ) = exp(θx− b(θ)
a(φ)+ c(x, φ))
G2= 2
�
i,j
log
�nij
µ̂ij
�
X2=
�
i,j
(nij − µ̂ij)2
µ̂ij
πij = πi. · π.j ∀i, j
Assume that a + b = 1 and c + d = 1
then Odds Ratio θ:
log θ = logad
bc= log
1− b
b+ log
d
1− d
Taylor≈ 4(d− b)
p̃ =y + 2
n + 4
p̃ ± zα/2
�1
np̃(1− p̃)
p− po�1npo(1− po)
= ±zα/2
1
Poisson Distribution
P (X = x;λ) = e−λ λx
x!
P (X = x; θ = log λ, φ = 1) = exp
�xθ − eθ − log(x!)
�
P (X = x) =
�n
x
�πx
(1− π)n−x
log(1− π) = − log(1 + eθ)
P (X = x; θ = logπ
1− π, φ = 1) =
= exp
�xθ − n log(1 + eθ
) + log
�n
x
��
f(x;µ, σ) =1√
2πσ2exp
�−(x− µ)2
2σ2
�=
(−x2+ 2µx− µ2
)/(2σ2)− 1/2 log(2πσ2
)
f(x; θ,φ) = exp
�θx− (θ/2)2
φ− x2/φ− 1/2 log(φπ)
�
f(x; θ,φ) = exp(θx− b(θ)
a(φ)+ c(x, φ))
G2= 2
�
i,j
log
�nij
µ̂ij
�
X2=
�
i,j
(nij − µ̂ij)2
µ̂ij
πij = πi. · π.j ∀i, j
Assume that a + b = 1 and c + d = 1
then Odds Ratio θ:
log θ = logad
bc= log
1− b
b+ log
d
1− d
Taylor≈ 4(d− b)
p̃ =y + 2
n + 4
p̃ ± zα/2
�1
np̃(1− p̃)
1
P (X = x;λ) = e−λ λx
x!
P (X = x; θ = log λ, φ = 1) = exp
�xθ − eθ − log(x!)
�
P (X = x) =
�n
x
�πx
(1− π)n−x
log(1− π) = − log(1 + eθ)
P (X = x; θ = logπ
1− π, φ = 1) =
= exp
�xθ − n log(1 + eθ
) + log
�n
x
��
f(x;µ, σ) =1√
2πσ2exp
�−(x− µ)2
2σ2
�=
(−x2+ 2µx− µ2
)/(2σ2)− 1/2 log(2πσ2
)
f(x; θ,φ) = exp
�θx− (θ/2)2
φ− x2/φ− 1/2 log(φπ)
�
f(x; θ,φ) = exp(θx− b(θ)
a(φ)+ c(x, φ))
G2= 2
�
i,j
log
�nij
µ̂ij
�
X2=
�
i,j
(nij − µ̂ij)2
µ̂ij
πij = πi. · π.j ∀i, j
Assume that a + b = 1 and c + d = 1
then Odds Ratio θ:
log θ = logad
bc= log
1− b
b+ log
d
1− d
Taylor≈ 4(d− b)
p̃ =y + 2
n + 4
p̃ ± zα/2
�1
np̃(1− p̃)
1
Systematic Component
• Let X1, X2, ..., Xp be explanatory variables
• systematic component:
• with parameters ß1, ß2, ..., ßp
Hair Eye ColorMHair Eye ColorM
Hair Eye Color
HairMHairM
Hair
EyeMEyeM
Eye
BlackBlackBlackM
Black
BrownBrownBrownM
Brown
RedRedRedM
Red
BlondBlondBlondM
Blond
BrownBrownBrownM
Bro
wn
BlueBlueBlueM
Blu
e
HazelHazelHazelM
Ha
ze
l
GreenGreenGreenMMaleMaleMaleM
Male
FemaleFemaleFemaleM
Female
MaleMaleMaleM
Male
FemaleFemaleFemaleM
Female
MaleMaleMaleM
Male
FemaleFemaleFemaleM
Female
MaleMaleMaleM
Male
FemaleFemaleFemaleM
Female
StandardizedMStandardizedMResiduals:MResiduals:M<-4<-4<-4M
<-4
-4:-2-4:-2-4:-2M
-4:-
2
-2:0-2:0-2:0M
-2:0
0:20:20:2M
0:2
2:42:42:4M
2:4
>4>4>4M
>4
Hair Eye ColorMHair Eye ColorM
Hair Eye Color
HairMHairM
Hair
EyeMEyeM
Eye
BlackBlackBlackM
Black
BrownBrownBrownM
Brown
RedRedRedM
Red
BlondBlondBlondM
Blond
BrownBrownBrownM
Bro
wn
BlueBlueBlueM
Blu
e
HazelHazelHazelM
Ha
ze
l
GreenGreenGreenMMaleMaleMaleM
Male
FemaleFemaleFemaleM
Female
MaleMaleMaleM
Male
FemaleFemaleFemaleM
Female
MaleMaleMaleM
Male
FemaleFemaleFemaleM
Female
MaleMaleMaleM
Male
FemaleFemaleFemaleM
Female
A test of independence, though, gives G2= 179.79 (which is highly significant). Using shading according to
the value of Pearson’s residuals, we get the mosaic on the right hand side. Here, we can identify the cells
(female, blonde, blue eyes) and (female, brown, blue eyes) as the main responsible cells for a deviation from
independence (there are “too many” blue-eyed blondes and “too few” brown-eyed brown heads).
3 Generalized Linear Models
Generalized Linear Models (GLMs) consist of three components:
random component:response variable Y with independent observations y1, y2, ..., yN , Y has a distribution in the naturalexponential family, i.e. let the distribution of Y have parameter θ, then we can write the density in
the following form:
f(yi; θi) = a(θ)b(yi) exp(yiQ(θi)),
where Q(θ) is the natural parameter.
systematic component:let X1, ..., Xp be explanatory variables, where xij is the value of variable j for subject i. The systematic
component is then a linear combination in Xj :
ηi =
p�
j=1
βjxij
This is equivalent to a design matrix in a standard linear model.
link functionThe link function builds the connection between the systematic and the random component: let µi be
the expected value of Yi, i.e. E[Yi] = µi, then
g(µi) = ηi =
�
j
βjxij
for i = 1, ..., N , where g is a differentiable, monotonic function.
g(µ) = µ is the identity link, g(µ) = Q(θ) is called the canonical link.
19
Link Function
• g() is called the link function of the GLM, if
• g differentiable and monotonic,
• g(E[Y]) = ∑ Xjßj
• g is called the natural link, if g(E[Y]) = θ