Statistics of Contingency Tables - Extension to I x J · Contingency Tables - Extension to I x J...

Statistics of Contingency Tables -

Extension to I x Jstat 557

Heike Hofmann

Outline

• Testing Independence

• Local Odds Ratios

• Concordance & Discordance

• Intro to GLMs

• Simpson’s paradox: marginal association between X and Y is opposite to conditional associations between X and Y for each level of Z

• due to: very strong marginal association between X and Z or Y and Z

Simpson’s paradox

Conditional Odds Ratios

• X, Y are conditionally independent for level k of Z, if the conditional log odds ratio is 0

• X,Y are conditionally independent given Z, if all conditional odds ratios are 0.(Does not imply marginal independence)

• X,Y have homogenous association, if all conditional odds ratios given Z are constant.

Testing Independence

• Odds ratio of 1 indicates independence, confidence interval helps to determine deviation from independence, but CI is approximation.

• Alternative solution: table tests

Testing independence

• null hypothesis:

• Score Test (Pearson, 1900):

• Likelihood-Ratio Test:

• both X2 and G2 have the same limiting distribution of chi2(I-1)(J-1)

πij = πi. · π.j ∀i, j

Assume that a + b = 1 and c + d = 1

then Odds Ratio θ:

log θ = logad

bc= log

1− b

b+ log

1− d

Taylor≈ 4(d− b)

p̃ =y + 2

p̃ ± zα/2

np̃(1− p̃)

p− po�1npo(1− po)

= ±zα/2

z2α/2

2n± zα/2

��

p(1− p) +

z2α/2

�−1

|p− π|�1np(1− p)

.∼ N(0, 1)

nπ ≥ 5, n(1− π) ≥ 5

λ ·��

�α2

{α0m, αm : m = 1, ...,M} M(p + 1)

{β0k, βk : k = 1, ...,K} K(M + 1)

gk(T ) =eTk

�K�=1 eT�

σ(ν) =1

1 + e−ν

(nij − µ̂ij)2

µ̂ij

then Odds Ratio θ:

log θ = logad

bc= log

1− b

b+ log

1− d

Taylor≈ 4(d− b)

p̃ =y + 2

p̃ ± zα/2

np̃(1− p̃)

= ±zα/2

z2α/2

2n± zα/2

��

p(1− p) +

z2α/2

�−1

|p− π|�1np(1− p)

.∼ N(0, 1)

nπ ≥ 5, n(1− π) ≥ 5

λ ·��

�α2

{α0m, αm : m = 1, ...,M} M(p + 1)

{β0k, βk : k = 1, ...,K} K(M + 1)

gk(T ) =eTk

�K�=1 eT�

�nij

µ̂ij

(nij − µ̂ij)2

µ̂ij

then Odds Ratio θ:

log θ = logad

bc= log

1− b

b+ log

1− d

Taylor≈ 4(d− b)

p̃ =y + 2

p̃ ± zα/2

np̃(1− p̃)

= ±zα/2

z2α/2

2n± zα/2

��

p(1− p) +

z2α/2

�−1

|p− π|�1np(1− p)

.∼ N(0, 1)

nπ ≥ 5, n(1− π) ≥ 5

λ ·��

�α2

{α0m, αm : m = 1, ...,M} M(p + 1)

{β0k, βk : k = 1, ...,K} K(M + 1)

Example: Cholesterol/Heart Disease

• 1329 patients of same age/sex

present absent

Cholesterol ≤ 220

y11= 20 y12= 553

y21= 72 y22= 684

Coronary Disease

Cholesterol/Heart Disease• Expected Values under independence

present absent Total

Cholesterol ≤ 220

39.67 533.33 573

52.33 703.67 756

92 1237 1329

Coronary Disease

Cholesterol/Heart Disease

• loglikelihood ratio test G2 = 19.8

• Pearson score test X2 = 18.4

• with df = (2-1)*(2-1) = 1independence seems to be violated

Extensions to I x J Contingency Tables

• Each set of four cells forming a rectangle yields one odds ratio

• Local Odds Ratio:Use only neighboring cells

• local odds ratios form a minimal sufficient set

Local Odds Ratios

• Study on Marijuana use (based on parental use)

never occasional regular

neither

141 54 40

68 44 51

17 11 19

student

parent

• evidence of association?

Example: Marijuana Use

neither one both

• Student by Parent Use

student

parent• positive association?

Example: Marijuana Use

prodplot(mj, count~student+parent, c("vspine","hspine"), subset=level==2)

• Concordance/Discordance:For each pair of subjects count #concordant/discordant pairs, where

• a pair is concordant, if subject 2 is ranked higher on X, it is also ranked higher on Y

• a pair is discordant, if subject 2 is ranked higher on X, but ranked lower on Y

X=1 X=2 X=iY=1Y=2...

π11 π12 πIi

π21 π22 π2i

πJ1 πJ2 ... πJi

Summaries of I x J Tables(ordinal variables)

i,ji,j

concordance to (i,j):<i, <j or >i, >j

i,jdiscordance to (i,j):

>i, <j or >i, <j

Concordance/Discordancegk(T ) =eTk

�K�=1 eT�

σ(ν) =1

1 + e−ν

Zm = σ(α0m + α�mX) m = 1, ...,M

Tk = β0k + β�kZ k = 1, ...,K

fk(X) = gk(T ) k = 1, ...,K

Πc = 2I�

πij ·�

πhk =�

σ2(γ̂)∞ =16

n(ΠC + ΠD)4

�ΠCπd

ij −ΠDπcij

γ =ΠC −ΠD

ΠC + ΠD

θ :=π00 π11

π10 π01

π : (1− π)

πj|i=0 − πj|i=1 =: π1 − π2,

r :=πj=1|i=0

πj=1|i=1

χ21,0.05 = 3.84, χ2

1,0.01 = 6.634897

πij = πi+ · π+j

Ho : P ( heart disease | Cholesterol ≤ 220) =P ( heart disease | Cholesterol > 220)

π11 + π12=

π21 + π22

• let ∏C, ∏D be the probabilities for concordance and discordance, resp.

• approx. normal with

Gamma Statistic

σ2(γ̂)∞ =16

n(ΠC + ΠD)4

�ΠCπd

ij −ΠDπcij

γ =ΠC −ΠD

ΠC + ΠD

θ :=π00 π11

π10 π01

π : (1− π)

πj|i=0 − πj|i=1 =: π1 − π2,

r :=πj=1|i=0

πj=1|i=1

χ21,0.05 = 3.84, χ2

1,0.01 = 6.634897

πij = πi+ · π+j

π11 + π12=

π21 + π22

Let X, Y be two categorical variables, with I, J categories respectively and X ∈ {x1, x2, ..., xI}, Y ∈{y1, y2, ..., yJ}.

Then the pair (X, Y ) is categorical variable with IJ outcomes.The table

X\Y y1 y2 ... yJ

x1 n11 n12 ... n1J n1.

x2 n21 n22 ... n2J n2....

...... . . . ...

...xI nI1 nI2 ... nIJ nI.

n.1 n.2 ... n.J n

σ2(γ̂)∞ =16

n(ΠC + ΠD)4

�ΠCπd

ij −ΠDπcij

γ =ΠC −ΠD

ΠC + ΠD

θ :=π00 π11

π10 π01

π : (1− π)

πj|i=0 − πj|i=1 =: π1 − π2,

r :=πj=1|i=0

πj=1|i=1

χ21,0.05 = 3.84, χ2

1,0.01 = 6.634897

πij = πi+ · π+j

π11 + π12=

π21 + π22

Let X, Y be two categorical variables, with I, J categories respectively and X ∈ {x1, x2, ..., xI}, Y ∈{y1, y2, ..., yJ}.

Then the pair (X, Y ) is categorical variable with IJ outcomes.The table

X\Y y1 y2 ... yJ

x1 n11 n12 ... n1J n1.

x2 n21 n22 ... n2J n2....

...... . . . ...

...xI nI1 nI2 ... nIJ nI.

n.1 n.2 ... n.J n

Marijuana Example:

• C=2·(141·(44+51+11+19)+ 68·(11+19)+ 54·(51+19)+ 44·19)=48562

• D=2·(17·(54+40+44+51)+ 68·(54+40)+ 11·(40+51)+ 44·40)=24732

• γ = (C − D)/(C + D) = = 0.3251

• sd(γ) = 0.0654

fairly strong indication of a positive association

Generalized Linear Models

• Three components: random, systematic, link

• Random part: distribution of observations y

• Systematic component: linear combination of explanatory variables

• link function: link between random and systematic component

Exponential Family

• Two parameter family, φ > 0

• density f(yi; θi, φ) = exp ((yiθi − b(θi))/a(φ) + c(yi, φ))

• θi is the natural parameter, φ is called dispersion parameter.

• E[Y] = b’(θ)Var[Y] = b”(θ) a(φ)

Normal Distribution

• define θ = 2µ, φ = 2σ2

f(x;µ, σ) =1√

2πσ2exp

�−(x− µ)2

(−x2+ 2µx− µ2

)/(2σ2)− 1/2 log(2πσ2

f(x; θ, φ) = exp(θx− b(θ)

a(φ)+ c(y, φ))

�nij

µ̂ij

(nij − µ̂ij)2

µ̂ij

then Odds Ratio θ:

log θ = logad

bc= log

1− b

b+ log

1− d

Taylor≈ 4(d− b)

p̃ =y + 2

p̃ ± zα/2

np̃(1− p̃)

= ±zα/2

z2α/2

2n± zα/2

��

p(1− p) +

z2α/2

�−1

|p− π|�1np(1− p)

.∼ N(0, 1)

nπ ≥ 5, n(1− π) ≥ 5

λ ·��

�α2

f(x;µ, σ) =1√

2πσ2exp

�−(x− µ)2

(−x2+ 2µx− µ2

)/(2σ2)− 1/2 log(2πσ2

f(x; θ,φ) = exp

�θx− (θ/2)2

φ− x2/φ− 1/2 log(φπ)

f(x; θ,φ) = exp(θx− b(θ)

a(φ)+ c(x, φ))

�nij

µ̂ij

(nij − µ̂ij)2

µ̂ij

then Odds Ratio θ:

log θ = logad

bc= log

1− b

b+ log

1− d

Taylor≈ 4(d− b)

p̃ =y + 2

p̃ ± zα/2

np̃(1− p̃)

= ±zα/2

z2α/2

2n± zα/2

��

p(1− p) +

z2α/2

�−1

|p− π|�1np(1− p)

.∼ N(0, 1)

nπ ≥ 5, n(1− π) ≥ 5

Binomial Distribution

P (X = x) =

�πx

(1− π)n−x

f(x;µ, σ) =1√

2πσ2exp

�−(x− µ)2

(−x2+ 2µx− µ2

)/(2σ2)− 1/2 log(2πσ2

f(x; θ,φ) = exp

�θx− (θ/2)2

a(φ)+ c(x, φ))

�nij

µ̂ij

(nij − µ̂ij)2

µ̂ij

then Odds Ratio θ:

log θ = logad

bc= log

1− b

b+ log

1− d

Taylor≈ 4(d− b)

p̃ =y + 2

p̃ ± zα/2

np̃(1− p̃)

= ±zα/2

z2α/2

2n± zα/2

��

p(1− p) +

z2α/2

�−1

|p− π|�1np(1− p)

.∼ N(0, 1)

P (X = x) =

�πx

(1− π)n−x

log(1− π) = − log(1 + eθ)

P (X = x; θ = logπ

1− π, φ = 1) =

�xθ − n log(1 + eθ

) + log

��

f(x;µ, σ) =1√

2πσ2exp

�−(x− µ)2

(−x2+ 2µx− µ2

)/(2σ2)− 1/2 log(2πσ2

f(x; θ,φ) = exp

�θx− (θ/2)2

a(φ)+ c(x, φ))

�nij

µ̂ij

(nij − µ̂ij)2

µ̂ij

then Odds Ratio θ:

log θ = logad

bc= log

1− b

b+ log

1− d

Taylor≈ 4(d− b)

p̃ =y + 2

p̃ ± zα/2

np̃(1− p̃)

= ±zα/2

Poisson Distribution

P (X = x;λ) = e−λ λx

P (X = x; θ = log λ, φ = 1) = exp

�xθ − eθ − log(x!)

P (X = x) =

�πx

(1− π)n−x

log(1− π) = − log(1 + eθ)

1− π, φ = 1) =

) + log

��

f(x;µ, σ) =1√

2πσ2exp

�−(x− µ)2

(−x2+ 2µx− µ2

)/(2σ2)− 1/2 log(2πσ2

f(x; θ,φ) = exp

�θx− (θ/2)2

a(φ)+ c(x, φ))

�nij

µ̂ij

(nij − µ̂ij)2

µ̂ij

then Odds Ratio θ:

log θ = logad

bc= log

1− b

b+ log

1− d

Taylor≈ 4(d− b)

p̃ =y + 2

p̃ ± zα/2

np̃(1− p̃)

P (X = x;λ) = e−λ λx

P (X = x; θ = log λ, φ = 1) = exp

�xθ − eθ − log(x!)

P (X = x) =

�πx

(1− π)n−x

log(1− π) = − log(1 + eθ)

1− π, φ = 1) =

) + log

��

f(x;µ, σ) =1√

2πσ2exp

�−(x− µ)2

(−x2+ 2µx− µ2

)/(2σ2)− 1/2 log(2πσ2

f(x; θ,φ) = exp

�θx− (θ/2)2

a(φ)+ c(x, φ))

�nij

µ̂ij

(nij − µ̂ij)2

µ̂ij

then Odds Ratio θ:

log θ = logad

bc= log

1− b

b+ log

1− d

Taylor≈ 4(d− b)

p̃ =y + 2

p̃ ± zα/2

np̃(1− p̃)

Systematic Component

• Let X1, X2, ..., Xp be explanatory variables

• systematic component:

• with parameters ß1, ß2, ..., ßp

Hair Eye ColorMHair Eye ColorM

Hair Eye Color

HairMHairM

EyeMEyeM

BlackBlackBlackM

BrownBrownBrownM

RedRedRedM

BlondBlondBlondM

BrownBrownBrownM

BlueBlueBlueM

HazelHazelHazelM

GreenGreenGreenMMaleMaleMaleM

FemaleFemaleFemaleM

Female

MaleMaleMaleM

FemaleFemaleFemaleM

Female

MaleMaleMaleM

FemaleFemaleFemaleM

Female

MaleMaleMaleM

FemaleFemaleFemaleM

Female

StandardizedMStandardizedMResiduals:MResiduals:M<-4<-4<-4M

-4:-2-4:-2-4:-2M

-2:0-2:0-2:0M

0:20:20:2M

2:42:42:4M

>4>4>4M

Hair Eye ColorMHair Eye ColorM

Hair Eye Color

HairMHairM

EyeMEyeM

BlackBlackBlackM

BrownBrownBrownM

RedRedRedM

BlondBlondBlondM

BrownBrownBrownM

BlueBlueBlueM

HazelHazelHazelM

GreenGreenGreenMMaleMaleMaleM

FemaleFemaleFemaleM

Female

MaleMaleMaleM

FemaleFemaleFemaleM

Female

MaleMaleMaleM

FemaleFemaleFemaleM

Female

MaleMaleMaleM

FemaleFemaleFemaleM

Female

A test of independence, though, gives G2= 179.79 (which is highly significant). Using shading according to

the value of Pearson’s residuals, we get the mosaic on the right hand side. Here, we can identify the cells

(female, blonde, blue eyes) and (female, brown, blue eyes) as the main responsible cells for a deviation from

independence (there are “too many” blue-eyed blondes and “too few” brown-eyed brown heads).

3 Generalized Linear Models

Generalized Linear Models (GLMs) consist of three components:

random component:response variable Y with independent observations y1, y2, ..., yN , Y has a distribution in the naturalexponential family, i.e. let the distribution of Y have parameter θ, then we can write the density in

the following form:

f(yi; θi) = a(θ)b(yi) exp(yiQ(θi)),

where Q(θ) is the natural parameter.

systematic component:let X1, ..., Xp be explanatory variables, where xij is the value of variable j for subject i. The systematic

component is then a linear combination in Xj :

βjxij

This is equivalent to a design matrix in a standard linear model.

link functionThe link function builds the connection between the systematic and the random component: let µi be

the expected value of Yi, i.e. E[Yi] = µi, then

g(µi) = ηi =

βjxij

for i = 1, ..., N , where g is a differentiable, monotonic function.

g(µ) = µ is the identity link, g(µ) = Q(θ) is called the canonical link.

Link Function

• g() is called the link function of the GLM, if

• g differentiable and monotonic,

• g(E[Y]) = ∑ Xjßj

• g is called the natural link, if g(E[Y]) = θ

Statistics of Contingency Tables - Extension to I x J · Contingency Tables - Extension to I x J...

Documents

Transcript of Statistics of Contingency Tables - Extension to I x J · Contingency Tables - Extension to I x J...

Statistical Inference on Large Contingency Tables ...math.bme.hu/~marib/prezentacio/hyd4.pdf · Preliminaries Convergence of contingency tables Testability Homogeneous partitions,

Visual Analysis of Large Contingency Tables · Contingency Wheel Visual Analysis of Large Contingency Tables A two-way contingency table is an n x m matrix that records the frequency

Chi-square, Goodness of fit, and Contingency Tables.

Contingency Tables Chapters Seven, Sixteen, and Eighteen Chapter Seven –Definition of Contingency Tables –Basic Statistics –SPSS program (Crosstabulation)

Contingency Tables (Crosstabs / Chi-Square Test) - · PDF fileSuch tables are known as contingency, cross-tabulation, or crosstab tables. When a breakdown of more than two ... Contingency

Relational models for contingency tables 1

THE MAXIMUM-LIKELIHOOD ESTIMATE FOR CONTINGENCY TABLES ... · PDF fileCONTINGENCY TABLES WITH ZERO DIAGONAL BY ... The Maximum-Likelihood Estimate for Contingency Tables ... 3, 4,

contingency tables - MadAsMaths

Chapter 5: How Variables Classify Jointly: Contingency Tables.

Confidence Intervals and Contingency Tables - ICB::Chagallphysiology.med.cornell.edu/people/banfelder/qbio/resources_2011... · Lecture III: Confidence Intervals and Contingency Tables

Chapter 2 Describing Contingency Tables

Algebraic statistics and contingency tables - UCLrisi/AML08/NIPS-AdrianDobra.pdf · Algebraic statistics and contingency tables Adrian Dobra University of Washington AML08: Algebraic

CLASSIFICATION ERRORS IN CONTINGENCY TABLES ANALYZED …

Contingency Tables For Tests of Independence

Inferences on Two-Way Contingency Tables

Sampling Contingency Tables - mzlabs

Chapter 10: Contingency tables I - University of South ...

Loglinear Models for Contingency Tables

Contingency Tables

Binary Contingency Tables in Theory and Practice - Rochester