Loglinear Models for Categorical Data - YorkU Math … Models for Categorical Data outline Lecture...
Transcript of Loglinear Models for Categorical Data - YorkU Math … Models for Categorical Data outline Lecture...
Loglinear Models for Categorical Data
Sex: Male
Adm
it?: Y
es
Sex: Female
Adm
it?: N
o
1198 1493
557 1278
High
2
3
Low
High 2 3 Low
Rig
ht
Ey
e G
ra
de
Left Eye Grade
Unaided distant vision data
4.4
-3.1
2.3
-5.9
-2.2
7.0
Black Brown Red Blond
Bro
wn
Ha
ze
l G
ree
n
Blu
e
Michael FriendlyYork University, <[email protected]>
SPIDA 2004
Loglinear Models for Categorical Data outline
Lecture Outline
Introduction
Graphical methods for Categorical Data
Loglinear models: Prelimaries
Overview of loglinear models
Visualizing contingency tables and loglinear models
2 × 2 tables
Larger tables
n-way tables
Structured tables
Ordinal variables
Square tables
SPIDA 2004 1 Michael Friendly
Loglinear Models for Categorical Data outline
Resources
SAS macro programs, from SAS System for Statistical Graphics (Friendly, 1991):
http://www.math.yorku.ca/SCS/sssg/
http://www.math.yorku.ca/SCS/sasmac/
SAS macro programs, from Visualizing Categorical Data (Friendly, 2000):
http://www.math.yorku.ca/SCS/vcd/
Workshop notes: http://www.math.yorku.ca/SCS/spida/loglin/
SCS Short Course notes, Categorical Data Analysis with Graphics
http://www.math.yorku.ca/SCS/Courses/grcat/
SPIDA 2004 2 Michael Friendly
Loglinear Models for Categorical Data overview2
Graphical Methods for Categorical Data
If I can’t picture it, I can’t understand it. Albert Einstein
Getting information from a table is like extracting sunlight from a cucumber.
Farquhar & Farquhar, 1891
Tables vs. Graphs
Tables are best suited for look-up— read off exact numbers
Graphs are better for showing patterns, trends, anomalies, making
comparisons
Visual presentation as communication: what do you want to say?
SPIDA 2004 3 Michael Friendly
Loglinear Models for Categorical Data overview2
Visual metaphors
Quantitative data: magnitude ∼ position along an axis
Frequency data: count ∼ area (Friendly, 1995)
Sex: Male
Adm
it?: Y
es
Sex: Female
Adm
it?: N
o
1198 1493
557 1278
Fourfold display for 2×2 table
A
B
C
D
E
F
Male Female Admitted Rejected
Model: (DeptGender)(Admit)
Mosaic plot for 3-way table
SPIDA 2004 4 Michael Friendly
LoglinearModelsforCategoricalDataoverview2
PrinciplesofGraphicalDisplays
Effectordering(FriendlyandKwan,2003)—Intablesandgraphs,sort
unorderedfactorsaccordingtotheeffectsyouwanttosee/show.
Auto data: Alpha order
Displa
Gratio
Hroom
Length
MPG
Price
Rep77
Rep78
Rseat
Trunk
Turn
Weight
Weig
htTurn
Tru
nk
Rseat
Rep78
Rep77
Pric
e M
PG
LengthH
room
Gra
tioDis
pla
Auto data: PC2/1 order
Gratio
MPG
Rep78
Rep77
Price
Hroom
Trunk
Rseat
Length
Weight
Displa
Turn
Turn
Dis
plaW
eig
ht
LengthR
seat
Tru
nk
Hro
om
Pric
e Rep77
Rep78
MP
G
Gra
tio
“Corrgrams:Exploratorydisplaysforcorrelationmatrices”(Friendly,2002)
SPIDA20045MichaelFriendly
Loglinear Models for Categorical Data overview2
Comparisons— Make visual comparisons easy
• Visual grouping— connect with lines, make key comparisons contiguous
• Baselines— compare data to model against a line, preferably horizontal
Fre
quen
cy
0
25
50
75
100
125
150
175
Number of Occurrences
0 1 2 3 4 5 6
Sqr
t(fr
eque
ncy)
-2
0
2
4
6
8
10
12
Number of Occurrences
0 1 2 3 4 5 6
SPIDA 2004 6 Michael Friendly
Loglinear Models for Categorical Data overview2
Small multiples— combine stratified graphs into coherent displays (Tufte, 1983)
e.g., scatterplot matrix for quantitative data: all pairwise scatterplots
Prestige
14.8
87.2
Educ
6.38
15.97
Income
611
25879
Women
0
97.51
SPIDA 2004 7 Michael Friendly
Loglinear Models for Categorical Data overview2
e.g., mosaic matrix for quantitative data: all pairwise mosaic plots
Admit
Male Female
Adm
it
Reje
ct
A B C D E F
Adm
it
Reje
ct
Admit Reject
Male
Fem
ale
Gender
A B C D E F
Male
Fem
ale
Admit Reject
A B
C
D
E
F
Male Female
A B
C
D
E
F
Dept
SPIDA 2004 8 Michael Friendly
Loglinear Models for Categorical Data prelim
Loglinear models: Preliminaries
Categorical data:
Variables with a finite number of discrete values
• Binary: pass/fail, live/die, vote yes/vote no
• Nominal: unordered categories
• Religion: Christian, Jewish, Muslim, ...
• Party preference: Conservative, Liberal, NDP, Bloc Quebecois, Green
Party
• Ordinal: ordered categories
• Education: Primary, Secondary, University, Post-graduate
• Income ranges: 0-20, 20-30, 30-40, ...
• Likert scale: strongly disagree · · · strongly agree
• Interval:
• Counts: number of children– 0, 1, 2, ...
SPIDA 2004 9 Michael Friendly
Loglinear Models for Categorical Data prelim
Loglinear models: Preliminaries
Data representations:
Case form: One row per case (observation), one column per variable—
standard data table
(Quantitative data is always represented in case form)
Table form: A set of n variables represented as an n-way table
Table 1: Admissions to Berkeley graduate programs
Admitted Rejected Total
Males 1198 1493 2691
Females 557 1278 1835
Total 1755 2771 4526
SPIDA 2004 10 Michael Friendly
Loglinear Models for Categorical Data prelim
3+-way tables
Table 2: Berkeley data: 3-way table
Male Female
Dept Admit Reject Admit Reject
A 512 313 89 19
B 353 207 17 8
C 120 205 202 391
D 138 279 131 244
E 53 138 94 299
F 22 351 24 317
SPIDA 2004 11 Michael Friendly
Loglinear Models for Categorical Data prelim
Frequency form: One row per combination of variable categories
1 Obs Admit Gender Dept count2
3 1 Admitted Male A 5124 2 Admitted Male B 3535 3 Admitted Male C 1206 4 Admitted Male D 1387 5 Admitted Male E 538 6 Admitted Male F 229 7 Admitted Female A 89
10 8 Admitted Female B 1711 9 Admitted Female C 20212 10 Admitted Female D 13113 11 Admitted Female E 9414 12 Admitted Female F 2415 ... ... ... ... ...16 23 Rejected Female E 29917 24 Rejected Female F 317
SPIDA 2004 12 Michael Friendly
Loglinear Models for Categorical Data prelim
Loglinear models: Preliminaries
Data transformations:
All SAS procedures take input in case form.
Most SAS procedures take input in frequency form— need to use a FREQ or
WEIGHT statement to identify the frequency (count) variable.
Categorical variables can be represented by numeric or character values
• Character variables are ordered alphabetically by default— often not what
you want (e.g., Low, Medium, High → High, Low, Medium)
• Often easiest to use numeric values, and create a SAS format to attach value
labels:1 proc format;2 value inc 1=’Low’ 2=’Medium’ 3=’High’;3 proc freq;4 tables educ * income;5 format income inc.;
SPIDA 2004 13 Michael Friendly
Loglinear Models for Categorical Data prelim
Recoding categories
• Categories can be collapsed (frequencies summed) using SAS formats1 proc format;2 value edu 1-8=’Primary’ 9-12=’Secondary’3 12-16=’University’ 16-high=’Post-grad’;4 proc freq;5 tables educ * income;6 format educ edu. income inc.;
Reordering categories
• Many SAS procedures allow several different ways to order the categories of
a discrete variable
• ORDER=INTERNAL— by numeric or character value (default).
• ORDER=DATA— by order of appearance in the dataset.
• ORDER=FORMATTED— by order of formatted value.
• ORDER=FREQ— by order of frequency of occurrence.
•The table macro
• Collapse or recode variables
SPIDA 2004 14 Michael Friendly
Loglinear Models for Categorical Data overview
Loglinear models: Overview
Related perspectivesLoglinear models can be developed as an analog of classical ANOVA andregression models, where multiplicative relations (under independence) arere-expressed in additive form as models for log(frequency).Alternatively, but somewhat more generally, loglinear models are also GLMs forlog(frequency), with a Poisson distribution for the cell counts.
Two-way tables: Loglinear approachFor two discrete variables, A and B, suppose a multinomial sample of totalsize n over the IJ cells of a two-way I × J contingency table, with cellfrequencies nij , and cell probabilities πij = nij/n.• The table variables are statistically independent when the cell (joint)
probability equals the product of the marginal probabilities,Pr(A = i & B = j) = Pr(A = i) × Pr(B = j), or,
πij = πi+π+j .
• An equivalent model in terms of expected frequencies, mij = nπij is
mij = n mi+ m+j .
SPIDA 2004 15 Michael Friendly
Loglinear Models for Categorical Data overview
• This multiplicative model can be expressed in additive form as a model for
log mij ,
log mij = log n + log mi+ + log m+j . (1)
• By anology with ANOVA models, the independence model (1) can be
expressed as
log mij = µ + λAi + λB
j , (2)
where µ is the grand mean of log mij and the parameters λAi and λB
j
express the marginal frequencies of variables A and B, and are typically
defined so that∑
i λAi =
∑j λB
j = 0.
Dependence between the table variables is expressed by adding association
parameters, λABij , giving the saturated model,
log mij = µ + λAi + λB
j + λABij . (3)
• The saturated model fits the table perfectly (mij = nij ): there are as many
parameters as cell frequencies.
• A global test for association tests H0 : λABij = 0.
• For ordinal variables, the λABij may be structured more simply, giving tests
for ordinal association.
SPIDA 2004 16 Michael Friendly
Loglinear Models for Categorical Data overview
Two-way tables: GLM approach
In the GLM approach, the vector of cell frequencies, n = {nij} is specified to
have a Poisson distribution with means m = {mij} given by
log m = Xβ
where X is a known design (model) matrix and β is a column vector
containing the unknown λ parameters.
For example, for a 2 × 2 table, the saturated model (3) with the usual zero-sum
constraints can be represented as
log m11
log m12
log m21
log m22
=
1 1 1 1
1 1 −1 −1
1 −1 1 −1
1 −1 −1 1
µ
λA1
λB1
λAB11
Note that only the linearly independent parameters are represented.
λA2 = −λA
1 , because λA1 + λA
2 = 0, and so forth.
An advantage of the GLM formulation is that it makes it somewhat easier to
express models with ordinal or quantitative variables.
SPIDA 2004 17 Michael Friendly
Loglinear Models for Categorical Data overview
Three-way Tables
Saturated model: For a 3-way table, of size I × J × K for variables
A, B, C , the saturated loglinear model includes associations between all pairs of
variables, as well as a 3-way association term, λABCijk
log mijk = µ + λAi + λB
j + λCk + λAB
ij + λACik + λBC
jk + λABCijk . (4)
As before, one-way terms (λAi , λB
j , λCk ) pertain to differences in the marginal
frequencies of the table variables.
Two-way terms (λABij , λAC
ik , λBCjk ) pertain to the partial association for each
pair of variables, controlling for the remaining variable.
The three-way term, λABCijk allows the partial association between any pair of
variables to vary over the categories of the third variable.
Such models are usually hierarchical : the presence of a high-order term, such
as λABCijk implies that all low-order relatives are automatically included.
Thus, a short-hand notation for a loglinear model lists only the high-order
terms, i.e., model (4) ≡ [ABC]
SPIDA 2004 18 Michael Friendly
Loglinear Models for Categorical Data overview
Reduced models: The usual goal is to fit the smallest model (fewest
high-order terms) that is sufficient to explain/describe the observed frequencies.
All representative subset models are shown in the table below
Table 3: Log-linear Models for Three-Way Tables
Model Model symbol Interpretation
Mutual independence [A][B][C] A ⊥ B ⊥ C
Joint independence [AB][C] (A B) ⊥ C
Conditional independence [AC][BC] (A ⊥ B) |CAll two-way associations [AB][AC][BC] homogeneous association
Saturated model [ABC] interaction
e.g.,
[AB][C] ≡ log mijk = µ + λAi + λB
j + λCk + λAB
ij
SPIDA 2004 19 Michael Friendly
Loglinear Models for Categorical Data overview
Assessing goodness of fit
Goodness of fit of a specified model may be tested by the likelihood ratio G2,
G2 = 2∑
i
ni log(ni/mi) , (5)
or the Pearson χ2,
χ2 =∑
i
(ni − mi)2
mi, (6)
with degrees of freedom = # cells - # estimated parameters.
E.g., for the model of mutual independence, [A][B][C], df =
IJK − (I − 1) − (J − 1) − (K − 1) = (I − 1)(J − 1)(K − 1)The terms summed in (5) and (6) are the squared cell residuals
Other measures of balance goodness of fit against parsimony, e.g., Akaike’s
Information Criterion (smaller is better)
AIC = G2 − 2df or AIC = G2 + 2 # parameters
SPIDA 2004 20 Michael Friendly
Loglinear Models for Categorical Data vismethods
Visualizing Contingency tables
Two-way tables
2 × 2 tables — Visualize odds ratio (FFOLD macro)2 × 2 × k tables — Homogeneity of associationr × 3 tables — Trilinear plots (TRIPLOT macro)r × c tables — Visualize association (SIEVE program)r × c tables — Visualize association (MOSAIC macro)Square r × r tables — Visualize agreement (AGREE program)
n-way tables
Fit loglinear models, visualize lack-of-fit — (MOSAIC macro)Test & visualize partial association — (MOSAIC macro)Visualize pairwise association — (MOSMAT macro)Visualize conditional association — (MOSMAT macro)Visualize loglinear structure — (MOSMAT macro)
Correspondence analysis and MCA — (CORRESP macro)
SPIDA 2004 21 Michael Friendly
Loglinear Models for Categorical Data 2by2
2 × 2 tables
2 × 2 tables provide the simplest context to illustrate loglinear models and theirrelations to other tests and modeling techniques.
Example: Bickel et al. (1975): admissions to graduate depatments at U. C.Berkeley in 1973— Aggregate data for the six largest departments:
Table 4: Admissions to Berkeley graduate programs
Admitted Rejected Total % Admitted
Males 1198 1493 2691 44.52
Females 557 1278 1835 30.35
Total 1755 2771 4526 38.78
Equivalent questions:
Proportions: Are the proportions of males and females admitted the same?
• Method: Test of independent proportions, H0 : p1 = p2
• Answer:z = p1−p2√
p(1−p)(1/n1+1/n2)= .4452−.3035√
.3878(.6163)(1/2691+1/1835)= 9.62
SPIDA 2004 22 Michael Friendly
Loglinear Models for Categorical Data 2by2
2 × 2 tables
Odds: Are the odds of admission the same for males and females?
• Method: Odds ratio, H0 : θ = Odds(Admit |Male)
Odds(Admit |Female)= 1
• Answer: θ = Odds(Admit |Male)
Odds(Admit |Female)= 1198/1493
557/1276 = 1.84
• → Males 84% more likely to be admitted.
Is admission independent of gender?
• Method: χ2 test of independence, H0 : pij = pi·p·j
• Answer: χ2(1) =
∑ ∑ (nij−mij)2
mij= 92.21, p < 0.0001
Does cell frequency depend only on marginal frequencies of gender and
admission?
• Method: Loglinear model, log(mij) = µ + λAi + λG
j + λAGij
• Answer: H0 : λAGij = 0 LR G2
(1) = 93.45, p < 0.0001
SPIDA 2004 23 Michael Friendly
Loglinear Models for Categorical Data 2by2
Standard analysis: PROC FREQ
proc freq data=berkeley;weight freq;tables gender*admit / chisq;
Output:
Statistics for Table of gender by admit
Statistic DF Value Prob------------------------------------------------------Chi-Square 1 92.2053 <.0001Likelihood Ratio Chi-Square 1 93.4494 <.0001Continuity Adj. Chi-Square 1 91.6096 <.0001Mantel-Haenszel Chi-Square 1 92.1849 <.0001Phi Coefficient 0.1427
How to visualize and interpret?
SPIDA 2004 24 Michael Friendly
Loglinear Models for Categorical Data 2by2
Fourfold displays for 2 × 2 tables
Quarter circles: radius ∼ √nij ⇒ area ∼ frequency
Independence: Adjoining quadrants ≈ alignOdds ratio: ratio of areas of diagonally opposite cellsConfidence rings: Visual test of H0 : θ = 1 ↔ adjoining rings overlap
Sex: Male
Adm
it?: Y
es
Sex: Female
Adm
it?: N
o
1198 1493
557 1278
Confidence rings do not overlap: θ 6= 1
SPIDA 2004 25 Michael Friendly
Loglinear Models for Categorical Data 2by2
Fourfold displays for 2 × 2 × k tables
Data in Table 4 had been pooled over departmentsStratified analysis: one fourfold display for each departmentEach 2 × 2 table standardized to equate marginal frequenciesShading: highlight departments for which Ha : θi 6= 1
Sex: Male
Ad
mit?
: Y
es
Sex: Female
Ad
mit?
: N
o
512 313
89 19
Department: A Sex: Male
Ad
mit?
: Y
es
Sex: Female
Ad
mit?
: N
o
353 207
17 8
Department: B Sex: Male
Ad
mit?
: Y
es
Sex: Female
Ad
mit?
: N
o
120 205
202 391
Department: C
Sex: Male
Ad
mit?
: Y
es
Sex: Female
Ad
mit?
: N
o
138 279
131 244
Department: D Sex: Male
Ad
mit?
: Y
es
Sex: Female
Ad
mit?
: N
o
53 138
94 299
Department: E Sex: Male
Ad
mit?
: Y
es
Sex: Female
Ad
mit?
: N
o
22 351
24 317
Department: F
Only one department (A) shows association; θA = 0.349 → women(0.349)−1 = 2.86 times as likely as men to be admitted.
SPIDA 2004 26 Michael Friendly
Loglinear Models for Categorical Data 2by2
What happened here?
Simpson’s paradox:
Aggregate data are misleading because they falsely assume men and women
apply equally in each field.
But:
Large differences in admission rates across departments.
Men and women apply to these departments differentially.
Women applied in large numbers to departments with low admission rates.
(This ignores possibility of structural bias against women: differential funding of
fields to which women are more likely to apply.)
Other graphical methods can show these effects.
SPIDA 2004 27 Michael Friendly
Loglinear Models for Categorical Data 2by2
The FOURFOLD program and the FFOLD macro
The FOURFOLD program is written in SAS/IML.The FFOLD macro provides a simpler interface.Printed output: (a) significance tests for individual odds ratios, (b) tests ofhomogeneity of association (here, over departments) and (c) conditionalassociation (controlling for department).
Plot by department:berk4f.sas
1 %include catdata(berkeley);2
3 %ffold(data=berkeley,4 var=Admit Gender, /* panel variables */5 by=Dept, /* stratify by dept */6 down=2, across=3, /* panel arrangement */7 htext=2); /* font size */
Aggregate data: first sum over departments, using the TABLE macro:
8 %table(data=berkeley, out=berk2,9 var=Admit Gender, /* omit dept */
10 weight=count, /* frequency variable */11 order=data);12 %ffold(data=berk2, var=Admit Gender);
SPIDA 2004 28 Michael Friendly
Loglinear Models for Categorical Data 2by2
The FOURFOLD program and the FFOLD macro
Printed output:
Odds (Admitted|Male) / (Admitted|Female)
Odds Ratio Log Odds SE(LogOdds) Z Pr>|Z|
A 0.349 -1.052 0.263 -4.005 0.000B 0.803 -0.220 0.438 -0.503 0.615C 1.133 0.125 0.144 0.868 0.385D 0.921 -0.082 0.150 -0.546 0.585E 1.222 0.200 0.200 1.000 0.317F 0.828 -0.189 0.305 -0.619 0.536
SPIDA 2004 29 Michael Friendly
Loglinear Models for Categorical Data 2by2
For a 2 × 2 × k table, additional tests are also of interest:
Homogeneity of odds ratios: H0 : θ1 = θ2 = · · · θk (This is equivalent to theloglinear model of no 3-way association, [AG][AD][DG].)
Conditional independence: Admit ⊥ Gender |Dept
Test of Homogeneity of Odds Ratios (no 3-Way Association)TEST CHISQ DF PROB
Homogeneity of Odds Ratios 20.204 5 0.0011
Conditional Independence of Admit and Gender | Dept(assuming Homogeneity)
TEST CHISQ DF PROB
Likelihood-Ratio 1.531 1 0.2159Cochran-Mantel-Haenszel 1.429 1 0.2320
SPIDA 2004 30 Michael Friendly
Loglinear Models for Categorical Data twoway
Two-way frequency tables
Table 5: Hair-color eye-color data
Eye Hair Color
Color Black Brown Red Blond Total
Green 5 29 14 16 64
Hazel 15 54 14 10 93
Blue 20 84 17 94 215
Brown 68 119 26 7 220
Total 108 286 71 127 592
SPIDA 2004 31 Michael Friendly
Loglinear Models for Categorical Data twoway
Two-way frequency tables: PROC FREQ
proc freq data=haireye;weight count;tables Eye * HAIR / chisq;
Output:
Statistics for Table of Eye by HAIR
Statistic DF Value Prob------------------------------------------------------Chi-Square 9 138.2898 <.0001Likelihood Ratio Chi-Square 9 146.4436 <.0001Mantel-Haenszel Chi-Square 1 28.2923 <.0001Phi Coefficient 0.4833
SPIDA 2004 32 Michael Friendly
Loglinear Models for Categorical Data twoway
Two-way frequency tables: PROC GENMOD
We can fit the same model as a Poisson GLM for count:
proc genmod data=haireye;class hair eye;model count = hair eye hair*eye / dist=poisson type3;
Output:
LR Statistics For Type 3 Analysis
Chi-Source DF Square Pr > ChiSq
hair 3 143.46 <.0001eye 3 58.57 <.0001hair*eye 9 146.44 <.0001
How to visualize and interpret?
SPIDA 2004 33 Michael Friendly
Loglinear Models for Categorical Data twoway
Two-way frequency tables: Sieve diagrams
count ∼ areaWhen row/col variables are independent, nij ∼ ni+n+j
⇒ each cell can be represented as a rectangle, with area = height × width ∼frequency, nij
Green
Hazel
Blue
Brown
Black Brown Red Blond
Eye
Co
lor
Hair Color
Expected frequencies: Hair Eye Color Data11.7 30.9 7.7 13.7
17.0 44.9 11.2 20.0
39.2 103.9 25.8 46.1
40.1 106.3 26.4 47.2
64
93
215
220
108 286 71 127 592
SPIDA 2004 34 Michael Friendly
Loglinear Models for Categorical Data twoway
Sieve diagrams
Height/width ∼ marginal frequencies, ni+, n+j
Area ∼ expected frequency, ∼ ni+n+j
Shading ∼ observed frequency, nij , color: sign(nij − mij).Independence: Shown when density of shading is uniform.
Green
Hazel
Blue
Brown
Black Brown Red Blond
Eye
Co
lor
Hair Color
5 29 14 16
15 54 14 10
20 84 17 94
68 119 26 7
SPIDA 2004 35 Michael Friendly
Loglinear Models for Categorical Data twoway
Sieve diagrams
Effect ordering: Reorder rows/cols to make the pattern coherent
Blue
Green
Hazel
Brown
Black Brown Red Blond
Eye
Co
lor
Hair Color
20 84 17 94
5 29 14 16
15 54 14 10
68 119 26 7
SPIDA 2004 36 Michael Friendly
Loglinear Models for Categorical Data twoway
Sieve diagrams
Vision classification data for 7477 women
High
2
3
Low
High 2 3 Low
Rig
ht
Eye G
rad
e
Left Eye Grade
Unaided distant vision data
SPIDA 2004 37 Michael Friendly
Loglinear Models for Categorical Data nway2
Mosaic displays and Log-linear Models
Hartigan and Kleiner (1981), Friendly (1994, 1999):
Width ∼ one set of marginals, ni+
Height ∼ relative proportions of other variable, pj | i = nij/ni+
⇒ area ∼ frequency, nij = ni+pj | i
1198
1493
557
1278
Male Female
Adm
itte
dR
eje
cte
dModel: (Gender)(Admit)
Sta
ndard
ized
resid
uals
:
<-4
-4:-
2-2
:-0
0:2
2:4
>4
SPIDA 2004 38 Michael Friendly
Loglinear Models for Categorical Data nway2
Shading: Sign and magnitude of Pearson χ2 residual,dij = (nij − mij)/
√mij (or L.R. G2)
Sign: − negative in red; + positive in blueMagnitude: intensity of shading: |dij | > 0, 2, 4, . . .
Independence: Rows ≈ align, or cells are empty!
E.g., aggregate Berkeley data, independence model:
1198
1493
557
1278
Male Female
Adm
itte
dR
eje
cte
dModel: (Gender)(Admit)
Sta
ndard
ized
resid
uals
:
<-4
-4:-
2-2
:-0
0:2
2:4
>4
SPIDA 2004 39 Michael Friendly
Loglinear Models for Categorical Data nway2
Mosaic displays
Departments × Gender:
Did departments differ in the total number of applicants?
Did men and women apply differentially to departments?
A
B
C
D
E
F
Male Female
Model: (Dept)(Gender)
Model [Dept] [Gender]: G2(5) =
1220.6.Note: Departments ordered A–F byoverall rate of admission.
SPIDA 2004 40 Michael Friendly
Loglinear Models for Categorical Data nway2
Mosaic displays: Hair color and eye color
4.4
-3.1
2.3
-5.9
-2.2
7.0
Black Brown Red Blond
Bro
wn
Hazel G
reen B
lue
Dark hair goes with dark eyes, light
hair with light eyes
Red hair, hazel eyes an exception?
Effect ordering: Rows/cols permuted
by CA Dimension 1
SPIDA 2004 41 Michael Friendly
Loglinear Models for Categorical Data nway2
Mosaic displays for multiway tables
Generalizes to n-way tables: divide cells recursively
Can fit any log-linear model (e.g., 3-way),
Table 6: Log-linear Models for Three-Way Tables
Model Model symbol Independence interpretation
Mutual independence [A][B][C] A ⊥ B ⊥ C
Joint independence [AB][C] (A B) ⊥ C
Conditional independence [AC][BC] (A ⊥ B) |CAll two-way associations [AB][AC][BC] (none)
Saturated model [ABC] (none)
e.g., the model for conditional independence (A ⊥ C |B):
[AB][BC] ≡ log mijk = µ + λAi + λB
j + λCk + λAB
ij + λBCjk
Each mosaics shows:
DATA (size of tiles)(some) marginal frequencies (spacing → visual grouping)RESIDUALS (shading) — what associations have been omitted?
SPIDA 2004 42 Michael Friendly
Loglinear Models for Categorical Data nway2
E.g., Joint independence, [DG][A] (null model, Admit as response) [G2(11) = 877.1]:
A
B
C
D
E
F
Male Female Admitted Rejected
Model: (DeptGender)(Admit)
SPIDA 2004 43 Michael Friendly
Loglinear Models for Categorical Data nway2
Mosaic displays for multiway tables
Visual fitting:
Pattern of lack-of-fit (residuals) → “better” model— smaller residuals“cleaning the mosaic” → “better” model— empty cellsbest done interactively!
-4.2 4.2 4.2 -4.2A
B
C
D
E
F
Male Female Admitted Rejected
Model: (DeptGender)(DeptAdmit)
E.g., Add [Dept Admit] association →Conditional independence:
Fits poorly, overall (G2(6) = 21.74)
But, only in Department A!
SPIDA 2004 44 Michael Friendly
Loglinear Models for Categorical Data nway2
Sequential plots and models
Mosaic for an n-way table → hierarchical decomposition of association in a way
analogous to sequential fitting in regression
Joint cell probabilities are decomposed as
pijk`··· =
{v1v2}︷ ︸︸ ︷pi × pj|i × pk|ij︸ ︷︷ ︸
{v1v2v3}
× p`|ijk × · · · × pn|ijk···
First 2 terms → mosaic for v1 and v2
First 3 terms → mosaic for v1, v2 and v3
· · ·Sequential models of joint independence → additive decomposition of the total
association, G2[v1][v2]...[vp] (mutual independence),
G2[v1][v2]...[vp] = G2
[v1][v2]+G2
[v1v2][v3]+G2
[v1v2v3][v4]+ · · ·+G2
[v1...vp−1][vp]
SPIDA 2004 45 Michael Friendly
Loglinear Models for Categorical Data nway2
Sequential plots and models: Example
Hair color x Eye color marginal table (ignoring Sex)
Black Brown Red Blond
Bro
wn
H
azel
Gre
en
Blu
e
(Hair)(Eye), G2 (9) = 146.44
SPIDA 2004 46 Michael Friendly
Loglinear Models for Categorical Data nway2
Sequential plots and models: Example
3-way table, Joint Independence Model [Hair Eye] [Sex]
Black Brown Red Blond
Bro
wn
H
azel
Gre
en
Blu
e
M F
(HairEye)(Sex), G2 (15) = 19.86
SPIDA 2004 47 Michael Friendly
Loglinear Models for Categorical Data nway2
Sequential plots and models: Example
3-way table, Mutual Independence Model [Hair] [Eye] [Sex]
Black Brown Red Blond
Bro
wn
H
azel
Gre
en
Blu
e
M F
(Hair)(Eye)(Sex), G2 (24) = 166.30
SPIDA 2004 48 Michael Friendly
Loglinear Models for Categorical Data nway2
Sequential plots and models: Example
Black Brown Red Blond
Bro
wn
H
azel
Gre
en
Blu
e
(Hair)(Eye), G2 (9) = 146.44
[Hair] [Eye]
G2(9) = 146.44
+
Black Brown Red Blond
Bro
wn
H
azel
Gre
en
Blu
e
M F
(HairEye)(Sex), G2 (15) = 19.86
[Hair Eye] [Sex]
G2(15) = 19.86
=
Black Brown Red Blond
Bro
wn
H
azel
Gre
en
Blu
e
M F
(Hair)(Eye)(Sex), G2 (24) = 166.30
[Hair] [Eye] [Sex]
G2(24) = 166.60
SPIDA 2004 49 Michael Friendly
Loglinear Models for Categorical Data nway2
Mosaic matrices
Analog of scatterplot matrix for categorical data (Friendly, 1999)
Shows all p(p − 1) pairwise views in a coherent display
Each pairwise mosaic shows bivariate (marginal) relation
Fit: marginal independence
Residuals: show marginal associations
Direct visualization of the “Burt” matrix analyzed in multiple correspondence
analysis for p categorical variables
Hair
Brown Haz Grn Blue
Bla
ck
Bro
wn R
ed Blo
nd
Male Female
Bla
ck
Bro
wn
Red Blo
nd
Black Brown Red Blond
Bro
wn
Haz
Grn
Blu
e
Eye
Male Female
Bro
wn
Haz
Grn
B
lue
Black Brown Red Blond
Male
F
em
ale
Brown Haz Grn Blue
Male
F
em
ale
Sex
SPIDA 2004 50 Michael Friendly
Loglinear Models for Categorical Data nway2
Hair
Brown Haz Grn Blue
Bla
ck
Bro
wn R
ed Blo
nd
Male Female
Bla
ck
Bro
wn
Red Blo
nd
Black Brown Red Blond
Bro
wn
Haz
Grn
Blu
e
Eye
Male Female
Bro
wn
Haz
Grn
B
lue
Black Brown Red Blond
Male
F
em
ale
Brown Haz Grn Blue
Male
F
em
ale
Sex
SPIDA 2004 51 Michael Friendly
Loglinear Models for Categorical Data nway2
Berkeley data:
Admit
Male Female
Adm
it
Rej
ect
A B C D E F
Adm
it
Rej
ect
Admit Reject
Mal
e
Fem
ale
Gender
A B C D E F
Mal
e
Fem
ale
Admit Reject
A
B
C
D
E
F
Male Female
A
B
C
D
E
F
Dept
SPIDA 2004 52 Michael Friendly
Loglinear Models for Categorical Data nway2
Partial association, Partial mosaics
Stratified analysis:How does the association between two (or more) variables vary over levels ofother variables?Mosaic plots for the main variables show partial association at each level of theother variables.E.g., Hair color, Eye color BY Sex ↔ TABLES sex * hair * eye;
2.8
-2.1
-3.3
3.3
Black Brown Red Blond
Bro
wn
Ha
zel
Gre
en
B
lue
Sex: Male
3.5
-2.3 -2.5
-4.9
-2.0
6.4
Black Brown Red Blond
Bro
wn
Ha
zel
Gre
en
Blu
e
Sex: Female
SPIDA 2004 53 Michael Friendly
Loglinear Models for Categorical Data nway2
Partial association, Partial mosaics
Stratified analysis:For models of partial independence, A ⊥ B at each level of (controlling for) CA ⊥ B |Ck, partial G2s add to the overall G2 for conditional independence,
G2A⊥B |C =
∑
k
G2A⊥B |C(k)
Table 7: Partial and Overall conditional tests, Hair ⊥ Eye |Sex
Model df G2 p-value
[Hair][Eye] | Male 9 44.445 0.000
[Hair][Eye] | Female 9 112.233 0.000
[Hair][Eye] | Sex 18 156.668 0.000
SPIDA 2004 54 Michael Friendly
Loglinear Models for Categorical Data nway2
Software for Mosaic Displays
Demonstration web applet:http://www.math.yorku.ca/SCS/Online/mosaics/
Runs the current version of mosaics via a cgi-scriptCan run sample data, upload a data file, enter data in a form.Choose model fitting and display options (not all supported).
SPIDA 2004 55 Michael Friendly
Loglinear Models for Categorical Data mossoft
SPIDA 2004 56 Michael Friendly
Loglinear Models for Categorical Data mossoft
SPIDA 2004 57 Michael Friendly
Loglinear Models for Categorical Data mossoft
SPIDA 2004 58 Michael Friendly
Loglinear Models for Categorical Data mossoft
Software for Mosaic Displays
SAS software & documentation:http://www.math.yorku.ca/SCS/mosaics.htmlhttp://www.math.yorku.ca/SCS/vcd/
Examples: Many in VCD and on web site
SAS/IML modules: mosaics.sas SAS/IML program
Enter frequency table directly in SAS/IML, or read from a SAS dataset.Most flexible:• Select, collapse, reorder, re-label table levels using SAS/IML statements• Specify structural 0s, fit specialized models (e.g., quasi-independence)• Interface to models fit using PROC GENMOD
SPIDA 2004 59 Michael Friendly
Loglinear Models for Categorical Data mossoft
Software for Mosaic Displays
Macro interface: mosaic macro, table macro, mosmat macro
mosaic macroEasiest to use:• Direct input from a SAS dataset• No knowledge of SAS/IML required• Reorder table variables; collapse, reorder table levels with table macro• Convenient interface to partial mosaics (BY=)
table macroCreate frequency table from raw dataCollapse, reorder table categoriesRe-code table categories using SAS formats, e.g., 1=’Male’ 2=’Female’
mosmat macroMosaic matrices— analog of scatterplot matrix (Friendly, 1999)
SPIDA 2004 60 Michael Friendly
Loglinear Models for Categorical Data mossoft
mosaic macro example: Berkeley data
berkeley.sas1 title ’Berkeley Admissions data’;2 proc format;3 value admit 1="Admitted" 0="Rejected" ;4 value dept 1="A" 2="B" 3="C" 4="D" 5="E" 6="F";5 value $sex ’M’=’Male’ ’F’=’Female’;6 data berkeley;7 do dept = 1 to 6;8 do gender = ’M’, ’F’;9 do admit = 1, 0;
10 input freq @@;11 output;12 end; end; end;13 /* -- Male -- - Female- */14 /* Admit Rej Admit Rej */15 datalines;16 512 313 89 19 /* Dept A */17 353 207 17 8 /* B */18 120 205 202 391 /* C */19 138 279 131 244 /* D */20 53 138 94 299 /* E */21 22 351 24 317 /* F */22 ;
SPIDA 2004 61 Michael Friendly
Loglinear Models for Categorical Data mossoft
Data set berkeley:
dept gender admit freq
1 M 1 5121 M 0 3131 F 1 891 F 0 192 M 1 3532 M 0 2072 F 1 172 F 0 83 M 1 1203 M 0 2053 F 1 2023 F 0 3914 M 1 1384 M 0 2794 F 1 1314 F 0 2445 M 1 535 M 0 1385 F 1 945 F 0 2996 M 1 226 M 0 3516 F 1 246 F 0 317
SPIDA 2004 62 Michael Friendly
Loglinear Models for Categorical Data mossoft
mosaic macro example: Berkeley data
mosaic9m.sas1 goptions hsize=7in vsize=7in;2
3 %include catdata(berkeley);4
5 *-- apply character formats to numeric table variables;6 %table(data=berkeley,7 var=Admit Gender Dept,8 weight=freq,9 char=Y, format=admit admit. gender $sex. dept dept.,
10 order=data, out=berkeley);11
12 %mosaic(data=berkeley,13 vorder=Dept Gender Admit, /* reorder variables */14 plots=2:3, /* which plots? */15 fittype=joint, /* fit joint indep. */16 split=H V V, htext=3); /* options */
SPIDA 2004 63 Michael Friendly
Loglinear Models for Categorical Data mossoft
mosaic macro example: Berkeley data
A
B
C
D
E
F
Male Female
Model: (Dept)(Gender)
Two-way, Dept. by Gender
A
B
C
D
E
F
Male Female Admitted Rejected
Model: (DeptGender)(Admit)
Three-way, Dept. by Gender by Admit
SPIDA 2004 64 Michael Friendly
Loglinear Models for Categorical Data mossoft
mosmat macro: Mosaic matrices
mosmat9m.sas1 %include catdata(berkeley);2 %mosmat(data=berkeley,3 vorder=Admit Gender Dept, sort=no);
Admit
Male Female
Adm
it
Reje
ct
A B C D E F
Adm
it
Reje
ct
Admit Reject
Male
Fem
ale
Gender
A B C D E F
Male
Fem
ale
Admit Reject
A B
C
D
E
F
Male Female
A B
C
D
E
F
Dept
SPIDA 2004 65 Michael Friendly
Loglinear Models for Categorical Data mossoft
Partial mosaics
mospart3.sas1 %include catdata(hairdat3s);2 %gdispla(OFF);3 %mosaic(data=haireye,4 vorder=Hair Eye Sex, by=Sex,5 htext=2, cellfill=dev);6 %gdispla(ON);7 %panels(rows=1, cols=2); /* make 2 figs -> 1 */
2.8
-2.1
-3.3
3.3
Black Brown Red Blond
Bro
wn
Ha
zel
Gre
en
B
lue
Sex: Male
3.5
-2.3 -2.5
-4.9
-2.0
6.4
Black Brown Red Blond
Bro
wn
Ha
zel
Gre
en
Blu
e
Sex: Female
SPIDA 2004 66 Michael Friendly
Loglinear Models for Categorical Data corresp
Correspondence analysis and MCA
Correspondence analysis (CA): Analog of PCA for frequency data:
account for maximum % of χ2 in few (2-3) dimensions
finds scores for row (xim) and column (yjm) categories on these dimensions
uses Singular Value Decomposition of residuals from independence,
dij = (nij − mij)/√
mij
dij√n
=M∑
m=1
λm xim yjm
optimal scaling: each pair of scores, xim) and column (yjm, have highest
possible correlation (= λm).
plots of the row (xim) and column (yjm) scores show associations
MCA: Extends CA to n-way tables, but only uses bivariate associations (like
mosaic matrix)
SPIDA 2004 67 Michael Friendly
Loglinear Models for Categorical Data corresp
Hair color, Eye color data:
Brown Blue
Hazel
Green
BLACK
BROWN
RED
BLOND
* Eye color * HAIR COLORD
im 2
(9.
5%)
-0.5
0.0
0.5
Dim 1 (89.4%)
-1.0 -0.5 0.0 0.5 1.0
Interpretation: row/column points “near” each other are positively associated
Dim 1: 89.4% of χ2 (dark ↔ light)
Dim 2: 9.5% of χ2 (RED/Green vs. others)
SPIDA 2004 68 Michael Friendly
Loglinear Models for Categorical Data corresp
PROC CORRESP and the CORRESP macro
Two forms of input dataset:dataset in contingency table form – column variables are levels of one factor,observations (rows) are levels of the other.
Obs Eye BLACK BROWN RED BLOND
1 Brown 68 119 26 72 Blue 20 84 17 943 Hazel 15 54 14 104 Green 5 29 14 16
Raw category responses (case form), or cell frequencies (frequency form),classified by 2 or more factors (e.g., output from PROC FREQ)
Obs Eye HAIR Count
1 Brown BLACK 682 Brown BROWN 1193 Brown RED 264 Brown BLOND 7...
15 Green RED 1416 Green BLOND 16
SPIDA 2004 69 Michael Friendly
Loglinear Models for Categorical Data corresp
PROC CORRESP and the CORRESP macro
PROC CORRESP
Handles 2-way CA, extensions to n-way tables, and MCA
Many options for scaling row/column coordinates and output statistics
OUTC= option → output dataset for plotting (PROC CORRESP doesn’t do plots
itself)
CORRESP macro
Uses PROC CORRESP for analysis
Produces labeled plots of the category points in either 2 or 3 dimensions
Many graphic options; can equate axes automatically
See: http://www.math.yorku.ca/vcd/corresp.html
SPIDA 2004 70 Michael Friendly
Loglinear Models for Categorical Data corresp
Example: Hair and Eye Color
Input the data in contingency table form
corresp2a.sas · · ·
1 data haireye;2 input EYE $ BLACK BROWN RED BLOND ;3 datalines;4 Brown 68 119 26 75 Blue 20 84 17 946 Hazel 15 54 14 107 Green 5 29 14 168 ;
SPIDA 2004 71 Michael Friendly
Loglinear Models for Categorical Data corresp
Example: Hair and Eye Color
Using PROC CORRESP directly— labeled printer plot
proc corresp data=haireye outc=coord short;id eye; /* row variable */var black brown red blond; /* col variables */
proc plot data=coord vtoh=2; /* plot step */plot dim2 * dim1 = ’*’ $eye/ box haxis=by .1 vaxis=by .1; /* plot options */
Using the CORRESP macro— labeled high-res plot
%corresp (data=haireye,id=eye, /* row variable */var=black brown red blond, /* col variables */dimlab=Dim); /* options */
SPIDA 2004 72 Michael Friendly
Loglinear Models for Categorical Data corresp
Example: Hair and Eye Color
Printed output:
The Correspondence Analysis Procedure
Inertia and Chi-Square Decomposition
Singular Principal Chi-Values Inertias Squares Percents 18 36 54 72 90
----+----+----+----+----+---0.45692 0.20877 123.593 89.37% *************************0.14909 0.02223 13.158 9.51% ***0.05097 0.00260 1.538 1.11%
------- -------0.23360 138.29 (Degrees of Freedom = 9)
Row CoordinatesDim1 Dim2
Brown -.492158 -.088322Blue 0.547414 -.082954Hazel -.212597 0.167391Green 0.161753 0.339040
Column CoordinatesDim1 Dim2
BLACK -.504562 -.214820BROWN -.148253 0.032666RED -.129523 0.319642BLOND 0.835348 -.069579
SPIDA 2004 73 Michael Friendly
Loglinear Models for Categorical Data corresp
Example: Hair and Eye Color
Output dataset(selected variables):
Obs _TYPE_ EYE DIM1 DIM2
1 INERTIA . .2 OBS Brown -0.49216 -0.088323 OBS Blue 0.54741 -0.082954 OBS Hazel -0.21260 0.167395 OBS Green 0.16175 0.339046 VAR BLACK -0.50456 -0.214827 VAR BROWN -0.14825 0.032678 VAR RED -0.12952 0.319649 VAR BLOND 0.83535 -0.06958
Row and column points are distinguished by the _TYPE_ variable: OBS vs. VAR
SPIDA 2004 74 Michael Friendly
Loglinear Models for Categorical Data corresp
Example: Hair and Eye Color
Graphic output from CORRESP macro:
Brown Blue
Hazel
Green
BLACK
BROWN
RED
BLOND
* Eye color * HAIR COLOR
Dim
2 (
9.5%
)
-0.5
0.0
0.5
Dim 1 (89.4%)
-1.0 -0.5 0.0 0.5 1.0
Top legend produced with Annotate data set and the INANNO= option to theCORRESP macro
SPIDA 2004 75 Michael Friendly
Loglinear Models for Categorical Data corresp
Multi-way tables
Stacking approach: van der Heijden and de Leeuw (1985)—
three-way table, of size I × J × K can be sliced and stacked as a two-waytable, of size (I × J) × K
K
J
I
I x J x K table
K
J
J
J
(I J )x K table
The variables combined are treated “interactively”Each way of stacking corresponds to a loglinear model• (I × J) × K → [AB][C]• I × (J × K) → [A][BC]• J × (I × K) → [B][AC]
SPIDA 2004 76 Michael Friendly
Loglinear Models for Categorical Data corresp
Multi-way tables: Stacking
PROC CORRESP: Use TABLES statement and option CROSS=ROW orCROSS=COL. E.g., for model [A B] [C],
proc corresp cross=row;tables A B, C;weight count;
CORRESP macro: Can use / instead of ,
%corresp(options=cross=row,tables=A B/ C,weight count);
SPIDA 2004 77 Michael Friendly
Loglinear Models for Categorical Data corresp
Example: Suicide Rates
Suicide rates in West Germany, by Age, Sex and Method of suicide
Sex Age POISON GAS HANG DROWN GUN JUMP
M 10-20 1160 335 1524 67 512 189M 25-35 2823 883 2751 213 852 366M 40-50 2465 625 3936 247 875 244M 55-65 1531 201 3581 207 477 273M 70-90 938 45 2948 212 229 268
F 10-20 921 40 212 30 25 131F 25-35 1672 113 575 139 64 276F 40-50 2224 91 1481 354 52 327F 55-65 2283 45 2014 679 29 388F 70-90 1548 29 1355 501 3 383
CA of the [Age Sex] by [Method] table:
Shows associations between the Age-Sex combinations and Method
Ignores association between Age and Sex
SPIDA 2004 78 Michael Friendly
Loglinear Models for Categorical Data corresp
Example: Suicide Rates
suicide5.sas · · ·
1 %include catdata(suicide);2 *-- equate axes!;3 axis1 order=(-.7 to .7 by .7) length=6.5 in label=(a=90 r=0);4 axis2 order=(-.7 to .7 by .7) length=6.5 in;5 %corresp(data=suicide, weight=count,6 tables=%str(age sex, method),7 options=cross=row short,8 vaxis=axis1, haxis=axis2);
Output:
Inertia and Chi-Square Decomposition
Singular Principal Chi-Values Inertias Squares Percents 12 24 36 48 60
----+----+----+----+----+---0.32138 0.10328 5056.91 60.41% *************************0.23736 0.05634 2758.41 32.95% **************0.09378 0.00879 430.55 5.14% **0.04171 0.00174 85.17 1.02%0.02867 0.00082 40.24 0.48%
------- -------0.17098 8371.28 (Degrees of Freedom = 45)
SPIDA 2004 79 Michael Friendly
Loglinear Models for Categorical Data corresp
10-20 * F
10-20 * M
25-35 * F25-35 * M
40-50 * F 40-50 * M
55-65 * F
55-65 * M
70-90 * F
70-90 * M
Drown
Gas
Gun
Hang
Jump
Poison
Dim
en
sio
n 2
(3
3.0
%)
-0.7
0.0
0.7
Dimension 1 (60.4%)-0.7 0.0 0.7
SPIDA 2004 80 Michael Friendly
Loglinear Models for Categorical Data corresp
Compare with mosaic display:
MA
LE
F
EM
AL
E
10-20 25-35 40-50 55-65 >65
Gu
n
Ga
s
Ha
ng
P
ois
on
Ju
mp
Dro
wn
Suicide data - Model (SexAge)(Method)
Sta
nd
ard
ize
dre
sid
ua
ls:
<
-8-8
:-4
-4:-
2-2
:-0
0:2
2:4
4:8
>8
SPIDA 2004 81 Michael Friendly
Loglinear Models for Categorical Data structured
More structured tables
Ordered categories
Tables with ordered categories allow more parsimonious representations of
association
Can represent λABij by a small number of parameters
→ more focused and more powerful tests of lack of independence
Square tables
For square I × I tables, where row and column variables have the same
categories:
• Can ignore diagonal cells, where association is expected and test remaining
association (quasi-independence)
• Can test whether association is symmetric around the diagonal cells.
SPIDA 2004 82 Michael Friendly
Loglinear Models for Categorical Data structured
Ordered categories
Ordinal scores
In many cases it may be reasonable to assign numeric scores, {ai} to an
ordinal row variable and/or numeric scores, {bi} to an ordinal column variable.
Typically, scores are equally spaced and sum to zero, {ai} = i − (I + 1)/2,
e.g., {ai} = {−1, 0, 1} for I=3.
Linear-by-Linear (Uniform) Association: When both variables are
ordinal, the simplest model posits that any association is linear in both variables.
λABij = γ aibj
This model adds one additional parameter to the independence model (γ = 0).
It is similar to
For integer scores, the local log odds ratio for any contiguous 2 × 2 table are
all equal, log θij = γThis is a model of uniform association
SPIDA 2004 83 Michael Friendly
Loglinear Models for Categorical Data structured
Row Effects and Column Effects: When only one variable is assigned
scores, we have the row effects model or the column effects model.
E.g., in the row effects model, the row variable (A) is treated as nominal, while
the column variable (B) is assigned ordered scores {bj}.
log mij = µ + λAi + λB
j + αibj
where the row parameters, αi, are defined so they sum to zero.
This model has (I − 1) more parameters than the independence model.
A Row Effects + Column Effects model allows both variables to be ordered, but
not necessarily with linear scores.
Fitting models for ordinal variables
Create numeric variables for category scores
PROC GENMOD: Use as quantitative variables in MODEL statement, but not
listed as CLASS variables
SPIDA 2004 84 Michael Friendly
Loglinear Models for Categorical Data structured
Example: Mental impairment and parents’ SES
Srole et al. (1978) Data on mental health status of ∼1600 young NYC residents in
relation to parents’ SES.
Mental health: Well, mild symptoms, moderate symptoms, Impaired
SES: 1 (High) – 6 (Low)
Mental Parents’ SES
health High 2 3 4 5 Low
1: Well 64 57 57 72 36 21
2: Mild 94 94 105 141 97 71
3: Moderate 58 54 65 77 54 54
4: Impaired 46 40 60 94 78 71
SPIDA 2004 85 Michael Friendly
Loglinear Models for Categorical Data structured
2.6 2.0 0.6 -2.4 -4.0
0.7 -1.2
-1.0 -0.6 1.2
-2.6 -3.0 -1.1 0.5 2.4 3.3
Wel
l
Mild
Mod
erat
eIm
paire
dM
enta
l
High 2 3 4 5 Low SES
Mental Impairment and SES: Independence
1.0 -0.6 -1.6
-1.5 1.0
1.3 0.7 0.5 -1.3 -1.3
0.8 -1.1 0.5
Wel
l
Mild
Mod
erat
eIm
paire
dM
enta
l
High 2 3 4 5 Low SES
Linear x Linear
Visual assessment:
Residuals from the independence model show an opposite-corner pattern. This is
consistent with both:
Linear × linear model: equi-spaced scores for both Mental and SES
Row effects model: equi-spaced scores for SES, ordered scores for Mental
SPIDA 2004 86 Michael Friendly
Loglinear Models for Categorical Data structured
Statistical assesment:
Table 8: Mental health data: Goodness-of-fit statistics for ordinal loglinear models
Model G2 df Pr(> G2) AIC AIC-best
Independence 47.4178 15 0.00003 65.4178 35.5227
Col effects (SES) 6.8293 10 0.74145 34.8293 4.9342
Row effects (mental) 6.2808 12 0.90127 30.2808 0.3856
Lin x Lin 9.8951 14 0.76981 29.8951 0.0000
Both the Row Effects and Linear × linear models are significantly better than the
Independence model
AIC indicates a slight preference for the Linear × linear model
In the Linear × linear model, the estimate of the coefficient of aibj is
γ = 0.0907 = log θ, so θ = exp(0.0907) = 1.095.
⇒ each step down the SES scale increases the odds of being classified one step
poorer in mental health by 9.5%.
SPIDA 2004 87 Michael Friendly
Loglinear Models for Categorical Data structured
Fitting these models with PROC GENMOD:mentgen2.sas
1 %include catdata(mental);2 data mental;3 set mental;4 m_lin = mental; *-- copy m_lin and s_lin for;5 s_lin = ses; *-- use non-CLASS variables;6
7 title ’Independence model’;8 proc genmod data=mental;9 class mental ses;
10 model count = mental ses / dist=poisson obstats residuals;11 format mental mental. ses ses.;12 ods output obstats=obstats;13 %mosaic(data=obstats, vorder=Mental SES, resid=stresdev,14 title=Mental Impairment and SES: Independence, split=H V);
Row Effects model:mentgen2.sas
16 proc genmod data=mental;17 class mental ses;18 model count = mental ses mental*s_lin / dist=poisson obstats;19 ...
Linear × linear model:mentgen2.sas
21 proc genmod data=mental;22 class mental ses;23 model count = mental ses m_lin*s_lin / dist=poisson obstats;
SPIDA 2004 88 Michael Friendly
Loglinear Models for Categorical Data square
Square-ish tables
Tables where two (or more) variables have the same category levels:
Employment categories of related persons (mobility tables)Multiple measurements over time (panel studies; longitudinal data)Repeated measures on the same individuals under different conditionsRelated/repeated measures are rarely independent, but may have simplerforms than general association
E.g., vision data: Left and right eye acuity grade for 7477 women
Hig
h
2
3
Low
Rig
htE
ye
High 2 3 Low LeftEye
Independence, G2(9)=6671.5
SPIDA 2004 89 Michael Friendly
Loglinear Models for Categorical Data square
Quasi-Independence
Related/repeated measures are rarely independent— most observations often
fall on diagonal cells.
Quasi-independence ignores diagonals, tests independence in remaining cells
(λij = 0 for i 6= j).
The model dedicates one parameter (δi) to each diagonal cell, fitting them
exactly,
log mij = µ + λAi + λB
j + δi I(i = j)
where I(·) is the indicator function.
This model may be fit as a GLM by including indicator variables for each
diagonal cell.
DIAG 4 rows 4 cols
1 0 0 00 2 0 00 0 3 00 0 0 4
SPIDA 2004 90 Michael Friendly
Loglinear Models for Categorical Data square
Using PROC GENMOD· · · mosaic10g.sas
1 title ’Quasi-independence model (women)’;2 proc genmod data=women;3 class RightEye LeftEye diag;4 model Count = LeftEye RightEye diag /5 dist=poisson link=log obstats residuals;6 ods output obstats=obstats;7 %mosaic(data=obstats, vorder=RightEye LeftEye, ...);
Mosaic:
Hig
h
2
3
Low
Rig
htE
ye
High 2 3 Low LeftEye
Quasi-Independence, G2(5)=199.1
SPIDA 2004 91 Michael Friendly
Loglinear Models for Categorical Data square
Symmetry
Tests whether the table is symmetric around the diagonal, i.e., mij = mji
As a loglinear model, symmetry is
log mij = µ + λAi + λB
j + λABij ,
subject to the conditions λAi = λB
j and λABij = λAB
ji .
This model may be fit as a GLM by including indicator variables with equal
values for symmetric cells, and indicators for the diagonal cells (fit exactly)
SYMMETRY 4 rows 4 cols)
1 12 13 1412 2 23 2413 23 3 3414 24 34 4
SPIDA 2004 92 Michael Friendly
Loglinear Models for Categorical Data square
Using PROC GENMOD· · · mosaic10g.sas
1 proc genmod data=women;2 class symmetry;3 model Count = symmetry /4 dist=poisson link=log obstats residuals;5 ods output obstats=obstats;6 %mosaic(data=obstats, vorder=RightEye LeftEye, ...);
Mosaic:
Hig
h
2
3
Low
Rig
htE
ye
High 2 3 Low LeftEye
Symmetry, G2(6)=19.25
SPIDA 2004 93 Michael Friendly
Loglinear Models for Categorical Data square
Quasi-Symmetry
Symmetry is often too restrictive: ⇒ equal marginal frequencies (λAi = λB
i )
PROC GENMOD: Use the usual marginal effect parameters + symmetry:· · · mosaic10g.sas
1 proc genmod data=women;2 class LeftEye RightEye symmetry;3 model Count = LeftEye RightEye symmetry /4 dist=poisson link=log obstats residuals;5 ods output obstats=obstats;
Hig
h
2
3
Low
Rig
htE
ye
High 2 3 Low LeftEye
Quasi-Symmetry, G2(3)=7.27
SPIDA 2004 94 Michael Friendly
Loglinear Models for Categorical Data square
Table 9: Summary of models fit to vision data
Model G2 df Pr(> G2) AIC AIC - min(AIC)
Independence 6671.51 9 0.00000 6685.51 6656.23
Linear*Linear 1818.87 8 0.00000 1834.87 1805.59
Row+Column Effects 1710.30 4 0.00000 1734.30 1705.02
Quasi-Independence 199.11 5 0.00000 221.11 191.83
Symmetry 19.25 6 0.00376 39.25 9.97
Quasi-Symmetry 7.27 3 0.06375 33.27 3.99
Ordinal Quasi-Symmetry 7.28 5 0.20061 29.28 0.00
Only the quasi-symmetry models provide an acceptable fit
The ordinal quasi-symmetry model is most parsimonious
SPIDA 2004 95 Michael Friendly
Loglinear Models for Categorical Data loglin
ReferencesBickel, P. J., Hammel, J. W., and O’Connell, J. W. Sex bias in graduate admissions: Data from
Berkeley. Science, 187:398–403, 1975.
Friendly, M. SAS System for Statistical Graphics. SAS Institute, Cary, NC, 1st edition, 1991.
Friendly, M. Mosaic displays for multi-way contingency tables. Journal of the AmericanStatistical Association, 89:190–200, 1994.
Friendly, M. Conceptual and visual models for categorical data. The American Statistician, 49:153–160, 1995.
Friendly, M. Extending mosaic displays: Marginal, conditional, and partial views of categoricaldata. Journal of Computational and Graphical Statistics, 8(3):373–395, 1999.
Friendly, M. Visualizing Categorical Data. SAS Institute, Cary, NC, 2000.
Friendly, M. Corrgrams: Exploratory displays for correlation matrices. The American Statistician,56(4):316–324, 2002.
Friendly, M. and Kwan, E. Effect ordering for data displays. Computational Statistics and DataAnalysis, 43(4):509–539, 2003.
Hartigan, J. A. and Kleiner, B. Mosaics for contingency tables. In Eddy, W. F., editor, ComputerScience and Statistics: Proceedings of the 13th Symposium on the Interface, pp. 268–273.Springer-Verlag, New York, NY, 1981.
Srole, L., Langner, T. S., Michael, S. T., Kirkpatrick, P., Opler, M. K., and Rennie, T. A. C. MentalHealth in the Metropolis: The Midtown Manhattan Study. NYU Press, New York, 1978.
SPIDA 2004 96 Michael Friendly
Loglinear Models for Categorical Data loglin
Tufte, E. R. The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT, 1983.
van der Heijden, P. G. M. and de Leeuw, J. Correspondence analysis used complementary tologlinear analysis. Psychometrika, 50:429–447, 1985.
SPIDA 2004 97 Michael Friendly