Categorical Data
description
Transcript of Categorical Data
![Page 1: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/1.jpg)
Categorical Data
Frühling Rijsdijk1 & Caroline van Baal2
1IoP, London 2Vrije Universiteit, A’dam
Twin Workshop, Boulder Tuesday March 2, 2004
![Page 2: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/2.jpg)
Aims
• Introduce Categorical Data
• Define liability and describe
assumptions of the liability model
• Show how heritability of liability can be
estimated from categorical twin data
• Practical exercises
![Page 3: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/3.jpg)
Categorical data
Measuring instrument is able to only discriminate between two or a few ordered categories : e.g. absence or presence of a disease
Data therefore take the form of counts, i.e. the number of individuals within each category
![Page 4: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/4.jpg)
Univariate Normal Distribution of Liability
Assumptions:
(1) Underlying normal
distribution of liability
(2) The liability distribution has
1 or more thresholds (cut-offs)
![Page 5: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/5.jpg)
The standard Normal distribution
Liability is a latent variable, the scale is arbitrary,
distribution is, therefore, assumed to be a Standard Normal Distribution (SND) or z-distribution:
• mean () = 0 and SD () = 1• z-values are the number of SD away from the mean
• area under curve translates directly to probabilities > Normal Probability Density function ()
-3 3-1 0 1 2-2
68%
![Page 6: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/6.jpg)
Standard Normal Cumulative Probability in right-hand tail(For negative z values, areas are found by symmetry)
Z0 Area
0 .50 50%.2 .42 42%.4 .35 35%.6 .27 27%.8 .21 21% 1 .16 16%1.2 .12 12%1.4 .08 8%1.6 .06 6%1.8 .036 3.6%2 .023 2.3%2.2 .014 1.4%2.4 .008 .8%2.6 .005 .5%2.8 .003 .3%2.9 .002 .2%
-3 3z0
Area=P(z z0)
![Page 7: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/7.jpg)
When we have one variable it is possible to find a z-value (threshold) on the SND, so that the proportion exactly matches the observed proportion of the sample
• i.e if from a sample of 1000 individuals, 150 have met a criteria for a disorder (15%): the z-value is 1.04
-3 3
1.04
![Page 8: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/8.jpg)
Two categorical traits
Trait2Trait1
0 1
0 00 01
1 10 11
When we have two categorical traits, the data are represented in a Contingency Table, containing cell counts that can be translated into proportions
0 = absent1 = present
Trait2Trait1
0 1
0 545 (.76)
75(.11)
1 56(.08)
40(.05)
![Page 9: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/9.jpg)
Categorical Data for twins:
When the measured trait is dichotomous i.e. a disorder either present or absent in an unselected sample of twins:
cell a: number of pairs concordant for unaffected
cell d: number of pairs concordant for affected
cell b/c: number of pairs discordant for the disorder
0 = unaffected1 = affected
Twin2Twin1
0 1
0 00 a
01 b
1 10 c
11 d
![Page 10: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/10.jpg)
Joint Liability Model for twin pairs
• Assumed to follow a bivariate normal distribution
• The shape of a bivariate normal distribution is determined by the correlation between the traits
• Expected proportions under the distribution can be calculated by numerical integration with mathematical subroutines
![Page 11: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/11.jpg)
Bivariate Normal
R=.00 R=.90
![Page 12: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/12.jpg)
Bivariate Normal (R=0.6) partitioned at threshold 1.4 (z-value) on both liabilities
![Page 13: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/13.jpg)
Liab 2Liab 1
0 1
0 .87
.05
1 .05
.03
Expected Proportions of the BN, for R=0.6, Th1=1.4, Th2=1.4
![Page 14: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/14.jpg)
Twin2Twin1 0 1
0 00 a 01 b
1 10 c 11 d
Correlated dimensions:
The correlation (shape) and the two thresholds determine the relative proportions of observations in the 4 cells of the CT.
Conversely, the sample proportions in the 4 cells can be used to estimate the correlation and the thresholds.
ad
bc
acbd
![Page 15: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/15.jpg)
• A variance decomposition (A, C, E) can be applied to liability, where the correlations in liability are determined by path model
• This leads to an estimate of the heritability of the liability
Twin Models
![Page 16: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/16.jpg)
ACE Liability Model
11
Twin 1
C EA
L
C AE
L
Twin 2
Unaf ¯Aff Unaf ¯Aff
1
1/.5
![Page 17: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/17.jpg)
Summary
It is possible to estimate a correlation between categorical traits from simple counts (CT), because of the assumptions we make about their joint distributions
![Page 18: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/18.jpg)
How can we fit ordinal data in Mx?
Summary statistics: CTMx has a built-in fit function for the maximum-likelihood analysis of 2-way Contingency Tables>analyses limited to only two variables
Raw data analyses- multivariate- handles missing data- moderator variables
![Page 19: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/19.jpg)
Model-fitting to CT
Mx has a built in fit function for the maximum-likelihood analysis of 2-way Contingency Tables
The Fit Function is twice the log-likelihood of the observed frequency data calculated as:
ijc
jij
r
i
PnL ln2ln211
nij is the observed frequency in cell ijpij is the expected proportion in cell ij
![Page 20: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/20.jpg)
Expected proportionsAre calculated by numerical integration of the
bivariate normal over two dimensions: the liabilities for twin1 and twin2
e.g. the probability that both twins are affected :
Φ is the bivariate normal probability density function, L1 and L2 are the liabilities of twin1 and twin2, with means 0, and is the correlation matrix of the two liabilities.
21T T
21 dLdL)Σ,0;L,L(
![Page 21: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/21.jpg)
d
For example: for a correlation of .9 and thresholds (z-values) of 1, the probability that both twins are above threshold (proportion d) is around .12
The probability that both twins are are below threshold (proportion a) is given by another integral function with reversed boundaries :
B
BL2
L1
21
T T
21 dLdL)Σ,0;L,L(
a
and is around .80 in this example
![Page 22: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/22.jpg)
log-likelihood of the data under the model subtracted from
log-likelihood of the observed frequencies themselves:
tot
ijc
1jij
r
1i n
nlnn2Lln2
χ² statistic:
![Page 23: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/23.jpg)
The model’s failure to predict the observed data i.e. a bad fitting model,is reflected
in a significant χ²
![Page 24: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/24.jpg)
Model-fitting to Raw Ordinal Data
ordinal ordinalZyg respons1 respons21 0 01 0 01 0 12 1 02 0 01 1 12 . 12 0 .2 0 1
![Page 25: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/25.jpg)
Model-fitting to Raw Ordinal Data
The likelihood of a vector of ordinal responses is computed by the Expected Proportion in
the corresponding cell of the MN
Expected proportion are calculated by numerical integration of the MN normal over n dimensions. In this example it will be two, the liabilities for twin1 and twin2
![Page 26: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/26.jpg)
2121 ),( dxdxxxT
T
2121 ),( dxdxxxT
T
2121 ),( dxdxxxT T
2121 ),( dxdxxx
T T
(0 1)
(1 0)
(0 0) (1 1)
![Page 27: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/27.jpg)
2121 ),( dxdxxxT T
is the MN pdf, which is a function of, the correlation matrix of the variables
By maximizing the likelihood of the dataunder a MN distribution, the ML estimate of the
correlation matrix and the thresholds are obtained
![Page 28: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/28.jpg)
Practical Exercise 1
Simulated data for 625 MZ and 625 DZ pairs (h2 = .40 c2 = .20 e2 = .40 > rmz=.60 rdz=.40) Dichotomized 0 = bottom 88%, 1 = top 12%This corresponds to threshold (z-value) of 1.18
Observed counts:MZ DZ
0 1 0 10 508 48 0 497 591 3534 1 49 20
Raw ORD File: bin.datScripts: tetracor.mx and ACEbin.mx
![Page 29: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/29.jpg)
Practical Exercise 2Same simulated data Categorized 0 = bottom 22%, 1 = mid 66%,2 = top 12%This corresponds to thresholds (z-values) of -0.75 1.18
Observed counts:MZ DZ
0 1 2 0 1 20 80 58 1 0 63 74 21 68 302 47 1 71 289 572 1 34 34 2 4 45 20Raw ORD File: cat.datAdjust the correlation and ACE script
![Page 30: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/30.jpg)
Threshold Specification in Mx
2 CategoriesThreshold Matrix : 1 x 2T(1,1) T(1,2) threshold twin1 & twin2
3 CategoriesThreshold Matrix : 2 x 2T(1,1) T(1,2) threshold 1 for twin1 & twin2T(2,1) T(2,2) threshold 2 for twin1 & twin2
![Page 31: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/31.jpg)
Threshold Specification in Mx
3 Categoriesnthresh=2 nvar=2Matrix T: nthresh nvar (2 x 2)T(1,1) T(1,2) threshold 1 for twin1 & twin2T(2,1) T(2,2) increment
L LOW nthresh nthreshValue 1 T 1 1 to T nthresh nthresh
Threshold Model L*T /
1 01 1
t11 t12
t21 t22
* =t11 t12
t11 + t21 t12 + t22
![Page 32: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/32.jpg)
Using Frequency Weights
ordinal ordinalZyg respons1 respons2
FREQ
1 0 0 5081 0 1 481 1 0 351 1 1 342 0 0 4972 0 1 592 1 0 492 1 1 20
The 1250 lines data file (bin.dat) can be summarized like this
![Page 33: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/33.jpg)
Using Frequency WeightsG1: Data and model for MZ correlationDAta NGroups=2 NInput_vars=4 Missing=.Ordinal File=binF.datLabels zyg bin1 bin2 freqSELECT IF zyg = 1SELECT bin1 bin2 freq /DEFINITION freq /
Begin Matrices; R STAN 2 2 FREE !Correlation matrixT FULL nthresh nvar FREE !thresh tw1, thresh tw2L LOW nthresh nthreshF FULL 1 1End matrices;Value 1 L 1 1 to L nthresh nthresh ! initialize L
COV R / !Predicted Correlation matrix for MZ pairs
Thresholds L*T / !to ensure t1>t2>t3 etc.......
FREQ F /
![Page 34: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/34.jpg)
The likelihood of this vector of observations is the Expected Proportion in
the corresponding cell of the MN :
2211
1 2
2211
1 2
1
),,,( dydxdydxyxyxtx ty
tx ty
ty
Example: 2 Variables measured in twins:
x has 2 cat > 0 below Tx1 , 1 above Tx1
y has 3 cat > 0 below Ty1 , 1 Ty1 - Ty2 , 2 above Ty2
Ordinal respons vector (x1, y1, x2, y2) For example (1 2 0 1)
![Page 35: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/35.jpg)
Proband-Ascertained Samples
For rare disorders (e.g. Schizophrenia), selecting a random sample of twins will lead to the vast majority of pairs being unaffected.
A more efficient design is to ascertain twin pairs through a register of affected individuals.
When an affected twin (the proband) is identified, the cotwin is followed up to see if he or she is also affected.
There are several types of ascertainment
![Page 36: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/36.jpg)
Twin 1Twin 2
0 1
0 00 a
01 b
1 10 c
11 d
Complete Ascertainment
Types of ascertainment
ProbandCo- twin
0 1
0 00 a
01 b
1 10 c
11 d
Single Ascertainment
![Page 37: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/37.jpg)
Omission of certain classes from observation leads to an increase of the likelihood of observing the remaining individuals
Mx corrects for incomplete ascertainment by dividing the likelihood by the proportion of the population remaining after ascertainment
CT from ascertained data can be analysed in Mx by simply substituting a –1 for the missing cells
Ascertainment Correction
CTable 2 2-1 11-1 13
![Page 38: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/38.jpg)
Summary
For a 2 x 2 CT: 3 observed statistics, 3 parameters (1 correlation, 2 threshold) df=0 any pattern of observed frequenciescan be accounted for, no goodness of fit of the normal distribution assumption.
This problem is solved when we have a CTwhich is at least 3 x 2: df>0A significant 2 reflects departure from normality.
![Page 39: Categorical Data](https://reader035.fdocuments.us/reader035/viewer/2022062301/56815006550346895dbdd9f2/html5/thumbnails/39.jpg)
SummaryPower to detect certain effects increaseswith increasing number of categories >continuous data most powerful
For raw ordinal data analyses, the first category must be coded 0!
Threshold specification when analyzing CT aredifferent