Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

download Comparing Group Means: The T-test and One-way  ANOVA Using STATA, SAS, and SPSS

of 179

Transcript of Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    1/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 1

    http://www.indiana.edu/~statmath

    Comparing Group Means: The T-test and One-way

    ANOVA Using STATA, SAS, and SPSS

    Hun Myoung Park

    This document summarizes the method of comparing group means and illustrates how to

    conduct the t-test and one-way ANOVA using STATA 9.0, SAS 9.1, and SPSS 13.0.

    1. Introduction

    2. Univariate Samples3. Paired (dependent) Samples

    4. Independent Samples with Equal Variances

    5. Independent Samples with Unequal Variances6. One-way ANOVA, GLM, and Regression

    7. Conclusion

    1. Introduction

    The t-test and analysis of variance (ANOVA) compare group means. The mean of a variable to

    be compared should be substantively interpretable. A t-test may examine gender differences in

    average salary or racial (white versus black) differences in average annual income. The left-hand side (LHS) variable to be tested should be interval or ratio, whereas the right-hand side

    (RHS) variable should be binary (categorical).

    1.1

    T-test and ANOVA

    While the t-test is limited to comparing means of two groups, one-way ANOVA can comparemore than two groups. Therefore, the t-test is considered a special case of one-way ANOVA.

    These analyses do not, however, necessarily imply any causality (i.e., a causal relationship

    between the left-hand and right-hand side variables). Table 1 compares the t-test and one-wayANOVA.

    Table 1. Comparison between the T-test and One-way ANOVA

    T-test One-way ANOVALHS (Dependent) Interval or ratio variable Interval or ratio variable

    RHS (Independent) Binary variable with only two groups Categorical variableNull Hypothesis

    21 = ...321 === Prob. Distribution* T distribution F distribution

    * In the case of one degree of freedom on numerator, F=t2.

    The t-test assumes that samples are randomly drawn from normally distributed populationswith unknown population means. Otherwise, their means are no longer the best measures of

    central tendency and the t-test will not be valid. The Central Limit Theorem says, however, that

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    2/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 2

    http://www.indiana.edu/~statmath

    the distributions of 1y and 2y are approximately normal when N is large. When 3021 + nn , in

    practice, you do not need to worry too much about the normality assumption.

    You may numerically test the normality assumption using the Shapiro-Wilk W (N

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    3/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 3

    http://www.indiana.edu/~statmath

    Figure 1. Two Types of Data Arrangement

    Variable Group Variable1 Variable2xx

    00

    xx

    yy

    yy

    11

    The data set used here is adopted from J. F. Fraumenis study on cigarette smoking and cancer

    (Fraumeni 1968). The data are per capita numbers of cigarettes sold by 43 states and the

    District of Columbia in 1960 together with death rates per hundred thousand people fromvarious forms of cancer. Two variables were added to categorize states into two groups. See the

    appendix for the details.

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    4/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 4

    http://www.indiana.edu/~statmath

    2. Univariate Samples

    The univariate-sample or one-sample t-test determines whether an unknown population mean

    differs from a hypothesized value cthat is commonly set to zero: cH =:0 . The t statistic

    follows Students T probability distribution with n-1 degrees of freedom, )1(~

    = nts

    cyt

    y,

    whereyis a variable to be tested and nis the number of observations.1

    Suppose you want to test if the population mean of the death rates from lung cancer is 20 per

    100,000 people at the .01 significance level. Note the default significance level used in mostsoftware is the .05 level.

    2.1 T-test in STATA

    The . t t est command conducts t-tests in an easy and flexible manner. For a univariate sampletest, the command requires that a hypothesized value be explicitly specified. The l evel ( )

    option indicates the confidence level as a percentage. The 99 percent confidence level is

    equivalent to the .01 significance level.

    . ttest lung=20, level(99)

    One- sampl e t t est- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 99% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    l ung | 44 19. 65318 . 6374133 4. 228122 17. 93529 21. 37108- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    mean = mean( l ung) t = - 0. 5441

    Ho: mean = 20 degr ees of f r eedom = 43

    Ha: mean < 20 Ha: mean ! = 20 Ha: mean > 20Pr ( T < t ) = 0. 2946 Pr ( | T| > | t | ) = 0. 5892 Pr ( T > t ) = 0. 7054

    STATA first lists descriptive statistics of the variable l ung. The mean and standard deviation

    of the 44 observations are 19.653 and 4.228, respectively. The t statistic is -.544 = (19.653-20)

    / .6374. Finally, the degrees of freedom are 43 =44-1.

    There are three t-tests at the bottom of the output above. The first and third are one-tailed tests,

    whereas the second is a two-tailed test. The t statistic -.544 and its large p-value do not reject

    the null hypothesis that the population mean of the death rate from lung cancer is 20 at the .01

    level. The mean of the death rate may be 20 per 100,000 people. Note that the hypothesizedvalue 20 falls into the 99 percent confidence interval 17.935-21.371.

    2

    1n

    yy

    i= ,1

    )( 22

    =

    n

    yys

    i, and standard error

    n

    ssy = .

    2The 99 percent confidence interval of the mean is 6374.*695.2653.192 = ysty , where the 2.695 is

    the critical value with 43 degree of freedom at the .01 level in the two-tailed test.

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    5/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 5

    http://www.indiana.edu/~statmath

    If you just have the aggregate data (i.e., the number of observations, mean, and standard

    deviation of the sample), use the . t t e s t i command to replicate the t-test above. Note thehypothesized value is specified at the end of the summary statistics.

    . ttesti 44 19.65318 4.228122 20, level(99)

    2.2 T-test Using the SAS TTEST Procedure

    The TTEST procedure conducts various types of t-tests in SAS. The H0option specifies a

    hypothesized value, whereas the ALPHA indicates a significance level. If omitted, the default

    values zero and .05 respectively are assumed.

    PROCTTESTH0=20ALPHA=.01DATA=masil.smoking;

    VARlung;

    RUN;

    The TTEST Procedure

    Statistics

    Lower CL Upper CL Lower CL Upper CL

    Variable N Mean Mean Mean Std Dev Std Dev Std Dev Std Err

    lung 44 17.935 19.653 21.371 3.2994 4.2281 5.7989 0.6374

    T-Tests

    Variable DF t Value Pr > |t|

    lung 43 -0.54 0.5892

    The TTEST procedure reports descriptive statistics followed by a one-tailed t-test. You may

    have a summary data set containing the values of a variable (lung) and their frequencies

    (count). The FREQ option of the TTEST procedure provides the solution for this case.

    PROCTTESTH0=20ALPHA=.01DATA=masil.smoking;

    VARlung;

    FREQcount;

    RUN;

    2.3 T-test Using the SAS UNIVARIATE and MEANS Procedures

    The SAS UNIVARIATE and MEANS procedures also conduct a t-test for a univariate-sample.

    The UNIVARIATE procedure is basically designed to produces a variety of descriptive

    statistics of a variable. Its MU0option tells the procedure to perform a t-test using the

    hypothesized value specified. The VARDEF=DFspecifies a divisor (degrees of freedom) used in

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    6/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 6

    http://www.indiana.edu/~statmath

    computing the variance (standard deviation).3The NORMALoption examines if the variable is

    normally distributed.

    PROCUNIVARIATEMU0=20VARDEF=DF NORMALALPHA=.01DATA=masil.smoking;

    VARlung;

    RUN;

    The UNIVARIATE Procedure

    Variable: lung

    Moments

    N 44 Sum Weights 44

    Mean 19.6531818 Sum Observations 864.74

    Std Deviation 4.22812167 Variance 17.8770129

    Skewness -0.104796 Kurtosis -0.949602

    Uncorrected SS 17763.604 Corrected SS 768.711555

    Coeff Variation 21.5136751 Std Error Mean 0.63741333

    Basic Statistical Measures

    Location Variability

    Mean 19.65318 Std Deviation 4.22812

    Median 20.32000 Variance 17.87701

    Mode . Range 15.26000

    Interquartile Range 6.53000

    Tests for Location: Mu0=20

    Test -Statistic- -----p Value------

    Student's t t -0.5441 Pr > |t| 0.5892

    Sign M 1 Pr >= |M| 0.8804

    Signed Rank S -36.5 Pr >= |S| 0.6752

    Tests for Normality

    Test --Statistic--- -----p Value------

    Shapiro-Wilk W 0.967845 Pr < W 0.2535

    Kolmogorov-Smirnov D 0.086184 Pr > D >0.1500

    Cramer-von Mises W-Sq 0.063737 Pr > W-Sq >0.2500

    Anderson-Darling A-Sq 0.382105 Pr > A-Sq >0.2500

    Quantiles (Definition 5)

    Quantile Estimate

    100% Max 27.270

    3The VARDEF=Nuses N as a divisor, while VARDEF=WDFspecifies the sum of weights minus one.

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    7/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 7

    http://www.indiana.edu/~statmath

    99% 27.270

    95% 25.950

    90% 25.450

    75% Q3 22.815

    50% Median 20.320

    25% Q1 16.285

    Quantiles (Definition 5)

    Quantile Estimate

    10% 14.110

    5% 12.120

    1% 12.010

    0% Min 12.010

    Extreme Observations

    -----Lowest---- ----Highest----

    Value Obs Value Obs

    12.01 39 25.45 16

    12.11 33 25.88 1

    12.12 30 25.95 27

    13.58 10 26.48 18

    14.11 36 27.27 8

    The third block of the output above reports a t statistic and its p-value. The fourth block

    contains several statistics of normality test. Since N is less than 2,000, you should read the

    Shapiro-Wilk W, which suggests that lungis normally distributed (p |t| CL for Mean CL for Mean

    19.6531818 4.2281217 0.6374133 30.83

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    8/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 8

    http://www.indiana.edu/~statmath

    The MEANS procedure does not, however, have an option to specify a hypothesized value to

    anything other than zero. Thus, the null hypothesis here is that the population mean of death

    rate from lung cancer is zero. The t statistic 30.83 is (19.6532-0)/.6374. The large t statistic and

    small p-value reject the null hypothesis, reporting a consistent conclusion.

    2.4 T-test in SPSS

    The SPSS has the T-TEST command for t-tests. The /TESTVALsubcommand specifies the value

    with which the sample mean is compared, whereas the /VARIABLESlist the variables to be tested.

    Like STATA, SPSS specifies a confidence level rather than a significance level in the

    /CRITERIA=CI()subcommand.

    T-TEST

    /TESTVAL = 20

    /VARIABLES = lung

    /MISSING = ANALYSIS

    /CRITERIA = CI(.99) .

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    9/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 9

    http://www.indiana.edu/~statmath

    3. Paired (Dependent) Samples

    When two variables are not independent, but paired, the difference of these two variables,

    iii yyd 21 = , is treated as if it were a single sample. This test is appropriate for pre-post

    treatment responses. The null hypothesis is that the true mean difference of the two variables is

    D0, 00 : DH d= .4The difference is typically assumed to be zero unless explicitly specified.

    3.1 T-test in STATA

    In order to conduct a paired sample t-test, you need to list two variables separated by an equal

    sign. The interpretation of the t-test remains almost unchanged. The -1.871 = (-10.1667-

    0)/5.4337 at 35 degrees of freedom does not reject the null hypothesis that the difference is zero.

    . ttest pre=post0, level(95)

    Pai red t t est- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    pre | 36 176. 0278 6. 529723 39. 17834 162. 7717 189. 2838post 0 | 36 186. 1944 7. 826777 46. 96066 170. 3052 202. 0836

    - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -di f f | 36 - 10. 16667 5. 433655 32. 60193 - 21. 19757 . 8642387

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -mean(di f f ) = mean(pre post 0) t = - 1. 8711

    Ho: mean(di f f ) = 0 degrees of f r eedom = 35

    Ha: mean(di f f ) < 0 Ha: mean(di f f ) ! = 0 Ha: mean(di f f ) > 0Pr ( T < t ) = 0. 0349 Pr ( | T| > | t | ) = 0. 0697 Pr ( T > t ) = 0. 9651

    Alternatively, you may first compute the difference between the two variables, and thenconduct one-sample t-test. Note that the default confidence level, l evel ( 95) , can be omitted.

    . gen d=prepost0

    . ttest d=0

    3.2 T-test in SAS

    In the TTEST procedure, you have to use the PAIRED instead of the VAR statement. For theoutput of the following procedure, refer to the end of this section.

    PROCTTESTDATA=temp.drug;

    PAIREDpre*post0;

    RUN;

    4 )1(~0

    = nts

    Ddt

    d

    d, where

    n

    dd

    i= ,1

    )( 22

    =

    n

    dds

    i

    d , andn

    ss d

    d =

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    10/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 10

    http://www.indiana.edu/~statmath

    The PAIRED statement provides various ways of comparing variables using asterisk (*) and

    colon (:) operators. The asterisk requests comparisons between each variable on the left with

    each variable on the right. The colon requests comparisons between the first variable on the left

    and the first on the right, the second on the left and the second on the right, and so forth.Consider the following examples.

    PROCTTEST;

    PAIREDpro: post0;

    PAIRED(a b)*(c d); /* Equivalent toPAIREDa*c a*d b*c b*d; */

    PAIRED(a b):(c d); /* Equivalent toPAIREDa*c b*c; */

    PAIRED(a1-a10)*(b1-b10);

    RUN;

    The first PAIRED statement is the same as the PAIRED pre*post0. The second and the third

    PAIRED statements contrast differences between asterisk and colon operators. The hyphen ()

    operator in the last statement indicates a1through a10and b1through b10. Let us consider an

    example of the PAIRED statement.

    PROCTTEST DATA=temp.drug;

    PAIRED(pre)*(post0-post1);

    RUN;

    The TTEST Procedure

    Statistics

    Lower CL Upper CL Lower CL Upper CL

    Difference N Mean Mean Mean Std Dev Std Dev Std Dev Std Err

    pre - post0 36 -21.2 -10.17 0.8642 26.443 32.602 42.527 5.4337

    pre - post1 36 -30.43 -20.39 -10.34 24.077 29.685 38.723 4.9475

    T-Tests

    Difference DF t Value Pr > |t|

    pre - post0 35 -1.87 0.0697

    pre - post1 35 -4.12 0.0002

    The first t statistic for preversus post0is identical to that of the previous section. The second

    for preversus post1rejects the null hypothesis of no mean difference at the .01 level (p

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    11/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 11

    http://www.indiana.edu/~statmath

    PROCUNIVARIATEMU0=0VARDEF=DF NORMAL; VARd1 d2; RUN;

    PROCMEANSMEANSTDSTDERRTPROBTCLM; VARd1 d2; RUN;

    PROCTTESTALPHA=.05; VARd1 d2; RUN;

    3.3 T-test in SPSS

    In SPSS, the PAIRS subcommand indicates a paired sample t-test.

    T-TEST PAIRS = pre post0

    /CRITERIA = CI(.95)

    /MISSING = ANALYSIS .

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    12/179

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    13/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 13

    http://www.indiana.edu/~statmath

    The SAS TTEST procedure and SPSS T-TEST command conduct F tests for equal variance.

    SAS reports the folded form F statistic, whereas SPSS computes Levene's weighted F statistic.

    In STATA, the . onewaycommand produces Bartletts statistic for the equal variance test. The

    following is an example of Bartlett's test that does not reject the null hypothesis of equal

    variance.

    . oneway lung smoke

    Anal ysi s of Vari anceSource SS df MS F Pr ob > F

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Bet ween groups 313. 031127 1 313. 031127 28. 85 0. 0000Wi t hi n groups 455. 680427 42 10. 849534

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Tot al 768. 711555 43 17. 8770129

    Bart l ett ' s t est f or equal vari ances: chi 2( 1) = 0. 1216 Prob>chi 2 = 0. 727

    STATA, SAS, and SPSS all compute Satterthwaites approximation of the degrees of freedom.In addition, the SAS TTEST procedure reports Cochran-Cox approximation and the

    STATA . t t est command provides Welchs degrees of freedom.

    4.2 T-test in STATA

    With the .ttestcommand, you have to specify a grouping variable smoke in this example in

    the parenthesis of thebyoption.

    . ttest lung, by(smoke) level(95)

    Two- sampl e t t est wi t h equal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt erval ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    0 | 22 16. 98591 . 6747158 3. 164698 15. 58276 18. 389061 | 22 22. 32045 . 7287523 3. 418151 20. 80493 23. 83598

    - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 44 19. 65318 . 6374133 4. 228122 18. 36772 20. 93865- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    di f f | - 5. 334545 . 9931371 - 7. 338777 - 3. 330314- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    di f f = mean(0) - mean(1) t = - 5. 3714Ho: di f f = 0 degr ees of f r eedom = 42

    Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0

    Pr ( T < t ) = 0. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 1. 0000

    Let us first check the equal variance. The F statistic is )21,21(~1647.3

    4182.317.1

    2

    2

    2

    2

    Fs

    s

    S

    L == . The

    degrees of freedom of the numerator and denominator are 21 (=22-1). The p-value of .7273,

    virtually the same as that of Bartletts test above, does not reject the null hypothesis of equal

    variance. Thus, the t-test here is valid (t=-5.3714 and p

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    14/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 14

    http://www.indiana.edu/~statmath

    )22222(~3714.5

    22

    1

    22

    1

    0)3205.229859.16(+=

    +

    = t

    s

    t

    pool

    , where

    8497.10

    22222

    4182.3)122(1647.3)122( 222 =

    +

    +=pools

    If only aggregate data of the two variables are available, use the . t t e s t i command and list the

    number of observations, mean, and standard deviation of the two variables.

    . ttesti 22 16.85591 3.164698 22 22.32045 3.418151, level(95)

    Suppose a data set is differently arranged (second type in Figure 1) so that one variable

    smk_l unghas data for smokers and the other non_l ungfor non-smokers. You have to use the

    unpai r edoption to indicate that two variables are not paired. A grouping variable here is not

    necessary. Compare the following output with what is printed above.

    . ttest smk_lung=non_lung, unpaired

    Two- sampl e t t est wi t h equal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -smk_l ung | 22 22. 32045 . 7287523 3. 418151 20. 80493 23. 83598non_l ung | 22 16. 98591 . 6747158 3. 164698 15. 58276 18. 38906- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 44 19. 65318 . 6374133 4. 228122 18. 36772 20. 93865- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    di f f | 5. 334545 . 9931371 3. 330313 7. 338777- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    di f f = mean( smk_l ung) - mean( non_l ung) t = 5. 3714Ho: di f f = 0 degr ees of f r eedom = 42

    Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0Pr ( T < t ) = 1. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 0. 0000

    This unpai r edoption is very useful since it enables you to conduct a t-test without additional

    data manipulation. You may run the . t t est command with the unpai r edoption to compare

    two variables, say l eukemi aand ki dney, as independent samples in STATA. In SAS and

    SPSS, however, you have to stack up two variables and generate a grouping variable before t-tests.

    . ttest leukemia=kidney, unpaired

    Two- sampl e t t est wi t h equal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -l eukemi a | 44 6. 829773 . 0962211 . 6382589 6. 635724 7. 023821

    ki dney | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 88 4. 812159 . 2249261 2. 109994 4. 365094 5. 259224- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    di f f | 4. 035227 . 1240251 3. 788673 4. 281781- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    15/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 15

    http://www.indiana.edu/~statmath

    di f f = mean( l eukemi a) - mean(ki dney) t = 32. 5356Ho: di f f = 0 degr ees of f r eedom = 86

    Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0Pr ( T < t ) = 1. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 0. 0000

    The F 1.5119 = (.6532589^2)/(.5190799^2) and its p-value (=.1797) do not reject the nullhypothesis of equal variance. The large t statistic 32.5356 rejects the null hypothesis that death

    rates from leukemia and kidney cancers have the same mean.

    4.3 T-test in SAS

    The TTEST procedure by default examines the hypothesis of equal variances, and provides T

    statistics for either case. The procedure by default reports Satterthwaites approximation for the

    degrees of freedom. Keep in mind that a variable to be tested is grouped by the variable that isspecified in the CLASS statement.

    PROCTTESTH0=0ALPHA=.05 DATA=masil.smoking;

    CLASSsmoke;

    VARlung;

    RUN;

    The TTEST Procedure

    Statistics

    Lower CL Upper CL Lower CL Upper CL

    Variable smoke N Mean Mean Mean Std Dev Std Dev Std Dev

    lung 0 22 15.583 16.986 18.389 2.4348 3.1647 4.5226

    lung 1 22 20.805 22.32 23.836 2.6298 3.4182 4.8848lung Diff (1-2) -7.339 -5.335 -3.33 2.7159 3.2939 4.1865

    Statistics

    Variable smoke Std Err Minimum Maximum

    lung 0 0.6747 12.01 25.45

    lung 1 0.7288 12.11 27.27

    lung Diff (1-2) 0.9931

    T-Tests

    Variable Method Variances DF t Value Pr > |t|

    lung Pooled Equal 42 -5.37

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    16/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 16

    http://www.indiana.edu/~statmath

    lung Folded F 21 21 1.17 0.7273

    The F test for equal variance does not reject the null hypothesis of equal variances. Thus, the t-

    test labeled as Pooled should be referred to in order to get the t -5.37 and its p-value .0001. Ifthe equal variance assumption is violated, the statistics of Satterthwaite and Cochran

    should be read.

    If you have a summary data set with the values of variables (lung) and their frequency (count),

    specify the count variable in the FREQ statement.

    PROCTTESTDATA=masil.smoking;

    CLASSsmoke;

    VARlung;

    FREQcount;

    RUN;

    Now, let us compare the death rates from leukemia and kidney in the second data arrangement

    type of Figure 1. As mentioned before, you need to rearrange the data set to stack up two

    variables into one and generate a grouping variable (first type in Figure 1).

    DATAmasil.smoking2;

    SETmasil.smoking;

    death = leukemia; leu_kid ='Leukemia'; OUTPUT;

    death = kidney; leu_kid ='Kidney'; OUTPUT;

    KEEPleu_kid death;

    RUN;

    PROCTTESTCOCHRAN DATA=masil.smoking2; CLASSleu_kid; VARdeath; RUN;

    The TTEST Procedure

    Statistics

    Lower CL Upper CL Lower CL Upper CL

    Variable leu_kid N Mean Mean Mean Std Dev Std Dev Std Dev Std Err

    death Kidney 44 2.6367 2.7945 2.9524 0.4289 0.5191 0.6577 0.0783

    death Leukemia 44 6.6357 6.8298 7.0238 0.5273 0.6383 0.8087 0.0962

    death Diff (1-2) -4.282 -4.035 -3.789 0.5063 0.5817 0.6838 0.124

    T-Tests

    Variable Method Variances DF t Value Pr > |t|

    death Pooled Equal 86 -32.54

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    17/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 17

    http://www.indiana.edu/~statmath

    death Folded F 43 43 1.51 0.1794

    Compare this SAS output with that of STATA in the previous section.

    4.4 T-test in SPSS

    In the T-TEST command, you need to use the /GROUPsubcommand in order to specify a

    grouping variable. SPSS reports Levene's F .0000 that does not reject the null hypothesis ofequal variance (p

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    18/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 18

    http://www.indiana.edu/~statmath

    5. Independent Samples with Unequal Variances

    If the assumption of equal variances is violated, we have to compute the adjusted t statisticusing individual sample standard deviations rather than a pooled standard deviation. It is also

    necessary to use the Satterthwaite, Cochran-Cox (SAS), or Welch (STATA) approximations of

    the degrees of freedom. In this chapter, you compare mean death rates from kidney cancerbetween the west (south) and east (north).

    5.1 T-test in STATA

    As discussed earlier, let us check equality of variances using the . onewaycommand. The

    t abul at eoption produces a table of summary statistics for the groups.

    . oneway kidney west, tabulate

    | Summar y of ki dney

    west | Mean Std. Dev. Freq.- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    0 | 3. 006 . 3001298 201 | 2. 6183333 . 59837219 24

    - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Tot al | 2. 7945455 . 51907993 44

    Anal ysi s of Vari anceSource SS df MS F Pr ob > F

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Bet ween groups 1. 63947758 1 1. 63947758 6. 92 0. 0118Wi t hi n groups 9. 94661333 42 . 236824127

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Tot al 11. 5860909 43 . 269443975

    Bart l ett ' s t est f or equal vari ances: chi 2( 1) = 8. 6506 Prob>chi 2 = 0. 003

    Bartletts chi-squared statistic rejects the null hypothesis of equal variance at the .01 level. It is

    appropriate to use the unequal option in the . t t es t command, which calculates

    Satterthwaites approximation for the degrees of freedom.

    Unlike the SAS TTEST procedure, the . t t est command cannot specify the mean difference

    D0 other than zero. Thus, the null hypothesis is that the mean difference is zero.

    . ttest kidney, by(west) unequal level(95)

    Two- sampl e t t est wi t h unequal var i ances

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt erval ]

    - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -0 | 20 3. 006 . 0671111 . 3001298 2. 865535 3. 1464651 | 24 2. 618333 . 1221422 . 5983722 2. 365663 2. 871004

    - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    di f f | . 3876667 . 139365 . 1047722 . 6705611- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    19/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 19

    http://www.indiana.edu/~statmath

    di f f = mean( 0) - mean( 1) t = 2. 7817Ho: di f f = 0 Satt ert hwai t e' s degr ees of f r eedom = 35. 1098

    Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0Pr ( T < t ) = 0. 9957 Pr ( | T| > | t | ) = 0. 0086 Pr ( T > t ) = 0. 0043

    See Satterthwaites approximation of 35.110 in the middle of the output. If you want to getWelchs approximation, use the wel chas well as unequal options; without the unequal option,

    the wel chis ignored.

    . ttest kidney, by(west) unequal welch

    Two- sampl e t t est wi t h unequal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Gr oup | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt erval ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    0 | 20 3. 006 . 0671111 . 3001298 2. 865535 3. 1464651 | 24 2. 618333 . 1221422 . 5983722 2. 365663 2. 871004

    - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236

    - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -di f f | . 3876667 . 139365 . 1050824 . 6702509- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    di f f = mean( 0) - mean( 1) t = 2. 7817Ho: di f f = 0 Wel ch' s degrees of f r eedom = 36. 2258

    Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0Pr ( T < t ) = 0. 9957 Pr ( | T| > | t | ) = 0. 0085 Pr ( T > t ) = 0. 0043

    Satterthwaites approximation is slightly smaller than Welchs 36.2258. Again, keep in mindthat these approximations are not integers, but real numbers. The t statistic 2.7817 and its p-

    value .0086 reject the null hypothesis of equal population means. The north and east have

    larger death rates from kidney cancer per 100 thousand people than the south and west.

    For aggregate data, use the . t t e s t i command with the necessary options.

    . ttesti 20 3.006 .3001298 24 2.618333 .5983722, unequal welch

    As mentioned earlier, the unpai r edoption of the . t t est command directly compares two

    variables without data manipulation. The option treats the two variables as independent of each

    other. The following is an example of the unpaired and unequal options.

    . ttest bladder=kidney, unpaired unequal welch

    Two- sampl e t t est wi t h unequal var i ances- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Var i abl e | Obs Mean St d. Er r . St d. Dev. [ 95% Conf . I nt er val ]- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -bl adder | 44 4. 121136 . 1454679 . 9649249 3. 827772 4. 4145ki dney | 44 2. 794545 . 0782542 . 5190799 2. 636731 2. 95236

    - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -combi ned | 88 3. 457841 . 1086268 1. 019009 3. 241933 3. 673748- - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    di f f | 1. 326591 . 1651806 . 9968919 1. 65629- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    di f f = mean(bl adder ) - mean(ki dney) t = 8. 0312Ho: di f f = 0 Wel ch' s degrees of f r eedom = 67. 0324

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    20/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 20

    http://www.indiana.edu/~statmath

    Ha: di f f < 0 Ha: di f f ! = 0 Ha: di f f > 0Pr ( T < t ) = 1. 0000 Pr ( | T| > | t | ) = 0. 0000 Pr ( T > t ) = 0. 0000

    The F 3.4556 = (.9649249^2)/(.5190799^2) rejects the null hypothesis of equal variance

    (p

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    21/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 21

    http://www.indiana.edu/~statmath

    F 3.9749 = (.5983722^2)/(.3001298^2) and p

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    22/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 22

    http://www.indiana.edu/~statmath

    Variable Method Num DF Den DF F Value Pr > F

    death Folded F 43 43 3.46

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    23/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 23

    http://www.indiana.edu/~statmath

    6. One-way ANOVA, GLM, and Regression

    The t-test is a special case of one-way ANOVA. Thus, one-way ANOVA produces equivalentresults to those of the t-test. ANOVA examines mean differences using the F statistic, whereas

    the t-test reports the t statistic. The one-way ANOVA (t-test), GLM, and linear regression

    present essentially the same things in different ways.

    6.1 One-way ANOVA

    Consider the following ANOVA procedure. The CLASS statement is used to specify

    categorical variables. The MODEL statement lists the variable to be compared and a grouping

    variable, separating them with an equal sign.

    PROCANOVADATA=masil.smoking;

    CLASSsmoke;

    MODELlung=smoke;

    RUN;

    The ANOVA Procedure

    Dependent Variable: lung

    Sum of

    Source DF Squares Mean Square F Value Pr > F

    Model 1 313.0311273 313.0311273 28.85 F

    smoke 1 313.0311273 313.0311273 28.85 F- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Model | 313. 031127 1 313. 031127 28. 85 0. 0000|

    smoke | 313. 031127 1 313. 031127 28. 85 0. 0000|

    Resi dual | 455. 680427 42 10. 849534- - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Tot al | 768. 711555 43 17. 8770129

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    24/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 24

    http://www.indiana.edu/~statmath

    In SPSS, the ONEWAY command is used.

    ONEWAY lung BY smoke

    /MISSING ANALYSIS .

    6.2 Generalized Linear Model (GLM)

    The SAS GLM and MIXED procedures and the SPSS UNIANOVA command also report the F

    statistic for one-way ANOVA. Note that STATAs . gl mcommand does not perform one-way

    ANOVA.

    PROCGLMDATA=masil.smoking;

    CLASSsmoke;

    MODELlung=smoke /SS3;

    RUN;

    The GLM Procedure

    Dependent Variable: lung

    Sum of

    Source DF Squares Mean Square F Value Pr > F

    Model 1 313.0311273 313.0311273 28.85 F

    smoke 1 313.0311273 313.0311273 28.85

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    25/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 25

    http://www.indiana.edu/~statmath

    The SAS REG procedure, STATA . r egr ess command, and SPSS REGRESSION commandestimate linear regression models.

    PROCREGDATA=masil.smoking;

    MODELlung=smoke;

    RUN;

    The REG Procedure

    Model: MODEL1

    Dependent Variable: lung

    Number of Observations Read 44

    Number of Observations Used 44

    Analysis of Variance

    Sum of Mean

    Source DF Squares Square F Value Pr > F

    Model 1 313.03113 313.03113 28.85 |t|

    Intercept 1 16.98591 0.70225 24.19

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    26/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 26

    http://www.indiana.edu/~statmath

    Model | 313. 031127 1 313. 031127 Pr ob > F = 0. 0000Resi dual | 455. 680427 42 10. 849534 R- squar ed = 0. 4072

    - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Adj R- s quar ed = 0. 3931Tot al | 768. 711555 43 17. 8770129 Root MSE = 3. 2939

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -l ung | Coef . St d. Err . t P>| t | [ 95% Conf . I nt erval ]

    - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -smoke | 5. 334545 . 9931371 5. 37 0. 000 3. 330314 7. 338777_cons | 16. 98591 . 702254 24. 19 0. 000 15. 5687 18. 40311

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    The SPSS REGRESSION command looks complicated compared to the SAS REG procedure

    and STATA . r egr esscommand.

    REGRESSION

    /MISSING LISTWISE

    /STATISTICS COEFF OUTS R ANOVA

    /CRITERIA=PIN(.05) POUT(.10)

    /NOORIGIN

    /DEPENDENT lung/METHOD=ENTER smoke.

    Note that ANOVA, GLM, and regression report the same F (1, 42) 28.85, which is equivalent

    to t (42) -5.3714. As long as the degrees of freedom of the numerator is 1, F is always t^2

    (28.85=-5.3714^2).

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    27/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 27

    http://www.indiana.edu/~statmath

    7. Conclusion

    The t-test is a basic statistical method for examining the mean difference between two groups.One-way ANOVA can compare means of more than two groups. The number of observations

    in individual groups does not matter in the t-test or one-way ANOVA; both balanced and

    unbalanced data are fine. One-way ANOVA, GLM, and linear regression models all use thevariance-covariance structure in their analysis, but present the results in different ways.

    Researchers must check four issues when performing t-tests. First, a variable to be testedshould be interval or ratio so that its mean is substantively meaningful. Do not, for example,

    run a t-test to compare the mean of skin colors (white=0, yellow=1, black=2) between two

    countries. If you have a latent variable measured by several Likert-scaled manifest variables,first run a factor analysis to get that latent variable.

    Second, examine the normality assumptions before conducting a t-test. It is awkward to

    compare means of variables that are not normally distributed. Figure 2 illustrates a normal

    probability distribution on top and a Poisson distribution skewed to the right on the bottom.Although the two distributions have the same mean and variance of 1, they are not likely to be

    substantively interpretable. This is the rationale to conduct normality test such as Shapiro-WilkW, Shapiro-Francia W, and Kolmogorov-Smirnov D statistics. If the normality assumption is

    violated, try to use nonparametric methods.

    Figure 2. Comparing Normal and Poisson Probability Distributions ( 2 =1 and =1)

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    28/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 28

    http://www.indiana.edu/~statmath

    Third, check the equal variance assumption. You should be careful when comparing means of

    normally distributed variables with different variances. You may conduct the folded form F test.

    If the equal variance assumption is violated, compute the adjusted t and approximations of thedegree of freedom.

    Finally, consider the types of t-tests, data arrangement, and functionalities available in eachstatistical software (e.g., STATA, SAS, and SPSS) to determine the best strategy for dataanalysis (Table 3). The first data arrangement in Figure 1 is commonly used for independent

    sample t-tests, whereas the second arrangement is appropriate for a paired sample test. Keep inmind that the type II data sets in Figure 1 needs to be reshaped into type I in SAS and SPSS.

    Table 3. Comparison of T-test Functionalities of STATA, SAS and SPSS

    STATA 9.0 SAS 9.1 SPSS 13.0Test for equal variance Bartletts chi-squared

    (. t t e s t command)Folded form F

    (TTESTprocedure)Levenes weighted F

    (T- TESTcommand)Approximation of thedegrees of freedom (DF)

    Satterthwaites DFWelchs DF

    Satterthwaites DFCochran-Cox DF

    Satterthwaites DF

    Second Data Arrangement var1=var2 Reshaping the data set Reshaping the data set

    Aggregate Data . t t est i command FREQoption N/A

    SAS has several procedures (e.g., TTEST, MEANS, and UNIVARIATE) and useful options for

    t-tests. The STATA . t t e s t and . t t e s t i commands provide very flexible ways of handling

    different data arrangements and aggregate data. Table 4 summarizes usages of options in these

    two commands.

    Table 4. Summary of the Usages of the . t t es t and . t t est Command Options

    Usage by(group var) unequal welch unpaired*

    Univariate sample var=c

    Paired (dependent) sample var1=var2

    Equal variance (1 variable) Var O

    Equal variance (2 variables)**

    var1=var2 O

    Unequal variance (1 variable) Var O O O

    Unequal variance (2 variables) var1=var2 O O O

    * The . t t e s t i command does not allow the unpai r edoption.** The var1=var2 assumes second type of data arrangement in Figure 1.

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    29/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 29

    http://www.indiana.edu/~statmath

    Appendix: Data Set

    Literature: Fraumeni, J. F. 1968. "Cigarette Smoking and Cancers of the Urinary Tract:Geographic Variations in the United States,"Journal of the National Cancer Institute, 41(5):

    1205-1211.

    Data Source: http://lib.stat.cmu.edu

    The data are per capita numbers of cigarettes smoked (sold) by 43 states and the District ofColumbia in 1960 together with death rates per 100 thousand people from various forms of

    cancer. The variables used in this document are,

    cigar= number of cigarettes smoked (hds per capita)bladder= deaths per 100k people from bladder cancer

    lung= deaths per 100k people from lung cancer

    kidney= deaths per 100k people from kidney cancer

    leukemia= deaths per 100k people from leukemiasmoke= 1 for those whose cigarette consumption is larger than the median and 0 otherwise.

    west= 1 for states in the South or West and 0 for those in the North, East or Midwest.

    The followings are summary statistics and normality tests of these variables.

    . sum cigar-leukemia

    Var i abl e | Obs Mean Std. Dev. Mi n Max- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    ci gar | 44 24. 91409 5. 573286 14 42. 4bl adder | 44 4. 121136 . 9649249 2. 86 6. 54

    l ung | 44 19. 65318 4. 228122 12. 01 27. 27

    ki dney | 44 2. 794545 . 5190799 1. 59 4. 32l eukemi a | 44 6. 829773 . 6382589 4. 9 8. 28

    . sfrancia cigar-leukemia

    Shapi r o- Franci a W' t est f or nor mal dat aVari abl e | Obs W' V' z Prob>z

    - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -ci gar | 44 0. 93061 3. 258 2. 203 0. 01381

    bl adder | 44 0. 94512 2. 577 1. 776 0. 03789l ung | 44 0. 97809 1. 029 0. 055 0. 47823

    ki dney | 44 0. 97732 1. 065 0. 120 0. 45217l eukemi a | 44 0. 97269 1. 282 0. 474 0. 31759

    . tab west smoke

    | smokewest | 0 1 | Tot al

    - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -0 | 7 13 | 201 | 15 9 | 24

    - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - +- - - - - - - - - -Tot al | 22 22 | 44

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    30/179

    2003-2005, The Trustees of Indiana University Comparing Group Means: 30

    http://www.indiana.edu/~statmath

    References

    Fraumeni, J. F. 1968. "Cigarette Smoking and Cancers of the Urinary Tract: GeographicVariations in the United States,"Journal of the National Cancer Institute, 41(5): 1205-

    1211.

    Ott, R. Lyman. 1993.An Introduction to Statistical Methods and Data Analysis. Belmont, CA:Duxbury Press.

    SAS Institute. 2005. SAS/STAT User's Guide, Version 9.1. Cary, NC: SAS Institute.

    SPSS. 2001. SPSS 11.0 Syntax Reference Guide. Chicago, IL: SPSS Inc.STATA Press. 2005. STATA Reference Manual Release 9. College Station, TX: STATA Press.

    Walker, Glenn A. 2002. Common Statistical Methods for Clinical Research with SAS

    Examples. Cary, NC: SAS Institute.

    Acknowledgements

    I am grateful to Jeremy Albright, Takuya Noguchi, and Kevin Wilhite at the UITS Center forStatistical and Mathematical Computing, Indiana University, who provided valuable comments

    and suggestions.

    Revision History

    2003. First draft 2004. Second draft 2005. Third draft (Added data arrangements and conclusion).

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    31/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 1

    http://www.indiana.edu/~statmath

    Regression Models for Event Count Data

    Using SAS, STATA, and LIMDEP

    Hun Myoung Park

    This document summarizes regression models for event count data and illustrates how to

    estimate individual models using SAS, STATA, and LIMDEP. Example models were tested in SAS

    9.1, STATA 9.0, and LIMDEP 8.0.

    1. Introduction2. The Poisson Regression Model (PRM)3. The Negative Binomial Regression Model (NBRM)4. The Zero-Inflated Poisson Regression Model (ZIP)5. The Zero-Inflated Negative Binomial Regression Model (ZINB)6. Conclusion

    7.

    Appendix

    1. Introduction

    An event count is the realization of a nonnegative integer-valued random variable (Cameron andTrivedi 1998). Examples are the number of car accidents per month, thunder storms per year, andwild fires per year. The ordinary least squares (OLS) method for event count data results inbiased, inefficient, and inconsistent estimates (Long 1997). Thus, researchers have developedvarious nonlinear models that are based on the Poisson distribution and negative binomialdistribution.

    1.1 Count Data Regression Models

    The left-hand side (LHS) of the equation has event count data. Independent variables are, as inthe OLS, located at the right-hand side (RHS). These RHS variables may be interval, ratio, orbinary (dummy). Table 1 below summarizes the categorical dependent variable regressionmodels (CDVMs) according to the level of measurement of the dependent variable.

    Table 1. Ordinary Least Squares and CDVMs

    Model Dependent (LHS) Method Independent (RHS)

    OLS Ordinary leastsquaresInterval or ratio Moment based

    method

    Binary response Binary (0 or 1)

    Ordinal response Ordinal (1st, 2nd, 3rd)

    Nominal response Nominal (A, B, C )CDVMs

    Event count data Count (0, 1, 2, 3)

    Maximumlikelihoodmethod

    A linear function ofinterval/ratio or binaryvariables

    ...22110 XX ++

    The Poisson regression model (PRM) and negative binomial regression model (NBRM) are basicmodels for count data analysis. Either the zero-inflated Poisson (ZIP) or the zero-inflated

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    32/179

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    33/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 3

    http://www.indiana.edu/~statmath

    )exp( iii x += (Long 1997). Thus, the conditional variance of y becomes larger than its

    conditional mean, iii xyE =)|( , which remains unchanged. Figure 2 illustrates how the

    probabilities for small and larger counts increase in the negative binomial distribution as theconditional variance of y increases, given 3= .

    Figure 2. Negative Binomial Probability Distribution with Alpha of .01, .5, 1, and 5

    The PRM and NBRM, however, have the same mean structure. If 0= , the NBRM reduces tothe PRM (Cameron and Trivedi 1998; Long 1997).

    1.3 Overdispersion

    When )|()|( iiii xyExyVar > , we are said to have overdispersion. Estimates of a PRM foroverdispersed data are unbiased, but inefficient with standard errors biased downward (Cameronand Trivedi 1998; Long 1997). The likelihood ratio test is developed to examine the nullhypothesis of no overdispersion, 0:0 =H .The likelihood ratio follows the Chi-squared

    distribution with one degree of freedom, )1(~)ln(ln*2 2PoissonNB LLLR = . If the null

    hypothesis is rejected, the NBRM is preferred to the PRM.

    Zero-inflated models handle overdispersion by changing the mean structure to explicitly modelthe production of zero counts (Long 1997). These models assume two latent groups. One is thealways-zero group and the other is the not-always-zero or sometime-zero group. Thus, zero

    counts come from the former group and some of the latter group with a certain probability.

    The likelihood ratio, )1(~)ln(ln*2 2ZIPZINB LLLR = , tests 0:0 =H to compare the ZIP

    and NBRM. The PRM and ZIP as well as NBRM and ZINB cannot, however, be tested by thislikelihood ratio, since they are not nested respectively. The Voungs statistic compares thesenon-nested models. If V is greater than 1.96, the ZIP or ZINB is favored. If V is less than -1.96,the PRM or NBRM is preferred (Long 1997).

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    34/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 4

    http://www.indiana.edu/~statmath

    1.4 Estimation in SAS, STATA, and LIMDEP

    The SAS GENMOD procedure estimates Poisson and negative binomial regression models.

    STATA has individual commands (e.g., . poi ssonand . nbr eg) for the corresponding count datamodels. LIMDEP has Poi sson$and Negbi n$commands to estimate various count data modelsincluding zero-inflated and zero-truncated models. Table 2 summarizes the procedures andcommands for count data regression models.

    Table 2. Comparison of the Procedures and Commands for Count Data Models

    Model SAS 9.1 STATA 9.0 LIMDEP 8.0Poisson Regression (PRM) GENMOD . poi sson Poi sson$

    Negative Binomial Regression (NBRM) GENMOD . nbr eg Negbi n$Zero-Inflated Poisson (ZIP) - . zi p Poi sson; Zi p; Rh2$Zero-inflated Negative Binomial (ZINB) - . zi nb Negbi n; Zi p; Rh2$Zero-truncated Poisson (ZTP) - . zt p Poi sson; Tr uncat i on$

    Zero-truncated Negative Binomial (ZTNB) - . zt nb Negbi n; Truncat i on$

    The example here examines how waste quotas (emps) and the strictness of policy implementation(st r i ct ) affect the frequency of waste spill accidents of plants (acci dent ).

    1. 5 Long and Freeses SPost Module

    STATA users may take advantages of user-written modules such as SPost written by J. ScottLong and Jeremy Freese. The module allows researchers to conduct follow-up analyses ofvarious CDVMs including event count data models. See 2.3 for examples of major SPost

    commands.

    In order to install SPost, execute the following commands consecutively. For more details, visit J.Scott Longs Web site at http://www.indiana.edu/~jslsoc/spost_install.htm.

    . net from http://www.indiana.edu/~jslsoc/stata/

    . net install spost9_ado, replace

    . net get spost9_do, replace

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    35/179

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    36/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 6

    http://www.indiana.edu/~statmath

    the number of regressors between the unrestricted and restricted models. The chi-squaredstatistic is 124.8218 = 2* [-667.2291 - (-729.6400)] (p ChiSq

    Intercept 1 0.3168 0.0306 0.2568 0.3768 107.20

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    37/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 7

    http://www.indiana.edu/~statmath

    Poi sson r egr essi on Number of obs = 778LR chi 2(2) = 124. 82Pr ob > chi 2 = 0. 0000

    Log l i kel i hood = - 1821. 5101 Pseudo R2 = 0. 0331

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    acci dent | Coef . St d. Er r. z P>| z| [ 95% Conf . I nt erval ]- - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -emps | . 0054186 . 0007434 7. 29 0. 000 . 0039615 . 0068757

    st r i ct | - . 7041664 . 0667619 - 10. 55 0. 000 - . 8350174 - . 5733154_cons | . 3900961 . 0466787 8. 36 0. 000 . 2986076 . 4815846

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    Let us run a restricted model and then run the . di spl aycommand in order to double check thatthe likelihood ratio for goodness-of-fit is 124.8218.

    . poisson accident

    I t er at i on 0: l og l i kel i hood = - 1883. 921I t er at i on 1: l og l i kel i hood = - 1883. 921

    Poi sson r egr essi on Number of obs = 778LR chi 2(0) = 0. 00Pr ob > chi 2 = .

    Log l i kel i hood = - 1883. 921 Pseudo R2 = 0. 0000

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -acci dent | Coef . St d. Er r. z P>| z| [ 95% Conf . I nt erval ]

    - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -_cons | . 3168165 . 0305995 10. 35 0. 000 . 2568426 . 3767904

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    . display 2 * (-1821.5101 - (-1883.921))

    124. 8218

    2.3 Using the SPost Module in STATA

    The SPost module provides useful commands for follow-up analyses of various categoricaldependent variable models. The . f i t st at command calculates various goodness-of-fit statisticssuch as log likelihood, McFaddens R2(or Pseudo R2), Akaike Information Criterion (AIC), andBayesian Information Criterion (BIC).

    . quietly poisson accident emps strict

    . fitstat

    Measures of Fi t f or poi sson of acci dent

    Log- Li k I nt ercept Onl y: - 1883. 921 Log- Li k Ful l Model : - 1821. 510D( 775) : 3643. 020 LR( 2): 124. 822

    Pr ob > LR: 0. 000McFadden' s R2: 0. 033 McFadden' s Adj R2: 0. 032Maxi mumLi kel i hood R2: 0. 148 Cr agg & Uhl er ' s R2: 0. 149AI C: 4. 690 AI C*n: 3649. 020BI C: - 1515. 943 BI C' : - 111. 508

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    38/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 8

    http://www.indiana.edu/~statmath

    The . l i s t coef command lists unstandardized coefficients (parameter estimates), factor andpercent changes, and standardized coefficients to help interpret regression results.

    . listcoef, help

    poi sson (N=778) : Fact or Change i n Expect ed Count

    Obser ved SD: 2. 9482675

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -acci dent | b z P>| z| e b e bSt dX SDof X

    - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -emps | 0. 00542 7. 289 0. 000 1. 0054 1. 2297 38. 1548

    st r i ct | - 0. 70417 - 10. 547 0. 000 0. 4945 0. 7031 0. 5003- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    b = r aw coef f i ci entz = z- score f or t est of b=0

    P>| z| = p- val ue f or z- t este b = exp(b) = f act or change i n expect ed count f or uni t i ncrease i n X

    e bSt dX = exp( b*SD of X) = change i n expect ed count f or SD i ncrease i n XSDof X = st andar d devi at i on of X

    The . prt abcommand constructs a table of predicted values (events) for all combinations ofcategorical variables listed. The following example shows that the predicted number of accidentsunder the strict policy is .9172 at the mean waste quota (emps=42.0129).

    . prtab strict

    poi sson: Predi ct ed r at es f or acci dent

    - - - - - - - - - - - - - - - - - - - - - -s t r i ct | Predi ct i on

    - - - - - - - - - - +- - - - - - - - - - -0 | 1. 85471 | 0. 9172

    - - - - - - - - - - - - - - - - - - - - - -

    emps st r i ctx= 42. 012853 . 50771208

    The . pr val uelists predicted values for a given set of values for the independent variables. Forexample, the predicted probability of a zero count is .3996 at the mean waste quota under thestrict policy (str i ct=1). Note that the predicted rate of .917 is equivalent to .9172 in the . pr t ababove.

    . prvalue, x(strict=1) maxcnt(5)

    poi sson: Predi ct i ons f or acci dent

    Predi ct ed r at e: . 917 95% CI [ . 827 , 1. 02]

    Predi cted probabi l i t i es:

    Pr( y=0| x) : 0. 3996 Pr( y=1| x): 0. 3665Pr( y=2| x) : 0. 1681 Pr( y=3| x): 0. 0514Pr( y=4| x) : 0. 0118 Pr( y=5| x): 0. 0022

    emps st r i ctx= 42. 012853 1

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    39/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 9

    http://www.indiana.edu/~statmath

    The most useful command is the . pr changethat calculates marginal effects (changes) anddiscrete changes. For instance, a standard deviation increase in waste quota form its mean willincrease accidents by .3841 under the lenient policy (st r i ct =0).

    . prchange, x(strict=0)

    poi sson: Changes i n Pr edi ct ed Rat e f or acci dent

    mi n- >max 0- >1 - +1/ 2 - +sd/ 2 MargEf ctemps 2. 3070 0. 0080 0. 0101 0. 3841 0. 0101

    st r i ct - 0. 9375 - 0. 9375 - 1. 3332 - 0. 6568 - 1. 3060

    exp( xb) : 1. 8547

    emps st r i ctx= 42. 0129 0

    sd( x) = 38. 1548 . 500262

    SPost also includes the . prgen command, which computes a series of predictions by holding allvariables but one constant and allowing that variable to vary (Long and Freese 2003). These

    SPost commands work with most categorical and count data models suchas . l ogi t , . probi t , . poi sson, . nbr eg, . z i p, and . zi nb.

    2.4 PRM in LIMDEP

    The LIMDEP Poi sson$command estimates the PRM. LIMDEP reports log likelihoods of boththe unrestricted and restricted models. Keep in mind that you must include the ONE for theintercept.

    POISSON;

    Lhs=ACCIDENT;

    Rhs=ONE,EMPS,STRICT

    +---------------------------------------------+

    | Poisson Regression |

    | Maximum Likelihood Estimates |

    | Model estimated: Aug 24, 2005 at 04:56:45PM.|

    | Dependent variable ACCIDENT |

    | Weighting variable None |

    | Number of observations 778 |

    | Iterations completed 8 |

    | Log likelihood function -1821.510 |

    | Restricted log likelihood -1883.921 |

    | Chi squared 124.8218 |

    | Degrees of freedom 2 || Prob[ChiSqd > value] = .0000000 |

    | Chi- squared = 4944.94781 RsqP= -.0051 |

    | G - squared = 2827.20794 RsqD= .0423 |

    | Overdispersion tests: g=mu(i) : 4.720 |

    | Overdispersion tests: g=mu(i)^2: 4.253 |

    +---------------------------------------------+

    +---------+--------------+----------------+--------+---------+----------+

    |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|

    +---------+--------------+----------------+--------+---------+----------+

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    40/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 10

    http://www.indiana.edu/~statmath

    Constant .3900961420 .46678663E-01 8.357 .0000

    EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853

    STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208

    (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

    SAS, STATA, and LIMDEP produce almost the same parameter estimates and standard errors

    (Table 3). The log likelihood in SAS is different from that of STATA and LIMDEP (-667.291versus -1821.5101). This difference seems to come from the generalized linear model that theGENMOD procedure uses. These log likelihoods are, however, equivalent in the sense that theyresult in the same likelihood ratio.

    Table 3. Summary of the Poisson Regression Model in SAS, STATA, and LIMDEP

    Model SAS 9.1 STATA 9.0 LIMDEP 8.0

    Intercept. 3901

    ( . 0467). 3901

    ( . 0467). 3901

    ( . 0467)

    EMPS. 0054

    ( . 0007). 0054

    ( . 0007). 0054

    ( . 0007)

    STRICT

    - . 7042

    ( . 0668)

    - . 7042

    ( . 0668)

    - . 7042

    ( . 0668)Log Likelihood (unrestricted) - 667. 2291 - 1821. 5101 - 1821. 510Log Likelihood (restricted) - 729. 6400 - 1883. 921 - 1883. 921Likelihood Ratio for Goodness-of-fit 124. 8218 124. 82 124. 8218

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    41/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 11

    http://www.indiana.edu/~statmath

    3. The Negative Binomial Regression Model

    The SAS GENMODE procedure, STATA . nbr egcommand, and LIMDEP Negbi n$commandestimate the negative binomial regression model (NBRM).

    3.1 NBRM in SAS

    The GENMOD procedure estimates the NBRM using the /DIST=NEGBIN option. Note that thedispersion parameter is equivalent to the alpha in STATA and LIMDEP.

    PROC

    GENMOD

    DATA = masil.accident;

    MODELaccident=emps strict /DIST=NEGBIN LINK=LOG;

    RUN

    ;

    The GENMOD Procedure

    Model Information

    Data Set COUNT.WASTE

    Distribution Negative Binomial

    Link Function Log

    Dependent Variable Accident

    Observations Used 778

    Criteria For Assessing Goodness Of Fit

    Criterion DF Value Value/DF

    Deviance 775 589.7752 0.7610

    Scaled Deviance 775 589.7752 0.7610

    Pearson Chi-Square 775 845.6033 1.0911

    Scaled Pearson X2 775 845.6033 1.0911Log Likelihood 37.5628

    Algorithm converged.

    Analysis Of Parameter Estimates

    Standard Wald 95% Confidence Chi-

    Parameter DF Estimate Error Limits Square Pr > ChiSq

    Intercept 1 0.3851 0.1278 0.1345 0.6357 9.07 0.0026

    Emps 1 0.0052 0.0023 0.0008 0.0096 5.29 0.0214

    Strict 1 -0.6703 0.1671 -0.9978 -0.3427 16.09

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    42/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 12

    http://www.indiana.edu/~statmath

    The likelihood ratio for overdispersion is 1409.5838 = 2 * (37.5628 - (-667.2291)).

    3.2 NBRM in STATA

    STATA has the . nbr egcommand for the NBRM. The command reports three log likelihoodstatistics: for the PRM, restricted NBRM (constant-only model), and unrestricted NBRM (fullmodel), which make it easy to conduct likelihood ratio tests.

    . nbreg accident emps strict

    Fi t t i ng compari son Poi sson model :

    I t er at i on 0: l og l i kel i hood = - 1821. 5112I t er at i on 1: l og l i kel i hood = - 1821. 5101I t er at i on 2: l og l i kel i hood = - 1821. 5101

    Fi t t i ng const ant - onl y model :

    I t er at i on 0: l og l i kel i hood = - 1256. 6761I t er at i on 1: l og l i kel i hood = - 1152. 6155I t er at i on 2: l og l i kel i hood = - 1125. 6643I t er at i on 3: l og l i kel i hood = - 1125. 4183I t er at i on 4: l og l i kel i hood = - 1125. 4183

    Fi tt i ng f ul l model :

    I t er at i on 0: l og l i kel i hood = - 1117. 1731I t er at i on 1: l og l i kel i hood = - 1116. 7201I t er at i on 2: l og l i kel i hood = - 1116. 7182I t er at i on 3: l og l i kel i hood = - 1116. 7182

    Negat i ve bi nomi al r egr essi on Number of obs = 778LR chi 2(2) = 17. 40Pr ob > chi 2 = 0. 0002

    Log l i kel i hood = - 1116. 7182 Pseudo R2 = 0. 0077

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -acci dent | Coef . St d. Er r. z P>| z| [ 95% Conf . I nt erval ]

    - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -emps | . 0051981 . 0022595 2. 30 0. 021 . 0007694 . 0096267

    st r i ct | - . 6702548 . 1671191 - 4. 01 0. 000 - . 9978021 - . 3427074_cons | . 3851111 . 1278468 3. 01 0. 003 . 134536 . 6356861

    - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -/ l nal pha | 1. 37509 . 0885176 1. 201599 1. 548582

    - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -al pha | 3. 955434 . 3501257 3. 32543 4. 704793

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Li kel i hood r at i o t est of al pha=0: chi bar2( 01) = 1409. 58 Prob>=chi bar2 = 0. 000

    The restricted model or constant-only model gives us a log likelihood -1125.4183. Thus, thelikelihood ratio for goodness-of-fit is 17.4002 = 2 * [-1116.7182 - (-1125.4183)] (p

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    43/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 13

    http://www.indiana.edu/~statmath

    The likelihood ratio test for overdispersion results in a chi-squared of 1409.5838 (p1 of a binary variable st r i ct , since itsmarginal change at the mean (.5077) is meaningless.

    . prchange

    nbr eg: Changes i n Predi ct ed Rate f or acci dent

    mi n- >max 0- >1 - +1/ 2 - +sd/ 2 MargEf ctemps 1. 5326 0. 0055 0. 0068 0. 2585 0. 0068

    st r i ct - 0. 8931 - 0. 8931 - 0. 8885 - 0. 4383 - 0. 8721

    exp( xb) : 1. 3011

    emps st r i ctx= 42. 0129 . 507712

    sd( x) = 38. 1548 . 500262

    3.3 NBRM in LIMDEP

    LIMDEP has the Negbi n$command for the NBRM that reports the PRM as well. Note that thestandard errors of parameter estimates are slightly different from those of SAS and STATA. TheMargi nal Ef f ect s$ and the Means$ subcommands compute marginal effects at the mean ofindependent variables. You may not omit the Means$ subcommand.

    NEGBIN;

    Lhs=ACCIDENT;

    Rhs=ONE,EMPS,STRICT;

    Marginal Effects;

    Means

    +---------------------------------------------+

    | Poisson Regression |

    | Maximum Likelihood Estimates || Model estimated: Sep 08, 2005 at 09:35:36AM.|

    | Dependent variable ACCIDENT |

    | Weighting variable None |

    | Number of observations 778 |

    | Iterations completed 8 |

    | Log likelihood function -1821.510 |

    | Restricted log likelihood -1883.921 |

    | Chi squared 124.8218 |

    | Degrees of freedom 2 |

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    44/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 14

    http://www.indiana.edu/~statmath

    | Prob[ChiSqd > value] = .0000000 |

    | Chi- squared = 4944.94781 RsqP= -.0051 |

    | G - squared = 2827.20794 RsqD= .0423 |

    | Overdispersion tests: g=mu(i) : 4.720 |

    | Overdispersion tests: g=mu(i)^2: 4.253 |

    +---------------------------------------------+

    +---------+--------------+----------------+--------+---------+----------+

    |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|

    +---------+--------------+----------------+--------+---------+----------+

    Constant .3900961420 .46678663E-01 8.357 .0000

    EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853

    STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208

    (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

    Normal exit from iterations. Exit status=0.

    +---------------------------------------------+

    | Negative Binomial Regression |

    | Maximum Likelihood Estimates |

    | Model estimated: Sep 08, 2005 at 09:35:36AM.|

    | Dependent variable ACCIDENT || Weighting variable None |

    | Number of observations 778 |

    | Iterations completed 8 |

    | Log likelihood function -1116.718 |

    | Restricted log likelihood -1821.510 |

    | Chi squared 1409.584 |

    | Degrees of freedom 1 |

    | Prob[ChiSqd > value] = .0000000 |

    +---------------------------------------------+

    +---------+--------------+----------------+--------+---------+----------+

    |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|

    +---------+--------------+----------------+--------+---------+----------+

    Constant .3851110699 .12855240 2.996 .0027

    EMPS .5198057234E-02 .22602075E-02 2.300 .0215 42.012853STRICT -.6702547660 .16729839 -4.006 .0001 .50771208

    Dispersion parameter for count data model

    Alpha 3.955434012 .35680876 11.086 .0000

    (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

    +-------------------------------------------+

    | Partial derivatives of expected val. with |

    | respect to the vector of characteristics. |

    | They are computed at the means of the Xs. |

    | Observations used for means are All Obs. |

    | Conditional Mean at Sample Point 1.3011 |

    | Scale Factor for Marginal Effects 1.3011 |

    +-------------------------------------------+

    +---------+--------------+----------------+--------+---------+----------+

    |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|

    +---------+--------------+----------------+--------+---------+----------+

    Constant .5010628939 .19396434 2.583 .0098

    EMPS .6763123170E-02 .29746591E-02 2.274 .0230 42.012853

    STRICT -.8720595665 .22469308 -3.881 .0001 .50771208

    (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    45/179

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    46/179

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    47/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 17

    http://www.indiana.edu/~statmath

    4.2 ZIP in LIMDEP

    The LIMDEP Poi sson$command needs to have the Zi pand Rh2subcommands. The Rh2isequivalent to the i nf l at e( ) option in STATA. The Al g=Newt on$subcommand is needed to usethe Newton-Raphson algorithm because the default Broyden algorithm failed to converge.1

    POISSON;

    Lhs=ACCIDENT;

    Rhs=ONE,EMPS,STRICT;

    ZIP;

    Rh2=ONE,EMPS,STRICT;

    Alg=Newton

    +---------------------------------------------+

    | Poisson Regression |

    | Maximum Likelihood Estimates |

    | Model estimated: Sep 06, 2005 at 00:25:07PM.|

    | Dependent variable ACCIDENT || Weighting variable None |

    | Number of observations 778 |

    | Iterations completed 8 |

    | Log likelihood function -1821.510 |

    | Restricted log likelihood -1883.921 |

    | Chi squared 124.8218 |

    | Degrees of freedom 2 |

    | Prob[ChiSqd > value] = .0000000 |

    | Chi- squared = 4944.94781 RsqP= -.0051 |

    | G - squared = 2827.20794 RsqD= .0423 |

    | Overdispersion tests: g=mu(i) : 4.720 |

    | Overdispersion tests: g=mu(i)^2: 4.253 |

    +---------------------------------------------+

    +---------+--------------+----------------+--------+---------+----------+

    |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|

    +---------+--------------+----------------+--------+---------+----------+

    Constant .3900961420 .46678663E-01 8.357 .0000

    EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853

    STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208

    (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

    Normal exit from iterations. Exit status=0.

    +----------------------------------------------------------------------+

    | Zero Altered Poisson Regression Model |

    | Logistic distribution used for splitting model. || ZAP term in probability is F[tau x Z(i) ] |

    | Comparison of estimated models |

    | Pr[0|means] Number of zeros Log-likelihood |

    | Poisson .27329 Act.= 498 Prd.= 212.6 -1821.51007 |

    1If you get a warning message of Error: 806: Line search does not improve fn. Exit iterations.Status=3 or Error: 805: Initial iterations cannot improve function. Status=3, you maychange the optimization algorithm or increase the maximum number of iterations (e.g., Maxi t =1000$).

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    48/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 18

    http://www.indiana.edu/~statmath

    | Z.I.Poisson .64642 Act.= 498 Prd.= 502.9 -1259.88568 |

    | Note, the ZIP log-likelihood is not directly comparable. |

    | ZIP model with nonzero Q does not encompass the others. |

    | Vuong statistic for testing ZIP vs. unaltered model is 9.5740 |

    | Distributed as standard normal. A value greater than |

    | +1.96 favors the zero altered Z.I.Poisson model. |

    | A value less than -1.96 rejects the ZIP model. |

    +----------------------------------------------------------------------+

    +---------+--------------+----------------+--------+---------+----------+

    |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|

    +---------+--------------+----------------+--------+---------+----------+

    Poisson/NB/Gamma regression model

    Constant 1.361977491 .23944641E-01 56.880 .0000

    EMPS -.2770010575E-03 .37770090E-03 -.733 .4633 42.012853

    STRICT -.9239125073E-01 .33326502E-01 -2.772 .0056 .50771208

    Zero inflation model

    Constant .4886559537 .12210013 4.002 .0001

    EMPS -.1098971050E-01 .22152492E-02 -4.961 .0000 42.012853

    STRICT 1.057031399 .17715551 5.967 .0000 .50771208

    (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

    In order to estimate the restricted model, run the following command with the ONE only in theLhs$subcommand. The Rh2$subcommand remains unchanged.

    POISSON;

    Lhs=ACCIDENT;

    Rhs=ONE;

    ZIP; Alg=Newton;

    Rh2=ONE,EMPS,STRICT

    Table 5 summarizes parameter estimates and goodness-of-fit statistics for the zero-inflatedPoisson model. STATA and LIMDEP report the same parameter estimates, but they produce

    different standard errors and log likelihoods. In particular, LIMDEP returned a suspicious loglikelihood for the restricted model, and thus ended up with the unlikely likelihood ratio of -.0304. In addition, the Vuong statistics in STATA and LIMDEP are different.

    Table 5. Summary of the Zero-Inflated Poisson Regression Model in STATA, and LIMDEP

    Model SAS 9.1 STATA 9.0 LIMDEP 8.0

    Intercept1. 3620( . 0493)

    1. 3620( . 0239)

    EMPS- . 0003( . 0009)

    - . 0003( . 0004)

    STRICT- . 0924( . 0729)

    - . 0924( . 0333)

    Intercept (Zero-inflated) . 4887( . 1211) . 4887( . 1221)

    EMPS (Zero-inflated)- . 0110( . 0023)

    - . 0110( . 0022)

    STRICT (Zero-inflated)1. 0570( . 1768)

    1. 0570( . 1772)

    Log Likelihood (unrestricted) - 1269. 7206 - 1259. 8857Log Likelihood (restricted) - 1270. 9523 - 1259. 8705Likelihood Ratio for Goodness-of-fit 2. 46 - . 0304Vuong Statistic (ZINB versus NBRM) 8. 40 9. 5740

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    49/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 19

    http://www.indiana.edu/~statmath

    5. The Zero-Inflated NB Regression Model

    STATA and LIMDEP can estimate the zero-inflated negative binomial regression model (ZINB).

    5.1 ZINB in STATA (.zinb)

    The STATA . zi nbcommand estimates the ZINB. The vuongoption computes the Vuongstatistic to compare the ZINB and NBRM.

    . zinb accident emps strict, inflate(emps strict) vuong

    Fi t t i ng const ant - onl y model :

    I t er ati on 0: l og l i kel i hood = - 1190. 5117 ( not concave)I t er at i on 1: l og l i kel i hood = - 1106. 9874I t er at i on 2: l og l i kel i hood = - 1098. 8642I t er at i on 3: l og l i kel i hood = - 1095. 3638I t er at i on 4: l og l i kel i hood = - 1094. 0237

    I t er at i on 5: l og l i kel i hood = - 1093. 063I t er at i on 6: l og l i kel i hood = - 1092. 6216I t er at i on 7: l og l i kel i hood = - 1091. 798I t er at i on 8: l og l i kel i hood = - 1091. 7332I t er at i on 9: l og l i kel i hood = - 1091. 7329I t er at i on 10: l og l i kel i hood = - 1091. 7329

    Fi tt i ng f ul l model :

    I t er at i on 0: l og l i kel i hood = - 1091. 7329I t er at i on 1: l og l i kel i hood = - 1089. 5565I t er at i on 2: l og l i kel i hood = - 1089. 5198I t er at i on 3: l og l i kel i hood = - 1089. 5198

    Zero- i nf l ated negat i ve bi nomi al r egr essi on Number of obs = 778

    Nonzer o obs = 280Zero obs = 498

    I nf l ati on model = l ogi t LR chi 2( 2) = 4. 43Log l i kel i hood = - 1089. 52 Prob > chi 2 = 0. 1094

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -| Coef . St d. Err . z P>| z| [ 95% Conf . I nt erval ]

    - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -acci dent |

    emps | - . 0004407 . 0020554 - 0. 21 0. 830 - . 0044691 . 0035877st r i ct | - . 3251317 . 1659173 - 1. 96 0. 050 - . 6503235 . 0000602_cons | . 7763065 . 1508037 5. 15 0. 000 . 4807367 1. 071876

    - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -i nf l at e |

    emps | - . 2087768 . 0955122 - 2. 19 0. 029 - . 3959772 - . 0215763st r i ct | 7. 562388 3. 055775 2. 47 0. 013 1. 573179 13. 5516_cons | . 1032115 . 3800045 0. 27 0. 786 - . 6415835 . 8480065

    - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -/ l nal pha | . 9252514 . 1351387 6. 85 0. 000 . 6603845 1. 190118

    - - - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -al pha | 2. 522502 . 3408876 1. 935536 3. 28747

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Vuong t est of zi nb vs. st andard negat i ve bi nomi al : z = 4. 13 Pr>z = 0. 0000

  • 8/11/2019 Comparing Group Means: The T-test and One-way ANOVA Using STATA, SAS, and SPSS

    50/179

    2003-2005, The Trustees of Indiana University Regression Models for Event Count Data: 20

    http://www.indiana.edu/~statmath

    The likelihood ratio, 360.4024= 2*(-1089.5198 - (-1269.721)), rejects the null hypothesis of nooverdispersion, indicating that the ZINB can improve goodness-of-fit over the ZIP (p 1.96, suggests that the ZINB is preferred to the NBRM.

    5.2 ZINB in LIMDEP

    The LIMDEP Negbi n$command needs to have the Zi p and Rh2subcommands for the ZINB.The following command produces the Poisson regression model, negative binomial model, andzero-inflated negative binomial model. You may omit the Al g=Newt on$subcommand.

    NEGBIN;

    Lhs=ACCIDENT;

    Rhs=ONE,EMPS,STRICT; Rh2=ONE,EMPS,STRICT;

    ZIP; Alg=Newton

    +---------------------------------------------+

    | Poisson Regression |

    | Maximum Likelihood Estimates |

    | Model estimated: Sep 10, 2005 at 00:20:00AM.|

    | Dependent variable ACCIDENT |

    | Weighting variable None |

    | Number of observations 778 |

    | Iterations completed 8 |

    | Log likelihood function -1821.510 |

    | Restricted log likelihood -1883.921 |

    | Chi squared 124.8218 |

    | Degrees of freedom 2 |

    | Prob[ChiSqd > value] = .0000000 |

    | Chi- squared = 4944.94781 RsqP= -.0051 |

    | G - squared = 2827.20794 RsqD= .0423 |

    | Overdispersion tests: g=mu(i) : 4.720 |

    | Overdispersion tests: g=mu(i)^2: 4.253 |

    +---------------------------------------------+

    +---------+--------------+----------------+--------+---------+----------+

    |Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] | Mean of X|

    +---------+--------------+----------------+--------+---------+----------+

    Constant .3900961420 .46678663E-01 8.357 .0000

    EMPS .5418599057E-02 .74341923E-03 7.289 .0000 42.012853

    STRICT -.7041663804 .66761926E-01 -10.547 .0000 .50771208

    (Note: E+nn or E-nn means multiply by 10 to + or -nn power.)

    Normal exit from iterations. Exit status=0.

    +---------------------------------------------+

    | Negative Binomial Regression || Maximum Likelihood Estimates |

    | Model estimated: Sep 10, 2005 at 00:20:00AM.|

    | Dependent variable ACCIDENT |

    | Weighting variable None |

    | Number of observations 778 |

    | Iterations completed 12 |

    | Log likelihood function -1116.718 |

    | Restricted log likel