Content Type Narrowed by Reviews and Journals

20
| About LexisNexis | Privacy Policy | Terms & Conditions | Copyright © 2019 LexisNexis Date and Time: Friday, August 9, 2019 4:39:00 PM EDT Job Number: 94705612 Document (1) 1. Statistical issues arising in the Kerner v. Denver: A class action disparate impact case Client/Matter: -None- Search Terms: Statistical issues arising in Kerner v. Denver, a class action disparate impact case Search Type: Natural Language Narrowed by: Content Type Narrowed by Secondary Materials Sources: Law, Probability & Risk; All Content Types: Law Reviews and Journals

Transcript of Content Type Narrowed by Reviews and Journals

Page 1: Content Type Narrowed by Reviews and Journals

| About LexisNexis | Privacy Policy | Terms & Conditions | Copyright © 2019 LexisNexis

Date and Time: Friday, August 9, 2019 4:39:00 PM EDT

Job Number: 94705612

Document (1)

1. Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

Client/Matter: -None-

Search Terms: Statistical issues arising in Kerner v. Denver, a class action disparate impact case

Search Type: Natural Language

Narrowed by:

Content Type Narrowed bySecondary Materials Sources: Law, Probability & Risk; All Content Types: Law

Reviews and Journals

Page 2: Content Type Narrowed by Reviews and Journals

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

1 March 2017

ReporterLaw Probability and Risk (2017) 16 (1): 35

Length: 13230 words

Author: Joseph l. Gastwirth, Weiwen Miao and Qing Pan, Department of Statistics, George Washington University, Washington, DC, 20052, USA; Department of Mathematics and Statistics, Haverford College, Haverford, PA 19041, USA

Text

When comparing the success rates of two groups, statisticians often stratify the data into subgroups of individuals with similar levels of other factors related to the response of interest. In the clinical trial context, one might create subgroups according to the number and strength of risk factors the subjects have, while in the equal employment context, one might form strata based on the type of position and seniority. Sometimes the data are naturally stratified, e.g. applicants for different jobs requiring different skills. However, plaintiffs may pool all the data into one large 2 × 2 table, making it easier to find that the pass rates are statistically significantly different. In contrast, defendants may argue that only when statistical significance is reached in many, e.g. at least one-half of the strata, should a court find that the data supports a prima facie case of discrimination. These issues arose in Kerner v. Denver, a case concerning the disparate impact of a pre-employment exam on minority applicants. The statistical presentations of both parties are reviewed and potential issues with them are noted. After demonstrating that the disparities in pass rates in the various jobs are sufficiently similar, the Mantel-Haenszel test is used to combine the data in each stratum into one overall test. Our analysis shows that there is a statistically significant disparity in the odds minority and majority applicants pass the test. Furthermore, the associated estimator of the common odds ratio of the pass rates is 0.40 indicating that the odds a minority applicant had of passing the test were less than one-half those of a Caucasian.

Keywords: Breslow-Day test; disparate impact; homogeneity; odds ratio; power of statistical tests; stratified data.

[Received on 8 August 2016; revised on 25 November 2016; accepted on 16 December 2016]

1. Introduction

The Kerner v. Denver 1 case concerns the propriety of the use of a test, the Accuplacer, by the City and County of Denver to establish eligibility lists for individuals applying for employment. The plaintiffs assert that the test has a disparate impact on minority applicants, i.e. they fail the exam at significantly higher rates than Caucasians and the exam is not a valid predictor of successful performance of the jobs. A class action covering African-American and Hispanic applicants for jobs requiring them to pass the test was certified in September 2015. The class consists of 386 individuals who applied for at least one of the 21 positions in Denver that required an applicant to have a minimum score on the Accuplacer exam. The exam had a reading component and a writing one. Some

1 Civil Action No. 11-cv-00256-MSK-KMT, 2015 WL 5698663 (D. Colo. 29 September 2015).

Page 3: Content Type Narrowed by Reviews and Journals

Page 2 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

positions used both exams while others only used one of them. The passing score depended on the position, e.g. an Agency Support Technician required a reading score of at least 77, while the Administrative Support Assistant (Level 3) needed a reading score of 60 or more.

When deciding whether an employment practice has disparate impact on a minority group, the selection or passing rate of the minority applicants is compared to that of the majority applicants. Early cases often relied on guidelines issued by the government agencies concerned with enforcing the fair employment laws. The guidelines introduced a rule of thumb, that if the ratio of the pass rate of minorities was less than four-fifths or 80% of the pass rate of majority members, the test had a disparate impact. 2 The guidelines also state that the statistical and practical significance may be considered. More recently, courts have indicated a preference for formal statistical tests such as the difference between two proportions, the chi-square test or Fisher′s exact test. 3 If plaintiffs succeed in establishing a prima facie case of disparate impact, the employer needs to show that the test or job requirement in question is job related, i.e. predicts success on the job. 4

In order to certify case as a class action, the plaintiffs need to demonstrate that the employment practice had an impact on all or nearly all the members of the class. When the putative class consists of employees in or applicants to different types of jobs or work at different locations, they need to show that the impact results in a disparity in most or all of jobs or units of the company, not in just a relatively few units alone. The opinion in Bolden v. Walsh Corp 5 described these concerns in explaining why it denied class certification when the firm had over 200 different job locations, each with a different supervisor. On the other hand, the court certified a class of African-American employees who were not promoted in Brown v. Nucor. Although the plant was organized into six units, the plaintiffs submitted anecdotal evidence that similar practices, e.g. failure to follow the firm′s established rules for posting openings, occurred throughout the plant and on a statistically significant difference in promotion rates. 6 This article shows that the data in Kerner are consistent with a ′common′ odds ratio for all the jobs using the Accuplacer test. The estimated common odds ratio of 0.4 is both statistically significantly and practically meaningfully less than one.

Section 2 reviews the statistical presentations of the experts for both parties and their rebuttal declarations. Because the test was used to screen applicants for many jobs, a major statistical issue in the case concerns which method of analysis is appropriate. In particular, the plaintiffs first pooled the data for all 21 positions into one large sample and found a statistically significant result. The defendant analysed the data for each position separately and found statistically significant differences in 8 of the 21 positions and concluded that this finding was not consistent with a general pattern of disparate impact in the positions that the test was used for. The authors′ analysis of the main data set concerning the pass rates of minority and majority applicants is presented in Section 3. The consistency of our finding that the test had a disparate impact with the defendant′s expert′s observing a statistically significant disparity in only 8 of the 21 jobs is demonstrated in Section 4. The importance of checking

2 Uniform Guidelines on Employee Selection Procedures. (1978). Federal Register Vol.43-166. 25 August. See also Vol. 44-43 (1979) and Vol.45-87 (1980).

3 For example in Isabel v. City of Memphis, 404 F.3d 404 (6th Cir. 2006), the court found that African-American applicants established a prima facie case of disparate impact by comparing their pass rate of 47/63 = 74.6% to the white pass rate of 51/57 = 89.5% because the difference was statistically significant. The ratio of the pass rates is 83.3%, which satisfies the four-fifths rule. Two other cases that relied on statistical significance testing rather than the ′four-fifths′ rule are: Stagi v. Nat′l. R.R. Passenger Corp. 391 F. Appx. 133, 144-45 (3rd Cir. 2010) and Jones v. City of Boston, 752 F.3d 38 (1st Cir. 2014).

4 The guidelines, supra n. 2 discuss the type of studies that can be used to demonstrate that a test with a disparate impact is a valid predictor of job performance. The Williams v. Ford Motor Co. 187 F.3d 533 (6th Cir. 1999) case discusses a sound validation study and Ernst et al. v. City of Chicago No. 14-3783 (7th Cir. 19 September 2016) describes a flawed one.

5 688 F.3d. 893 (2012).

6 785 F.3d 895 (4th Cir. 2005). African-Americans formed 19.25% of applicants for promotion but only 7.94% of those promoted, a practical and statistically significant (p-value <.02).

Page 4: Content Type Narrowed by Reviews and Journals

Page 3 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

Table 1. Probability AccuPlacer results are independent of race for black and hispanic applicants

Race Total

White 1394 345 25

Yes

5 × 10-15

5 in one

Minority 2215 920 42

Thousandtrillion trillion

that the disparities in pass rates in the various strata (positions) are sufficiently similar before using the Cochran-Mantel-Haenszel (MH or CMH) test or other combination procedure to analyse the data is illustrated in Section 5. The last section discusses the parallels between the analysis of stratified data in the Equal Employment Opportunity Commission (EEOC) context and the analysis of multi-centre clinical trials and notes that the need for stratifying data into relatively homogeneous subgroups may be greater in the EEOC area because applicants and employees are not randomized to treatment or control as in clinical trials.

2. Summary of the statistical analyses submitted to the court

2.1 Plaintiff′s expert′s original analysis

The plaintiffs′ first analysis compared the failure rates of White and Black/Hispanic applicants by pooling the pass-fail data from each of the 21 jobs at issue into a single 2 × 2 table. Table 1 ′reproduces′ the Table 1 from their expert′s (P) report, which reports the data and summarizes their analysis.

Comment: It is important to point out that both the ′title′ of the table and the ′heading′ of the last column ′misinterpret′ the p-value of a statistical test. 7 The p-value is ′not′ the probability that the results are independent of race or the likelihood of the test being race-neutral. Rather it is the probability of observing a difference of ′at least as large as the actual one of 17%′ between the failure rates of the two groups assuming that members of both

7 The heading of the next to last column conflates two related but distinct concepts. The significance level of a test is a pre-set value, usually 0.05 but sometimes 0.01, used to determine whether or not the test of the data reaches statistical significance. The p-value is the probability; calculated assuming the null hypothesis is true, that data at least as far away as the actual data would occur. The p-value is more informative than just reporting whether or not the data is statistically significant, i.e. whether the p-value is less than the pre-set threshold, say 0.05 because one also knows how unlikely the data would occur assuming the null hypothesis (both groups have the same pass rate in) is true. Finally, there is a typo in the last column as the p-value of 5 × 10-15 is 5 in one thousand trillion, which is an extremely small probability.

Page 5: Content Type Narrowed by Reviews and Journals

Page 4 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

groups had the same chance of failing the test. When this probability, named p-value, is low one infers that the observed data is inconsistent with the assumption that the failure rates of the two groups are equal and concludes that they differ. It is important to recall that the p-value of a statistical test depends on both the ′magnitude of the difference′ in pass rates and the ′sample sizes′ of the two groups. By aggregating applicants from all jobs into one large sample, small differences in failure or pass rates can be classified as statistically significant. 8

Expert P used the chi-square test to analyse the data in Table 1, obtaining a value of about 105 for a chi-square distribution with 1 degree of freedom corresponding to a very small p-value, less than one in a billion. The expert then compared the data on failure rates of each minority group to that of whites.

As 311 of 752 African-American applicants failed the test and 609 of 1463 Hispanic applicants failed, comparing their failure rates with the 345 of 1394 whites again yielded highly significant results. 9 The report concluded that the Accuplacer test had a consistent preferential impact on white applicants and a corresponding negative one on non-whites.

2.2 Defendant′s original report 10

Defendant′s expert (D) presented a table giving the pass rates and average scores on the reading and writing tests of ′all′ applicants for each of the jobs. He showed that there was a statistically significant variation in the pass rates for the different positions. 11 He suggested that this variation could arise because the passing score was not the same for all positions. Indeed, some positions only administered one of the two parts of the AccuPlacer exam to applicants. Due to the variation in pass rates for the different positions, expert D argued that it was more appropriate to examine the impact of the test by controlling for the position applicants applied for. Furthermore, he claimed that this procedure allows us to determine if the test impacted minority applicants similarly across titles. Expert D noted that Fisher′s exact test is a commonly accepted test that ′measures the likelihood of differences in selection rates between groups′, in this case the pass rates of the minority group of African-Americans and Hispanics compared to the pass rates of Caucasians. 12

Expert D carried out two analyses. Some individuals applied for multiple position postings ′within′ a job title, so they took the exam several times. The first analysis included all times an applicant took the exam. The second considered each applicant for the same position or classification title once. Although the expert′s report does not state the criterion used to determine pass, it is sensible to assume that if someone took the test several times and eventually passed, then he/she would be classified as a pass. 13 Essentially, this classifies multiple applications for

8 To appreciate the importance of the sample size, suppose 375 Whites (26.9%) and 665 minorities (30%) failed the exam. The chi-square test yields a statistically significant result at the commonly used 0.05 level (p-value = 0.0479). In terms of the pass rates, 70% of minority applicants would pass while 73.1% of the Whites did. The selection ratio of 0.70/0.731 = 0.958 substantially exceeds the ′four-fifths′ or 80% guideline issued by the federal government and courts might well decide that the small difference in pass rates is not legally meaningful.

9 The p-values of each of the tests were less than one in a billion. These very small values are due in part to the large samples sizes. For further discussion of the role of sample size and references to the literature, see Gastwirth and Xu (2014).

10 Analysis of the Racial Impact of the AccuPlacer Tests on the Hiring Process in the City and County of Denver and Estimate of Potential Economic Loss; Submitted by defendant′s expert in Kerner et al. v. City and County of Denver, Civil Case No. 11-cv-00256-MSK-KMT.

11 Ibid Table 1 on page 8. For example, the overall pass rates on the reading test ranged from 75.8% to 98.2%. Using the chi-square test for the equality of proportions, he found a statistically significant difference in the pass rates on both the reading and writing exams across the job titles, Ibid at 8. The p-values of the test were not reported.

12 Ibid at page 9. As noted in the previous sub-section, a test of hypothesis does not measure the likelihood of a difference in two rates.

13 The expert report, ibid at p. 10, n.13, says that ′Analyses were also conducted controlling for classification title.′

Page 6: Content Type Narrowed by Reviews and Journals

Page 5 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

the same position by their scores on the last exam they took. Because the test results of the same individual are likely to be positively correlated, the first data set will not be discussed here. The second data set, including the results of Fisher′s exact test applied to each of the 21 positions is given in Table 2 below. 14

Expert D then noted that the difference in pass rates between minority and white test takers reached statistical significance in only 8 of the 21 positions. Furthermore, he stated that the difference between the proportion of minorities taking the test and their proportion of test passers ranged from -1.6% to 7.6%. He concludes that there is no evidence to support the allegations that minority applicants had significantly higher failure rates than Caucasians in the 13 of the 21 job titles because the differences in those positions were not statistically significant.

Table 2. Defence analysis of pass-fail data by job title and minority status

Classification title

Number of minority test taker

Number of Caucasian test taker

Percent minority (%)

Number of minority passes

Number of Caucasian passes

Percent minority among passes (%)

Prob.

311 CSA 63 28 69.2 44 26 62.9 0.0164

ASA I 18 17 51.4 18 17 51.4 1

ASA II 592 302 66.2 536 293 64.7 0.0003

ASA III 595 354 62.7 416 310 57.3 <0.0001

ASA IV 1340 868 60.7 787 658 54.5 <0.0001

ASA V26 9 74.3 17 6 73.9 1

AST 28 16 63.6 14 13 51.9 0.0565

ACI 32 49 39.5 20 40 33.3 0.0712

ACSA 18 17 51.4 12 11 52.2 1

CCT 40 22 64.5 29 22 56.9 0.0054

ET 529 219 70.7 274 164 62.6 <0.0001

EA I 65 74 46.8 39 60 39.4 0.0084

EA II 90 87 50.9 66 75 46.8 0.0403

FIC 51 66 43.6 46 65 41.8 0.237

HO 8 3 72.7 6 3 66.7 1

HRST 33 33 50 26 30 46.4 0.3034

HRT 6 4 60 3 3 50.0 0.5714

MCT 8 7 53.3 5 5 50.0 1

PT 26 28 48.1 22 26 45.8 0.4126

VIC 42 14 75 34 13 72.3 0.4239

WQI 18 25 41.9 12 16 42.9 1

Source: The data referring to the number of individuals who took and passed the test are taken from Table 2 in the Original Report of expert D. We calculated the percentages and p-values (Prob.) of the Fisher′s exact test. Those percentages and p-values differ slightly from the expert D′s Report.

14 Ibid Table A-1 in Appendix A.

Page 7: Content Type Narrowed by Reviews and Journals

Page 6 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

Comments: (1) Comparing the minority proportion of all applicants to their proportion of test passers is not the preferred comparison. 15 Rather one should compare the pass rates directly as is done in Section 3. Both experts in the case used Fisher′s exact test, which is an appropriate test and will be used in that section. There are newer statistical tests, which have more power to detect a difference in pass rates in the context of this case. 16

(2) Because the same exam, albeit with different criteria for passing, is used for all the positions, analysing one strata or job category at a time loses information. Furthermore, in Section 4 it will be seen that, in three job titles the configuration of the data is such that it is mathematically ′impossible′ for Fisher′s exact test to reach statistical significance at the commonly used 0.05 or two-standard deviation level.

(3) The simple aggregation of stratified data into one single 2 × 2 table as Expert P did in Table 1 is statistically questionable due to the possibility of Simpson′s paradox (Agresti, 1996, pp. 57, 258 and Samuels, 1993), where the pass rates of two groups can be equal in each of the strata but be statistically significantly different in the aggregated data. This occurs when the demographic mix of the applicants in the different strata varies substantially and the strata with more minorities have a lower pass rates. To illustrate the problem, consider two job categories (strata). One-half of the applicants pass in the first strata, but only one-fourth pass in the second. Suppose there are 200 majority and 100 minority applicants in the first strata and both groups have a 50% pass rate. In the second job category there are 100 majority and 200 minority applicant and both groups have a 25% pass rate. In the aggregated table 125 of the 300 majority applicants pass while only 100 of the 300 minorities pass. In the aggregated data, 41.67% of majority members pass while 33.33% of the minority members pass. Applying Fisher′s exact test to the aggregated data yields a statistically significant result (p-value = 0.043); nevertheless, the pass rates of the minority and majority applicants to each job were identical.

2.3 Response of plaintiffs′ expert

Expert P′s response notes that defendant′s major criticism of his original report was that by combining all the test results for all 21 positions into a single sample before applying Fisher′s exact text was inappropriate and the results for each job group should be analysed individually. He claims: (a) that a test examining pass rates for all applicants is the more appropriate test and (b) a test making a single determination of disparate impact in each job group ignores a general trend and conceals a substantial disparity by creating such small groups that no statistical significance can be found. 17

To support these arguments, expert P considers hypothetical scenarios where the racial composition of the applicants is the same (62% minority) for all job categories and the minority pass rate is 67% in all of the positions and majority pass rate is 83% in all positions. Assuming various possible totasample sizes, from 10 to 140, he demonstrates that statistical significance would not be found untithere were 140 or more applicants in the job

15 Courts do accept this type of comparison when the number of selections, passers here, is a small fraction of the total population of applicants. In this situation, it can be shown (Gastwirth and Greenhouse, 1987) that the statistic based on the Expert D′s comparison is a good approximation to the test of the two pass rates. This approximation is known as the binomial model (Finkelstein, 1966) and is appropriate for the analysis of data arising in fair jury representation cases as the number of individuals called for service during a period of 6 months to a year is a small fraction of the jury eligible population. In that context, the method was accepted by the Court in Castaneda v. Partida, 430 U.S. 482 (1977). There is a minor discrepancy between the range (-1.6% to 7.6%) of the minority proportions in the report of Expert D and in the range (-1.0% and 11.8%) we calculated from Table 2.

16 Unlike Fisher′s exact test, the p-value of the procedure of Berger and Boos (1994) is not calculated from the conditional distribution of the number of minority passers given the total number of passers. Their test finds statistical significance in all the positions Fisher′s exact test does and in an additional one, the position AST. For the position AST, 14 of 28 minorities passed and 13 of 16 majority applicants passed. The p-value of the two-sided Fisher′s exact test equaled 0.0565 while the p-value of the two-sided Berger-Boos test is 0.044, which reaches significance at the 0.05 level. A freely available R programme for calculating the Berger-Boos test was developed by Calhoun (2015).

17 Declaration of plaintiffs′ expert filed in support of plaintiffs′ response to defendant′s motion for summary judgment at 4-6, points 12-16.

Page 8: Content Type Narrowed by Reviews and Journals

Page 7 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

category. Expert P then points out the fact that the disparities in smaller groups do not reach statistical significance does not imply that the overaldisparity is not statistically significant. He then cites a number of statistical publications that describe appropriate methods for combining evidence from all the groups into one summary test. 18

Then the well-established MH test, which expert P calls the Peer Group test, is applied to the data in Table 2 and yielding a highly significant result. 19 The declaration states: ′The Peer Group test

computes a level of statistical significance for the disparity while comparing only similarly situated applicants, by controlling for membership in a group. The test also provides a way to determine whether we should accept that the overall level of impact is an appropriate measure of the disparate impact for each group. This is referred to as a test for homogeneity′. 20 The declaration of Expert P states that the Peer Group test checks the method used by Expert D ′by testing statistical significance of a disparity when analyzing applicants by job group, and a check of my method by indicting whether the subgroups are homogeneous (meaning there is a consistent impact across all groups) and should be treated as a single homogeneous group rather than distinct subgroups′. 21 Then Expert P claims that the peer group test rejects Expert D′s assertion that the job titles exhibit different levels of disparity. The declaration states ′The Peer Group test compares the racial disparity in test results for each job (Classification Title), and combines the evidence from all jobs to determine whether the disparity is different in different jobs or homogeneous. The Peer Group test determines whether the evidence from all jobs points to a common level of disparity in test results, or points to different levels of disparity job to job. In other words, are the differences we observe job to job the result of chance or of real differences in the racial disparity in test results. The result: the Peer Group test concludes that the disparity is homogeneous for all jobs … The test thereby supports my analysis of all applicants as a homogeneous group.′ 22

Comments: Expert P is correct when he argues the fact that a disparity is not statistically significant in each stratum does not imply that the overall pattern of disparity is not significant. This point is very important when the sample sizes in many strata are small. 23 The authors have not been able to find a reference for the ′Peer Group test′ which Expert P said is the MH test. The MH test is not a test for homogeneity. 24 The most commonly used

18 Ibid at 10. In particular, the declaration cites the books by Finkelstein and Levin (2001) and Fleiss (1986) and the article by Gastwirth and Xu (2014).

19 Ibid at 11, reporting a p-value of 10-41.

20 Ibid at 11, part 24.

21 Ibid. at 11 (point 24).

22 Ibid at 11-12 (point 26).

23 A statistical test in the small sample situation may have low power, i.e. might not be able to classify a legally meaningful difference as statistically significant. This issue is discussed in Zeisel and Kaye (1997) at p. 88, Collins and Morris (2008) and Gastwirth and Xu (2014). For example, suppose one is examining data on hiring, where a minority forms 20% of the qualified labor force in the area from which the employer draws applicants. Suppose the employer has ten units and each hires 10 new employees from the area. Suppose all ten units do not hire a single minority individual this year. If one examines each unit separately, one would compare the zero number of minority hires to the number, two, that would be expected. Under random selection of ten hires from the large labor force, the probability that none would be minority is .107. This probability is greater than .05, so statistical significance is not reached in that unit. When one looks at the 100 hires in all ten units and observes that none of them were minority, although one would expect about 20 to be, one should not need formal statistical analysis to conclude that the employer did not hire minority members. Indeed, the probability that ten random samples of 10 from this large qualified labour force would contain ′no′ minorities is about 2.00 × 10- n10 or less than one in a billion.

24 The MH test, also referred to as the CMH or Cochran-Mantel-Haenszel test, in our application is a test of whether there is an association between pass rates and race of applicants, conditioning on the job applied for. The procedure is carefully described by Agresti (1996, pp. 60-65) who notes that individuals in different strata, e.g. job groups, might well vary with respect to other characteristics, e.g. educational level that could affect their probability of passing and the type of job they would apply for.

Page 9: Content Type Narrowed by Reviews and Journals

Page 8 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

homogeneity test is the Breslow-Day test (Breslow and Day, 1980), which checks whether the ′odds ratios′ 25 of the two pass rates in the 21 job categories are the same, i.e. whether the data is consistent with a common odds ratio underlying all 21 data sets. It will

be used to analyse the data in Table 2 in Section 3 and will essentially confirm Expert P′s conclusions that the data are consistent with having a common odds ratio and that the disparity between the minority and majority pass rates in the job groups is statistically significant. On the other hand, the fact that a p-value obtained from a test for homogeneity is higher than 0.05 does ′not′ necessarily justify combining the data from all 21 positions into a single table for analysis. Indeed, Whittemore (1978) discusses this issue and gives the mathematical criteria the data must satisfy to justify pooling all the data and provides an illustrative example. 26 A related concern is whether the summary odds ratio obtained from the MH procedure and the single odds ratio from the aggregated data are in opposite directions, e.g. one is less than 1.0, while the other is greater than 1.0 is discussed by Samuels (1993).

2.4 Defence expert′s declaration in response 27

Expert D disputes the argument made by Expert P that group-wide analysis is widely accepted and mandated by the Regulations issued by the Federal Civil Rights Agencies. He notes that one part of those regulations requires employers who use pre-employment tests to maintain records for each job so they can make adverse impact determinations for each job. 28

From a statistical point of view, Expert D agrees that a group-wise analysis of similarly situated individuals is reasonable; however, he argues that the groups of applicants applying to the various jobs were ′not′ similarly situated. He notes that individuals applying to different positions are likely to vary in educational background and skills, so they will have different probabilities of passing the test. In addition, he notes that the passing requirements varied among the job categories; some positions required applicants to pass both portions of the AccuPlacer test while other jobs used only one of the two component tests. He concludes that one would expect the test to have different impacts on applicants in the different job titles, so they must be examined separately. 29

To buttress his argument, he expands his examination of testing whether the pass rates for the different positions were equal. Previously, he looked at the pass rate of all applicants and found they varied with the job. Now he studies whether the 21 pass rates of applicants of each race-ethnic group were equal. For whites, he found a statistically significant variation, described as ′The test indicates there is less than a .0001 probability that the

Individuals in the same stratum who have applied for the same position are more likely to have similar educational and other characteristics than applicants to different jobs.

25 If p is a probability of an event, the odds of its occurrence is p/(1-p). For example, if the probability of an event is 1/3, the odds of its occurrence are (1/3)/(2/3) or ½ or 1:2. In horse racing, this means that if the probability that your horse will win is 1/3, a fair bet would have you win $2.00 if you bet $1.00. This article uses this formula to calculate the odds ratios for each of the positions in Table 3, infra. For example, the pass rate for whites applicants for the ACI job equals 0.8163, so their odds = 0.9163/ 0.1837 = 4.4437. The odd for minorities equals 0.625/0.375 =1.6667 so the odds ratio is 1.6667/4.4437 = 0.375. In situations where one knows the total number of passers from both groups there is another estimator, called the conditional maximum likelihood estimate, which uses this fact. For most data sets of reasonable size, the two estimators give very similar answers, so the simpler estimate of the odds ratio is used in this article.

26 One of the examples concerns two 2 × 2 tables. In the first table or job type 4 of 12 minorities and 3 of 8 majority members pass; in the second 2 of 14 minorities and 3 of 15 majority members pass. The odds ratios in both tables equal 5/6, i.e. the tables are homogeneous. In the combined (collapsed) table 6 of 26 minorities and 6 of 26 majority members pass, so the pass rates of both groups are identical and the odds ratio is 1.0.

27 Declaration of Defence Expert for case No. 11-cv-00256-MSK-KMT.

28 Ibid at 3 (point 8).

29 Ibid at 4 (points 9 and 10).

Page 10: Content Type Narrowed by Reviews and Journals

Page 9 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

AccuPlacer success rates are the same across titles′. He finds similar results for both African-Americans and Hispanics. 30

With respect to Expert P′s discussion of the effect of the sample size on finding a statistically significant result, Expert D stated that this is not an accepted reason for deciding how to analyse a selection process. Rather, the decision on whether or not groups should be analysed separately should depend on economic theory, relevant facts in the case as well as statistical theory. He points out that in some job categories the minority and majority pass rates were essentially equal and argues that they should not be included either in a simple pooled analysis or a method aggregating within the group differences. He also questions Expert P′s table indicating that to find statistical significance one needed at least 140 applicants. Expert D points out that the Fisher′s exact test did find statistically significant results in three positions that had less than 140 applicants. 31

He notes that several experts have cautioned against routinely combining data. He quotes Gastwirth (1988) who wrote that ′It is useful to look at data in each of the strata or subgroups to make sure that the general pattern of odds ratios or differences from expected are similar … had minorities received a significant excess of promotions in one category, usually, one should not routinely combine this data with that of the three strata showing underrepresentation.′ 32

His final point notes that statistical aggregation methods such as the MH test are useful to determine whether there is an overall significant result, it provides no information with regard to any single group. He presents an example (Table 5), which will be discussed in Section 5 to illustrate how a very significant result in one group or strata, when combined with data in another group in which the pass rates are essentially the same, would lead to a statistically significant result in the MH combined analysis.

Comments: (1) The purpose of stratifying or forming similarly subgroups is to control for or eliminate the effect of other variables, such as educational background in the analysis. While expert D is correct when suggesting that these background factors are likely to differ between applicants for higher level positions and lesser jobs, when one analyses data stratified by job applied for, this would only be a ′potential problem′ if there were a substantial difference in the distribution of background qualifications in the minority and majority individuals applying for the ′same′ position. Although this might be the case, expert D does not provide any evidence that the applicants for the same position do not have similar amounts of education or relevant experience. Thus, expert D′s demonstration that the differences in pass rates of applicants of each race for the ′different′ positions are statistically significant, does not show that an appropriate analysis of the stratified data that combines the comparisons of the pass rates of minority and majority applicants ′separately′ for each position is not appropriate. It does raise a legitimate question about the propriety of simply pooling the data for all positions into a single table as in Table 1 for analysis.

(2) Expert D is correct when he points out that the sample size calculations of Expert P do not apply to the individual subgroups. In those calculations, expert P assumed that the demographic mix and pass rates of minority

30 Ibid at 5 (point 11). Again, notice that Expert D makes the same misinterpretation of the p-value of a statistical test as Expert P made in Table 1 of Section 2. The probability 0.0001 is calculated assuming the pass rates in all 21 positions are equal. It means that the probability of observing a variation in these rates at least as large as observed in the data is one in 1000 assuming the pass rates were equal. This result tells us that if the hypothesis or assumption of equal pass rates in the 21 positions is true, our data was very unlikely to occur. Such a rare outcome seems inconsistent with the assumption of equal pass rates so we infer that this assumption is not correct and the alternative assumption that the pass rates vary is more likely to be correct.

31 Ibid at pages 7 and 8. In Table 2, jobs CCT, 311 CSA and EA I had sample sizes of 62, 91 and 139, respectively. Note that for those three jobs, the proportion of minority candidates, the pass rates for minority and majority are all different than the numbers in expert P′s assumption. In particular, the proportions of minority candidates are: 64.5%, 69.2%, and 46.8%, different than the 62% in expert P′s assumption. The passing rates for minority for those three jobs are: 72.5%, 69.8% and 60%, also different than the 67% in expert P′s assumption. The pass rates for majority are: 100%, 92.9% and 81.1%, different than the 83% in expert P′s assumption.

32 Ibid at page 8 quoting Gastwirth (1988) at p. 236.

Page 11: Content Type Narrowed by Reviews and Journals

Page 10 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

and majority applicants were the same in all 21 job categories. Because the minority proportions of applicants for the positions in Table 2 vary noticeably, a preferable analysis would have

determined the minimum sample size in each stratum to have a reasonable power to detect a prespecified odds ratio or other measure of the difference in the pass rates. 33

(3) Courts have noted that plaintiffs cannot focus on hiring data for a few months when relatively few minority members were hired to prove that the defendant had a discriminatory policy when the full year′s data indicated fair hiring. Considering a yearly hiring pattern for each month separately is similar to considering the pass rate differences separately for each position. Sampling variation in the qualifications of applicants over a year can lead to the data in one or two months that appear discriminatory, although the data for the entire year indicates fair hiring. Similarly, a test that generally has a disparate impact may yield roughly equal pass-rate data in a few groups. 34

3. An alternative analysis of the data

Before analysing the data it is useful to review the commonalities in the analysis of data arising in multi-centre clinical trials and other stratified data sets that occur in medical statistics and EEOC cases as many of the statistical methods were originally developed to analyse data arising in biostatistics. 35 There is not a perfect parallel because the statistical inferences drawn from the employment data of one defendant only apply to that specific defendant and indeed only to the time period relevant to the particular case. A survey by Fleiss (1986) emphasizes that multi-clinic trials are used to obtain an adequate sample size to detect a difference between treatments and the need to ′generalize′ the study′s results to more than one kind of patient and one kind of treatment facility. The patients in the different clinics may vary in the average levels of their response to the treatment received due to their varying demographics and prior medical histories. Additionally, Fleiss (1986) observes that there often will be some deviations from the study protocol between the clinics and they may use somewhat varying criteria to evaluate response to treatment. Thus, one expects some variation in the response to the treatments under investigation among the clinics and in the differences between the response rates of treatment and control groups. This variation in the differences in response is called treatment by clinic interaction and can cause difficulties in interpreting the results of the standard statistical analysis of stratified data. Consequently, Fleiss recommends using a significance level of 0.10 rather than the usual 0.05, to test for interactions. When interaction is not a serious issue, Fleiss (1986) observes that averaging ′within′ clinic differences is almost always justified, 36 while combining the data from each

33 In some applications, the difference between the pass rates or the ratio of the smaller rate to the larger rate is used as the measure of the magnitude of the disparity.

34 See Roman v. ESB, 550 F. 2d 1343 (4th Cir. 1976) noting that plaintiffs cannot focus on data for an isolated time period (3 months) to prove a pattern of discrimination in hiring data for an entire year. Similarly, in jury discrimination cases, the fairness of minorities on jury pools is determined by analysing demographic data on individuals called for jury service over a reasonable time frame and not a defendant′s jury or venire. A fair system, which approximates random sampling of the jury eligible population, will produce some venires in which minorities are under-represented due to normal sampling fluctuation. Similarly, sampling fluctuations in an unfair system will create some isolated periods where minorities are represented in proportion to their share of eligible. Thus, data for all 12 months in a year or all relevant job groups should be considered in assessing whether a test used in all of them has a disparate impact.

35 For example, the MH (1959) test, which refined the earlier procedure of Cochran (1954), was developed for the analysis of epidemiologic studies.

36 While Fleiss (1986) focused on the difference in mean responses, the focus in disparate impact is on the difference in success rates. Government guidelines use the selection ratio or the minority pass rate divided by the majority pass rate to assess the difference and state that in general a test with a selection ratio less than four-fifths or 80% has a disparate impact. Statisticians often use the odds ratio of the rates to measure the magnitude of the disparity in success or response rates in the two groups (Mosteller, 1968). When there is no interaction, the appropriate average of the differences in the various clinics is considered the estimate of the overall effect of the treatment. In our application, no interaction means that the odds ratios of the

Page 12: Content Type Narrowed by Reviews and Journals

Page 11 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

clinic into one large data set for analysis, i.e. ignoring differences between the clinics, is only rarely justified.

In the context of the Kerner v. Denver case, the strata are the different positions and correspond to the clinics in multi-centre clinical trials. Because the response under scrutiny is the passing of a test, the potential problem of differential measurement of the response to treatment between clinics does not arise. Also, applicants for a specific position are likely to have similar educational levels, so there probably is less variation between minority and majority applicants within each stratum than there is in the prior medical history of patients in a single clinic. On the other hand, interaction between the position and the odds ratios of the pass rates between minority and majority applicants could well exist and an appropriate statistical test will be used to check that this is not a problem in the Kerner data.

With respect to the expert reports in Kerner, these considerations support expert D′s questioning of the first analysis of expert P, who combined the data for all positions and did not provide a statistical justification for doing so. On the other hand, expert D′s demonstration that the pass rates of each of the race-ethnic groups varied significantly with respect to the position applied for, while interesting, is not especially relevant to the main issue, which is whether the odds ratios of the success rates of the two groups are similar across the different strata. 37

Because both the magnitude of a difference and the sample sizes in the two groups affect the p-value of a test, which determines whether a difference reaches statistical significance, plaintiffs and their experts often aggregate data before analysing it, while defendants and their experts argue for analysing the data in each subgroup. For example, plaintiffs in both Wal-Mart cases 38 aggregated data on promotions from all stores or districts within a region of the nation into one 2 × 2 table as in Table 1. In the context of certifying a nation-wide class action case, the Supreme Court rejected that analysis because of a concern that a small number of stores or districts could create a statistical significant result, even if the promotions were fair in a majority of locations. 39 Following Miao and Gastwirth (2016) and Gastwirth et al. (2003), we will apply a statistical test to assess whether it is appropriate to combine the stratum-wide differences into a single summary statistical test. 40

While the information required for these tests is contained in Table 2, the results will be clearer if it is reformulated so that the minority and majority pass rates for each position (strata) are reported, rather than the minority shares of applicants and passers. Table 3 provides this information, along with the pass rates for each position. Examining Table 3, one notices that ′all′ applicants for the Administrative Support Assistant I (ASA I) position passed the test. Hence, that data will not enter

Table 3. Reformulation of the minority and majority pass rate data for disparate impact analysis

pass rates in the different strata follow a consistent pattern, i.e. almost all are less than (or greater than) one and the ones that might be in the opposite direction are barely greater (less) than one. In this situation, the MH test can be applied to the data and has an associated estimate of the summary or overall odds ratio, which is a weighted average of the odds ratios in the individual strata.

37 Statisticians often use the odds ratio to assess the disparity in success rates but the difference between the pass rates is also used. Radhakrishna (1965) investigated the correlation between the most powerful test when the differences between the pass rates in the strata are equal and the MH test, which is optimal when the odds ratios are the same in all the strata. In most situations, the two procedures have a high correlation, so using the MH test when the differences in pass rates are consistent with a common value is still a reasonable procedure.

38 Wal-Mart Stores, Inc. v. Dukes, 131 S. Ct. 2541 (2011) and 964 F. Supp. 2d 1115 (N.D. Cal. 2013).

39 Ibid at 2555.

40 To check whether the odds ratios of the pass rates in several 2 × 2 tables are equal (or homogeneous), the Breslow-Day (1980) test is commonly used. This checks whether the data is consistent with a common odds ratio in all tables. Thus, it is testing for a stronger requirement than a test for interaction, which focuses on whether some odds ratios are less than 1.0, corresponding to the situation when the pass rates in that strata or group are equal, while others are greater than 1.0. The Gail-Simon (1985) test is appropriate for this problem.

Page 13: Content Type Narrowed by Reviews and Journals

Page 12 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

Job title White pass

White fail White pass rate (%)

Minority Pass

Minority fail

Minority pass rate (%)

Odds ratio

311 CSA 26 2 92.86 44 19 69.84 0.1781

ASA I 17 0 100.00 18 0 100.0% NA

ASA II 293 9 97.02 536 56 90.54 0.2940

ASA III 310 44 87.57 416 179 69.92 0.3299

ASA IV 658 210 75.81 787 553 58.73 0.4542

ASA V6 3 66.67 17 9 65.38 0.9444

AST 13 3 81.25 14 14 50.00 0.2308

ACI 40 9 81.63 20 12 62.50 0.3750

ACSA 11 6 64.71 12 6 66.67 1.0909

CCT 22 0 100.00 29 11 72.50 NA

ET 164 55 74.89 274 255 51.80 0.3604

EA I 60 14 81.08 39 26 60.00 0.3500

EA II 75 12 86.21 66 24 73.33 0.4400

FIC 64 2 96.97 46 5 90.20 0.2875

HO 3 0 100.00 6 2 75.00 NA

HRST 30 3 90.91 26 7 78.79 0.3714

HRT 3 1 75.00 3 3 50.00 0.3333

MCT 5 2 71.43 5 3 62.50 0.6667

PT 26 2 92.86 22 4 84.62 0.4231

VIC 13 1 92.86 34 8 80.95 0.3269

WQI 16 9 64.00 12 6 66.67 1.1250

into the MH test. 41 Because there were only 35 applicants for that position out of a total of about 3600, it is sensible to delete this small portion of the data and examine the data for the 20 other positions. In the context of an EEOC case, this also makes sense as ′none′ of the applicants for this position would be eligible for damages as the test had no effect on their chances of obtaining a job. Thus, it would have been appropriate for the court to exclude applicants in that job category from the class.

The reader should be aware that odds ratios, especially calculated from small samples, e.g. job categories, FIC, HO, HRT, MCT, VIC are quite variable, so one needs to apply a formal test to decide whether the variation is so great that the odds ratios are not consistent with a common underlying one. The appropriate statistical method is the Breslow-Day test (Breslow and Day, 1980). Firstly, it calculates the MH estimate of a common odds ratio and then predicts what the data in each stratum would look like if it were generated from a system with that common odds ratio. Then the test compares the actual data in Table 3 to those predictions.

Applying the test to the data for the 20 positions, yields a p-value of 0.69, far exceeding the recommended threshold (Fleiss, 1986) of 0.10 for statistical significance for interaction type tests. 42

41 Examining the formula for the MH statistic (Gastwirth, 1988 p. 231; Agresti, 1996, p. 61) the contributions of that job group or stratum to both the numerator and denominator are zero.

42 The reader might wonder why a less stringent criterion for statistical significance is recommended for the homogeneity test. The reason is that the homogeneity test is used to see whether the data are consistent with an assumption made by the MH test when applied in this context. In contrast, the main hypothesis under study is whether the common disparity on minority

Page 14: Content Type Narrowed by Reviews and Journals

Page 13 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

Thus, the observed pass rates for the 20 jobs are consistent with a common odds ratio and the MH estimated common odds ratio is 0.403, which is statistically significantly less than 1.0 (p-value < 10-9). Furthermore, an odds ratio of 0.40 means that the odds a minority applicant had of passing the AccuPlacer test were ′less than one-half′ those of Caucasian applicants. This is a substantial disparity between the minority and majority pass rates. For example, suppose 50% of majority applicants pass an exam. In order for odds ratio to equal to 0.4, the pass rate for minority group would need to be 28.6%. It is difficult to imagine a court not finding such a large and highly statistically significant difference in pass rates establishes a prima facie case of disparate impact, even without any formal statistical hypothesis testing.

4. Consistency of the observed number of statistically significant results with the estimated common odds ratio

The defendant argued that because applying Fisher′s exact test to the data in each of the twenty-one job titles yielded only eight statistically significant differences, one could not conclude the test had a disparate impact. First, notice that in three strata, the sample size within the particular strata is too small for Fisher′s test to conclude statistically significant disparity even if the passing rates in Whites and minorities are quite different. 43 This means that the defendant found a statistically significant result at the .05 level in eight of the eighteen job categories where the sample sizes are large enough that significance can be reached. 44

Two types of calculation help to understand the strength of observing 8 statistically significant results in 21 comparisons. Firstly, when the hypothesis of equal pass rates is true, when 21 data sets are analysed at the 0.05 level, one expects about ′one′ of the 21 to be statistically significant as a result of normal statistical fluctuations. 45 The probability of finding 8 or more significant different test results at the 0.05 level when the pass rates are the same in all 21 groups is just 4.41 × 10 -6 or less than one in two hundred thousand. Thus, finding at least eight significant results in 21 tests should lead to rejecting the hypothesis that the minority and majority applicants had the same pass rate in each of the positions. 46

A second calculation compares the number of statistically significant results that one expects to observe if the disparity in the pass rates in the different strata are consistent with a common odds ratio that are legally meaningful. The authors believe it would be reasonable for courts to consider odds ratios of 0.667 or 0.50 as

applicants is significantly below what its value would be if the pass rates were equal in all job groups. Statisticians often recommend the use of significance levels greater than 0.05 for a preliminary test that checks whether the data are consistent with the assumptions needed for the validity of the statistical procedure used to test the main hypothesis. For example, Bancroft (1944) suggested the use of 0.25 for checking whether several data sets that will be analysed in a regression can be pooled into one common sample, Gastwirth et al. (2009) found that when checking whether the variances of several data sets are equal with the Levene (1960) test, the level 0.15 was appropriate. From the model building point of view, if we fit a logistic regression predicting the passing rates, the model using race and job titles produce a BIC of 6255, while the model with race, job title and their interactions gives a BIC of 6277. Therefore, the model without interaction is preferred.

43 The three strata are HO, HRT and VIC.

44 One could argue that the configuration of the data for the ASA I position also precludes finding a statistically significant result as all applicants passed, however, to be fair to the defendant we include it as before one examined the data and saw that all applicants passed, a statistically disparity could have been observed.

45 Since each data set has probability 0.05 of yielding a statistically significant results, the expected total number of significant data sets is 21 × 0.05 = 1.05. Even though the data configuration in three positions had no possibility of yielding a statistically significant Fisher′s exact test, to be fair to the defendant we consider testing all 21 data sets. The relationship between sample size and the possibility of finding statistical significance was illustrated in the second declaration of Expert P; supra n. 12 at 8-10 points 19-21 but Expert D, supra n. 22 at 9-10 point 15, showed that the findings in the declaration of Expert P did not apply to all the data sets in the case.

46 It is important to remember that we are not testing that the pass rates of all applicants in all the job categories are the same; we are testing that the pass rates of minority and majority applicants for each of the jobs are the same.

Page 15: Content Type Narrowed by Reviews and Journals

Page 14 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

indicating a disparity worth justification. 47 The power of the Fisher′s exact test at a particular alternative, e.g. the odds a minority passes the test are one-half (0.50) those of a white, is the probability that the test of the difference in pass rates of the data configuration will classify the observed disparity as statistically significant. 48 Table 4 shows the power of a one-sided Fisher′s exact test to detect odds ratios of 0.75, 667, 0.50 and 0.40 in each of the job titles. As noted previously Fisher′s exact test of the data in three jobs can never find a statistically significant difference, i.e. its power is zero. For the position ASA I where every applicant passed, the power is also zero. If the common odds ratio were two-thirds, one would expect the Fisher′s exact test, focusing on a disparity disadvantaging minorities, to detect only 4.398 significant results, i.e. one expects to find a statistically significant difference in pass rates in about four or five job titles out of the 17 in which statistical significance is possible. If the common odds ratio were 0.50, one expects 6.68 of the tests to be significant. If the common odds ratio equaled 0.40, one expects 8.28 tests to reach significance; which is very consistent with the eight significant results obtained by Expert D.

Note: The power of the test for each position is calculated from the distribution of the number of minority passers given the numbers of minority and majority applicants and the total number of passers from both groups. This distribution is known as the non-central hypergeometric and is described in Agresti (2002, p. 99). Because the formula is complicated, the results in the table were obtained using the pFNCHypergeo function in the Biased Urn Package in R developed by Fog (2015).

5. Further discussion of the expert reports and rebuttal declarations

The analysis presented in the previous two sections clearly supports the plaintiffs claim that the AccuPlacer test had a statistically significant disparate impact on minority applicants to most of the positions for which Denver used it. Neither expert appears to have checked the data in Table 3 to ascertain whether it was appropriate to combine the differences or odds ratios of the stratum into one summary measure of disparity. Expert D raised an interesting point in his rebuttal declaration. 49 After noting that one should check that the general pattern of the odds ratios or differences is similar, he points out that the MH and similar aggregation techniques do not provide information with respect to a particular group. He provides an example of data from a hypothetical sex discrimination case where there is statistically significant shortfall of female hires in one position (Sales) while the hire rates are equal in the second position (Accounting). The data and statistical results are given in Table 5.

Following the approach used in our reanalysis in Section 3, before analysing the data with the MH test one should see whether the data for the two positions in Table 5 have a common odds ratio. Both

Table 4. The power of a 0.05 Level One-sided Fisher′s exact test to detect various odds ratios

Classification Title

Power

OR = 0.75 OR = 0.667 OR = 0.5 OR = 0.4

331 CSA 0.0408 0.0616 0.1443 0.2410

ASA II 0.2190 0.3546 0.7220 0.9037

ASA III 0.5376 0.7928 0.9948 0.9999

47 The determination of the measure and specification of a legally meaningful degree of disparity is a legal, rather than statistical question. The government guidelines specify that a selection ratio, i.e. the ratio of the minority to majority pass rate, less than 0.80 indicates the test or job requirement has a disparate impact. There is not a direct correspondence between the selection and odds ratios of a pair of pass rates or probabilities. When 40% of a minority group passes a test and 50% of the majority pass, the selection ratio is 0.80 and the odds ratio is 0.667. This suggests that a test or job requirement where the odds ratio of the pas rates is 0.667 is worth examination.

48 See Gastwirth (1988) at 132-150, Zeisel and Kaye (2000) at 88 for a discussion of the concept of power and its importance and Gastwirth and Xu (2014) for a description of the relationship of the power of a test and the size of the available sample.

49 Supra n. 22 at 9 and 10, points 17 and 18.

Page 16: Content Type Narrowed by Reviews and Journals

Page 15 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

ASA IV 0.9238 0.9965 1.0000 1.0000

ASA V0.0264 0.0357 0.0689 0.1063

AST 0.0966 0.1314 0.2484 0.3652

ACI 0.1396 0.1980 0.3912 0.5656

ACSA 0.1044 0.1381 0.2483 0.3569

CCT 0.0907 0.1193 0.2123 0.3033

ET 0.5295 0.7831 0.9939 0.9999

EA I 0.1522 0.2378 0.5221 0.7408

EA II 0.1190 0.1925 0.4560 0.6799

FIC 0.0591 0.0792 0.1488 0.2231

HO

HRST 0.0946 0.1266 0.2331 0.3392

HRT

MCT 0.0352 0.0446 0.0761 0.1097

PT 0.0206 0.0281 0.0557 0.0881

VIC

WQR 0.0512 0.0739 0.1606 0.2606

Sum 3.24 4.39 6.68 8.28

the original Breslow-Day (BD) test and the version incorporating the correction developed by Tarone (1985) yield a p-value of 0.03 ′well below′ the recommended threshold of 0.10. Because the odds ratios in the two strata in Table 5 are statistically significantly different one should not routinely use the MH test as Expert D did.

The example given by Expert D demonstrates that one should not simply apply the MH or other statistical procedure combining stratum-specific analyses without ensuring that the odds ratios or other measure of disparity are consistent across the strata. 50 This is why the analysis in Section 3 used the BD test to check that the odds ratios of the pass rates in the 20 positions were sufficiently similar that the MH procedure can be applied.

6. Summary and discussion

The problem of analysing applicant or employee data that is either naturally stratified by location or by job type has arisen in several important cases. 51 The first set of expert reports in Kerner display a

Table 5 Expert D′s illustrative example of a potential pitfall in the analysis of Two 2 × 2 tables

Dept. Total applied

Female applied.

Total hires

Female hires

Expected

No. femalehires p-value Stat.

50 The BD test checks to see whether the odds ratios in each stratum have a common value. This is a more stringent hypothesis than there is no interaction between the odds ratios and the strata as all the odds ratios may be less than 1.0, indicating minorities performed worse on the test in all the strata but they may not have a common value. The appropriate test for no interaction is the Gail-Simon (1985) test.

51 See Wal-Mart Stores, Inc. v. Dukes, 131 S. Ct. 2541 (2011) (rejecting a statistical analysis in which plaintiffs pooled the employees in all the stores in each region of the country into one large sample because an overall disparity could have been a

Page 17: Content Type Narrowed by Reviews and Journals

Page 16 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

Signif.Sales 125 92 18 6 13.25 0.0001 Yes

Account. 158 65 7 3 2.88 0.2992 No

Total and MH analysis

283 157 25 9 16.13 0.0212 Yes

Source: Extracted from the table on page 9 of the declaration of defendant′s expert.

typical pattern. Plaintiffs create as large a sample as possible, which may lead to small differences in pass rates being statistically significant even though they may not be legally meaningful. As the Court noted in Dukes v. Wal-Mart a disparity in highly aggregated data may be due to disparities in a few of the component data sets. In contrast, the defendants argue that the data should be analysed separately for each stratum. Some courts have noted that repeatedly disaggregating the data into many small groups in which it is difficult to find statistically significant results may mask an overall decision making process that produces a discriminatory result. 52 Indeed, combination of stratified data is needed if the sample is so small that hiring ′zero′ minorities would not be classified as statistically significant. 53

The opposite problem arose in EEOC v. Autozone, where the plaintiffs′ expert analysed hiring data for managerial jobs by categorizing them into three salary bands and analysing the female and male hiring data in two time periods separately and also combined the data in each job for both periods. 54 He found statistically significant under-hiring in ′one′ job category (middle level managers). When the fact that he conducted nine tests on the data is taken into account, the result is no longer statistically significant. 55

Several judicial opinions have properly questioned whether strictly requiring the p-value of a statistical test to reach the 0.05 level is appropriate. Indeed, Judge P. Higginbotham in Vuyanich describes the p-value as a sliding scale; i.e. as its value decreases the evidence against the null hypothesis increases and Judge Posner in Kadas v. MCI Systemhouse Corp 56 points out that this threshold is

used by journal editors to screen papers and should not be a rigid criterion for statistical evidence in court cases. These observations are especially relevant in the analysis of stratified data, where the sample sizes and configurations of some subgroups ′preclude′ finding statistical significance. There are several statistical procedures that utilize the p-values of the test applied to the data in the strata to calculate one summary test statistics rather than simply counting the number that are below 0.05. These procedures, like the MH test, use more of the information from the data, and are usually more appropriate. Because courts need to examine expert testimony and reports that simply summarize the results of many statistical tests by whether or not they reach

52 See Segar v. Smith, 738 F. 2d 1249 at 1286.

53 See Vuyanich v. Republic National Bank, 505 F. Supp. 224 (N.D. Texas 1980).

54 2005 WL 3591641 (W.D. Tenn.).

55 See Gastwirth (2008) for further discussion of the statistical issues in the case and the need to adjust the criteria for determining statistical significance when multiple tests are applied to data. If the nine tests were statistically independent the probability of finding one statistically significant result at the .05 level when the employer′s hires were consistent with a random sample of the available labour force for each of the job categories would be more than .333 or one-third. The need to consider the potential effect of multiple testing on the conclusions drawn from a statistical analysis was noted recently by Judge S. Williams in his concurrence in Shea v. Kerry, 796 F 3d 42 (2015).

56 255 F.3d 359 (7th Cir. 2001).

Page 18: Content Type Narrowed by Reviews and Journals

Page 17 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

significance at the 0.05 level, it is important that they ask the expert to provide both the p-value and the power each test has of detecting a meaningful disparity and why an overall or summary procedure was not used. 57

Many of the statistical techniques for analysing stratified data were originally developed for the analysis of bio-medical data to ensure that the patients given a new drug and those given a placebo or existing drug were balanced with respect to other factors such as age and health status. For example, one could stratify patients in a heart study by the number of risk factors, e.g. high blood pressure, high cholesterol or history of smoking. Similarly, a number of case-control studies to assess whether a disease is related to an exposure are carried out by several researchers in different locations and all of the results need to be examined together. The Cochran (1954) and MH (1959) tests were developed to analyse such data.

In order to obtain an adequate sample size many studies of new drugs are conducted in several clinics or hospitals. Because the hospitals may use somewhat different protocols and the patients in the different communities are exposed to different environmental factors, the Breslow-Day test and similar procedures were developed to check whether the results obtained from each clinic could be combined to provide an overall or summary estimate of the effectiveness of the drug. The Breslow- Day checks whether the odds ratios are the same in all sub-groups; however, a new drug would be an improvement even if all the odds ratios were different, yet in the same direction indicating an increased survival rate. The Gail-Simon (1985) test was developed to test this weaker restriction on the odds ratios.

When analysing data obtained from multiple clinics or studies submitted to support an application for a new drug, the Food and Drug Administration (FDA) requires the overall statistical test to reach significance at the 0.05 level; but assesses whether the data are homogeneous using a 0.10 level test. 58 Thus, the basic analytic approach recommended here is a translation or adaptation of methods with a long history of successful use in the analysis of bio-medical and epidemiologic data to the EEOC context. Unlike clinical trials, where patients can be randomly assigned to the new or existing treatment, the individuals who apply for various jobs are self-selected. 59 Thus, there may be a greater need

to consider data stratified into subsets containing similarly qualified minority and majority applicants in the EEOC context. Then an appropriate test for homogeneity or common pattern is applied to the data to ensure the appropriateness of using the MH or other combination method to obtain a test of significance and estimate of the overall disparity, if any, in the success rates of the minority and majority groups.

Neither expert in the Kerner case reports the results of an appropriate test to check that the odds ratios or other measure of disparity in pass rates were sufficiently similar so the stratified data could be analysed by the CMH or similar test using the data in all the strata to see whether the exam had an overall disparate impact and the stratum-specific odds ratios could be combined into a summary estimate of this impact. Indeed, it was only in their

57 Similarly, if an expert uses the MH or other method combining analyses of individual strata into a summary test, a judge should ask whether the disparities in the separate strata were sufficiently similar that an overall procedure is appropriate.

58 See Cheng (2015) for the statistical review of application STN 125351/172. The drug Tacho-Sil had been approved for cardiovascular surgery and the producer provided additional data to supports its use as an adjunct for haemostasis (stopping bleeding). The outcome study was whether the patient achieved haemostasis after 3 min. The results of three studies were analysed. All three odds ratios were greater than 1.0 (indicating the drug was superior to the comparator) but not all were statistically significant. Because the p-value of the test for homogeneity equalled 0.128, exceeding 0.10, the results from all three studies were combined and showed that the drug achieved its desired objective.

59 The random allocation of patients, at least in each clinic, helps ensure that the two patient groups are comparable with other factors, such as age or general health status, that indicate a good or poor prognosis. In the Kerner case, applicants for each type of job are likely to have education and prior experience appropriate for the job they applied for. When the two groups in each strata are similar with respect to the major characteristics related to the response of interest, before analysing the data stratum by stratum a test such as the Breslow-Day procedure, should be used to ascertain whether the disparities in the strata are sufficiently homogeneous that an overall summary test is appropriate.

Page 19: Content Type Narrowed by Reviews and Journals

Page 18 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

rebuttal declarations that they applied the MH test. In contrast, our analysis showed that with the exception of one position it was appropriate to combine the stratum results into an overall summary statistical test, which demonstrated that minorities had a statistically significantly lower pass rate on the Accuplacer test than whites. 60

The decision: After the trial in April 2016, the court ruled that the Accuplacer test had a disparate impact in 20 of the 21 positions and that the city had not demonstrated that it was job related. Because the experts estimating the damages used a variety of models and calculations, the court gave them more detailed instructions and asked them to recalculate the damages. Both sides presented their alternative damage estimates and in July 2016 the court awarded the plaintiffs damages of $1,674,807. 61

Acknowledgements

It is a pleasure for us to thank the students in the Legal Statistics class at George Washington University in the spring 2016 semester for exploring the data with us. Special thanks go to Ms. Yi Lu, who devoted extra time to the project. One author (JLG) would like to thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, for support and hospitality during the programme Probability and Statistics in Forensic Science which was supported by EPSRC grant number EP/ K032208/1.

References

Agresti, A. (1992) A survey of exact inference for contingency tables. Statistical Science, 7, 131-177.

Agresti, A (1996) An Introduction to Categorical Data Analysis. New York, NY: John Wiley.

Agresti, A (2002) Categorical Data Analysis. 2nd Edition. New York, NY: John Wiley.

Bancroft, T. A. (1944) On biases in estimation due to the use of preliminary test of significance. Annals of Mathematical Statistics, 15, 190-204.

Breslow, N. E. and DAY, N. E. (1980) Statistical Methods in Cancer Research, Vol I: The Analysis of Case- Control Studies. Lyon: IARC.

Berger, R. L. and Boos, D. D. (1994) P values maximized over a confidence set for the nuisance parameter. Journal of American Statistics Association, 89, 1012-1016.

Calhoun, P. (2015) The Exact Package in R. https://cran.r-project.org/web/packages/Exact/index.html

Cheng, C. (2015) Statistical Review of Application STN 125351/172 to the Food and Drug Administration.

Cochran, W. G. (1954) The combination of estimates from different experiments. Biometrics, 10, 101-129.

Collins, M. W. and Morris, S. B. (2008) Testing for adverse impact when sample size is small. Journal of Applied Psychology, 93, 463-471.

Finkelstein, M. O. (1966) The application of statistical decision theory to jury discrimination cases. Harvard Law Review, 80, 338-376.

Finkelstein, M. O. and Levin, B. (2001) Statistics for Lawyers. 2nd Edition. New York, NY: Springer.

Fleiss, J. L. (1986) Analysis of Data from Multiclinic Trials. Controlled Clinical Trials, 7, 267-275.

Fog, A. (2015) The BiasedUrn Package in R. https://cran.r-project.org/web/packages/BiasedUrn/index.html

60 As noted previously all 17 white and 18 minority applicants for the Administrative Support Assistant I position passed the exam.

61 CA No. 11-cv-00256-MSK-KMT (D. Colo. 8 July 2016).

Page 20: Content Type Narrowed by Reviews and Journals

Page 19 of 19

Statistical issues arising in the Kerner v. Denver: A class action disparate impact case

Gail, M. and Simon, R. (1985) Testing for qualitative interactions between treatment effects and patient subsets. Biometrics, 41, 361-372.

Gastwirth, J. L. and Greenhouse, S. W. (1987) Estimating a common relative risk: application in equal employment. Journal of American Statistical Association, 82, 38-45.

Gastwirth, J. L., Miao, W. and Zheng, G. (2003) Statistical issues arising in disparate impact cases and the use of expectancy curve in assessing the validity pf employment test. International Statistical Review, 71, 565-580.

Gastwirth, J. L. (2008) Case comment: an expert′s report criticizing plaintiff′s failure to account for multiple comparisons is deemed admissible in EEOC v. Autozone. Law, Probability and Risk, 7, 61-74.

Gastwirth, J. L., Gel, Y. and Miao, W. (2009) The impact of Levene′s test of equality of variances on statistical theory and practice. Statistical Science, 24, 343-360.

Gastwirth, J. L. and XU, W. (2014) Statistical tools for evaluating the adequacy of the size of a sample on which statistical evidence is based. Law, Probability and Risk, 13, 277-306.

Levene, H. (1960) Robust tests for equality of variances. In Contributions to Probability and Statistics (Olkin , ed.) 278-292. Stanford University Press, Palo Alto, CA.

Miao, W. and Gastwirth, J. L. (2016) Statistical Issues Arising in Class Action Cases: A reanalysis of the Statistical Evidence in Dukes v. Wal-Mart II. Law, Probability and Risk, 15, 155-174.

Mosteller, F. (1968) Association and Estimation in Contingency Tables. Journal of the American Statistical Association, 63, 1-28.

Radhakrishna, S. (1965) Combination of results from several 2x2 tables. Biometrics21, 86-98.

Samuels, M. L. (1993) Simpson′s Paradox and Related Phenomena. Journal of the American Statistical Association, 88, 81-88.

Tarone, R. E. (1985) On heterogeneity tests based on efficient scores. Biometrika, 72, 91-95

Whittemore, A. S. (1978) Collapsibility of Multidimensional Contingency Tables. Journal of the Royal Statistical Society, Series B, 40, 328-340.

Zeisel, H. and Kaye, D. H. (1997) Prove It with Figures: Empirical Methods in Litigation, Springer-Verlag, New York, USA.

Law, Probability and RiskCOPYRIGHT © Oxford University Press 2017

End of Document