Testing Discrimination in Practice

fairness and machine learning - 2021-06-04 1

1 We will use the terms unfairness anddiscrimination roughly synonymouslyThere is no overarching definitionof either term but we will make ourdiscussion precise by referring to aspecific criterion whenever possibleLinguistically the term discriminationputs more emphasis on the agency ofthe decision maker

2 Wersquoll use ldquosystemrdquo as a shorthandfor a decision-making system such ashiring at a company It may or may notinvolve any automation or machinelearning

3 Bogen and Rieke ldquoHelp wantedan examination of hiring algorithmsequity and biasrdquo (Technical reportUpturn 2018)

5

Testing Discrimination in Practice

In previous chapters we have seen statistical causal and norma-tive fairness criteria This chapter is about the complexities that arisewhen we want to apply them in practice

A running theme of this book is that there is no single test forfairness Rather there are many quantitative criteria that can be usedto diagnose potential unfairness or discrimination1 Therersquos often agap between moral notions of fairness and what is measurable byavailable experimental or observational methods This does not meanthat we can select and apply a fairness test based on convenience Farfrom it we need moral reasoning and domain-specific considerationsto determine which test(s) are appropriate how to apply themdetermine whether the findings indicate wrongful discriminationand whether an intervention is called for We will see examples ofsuch reasoning throughout this chapter Conversely if a systempasses a fairness test we should not interpret it as a certificate thatthe system is fair2

In this chapter our primary objects of study will be real systemsrather than models of systems We must bear in mind that there aremany necessary assumptions in creating a model which may nothold in practice For example so-called automated decision makingsystems rarely operate without any human judgment Or we mayassume that a machine learning system is trained on a sample drawnfrom the same population on which it makes decisions which is alsoalmost never true in practice Further decision making in real lifeis rarely a single decision point but rather a cumulative series ofsmall decisions For example hiring includes sourcing screeninginterviewing selection and evaluation and those steps themselvesinclude many components3

An important source of difficulty for testing discrimination inpractice is that researchers have a limited ability to observe mdash muchless manipulate mdash many of the steps in a real-world system In factwersquoll see that even the decision maker faces limitations in its ability tostudy the system

Despite these limitations and difficulties empirically testingfairness is vital The studies that wersquoll discuss serve as an existenceproof of discrimination and provide a lower bound of its prevalence

2 solon barocas moritz hardt arvind narayanan

They enable tracking trends in discrimination over time When thefindings are sufficiently blatant they justify the need for interventionregardless of any differences in interpretation And when we doapply a fairness intervention they help us measure its effectivenessFinally empirical research can also help uncover the mechanismsby which discrimination takes place which enables more targetedand effective interventions This requires carefully formulating andtesting hypotheses using domain knowledge

The first half of this chapter surveys classic tests for discrimina-tion that were developed in the context of human-decision makingsystems The underlying concepts are just as applicable to the studyof fairness in automated systems Much of the first half will buildon the causality chapter and explain concrete techniques includingexperiments difference-in-differences and regression discontinuityWhile these are standard tools in the causal inference toolkit wersquolllearn about the specific ways in which they can be applied to fairnessquestions Then we will turn to the application of the observationalcriteria from Chapter 2 The summary table at the end of the first halflists for each test the fairness criterion that it probes the type of ac-cess to the system that is required and other nuances and limitationsThe second half of the chapter is about testing fairness in algorithmicdecision making focusing on issues specific to algorithmic systems

Part 1 Traditional tests for discrimination

Audit studies

The audit study is a popular technique for diagnosing discriminationIt involves a study design called a field experiment ldquoFieldrdquo refersto the fact that it is an experiment on the actual decision makingsystem of interest (in the ldquofieldrdquo as opposed to a lab simulation ofdecision making) Experiments on real systems are hard to pull offFor example we usually have to keep participants unaware thatthey are in an experiment But field experiments allow us to studydecision making as it actually happens rather than worrying thatwhat wersquore discovering is an artifact of a lab setting At the sametime the experiment by carefully manipulating and controllingvariables allows us to observe a treatment effect rather than merelyobserving a correlation

How to interpret such a treatment effect is a more tricky questionIn our view most audit studies including the ones wersquoll describeare best seen as attempts to test blindness whether a decision makerdirectly uses a sensitive attribute Recall that this notion of discrim-ination is not necessarily a counterfactual in a valid causal model


4 Wienk et al ldquoMeasuring RacialDiscrimination in American HousingMarkets The Housing Market PracticesSurveyrdquo 1979

5 Ayres and Siegelman ldquoRace andGender Discrimination in Bargainingfor a New Carrdquo The American EconomicReview 1995 304ndash21

6 In an experiment such as this wherethe treatment is randomized the addi-tion or omission of control variables ina regression estimate of the treatmenteffect does not result in an incorrect esti-mate but control variables can explainsome of the noise in the observationsand thus increase the precision of thetreatment effect estimate ie decreasethe standard error of the coefficient

(Chapter 4) Even as tests of blindness there is debate about preciselywhat it is that they measure since the researcher can at best signalrace gender or another sensitive attribute This will become clearwhen we discuss specific studies

Audit studies were pioneered by the US Department of Housingand Urban Development in the 1970s for the purpose of studying theadverse treatment faced by minority home buyers and renters4 Theyhave since been successfully applied to many other domains

In one landmark study researchers recruited 38 testers to visitabout 150 car dealerships to bargain for cars and record the pricethey were offered at the end of bargaining5 Testers visited dealer-ships in pairs testers in a pair differed in terms of race or genderBoth testers in a pair bargained for the same model of car at thesame dealership usually within a few days of each other

Pulling off an experiment such as this in a convincing way re-quires careful attention to detail here we describe just a few of themany details in the paper Most significantly the researchers wentto great lengths to minimize any differences between the testers thatmight correlate with race or gender In particular all testers were 28ndash32 years old had 3ndash4 years of postsecondary education and ldquoweresubjectively chosen to have average attractivenessrdquo Further to mini-mize the risk of testersrsquo interaction with dealers being correlated withrace or gender every aspect of their verbal or nonverbal behavior wasgoverned by a script For example all testers ldquowore similar lsquoyuppiersquosportswear and drove to the dealership in similar rented carsrdquo Theyalso had to memorize responses to a long list of questions they werelikely to encounter All of this required extensive training and regulardebriefs

The paperrsquos main finding was a large and statistically significantprice penalty in the offers received by Black testers For exampleBlack males received final offers that were about $1100 more thanWhite males which represents a threefold difference in dealer profitsbased on data on dealer costs The analysis in the paper has alterna-tive target variables (initial offers instead of final offers percentagemarkup instead of dollar offers) alternate model specifications(eg to account the two audits in each pair having correlated noise)and additional controls (eg bargaining strategy) Thus there are anumber of different estimates but the core findings remain robust6

A tempting interpretation of this study is that if two people wereidentical except for race with one being White and the other beingBlack then the offers they should expect to receive would differ byabout $1100 But what does it mean for two people to be identicalexcept for race Which attributes about them would be the same andwhich would be different


7 Freeman et al ldquoLooking the Part So-cial Status Cues Shape Race PerceptionrdquoPloS One 6 no 9 (2011) e25107

8 In most other domains say employ-ment testing demographic disparitywould be less valuable because thereare relevant differences between candi-dates Price discrimination is unusualin that there are no morally salientqualities of buyers that may justify it9 Bertrand and Mullainathan ldquoAreEmily and Greg More Employable ThanLakisha and Jamal A Field Experimenton Labor Market DiscriminationrdquoAmerican Economic Review 94 no 4

(2004) 991ndash1013

With the benefit of the discussion of ontological instability inChapter 4 we can understand the authorsrsquo implicit framework formaking these decisions In our view they treat race as a stable sourcenode in a causal graph attempt to hold constant all of its descen-dants such as attire and behavior in order to estimate the directeffect of race on the outcome But what if one of the mechanisms ofwhat we understand as ldquoracial discriminationrdquo is based on attire andbehavior differences The social construction of race suggests thatthis is plausible7

Note that the authors did not attempt to eliminate differencesin accent between testers Why not From a practical standpointaccent is difficult to manipulate But a more principled defense ofthe authorsrsquo choice is that accent is a part of how we understandrace a part of what it means to be Black White etc so that even ifthe testers could manipulate their accents they shouldnrsquot Accent issubsumed into the ldquoracerdquo node in the causal graph

To take an informed stance on questions such as this we needa deep understanding of cultural context and history They arethe subject of vigorous debate in sociology and critical race theoryOur point is this the design and interpretation of audit studiesrequires taking positions on contested social questions It may befutile to search for a single ldquocorrectrdquo way to test even the seeminglystraightforward fairness notion of whether the decision maker treatssimilar individuals similarly regardless of race Controlling for aplethora of attributes is one approach another is to simply recruitBlack testers and White testers have them behave and bargain aswould be their natural inclination and measure the demographicdisparity Each approach tells us something valuable and neither isldquobetterrdquo8

Another famous audit study tested discrimination in the labormarket9 Instead of sending testers in person the researchers sentin fictitious resumes in response to job ads Their goal was to test ifan applicantrsquos race had an impact on the likelihood of an employerinviting them for an interview They signaled race in the resumes byusing White-sounding names (Emily Greg) or Black-sounding names(Lakisha Jamal) By creating pairs of resumes that were identicalexcept for the name they found that White names were 50 morelikely to result in a callback than Black names The magnitude of theeffect was equivalent to an additional eight years of experience on aresume

Despite the studyrsquos careful design debates over interpretationhave inevitably arisen primarily due to the use of candidate namesas a way to signal race to employers Did employers even noticethe names in all cases and might the effect have been even stronger


10 Pager ldquoThe Use of Field Experimentsfor Studies of Employment Discrimi-nation Contributions Critiques andDirections for the Futurerdquo The Annalsof the American Academy of Political andSocial Science 609 no 1 (2007) 104ndash33

11 Kohler-Hausmann ldquoEddie Murphyand the Dangers of CounterfactualCausal Thinking about Detecting RacialDiscriminationrdquo Nw UL Rev 113

(2018) 1163

12 Bertrand and Duflo ldquoField Experi-ments on Discriminationrdquo in Handbookof Economic Field Experiments vol 1

(Elsevier 2017) 309ndash9313 Bertrand and Duflo14 Bertrand and Duflo15 Quillian et al ldquoMeta-Analysis ofField Experiments Shows No Changein Racial Discrimination in Hiringover Timerdquo Proceedings of the NationalAcademy of Sciences 114 no 41 (2017)10870ndash75

if they had Or can the observed disparities be better explainedbased on factors correlated with race such as a preference for morecommon and familiar names or an inference of higher socioeco-nomic status for the candidates with White-sounding names (Ofcourse the alternative explanations donrsquot make the observed behav-ior morally acceptable but they are important to consider) Althoughthe authors provide evidence against these interpretations debate haspersisted For a discussion of critiques of the validity of audit studiessee Pagerrsquos survey10

In any event like other audit studies this experiment tests fair-ness as blindness Even simple proxies for race such as residentialneighborhood were held constant between matched pairs of resumesThus the design likely underestimates the extent to which morallyirrelevant characteristics affect callback rates in practice This is justanother way to say that attribute flipping does not generally pro-duce counterfactuals that we care about and it is unclear if the effectsizes measured have any meaningful interpretation that generalizesbeyond the context of the experiment

Rather audit studies are valuable because they trigger a strongand valid moral intuition11 They also serve a practical purposewhen designed well they illuminate the mechanisms that producedisparities and help guide interventions For example the car bar-gaining study concluded that the preferences of owners of dealer-ships donrsquot explain the observed discrimination that the preferencesof other customers may explain some of it and strong evidence thatdealers themselves (rather than owners or customers) are the primarysource of the observed discrimination

Resume-based audit studies also known as correspondence stud-ies have been widely replicated We briefly present some majorfindings with the caveat that there may be publication biases Forexample studies finding no evidence of an effect are in general lesslikely to be published Alternately published null findings mightreflect poor experiment design or might simply indicate that discrim-ination is only expressed in certain contexts

A 2016 survey lists 30 studies from 15 countries covering nearlyall continents revealing pervasive discrimination against racial andethnic minorities12 The method has also been used to study dis-crimination based on gender sexual orientation and physical ap-pearance13 It has also been used outside the labor market in retailand academia14 Finally trends over time have been studied a meta-analysis found no change in racial discrimination in hiring againstAfrican Americans from 1989 to 2015 There was some indication ofdeclining discrimination against Latinx Americans although the dataon this question was sparse15


16 Blank ldquoThe Effects of Double-BlindVersus Single-Blind Reviewing Exper-imental Evidence from the AmericanEconomic Reviewrdquo The AmericanEconomic Review 1991 1041ndash67

17 Pischke ldquoEmpirical Methods inApplied Economics Lecture Notesrdquo2005

Collectively audit studies have helped nudge the academic andpolicy debate away from the naive view that discrimination is aconcern of a bygone era From a methodological perspective ourmain takeaway from the discussion of audit studies is the complexityof defining and testing blindness

Testing the impact of blinding

In some situations it is not possible to test blindness by randomizingthe decision makerrsquos perception of race gender or other sensitiveattribute For example suppose we want to test if therersquos gender biasin peer review in a particular research field Submitting real paperswith fictitious author identities may result in the reviewer attemptingto look up the author and realizing the deception A design in whichthe researcher changes author names to those of real people is evenmore problematic

There is a slightly different strategy thatrsquos more viable an editor ofa scholarly journal in the research field could conduct an experimentin which each paper received is randomly assigned to be reviewedin either a single-blind fashion (in which the author identities areknown to the referees) or double-blind fashion (in which authoridentities are withheld from referees) Indeed such experimentshave been conducted16 but in general even this strategy can beimpractical

At any rate suppose that a researcher has access to only obser-vational data on journal review policies and statistics on publishedpapers Among ten journals in the research field some introduceddouble-blind review and did so in different years The researcherobserves that in each case right after the switch the fraction offemale-authored papers rose whereas there was no change for thejournals that stuck with single-blind review Under certain assump-tions this enables estimating the impact of double-blind reviewingon the fraction of accepted papers that are female-authored Thishypothetical example illustrates the idea of a ldquonatural experimentrdquoso called because experiment-like conditions arise due to natural vari-ation Specifically the study design in this case is called differences-in-differences The first ldquodifferencerdquo is between single-blind anddouble-blind reviewing and the second ldquodifferencerdquo is betweenjournals (row 2 in the summary table)

Differences-in-differences is methodologically nuanced and a fulltreatment is beyond our scope17 We briefly note some pitfalls Theremay be unobserved confounders perhaps the switch to double-blindreviewing at each journal happened as a result of a change in edi-torship and the new editors also instituted policies that encouraged


18 Bertrand Duflo and MullainathanldquoHow Much Should We TrustDifferences-in-Differences EstimatesrdquoThe Quarterly Journal of Economics 119no 1 (2004) 249ndash75

19 Kang et al ldquoWhitened ResumesRace and Self-Presentation in theLabor Marketrdquo Administrative ScienceQuarterly 61 no 3 (2016) 469ndash502

female authors to submit strong papers There may also be spillovereffects (which violates the Stable Unit Treatment Value Assumption)a change in policy at one journal can cause a change in the set ofpapers submitted to other journals Outcomes are serially correlated(if there is a random fluctuation in the gender composition of theresearch field due to an entry or exodus of some researchers theeffect will last many years) This complicates the computation of thestandard error of the estimate18 Finally the effect of double blindingon the probability of acceptance of female-authored papers (ratherthan on the fraction of accepted papers that are female authored) isnot identifiable using this technique without additional assumptionsor controls

Even though testing the impact of blinding sounds similar to test-ing blindness there is a crucial conceptual and practical differenceSince we are not asking a question about the impact of race genderor another sensitive attribute we avoid running into ontological in-stability The researcher doesnrsquot need to intervene on the observablefeatures by constructing fictitious resumes or training testers to usea bargaining script Instead the natural variation in features is leftunchanged the study involves real decision subjects The researcheronly intervenes on the decision making procedure (or exploits natu-ral variation) and evaluates the impact of that intervention on groupsof candidates defined by the sensitive attribute A Thus A is nota node in a causal graph but merely a way to split the units intogroups for analysis Questions of whether the decision maker actu-ally inferred the sensitive attribute or merely a feature correlatedwith it are irrelevant to the interpretation of the study Further theeffect sizes measured do have a meaning that generalizes to scenariosbeyond the experiment For example a study tested the effect of ldquore-sume whiteningrdquo in which minority applicants deliberately concealcues of their racial or ethnic identity in job application materials toimprove their chances of getting a callback19 The effects reported inthe study are meaningful to job seekers who engage in this practice

Revealing extraneous factors in decisions

Sometimes natural experiments can be used to show the arbitrari-ness of decision making rather than unfairness in the sense of non-blindness (row 3 in the summary table) Recall that arbitrarinessis one type of unfairness that we are concerned about in this book(Chapter 3) Arbitrariness may refer to the lack of a uniform decisionmaking procedure or to the incursion of irrelevant factors into theprocedure

For example a study looked at decisions made by judges in


20 Eren and Mocan ldquoEmotional Judgesand Unlucky Juvenilesrdquo AmericanEconomic Journal Applied Economics 10no 3 (2018) 171ndash205

21 For readers unfamiliar with the cul-ture of college football in the UnitedStates the paper helpfully notes thatldquoDescribing LSU football just as anevent would be a huge understate-ment for the residents of the state ofLouisianardquo22 Danziger Levav and Avnaim-PessoldquoExtraneous Factors in Judicial De-cisionsrdquo Proceedings of the NationalAcademy of Sciences 108 no 17 (2011)6889ndash92

23 In fact it would be so extraordi-nary that it has been argued that thestudy should be dismissed simplybased on the fact that the effect sizeobserved is far too large to be causedby psychological phenomena suchas judgesrsquo attention See (LakensldquoImpossibly Hungry Judgesrdquo (httpsdaniellakensblogspotcom2017

07impossibly-hungry-judgeshtml2017))24 Weinshall-Margel and ShapardldquoOverlooked Factors in the Analysisof Parole Decisionsrdquo Proceedings of theNational Academy of Sciences 108 no 42

(2011) E833ndash33

Louisiana juvenile courts including sentence lengths20 It foundthat in the week following an upset loss suffered by the LouisianaState University (LSU) football team judges imposed sentences thatwere 7 longer on average The impact was greater for Black de-fendants The effect was driven entirely by judges who got theirundergraduate degrees at LSU suggesting that the effect is due to theemotional impact of the loss21

Another well-known study on the supposed unreliability of judi-cial decisions is in fact a poster child for the danger of confoundingvariables in natural experiments The study tested the relationshipbetween the order in which parole cases are heard by judges and theoutcomes of those cases22 It found that the percentage of favorablerulings started out at about 65 early in the day before graduallydropping to nearly zero right before the judgesrsquo food break returnedto ~65 after the break with the same pattern repeated for thefollowing food break The authors suggested that judgesrsquo mentalresources are depleted over the course of a session leading to poorerdecisions It quickly became known as the ldquohungry judgesrdquo studyand has been widely cited as an example of the fallibility of humandecision makers

Figure 1 (from Danziger et al) fractionof favorable rulings over the course ofa day The dotted lines indicate foodbreaks

The finding would be extraordinary if the order of cases wastruly random23 The authors were well aware that the order wasnrsquotrandom and performed a few tests to see if it is associated withfactors pertinent to the case (since those factors might also impact theprobability of a favorable outcome in a legitimate way) They did notfind such factors But it turned out they didnrsquot look hard enough Afollow-up investigation revealed multiple confounders and potentialconfounders including the fact that prisoners without an attorneyare presented last within each session and tend to prevail at a muchlower rate24 This invalidates the conclusion of the original study


25 Huq ldquoRacial Equity in AlgorithmicCriminal Justicerdquo Duke LJ 68 (2018)1043

26 For example if the variation (stan-dard error) in test scores for students ofidentical ability is 5 percentage pointsthen the difference between 84 and86 is of minimal significance

Testing the impact of decisions and interventions

An underappreciated aspect of fairness in decision making is theimpact of the decision on the decision subject In our predictionframework the target variable (Y) is not impacted by the score orprediction (R) But this is not true in practice Banks set interest ratesfor loans based on the predicted risk of default but setting a higherinterest rate makes a borrower more likely to default The impact ofthe decision on the outcome is a question of causal inference

There are other important questions we can ask about the impactof decisions What is the utility or cost of a positive or negative deci-sion to different decision subjects (and groups) For example admis-sion to a college may have a different utility to different applicantsbased on the other colleges where they were or werenrsquot admitted De-cisions may also have effects on people who are not decision subjectsfor instance incarceration impacts not just individuals but commu-nities25 Measuring these costs allows us to be more scientific aboutsetting decision thresholds and adjusting the tradeoff between falsepositives and negatives in decision systems

One way to measure the impact of decisions is via experimentsbut again they can be infeasible for legal ethical and technicalreasons Instead we highlight a natural experiment design for test-ing the impact of a decision mdash or a fairness intervention mdash on thecandidates called regression discontinuity (row 4 in the summarytable)

Suppose wersquod like to test if a merit-based scholarship program forfirst-generation college students has lasting beneficial effects mdash sayon how much they earn after college We cannot simply compare theaverage salary of students who did and did not win the scholarshipas those two variables may be confounded by intrinsic ability orother factors But suppose the scholarships were awarded based ontest scores with a cutoff of 85 Then we can compare the salaryof students with scores of 85 to 86 (and thus were awarded thescholarship) with those of students with scores of 84 to 85 (andthus were not awarded the scholarship) We may assume that withinthis narrow range of test scores scholarships are awarded essentiallyrandomly26 Thus we can estimate the impact of the scholarship as ifwe did a randomized controlled trial

We need to be careful though If we consider too narrow a bandof test scores around the threshold we may end up with insufficientdata points for inference If we consider a wider band of test scoresthe students in this band may no longer be exchangeable units for theanalysis

Another pitfall arises because we assumed that the set of students


27 Norton ldquoThe Supreme CourtrsquosPost-Racial Turn Towards a Zero-SumUnderstanding of Equalityrdquo Wm ampMary L Rev 52 (2010) 19728 Ayres ldquoThree Tests for MeasuringUnjustified Disparate Impacts in OrganTransplantation The Problem ofIncluded Variable Biasrdquo Perspectivesin Biology and Medicine 48 no 1 (2005)68ndashS87

29 Testing conditional demographicparity using regression requires strongassumptions about the functionalform of the relationship between theindependent variables and the targetvariable

who receive the scholarship is precisely those that are above thethreshold If this assumption fails it immediately introduces thepossibility of confounders Perhaps the test score is not the onlyscholarship criterion and income is used as a secondary criterion Orsome students offered the scholarship may decline it because theyalready received another scholarship Other students may not avail ofthe offer because the paperwork required to claim it is cumbersomeIf it is possible to take the test multiple times wealthier students maybe more likely to do so until they meet the eligibility threshold

Purely observational tests

The final category of quantitative tests for discrimination is purelyobservational When we are not able to do experiments on the systemof interest nor have the conditions that enable quasi-experimentalstudies there are still many questions we can answer with purelyobservational data

One question that is often studied using observational data iswhether the decision maker used the sensitive attribute this can beseen as a loose analog of audit studies This type of analysis is oftenused in the legal analysis of disparate treatment although there is adeep and long-standing legal debate on whether and when explicitconsideration of the sensitive attribute is necessarily unlawful27

The most common way to do this is to use regression analysis tosee if attributes other than the protected attributes can collectivelyldquoexplainrdquo the observed decisions28 (row 5 in the summary table) Ifthey donrsquot then the decision maker must have used the sensitiveattribute However this is a brittle test As discussed in Chapter 2given a sufficiently rich dataset the sensitive attribute can be recon-structed using the other attributes It is no surprise that attempts toapply this test in a legal context can turn into dueling expert reportsas seen in the SFFA vs Harvard case discussed in Chapter 4

We can of course try to go deeper with observational data andregression analysis To illustrate consider the gender pay gap Astudy might reveal that there is a gap between genders in wage perhour worked for equivalent positions in a company A rebuttal mightclaim that the gap disappears after controlling for college GPA andperformance review scores Such studies can be seen as tests forconditional demographic parity (row 6 in the summary table)29

It can be hard to make sense of competing claims based on re-gression analysis Which variables should we control for and whyThere are two ways in which we can put these observational claimson a more rigorous footing The first is to use a causal framework tomake our claims more precise In this case causal modeling might


alert us to unresolved questions why do performance review scoresdiffer by gender What about the gender composition of differentroles and levels of seniority Exploring these questions may revealunfair practices Of course in this instance the questions we raisedare intuitively obvious but other cases may be more intricate

The second way to go deeper is to apply our normative under-standing of fairness to determine which paths from gender to wageare morally problematic If the pay gap is caused by the (well-known)gender differences in negotiating for pay raises does the employerbear the moral responsibility to mitigate it This is of course anormative and not a technical question

Outcome-based tests

So far in this chapter wersquove presented many scenarios mdash screeningjob candidates peer review parole hearings mdash that have one thingin common while they all aim to predict some outcome (job perfor-mance paper quality recidivism) the researcher does not have accessto data on the true outcomes

Lacking ground truth the focus shifts to the observable character-istics at decision time such as job qualifications A persistent sourceof difficulty in these settings is for the researcher to construct two setsof samples that differ only in the sensitive attribute and not in any ofthe relevant characteristics This is often an untestable assumptionEven in an experimental setting such as a resume audit study thereis substantial room for different interpretations did employers inferrace from names or socioeconomic status And in observational stud-ies the findings might turn out to be invalid because of unobservedconfounders (such as in the hungry judges study)

But if outcome data are available then we can do at least one testof fairness without needing any of the observable features (other thanthe sensitive attribute) specifically we can test for sufficiency whichrequires that the true outcome be conditionally independent of thesensitive attribute given the prediction (YperpA|R) For example in thecontext of lending if the bankrsquos decisions satisfy sufficiency thenamong applicants in any narrow interval of predicted probability ofdefault (R) we should find the same rate of default (Y) for applicantsof any group (A)

Typically the decision maker (the bank) can test for sufficiencybut an external researcher cannot since the researcher only gets toobserve Y and not R (ie whether or not the loan was approved)Such a researcher can test predictive parity rather than sufficiencyPredictive parity requires that the rate of default (Y) for favorablyclassified applicants (Y = 1) of any group (A) be the same This


30 Simoiu Corbett-Davies and GoelldquoThe Problem of Infra-Marginality inOutcome Tests for DiscriminationrdquoThe Annals of Applied Statistics 11 no 3

(2017) 1193ndash1216

observational test is called the outcome test (row 7 in the summarytable)

Here is a tempting argument based on the outcome test if onegroup (say women) who receive loans have a lower rate of defaultthan another (men) it suggests that the bank applies a higher barfor loan qualification for women Indeed this type of argument wasthe original motivation behind the outcome test But it is a logicalfallacy sufficiency does not imply predictive parity (or vice versa) Tosee why consider a thought experiment involving the Bayes optimalpredictor In the hypothetical figure below applicants to the left ofthe vertical line qualify for the loan Since the area under the curveto the left of the line is concentrated further to the right for men thanfor women men who receive loans are more likely to default thanwomen Thus the outcome test would reveal that predictive parity isviolated whereas it is clear from the construction that sufficiency issatisfied and the bank applies the same bar to all groups

Figure 2 Hypothetical probabilitydensity of loan default for two groupswomen (orange) and men (blue)

This phenomenon is called infra-marginality ie the measurementis aggregated over samples that are far from the decision threshold(margin) If we are indeed interested in testing sufficiency (equiva-lently whether the bank applied the same threshold to all groups)rather than predictive parity this is a problem To address it wecan somehow try to narrow our attention to samples that are closeto the threshold This is not possible with (Y A Y) alone withoutknowing R we donrsquot know which instances are close to the thresholdHowever if we also had access to some set of features Xprime (whichneed not coincide with the set of features X observed by the decisionmaker) it becomes possible to test for violations of sufficiency Thethreshold test is a way to do this (row 8 in the summary table) A fulldescription is beyond our scope30 One limitation is that it requires amodel of the joint distribution of (Xprime A Y) whose parameters can beinferred from the data whereas the outcome test is model-free

While we described infra-marginality as a limitation of the out-come test it can also be seen as a benefit When using a marginal


31 Lakkaraju et al ldquoThe Selective La-bels Problem Evaluating AlgorithmicPredictions in the Presence of Unob-servablesrdquo in Proceedings of the 23rdACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining(ACM 2017) 275ndash84

32 Bird et al ldquoExploring or ExploitingSocial and Ethical Implications ofAutonomous Experimentation in AIrdquo inWorkshop on Fairness Accountability andTransparency in Machine Learning 2016

test we treat the distribution of applicant characteristics as a givenand miss the opportunity to ask why are some individuals so farfrom the margin Ideally we can use causal inference to answer thisquestion but when the data at hand donrsquot allow this non-marginaltests might be a useful starting point for diagnosing unfairness thatoriginates ldquoupstreamrdquo of the decision maker Similarly error ratedisparity to which we will now turn while crude by comparison tomore sophisticated tests for discrimination attempts to capture someof our moral intuitions for why certain disparities are problematic

Separation and selective labels

Recall that separation is defined as RperpA|Y At first glance it seemsthat there is a simple observational test analogous to our test forsufficiency (YperpA|R) However this is not straightforward even forthe decision maker because outcome labels can be observed onlyfor some of the applicants (ie the ones who received favorabledecisions) Trying to test separation using this sample suffers fromselection bias This is an instance of what is called the selective labelsproblem The issue also affects the computation of false positive andfalse negative rate parity which are binary versions of separation

More generally the selective labels problem is the issue of selec-tion bias in evaluating decision making systems due to the fact thatthe very selection process we wish to study determines the sampleof instances that are observed It is not specific to the issue of testingseparation or error rates it affects the measurement of other fun-damental metrics such as accuracy as well It is a serious and oftenoverlooked issue that has been the subject of recent study31

One way to get around this barrier is for the decision maker toemploy an experiment in which some sample of decision subjectsreceive positive decisions regardless of the prediction (row 9 in thesummary table) However such experiments raise ethical concernsand are rarely done in practice In machine learning some experi-mentation is necessary in settings where there does not exist offlinedata for training the classifier which must instead simultaneouslylearn and make decisions32

One scenario where it is straightforward to test separation iswhen the ldquopredictionrdquo is not actually a prediction of a future eventbut rather when machine learning is used for automating humanjudgment such as harassment detection in online comments In theseapplications it is indeed possible and important to test error rateparity


Summary of traditional tests and methods

Table 1 Summary of traditional tests and methods highlightingthe relationship to fairness the observational and experimentalaccess required by the researcher and limitations

Test study design Fairness notion application Access Notes limitations

1 Audit study Blindness A-exp = X = R Difficult to interpret2 Natural experiment Impact of blinding A-exp sim R Confounding

especially diff-in-diff SUTVA violations other3 Natural experiment Arbitrariness W sim R Unobserved confounders4 Natural experiment Impact of decision R Y or Yprime Sample size confounding

especially regr disc other technical difficulties5 Regression analysis Blindness X A R Unreliable due to proxies6 Regression analysis Cond demographic parity X A R Weak moral justification7 Outcome test Predictive parity A Y | Y = 1 Infra-marginality8 Threshold test Sufficiency Xprime A Y | Y = 1 Model-specific9 Experiment Separationerror rate parity A R Y = Y Often unethical or impractical10 Observational test Demographic parity A R See Chapter 2

11 Mediation analysis ldquoRelevantrdquo mechanism X A R See Chapter 4

Legend

bull = indicates intervention on some variable (that is X = does not represent a new random variable but issimply an annotation describing how X is used in the test)

bull sim natural variation in some variable exploited by the researcherbull A-exp exposure of a signal of the sensitive attribute to the decision makerbull W a feature that is considered irrelevant to the decisionbull Xprime a set of features which may not coincide with those observed by the decision maker|bull Yprime an outcome that may or may not be the one that is the target of prediction|

Taste-based and statistical discrimination

We have reviewed several methods of detecting discrimination but wehave not addressed the question of why discrimination happens Along-standing way to try to answer this question from an economicperspective is to classify discrimination as taste-based or statisticalA taste-based discriminator is motivated by an irrational animusor prejudice for a group As a result they are willing to make sub-optimal decisions by passing up opportunities to select candidatesfrom that group even though they will incur a financial penaltyfor doing so This is the classic model of discrimination in labor


33 Becker The Economics of Discrimination(University of Chicago Press 1957)

34 Phelps ldquoThe Statistical Theory ofRacism and Sexismrdquo The AmericanEconomic Review 62 no 4 (1972) 659ndash61Arrow ldquoThe Theory of DiscriminationrdquoDiscrimination in Labor Markets 3 no 10

(1973) 3ndash33

markets33

A statistical discriminator in contrast aims to make optimal pre-dictions about the target variable using all available informationincluding the protected attribute34 In the simplest model of statis-tical discrimination two conditions hold first the distribution ofthe target variable differs by group The usual example is of genderdiscrimination in the workplace involving an employer who believesthat women are more likely to take time off due to pregnancy (re-sulting in lower job performance) The second condition is that theobservable characteristics do not allow a perfect prediction of thetarget variable which is essentially always the case in practice Underthese two conditions the optimal prediction will differ by groupeven when the relevant characteristics are identical In this examplethe employer would be less likely to hire a woman than an equallyqualified man Therersquos a nuance here from a moral perspective wewould say that the employer above discriminates against all femalecandidates But under the definition of statistical discrimination theemployer only discriminates against the female candiates who wouldnot have taken time off if hired (and in fact discriminates in favor ofthe female candidates who would take time off if hired)

While some authors put much weight understanding discrimina-tion based on the taste-based vs statistical categorization we willde-emphasize it in this book Several reasons motivate our choiceFirst since we are interested in extracting lessons for statistical deci-sion making systems the distinction is not that helpful such systemswill not exhibit taste-based discrimination unless prejudice is explic-itly programmed into them (while that is certainly a possibility it isnot a primary concern of this book)

Second there are practical difficulties in distinguishing betweentaste-based and statistical discrimination Often what might seemto be a ldquotasterdquo for discrimination is simply the result of an imperfectunderstanding of the decision-makerrsquos information and beliefsFor example at first sight the findings of the car bargaining studymay look like a clear-cut case of taste-based discrimination Butmaybe the dealer knows that different customers have differentaccess to competing offers and therefore have different willingnessto pay for the same item Then the dealer uses race as a proxy forthis amount (correctly or not) In fact the paper provides tentativeevidence towards this interpretation The reverse is also possible ifthe researcher does not know the full set of features observed by thedecision maker taste-based discrimination might be mischaracterizedas statistical discrimination

Third many of the fairness questions of interest to us such asstructural discrimination donrsquot map to either of these criteria (as


35 For example laws restricting em-ployers from asking about applicantsrsquocriminal history resulted in employ-ers using race as a proxy for it See(Agan and Starr ldquoBan the Box CriminalRecords and Racial Discrimination AField Experimentrdquo The Quarterly Journalof Economics 133 no 1 (2017) 191ndash235)

36 Williams and Ceci ldquoNational HiringExperiments Reveal 2 1 Faculty Pref-erence for Women on STEM TenureTrackrdquo Proceedings of the NationalAcademy of Sciences 112 no 17 (2015)5360ndash65

37 Note that if this assumption iscorrect then a preference for femalecandidates is both accuracy maximizing(as a predictor of career success)and required under some notions offairness such as counterfactual fairness

they only consider causes that are relatively proximate to the decisionpoint) We will discuss structural discrimination in Chapter 6

Finally the distinction is also not especially valuable from a nor-mative perspective Recall that our moral understanding of fairnessemphasizes the effects on the decision subjects and does not putmuch weight on the mental state of the decision maker Itrsquos alsoworth nothing that this dichotomy is associated with the policyposition that fairness interventions are unnecessary mdash firms thatpractice taste-based discrimination will go out of business as forstatistical discrimination either it is argued to be justified or futile toproscribe because firms will find workarounds35 Of course thatrsquosnot necessarily a reason to avoid discussing taste-based and statisticaldiscrimination as the policy position in no way follows from the tech-nical definitions and models themselves itrsquos just a relevant caveat forthe reader who might encounter these dubious arguments in othersources

Although we de-emphasize this distinction we consider it criticalto study the sources and mechanisms of discrimination This helps usdesign effective and well-targeted interventions For example severalstudies (including the car bargaining study) test whether the sourceof discrimination lies in the owner employees or customers

An example of a study that can be difficult to interpret without un-derstanding the mechanism is a 2015 resume-based audit study thatrevealed a 21 faculty preference for women for STEM tenure-trackpositions36 Consider the range of possible explanations animusagainst men a desire to compensate for past disadvantage sufferedby women in STEM fields a preference for a more diverse faculty(assuming that the faculties in question are currently male domi-nated) a response to financial incentives for diversification frequentlyprovided by universities to STEM departments and an assumptionby decision makers that due to prior descrimination a female candi-date with an equivalent CV to a male candidate is of greater intrinsicability37

To summarize rather than a one-size-fits-all approach to under-standing mechanisms such as taste-based vs statistical discriminationmore useful is a nuanced and domain-specific approach where weformulate hypotheses in part by studying decision making processesand organizations especially in a qualitative way Let us now turn tothose studies

Studies of decision making processes and organizations

One way to study decision making processes is through surveys ofdecision makers or organizations Sometimes such studies reveal


38 Neckerman and KirschenmanldquoHiring Strategies Racial Bias andInner-City Workersrdquo Social Problems 38no 4 (1991) 433ndash4739 Pager and Shepherd ldquoThe Sociologyof Discrimination Racial Discrimina-tion in Employment Housing Creditand Consumer Marketsrdquo Annu RevSociol 34 (2008) 181ndash209

40 Rivera Pedigree How Elite Students GetElite Jobs (Princeton University Press2016)

blatant discrimination such as strong racial preferences by employ-ers38 Over the decades however such overt attitudes have becomeless common or at least less likely to be expressed39 Discriminationtends to operate in more subtle indirect and covert ways

Ethnographic studies excel at helping us understand covert dis-crimination Ethnography is one of the main research methods inthe social sciences and is based on the idea of the researcher beingembedded among the research subjects for an extended period oftime as they go about their daily activities It is a set of qualitativemethods that are complementary to and symbiotic with quantitativeones Ethnography allows us to ask questions that are deeper thanquantitative methods permit and to produce richly detailed accountsof culture It also helps formulate hypotheses that can be testedquantitatively

A good illustration is the book Pedigree which examines hiringpractices in a set of elite consulting banking and law firms40 Thesefirms together constitute the majority of the highest-paying and mostdesirable entry-level jobs for college graduates The author used twostandard ethnographic research methods The first is a set of 120

interviews in which she presented as a graduate student interestedin internship opportunities The second method is called participantobservation she worked in an unpaid Human Resources positionat one of the firms for 9 months after obtaining consent to use herobservations for research There are several benefits to the researcherbecoming a participant in the culture it provides a greater level ofaccess allows the researcher to ask more nuanced questions andmakes it more likely that the research subjects would behave as theywould when not being observed

Several insights from the book are relevant to us First the hiringprocess has about nine stages including outreach recruitment eventsscreening multiple rounds of interviews and deliberations and ldquosellrdquoevents This highlights why any quantitative study that focuses ona single slice of the process (say evaluation of resumes) is limitedin scope Second the process bears little resemblance to the ideal ofpredicting job performance based on a standardized set of attributesalbeit noisy ones that we described in Chapter 1 Interviewers pay asurprising amount of attention to attributes that should be irrelevantor minimally relevant such as leisure activities but which insteadserve as markers of class Applicants from privileged backgroundsare more likely to be viewed favorably both because they are ableto spare more time for such activities and because they have theinsider knowledge that these seemingly irrelevant attributes matterin recruitment The signals that firms do use as predictors of jobperformance such as admission to elite universities mdash the pedigree


41 Posselt Inside Graduate Admissions(Harvard University Press 2016)

in the bookrsquos title mdash are also highly correlated with socioeconomicstatus The authors argue that these hiring practices help explain whyelite status is perpetuated in society along hereditary lines In ourview the careful use of statistical methods in hiring despite theirlimits may mitigate the strong social class based preferences exposedin the book

Another book Inside Graduate Admissions focuses on educationrather than labor market41 It resulted from the authorrsquos observationsof decision making by graduate admissions committees in nine aca-demic disciplines over two years A striking theme that pervades thisbook is the tension between formalized and holistic decision makingFor instance committees arguably over-rely on GRE scores despitestating that they consider their predictive power to be limited Asit turns out one reason for the preference for GRE scores and otherquantitative criteria is that they avoid the difficulties of subjectiveinterpretation associated with signals such as reference letters Thisis considered valuable because it minimizes tensions between facultymembers in the admissions process On the other hand decision mak-ers are implicitly aware (and occasionally explicitly articulate) that ifadmissions criteria are too formal then some groups of applicants mdashnotably applicants from China mdash would be successful at a far greaterrate and this is considered undesirable This motivates a more holis-tic set of criteria which often include idiosyncratic factors such as anapplicantrsquos hobby being considered ldquocoolrdquo by a faculty member Theauthor argues that admissions committees use a facially neutral setof criteria characterized by an almost complete absence of explicitsubstantive discussion of applicantsrsquo race gender or socioeconomicstatus but which nonetheless perpetuates inequities For examplethere is a reluctance to take on students from underrepresented back-grounds whose profiles suggest that they would benefit from moreintensive mentoring

This concludes the first part of the chapter Now let us turn toalgorithmic systems The background wersquove built up so far willprove useful In fact the traditional tests of discrimination are just asapplicable to algorithmic systems But we will also encounter manynovel issues

Part 2 Testing discrimination in algorithmic systems

An early example of discrimination in an algorithmic system is fromthe 1950s In the United States applicants for medical residencyprograms provide a ranked list of their preferred hospital programsto a centralized system and hospitals likewise rank applicants Amatching algorithm takes these preferences as input and produces


42 Specifically it satisfies the require-ment that if applicant A is not matchedto hospital H then either A matched toa hospital that he ranked higher than Hor H matched to a set of applicants allof whom it ranked higher than A

43 Roth ldquoThe Origins History andDesign of the Resident Matchrdquo Jama289 no 7 (2003) 909ndash12 Friedman andNissenbaum ldquoBias in Computer Sys-temsrdquo ACM Transactions on InformationSystems (TOIS) 14 no 3 (1996) 330ndash47

44 A 2014 paper issued a call to actiontowards this type of research Mostof the studies that we cite postdatethat piece (Sandvig et al ldquoAuditingAlgorithms Research Methods forDetecting Discrimination on InternetPlatformsrdquo ICA Pre-Conference on Dataand Discrimination 2014)

an assignment of applicants to hospitals that optimizes mutualdesirability42

Early versions of the system discriminated against couples whowished to stay geographically close because couples could notaccurately express their joint preferences for example each partnermight prefer a hospital over all others but only if the other partneralso matched to the same hospital43 This is a non-comparativenotion of discrimination the system does injustice to an applicant (ora couple) when it does not allow them to express their preferencesregardless of how other applicants are treated Note that none ofthe tests for fairness that we have discussed are capable of detectingthis instance of discrimination as it arises because of dependenciesbetween pairs of units which is not something we have modeled

There was a crude attempt in the residency matching systemto capture joint preferences involving designating one partner ineach couple as the ldquoleading memberrdquo the algorithm would matchthe leading member without constraints and then match the othermember to a proximate hospital if possible Given the prevailinggender norms at that time it is likely that this method had a furtherdiscriminatory impact on women in heterosexual couples

Despite these early examples it is the 2010s that testing unfairnessin real-world algorithmic systems has become a pressing concernand a distinct area of research44 This work has much in commonwith the social science research that we reviewed but the targetsof research have expanded considerably In the rest of this chapterwe will review and attempt to systematize the research methods inseveral areas of algorithmic decision making various applicationsof natural-language processing and computer vision ad targetingplatforms search and information retrieval tools and online markets(ride hailing vacation rentals etc) Much of this research has focusedon drawing attention to the discriminatory effects of specific widely-used tools and platforms at specific points in time While that is avaluable goal we will aim to highlight broader generalizable themesin our review We will close the chapter by identifying commonprinciples and methods behind this body of research

Fairness considerations in applications of natural language process-ing

One of the most central tasks in NLP is language identification deter-mining the language that a given text is written in It is a precursorto virtually any other NLP operation on the text such as translationto the userrsquos preferred language on social media platforms It isconsidered a more-or-less solved problem with relatively simple


45 For a treatise on AAE see (GreenAfrican American English A LinguisticIntroduction (Cambridge UniversityPress 2002)) The linguistic study ofAAE highlights the complexity andinternal consistency of its grammarvocabulary and other distinctivefeatures and refutes the basis ofprejudiced views of AAE as inferior tostandard English46 Dastin ldquoAmazon Scraps Secret AIRecruiting Tool That Showed BiasAgainst Womenrdquo Reuters 201847 Buranyi ldquoHow to Persuade aRobot That You Should Get the Jobrdquo(Guardian 2018)48 De-Arteaga et al ldquoBias in Bios ACase Study of Semantic RepresentationBias in a High-Stakes Settingrdquo inProceedings of the Conference on FairnessAccountability and Transparency (ACM2019) 120ndash2849 Ramineni and Williamson ldquoUn-derstanding Mean Score DifferencesBetween the e-rater Automated ScoringEngine and Humans for Demographi-cally Based Groups in the GRE GeneralTestrdquo ETS Research Report Series 2018no 1 (2018) 1ndash3150 Amorim Canccedilado and Veloso ldquoAu-tomated Essay Scoring in the Presenceof Biased Ratingsrdquo in Proceedings of the2018 Conference of the North AmericanChapter of the Association for Compu-tational Linguistics Human LanguageTechnologies Volume 1 (Long Papers)2018 229ndash3751 Sap et al ldquoThe Risk of Racial Bias inHate Speech Detectionrdquo in Proceedingsof the 57th Annual Meeting of the Associ-ation for Computational Linguistics 20191668ndash7852 Kiritchenko and Mohammad ldquoEx-amining Gender and Race Bias in TwoHundred Sentiment Analysis SystemsrdquoarXiv Preprint arXiv180504508 201853 Tatman ldquoGender and Dialect Biasin YouTubersquos Automatic Captionsrdquo inProceedings of the First ACL Workshopon Ethics in Natural Language Processing(Valencia Spain Association forComputational Linguistics 2017) 53ndash59 httpsdoiorg1018653v1W17-1606

models based on n-grams of characters achieving high accuracies onstandard benchmarks even for short texts that are a few words long

However a 2016 study showed that a widely used tool langidpywhich incorporates a pre-trained model had substantially morefalse negatives for tweets written in African-American English (AAE)compared to those written in more common dialectal forms 132of AAE tweets were classified as non-English compared to 76 ofldquoWhite-alignedrdquo English tweets AAE is a set of English dialects com-monly spoken by Black people in the United States (of course thereis no implication that all Black people in the United States primarilyspeak AAE or even speak it at all)45 The authorsrsquo construction ofthe AAE and White-aligned corpora themselves involved machinelearning as well as validation based on linguistic expertise we willdefer a full discussion to the Measurement chapter The observederror rate disparity is likely a classic case of underrepresentation inthe training data

Unlike the audit studies of car sales or labor markets discussedearlier here it is not necessary (or justifiable) to control for anyfeatures of the texts such as the level of formality While it maycertainly be possible to ldquoexplainrdquo disparate error rates based on suchfeatures that is irrelevant to the questions of interest in this contextsuch as whether NLP tools will perform less well for one group ofusers compared to another

NLP tools range in their application from aids to online interactionto components of decisions with major career consequences Inparticular NLP is used in predictive tools for screening of resumes inthe hiring process There is some evidence of potential discriminatoryimpacts of such tools both from employers themselves46 and fromapplicants47 but it is limited to anecdotes There is also evidencefrom the lab experiments on the task of predicting occupation fromonline biographies48

We briefly survey other findings Automated essay grading soft-ware tends to assign systematically lower scores to some demo-graphic groups49 compared to human graders who may themselvesprovide biased ratings50 Hate speech detection models use markersof dialect as predictors of toxicity according to a lab study51 result-ing in discrimination against minority speakers Many sentimentanalysis tools assign systematically different scores to text basedon race-aligned or gender-aligned names of people mentioned inthe text52 Speech-to-text systems perform worse for speakers withcertain accents53 In all these cases the author or speaker of the textis potentially harmed In other NLP systems ie those involvingnatural language generation or translation there is a different typeof fairness concern namely the generation of text reflecting cultural


54 Solaiman et al ldquoRelease Strate-gies and the Social Impacts ofLanguage Modelsrdquo arXiv PreprintarXiv190809203 2019

55 Buolamwini and Gebru ldquoGen-der Shades Intersectional AccuracyDisparities in Commercial GenderClassificationrdquo in Conference on FairnessAccountability and Transparency 201877ndash91

prejudices resulting in representational harm to a group of people54

The table below summarizes this discussionThere is a line of research on cultural stereotypes reflected in word

embeddings Word embeddings are representations of linguisticunits they do not correspond to any linguistic or decision-makingtask As such lacking any notion of ground truth or harms to peopleit is not meaningful to ask fairness questions about word embeddingswithout reference to specific downstream tasks in which they mightbe used More generally it is meaningless to ascribe fairness asan attribute of models as opposed to actions outputs or decisionprocesses

Table 2 Four types of NLP tasks and the types of unfairnessthat can result Note that the traditional tests discussed in Part1 operate in the context of predicting outcomes (row 3 in thistable)

Type of task Examples Sources of disparity Harm

Perception Language id Underrep in training corpus Degraded servicespeech-to-text

Automating judgment Toxicity detection Human labels underrep in training corpus Adverse decisionsessay grading

Predicting outcomes Resume filtering Various including human labels Adverse decisionsSequence prediction Language generation Cultural stereotypes historical prejudices Repres harm

translation

Demographic disparities and questionable applications of computervision

Like NLP computer vision technology has made major headwayin the 2010s due to the availability of large-scale training corporaand improvements in hardware for training neural networks Todaymany types of classifiers are used in commercial products to analyzeimages and videos of people Unsurprisingly they often exhibitdisparities in performance based on gender race skin tone and otherattributes as well as deeper ethical problems

A prominent demonstration of error rate disparity comes froman analysis of three commercial tools designed to classify a personrsquosgender as female or male based on an image developed by MicrosoftIBM and Face++ respectively55 The study found that all three clas-sifiers perform better on male faces than female faces (81 ndash 206


56 Vries et al ldquoDoes Object RecognitionWork for Everyonerdquo in Proceedings ofthe IEEE Conference on Computer Visionand Pattern Recognition Workshops 201952ndash59

57 Shankar et al ldquoNo ClassificationWithout Representation AssessingGeodiversity Issues in Open Data Setsfor the Developing Worldrdquo in NIPS2017 Workshop Machine Learning for theDeveloping World 201758 Simonite ldquoWhen It Comes to GorillasGoogle Photos Remains Blindrdquo WiredJanuary 13 (2018) Hern ldquoFlickr FacesComplaints over lsquoOffensiversquoauto-Tagging for Photosrdquo The Guardian 20

(2015)59 Martineau ldquoCities ExaminePropermdashand ImpropermdashUsesof Facial Recognition | WIREDrdquo(httpswwwwiredcomstorycities-examine-proper-improper-facial-recognition2019)60 OrsquoTOOLE et al ldquoSimulating thelsquoOther-Race Effectrsquoas a Problem inPerceptual Learningrdquo Connection Science3 no 2 (1991) 163ndash7861 ldquoKinect May Have Issues withDark-Skinned Users | Tomrsquos Guiderdquo(httpswwwtomsguidecomusMicrosoft-Kinect-Dark-Skin-Facial-Recognition

news-8638html 2010)62 Wilson Hoffman and MorgensternldquoPredictive Inequity in Object Detec-tionrdquo arXiv Preprint arXiv1902110972019

difference in error rate) Further all perform better on lighter facesthan darker faces (118 ndash 192 difference in error rate) and worston darker female faces (208 ndash 347 error rate) Finally since allclassifiers treat gender as binary the error rate for people of nonbi-nary gender can be considered to be 100

If we treat the classifierrsquos target variable as gender and the sensi-tive attribute as skin tone we can decompose the observed disparitiesinto two separate issues first female faces are classified as malemore often than male faces are classified as female This can be ad-dressed relatively easily by recalibrating the classification thresholdwithout changing the training process The second and deeper issueis that darker faces are misclassified more often than lighter faces

Image classification tools have found it particularly challengingto achieve geographic equity due to the skew in training datasetsA 2019 study evaluated five popular object recognition services onimages of household objects from 54 countries56 It found significantaccuracy disparities between countries with images from lower-income countries being less accurately classified The authors pointout that household objects such as dish soap or spice containerstend to look very different in different countries These issues areexacerbated when images of people are being classified A 2017 anal-ysis found that models trained on ImageNet and Open Images twoprominent datasets for object recognition performed dramaticallyworse at recognizing images of bridegrooms from countries such asPakistan and India compared to those from North American andEuropean countries (the former were often classified as chain mail atype of armor)57

Several other types of unfairness are known through anecdotalevidence in image classification and face recognition systems Atleast two different image classification systems are known to haveapplied demeaning and insulting labels to photos of people58 Facerecognition systems have been anecdotally reported to exhibit thecross-race effect wherein they are more likely to confuse faces oftwo people who are from a racial group that is underrepresentedin the training data59 This possibility was shown in a simple linearmodel of face recognition as early as 199160 Many commercialproducts have had difficulty detecting faces of darker-skinned peopleMcEntegart61 Similar results are known from lab studies of publiclyavailable object detection models62

More broadly computer vision techniques seem to be particularlyprone to use in ways that are fundamentally ethically questionable re-gardless of accuracy Consider gender classification while MicrosoftIBM and Face++ have worked to mitigate the accuracy disparitiesdiscussed above a more important question is why build a gender


63 Turow et al ldquoAmericans Reject Tai-lored Advertising and Three ActivitiesThat Enable Itrdquo Available at SSRN1478214 2009

64 Raghavan et al ldquoMitigating Bias inAlgorithmic Employment ScreeningEvaluating Claims and Practicesrdquo arXivPreprint arXiv190609208 2019

65 Yao and Huang ldquoBeyond ParityFairness Objectives for CollaborativeFilteringrdquo in Advances in Neural Informa-tion Processing Systems 2017 2921ndash30

classification tool in the first place By far the most common appli-cation appears to be displaying targeted advertisements based oninferred gender (and many other inferred characteristics includingage race and current mood) in public spaces such as billboardsstores or screens in the back seats of taxis We wonrsquot recap the objec-tions to targeted advertising here but it is an extensively discussedtopic and the practice is strongly opposed by the public at least inthe United States63

Morally dubious computer vision technology goes well beyondthis example and includes apps that ldquobeautifyrdquo images of usersrsquofaces ie edit them to better conform to mainstream notions ofattractiveness emotion recognition which has been alleged to be apseudoscience and the analysis of video footage for cues such asbody language for screening job applicants64

Search and recommendation systems three types of harms

Search engines social media platforms and recommendation systemshave different goals and underlying algorithms but they do havemany things in common from a fairness perspective They are notdecision systems and donrsquot provide or deny people opportunities atleast not directly Instead there are (at least) three types of disparitiesand attendant harms that may arise in these systems First theymay serve the informational needs of some consumers (searchers orusers) better than others Second they may create inequities amongproducers (content creators) by privileging certain content over othersThird they may create representational harms by amplifying andperpetuating cultural stereotypes There are a plethora of otherethical concerns about information platforms such as the potentialto contribute to the political polarization of society However we willlimit our attention to harms that can be considered to be forms ofdiscrimination

Unfairness to consumers An illustration of unfairness to consumerscomes from a study of collaborative filtering recommender systemsthat used theoretical and simulation methods (rather than a fieldstudy of a deployed system)65 Collaborative filtering is an approachto recommendations that is based on the explicit or implicit feedback(eg ratings and consumption respectively) provided by other usersof the system The intuition behind it is seen in the ldquousers who likedthis item also liked rdquo feature on many services The study foundthat such systems can underperform for minority groups in the senseof being worse at recommending content that those users would likeA related but distinct reason for underperformance occurs whenusers from one group are less observable eg less likely to provide


66 Mehrotra et al ldquoAuditing SearchEngines for Differential SatisfactionAcross Demographicsrdquo in Proceedings ofthe 26th International Conference on WorldWide Web Companion 2017 626ndash33

ratings The underlying assumption is that different groups havedifferent preferences so that what the system learns about one groupdoesnrsquot generalize to other groups

In general this type of unfairness is hard to study in real sys-tems (not just by external researchers but also by system operatorsthemselves) The main difficulty is accurately measuring the targetvariable The relevant target construct from a fairness perspectiveis usersrsquo satisfaction with the results or how well the results servedthe usersrsquo needs Metrics such as clicks and ratings serve as crudeproxies for the target and are themselves subject to demographicmeasurement biases Companies do expend significant resources onAB testing or other experimental methods for optimizing searchand recommendation systems and frequently measure demographicdifferences as well But to reiterate such tests almost always empha-size metrics of interest to the firm rather than benefit or payoff for theuser

A rare attempt to transcend this limitation comes from an (inter-nal) audit study of the Bing search engine66 The authors devisedmethods to disentangle user satisfaction from other demographic-specific variation by controlling for the effects of demographic factorson behavioral metrics They combined it with a method for inferringlatent differences directly instead of estimating user satisfaction foreach demographic group and then comparing these estimates Thismethod infers which impression among a randomly selected pair ofimpressions led to greater user satisfaction They did this using prox-ies for satisfaction such as reformulation rate Reformulating a searchquery is a strong indicator of dissatisfaction with the results Basedon these methods they found no gender differences in satisfactionbut mild age differences

Unfairness to producers In 2019 a group of content creators suedYouTube alleging that YouTubersquos algorithms as well as human moder-ators suppressed the reach of LGBT-focused videos and the ability toearn ad revenue from them This is a distinct type of issue from thatdiscussed above as the claim is about a harm to producers ratherthan consumers (although of course YouTube viewers interested inLGBT content are also presumably harmed) There are many otherongoing allegations and controversies that fall into this categorypartisan bias in search results and social media platforms searchengines favoring results from their own properties over competitorsfact-checking of online political ads and inadequate (or converselyover-aggressive) policing of purported copyright violations It is dif-ficult to meaningfully discuss and address these issues through thelens of fairness and discrimination rather than a broader perspectiveof power and accountability The core issue is that when information


67 For in-depth treatments of thehistory and politics of informationplatforms see (Wu The Master SwitchThe Rise and Fall of Information Empires(Vintage 2010) Gillespie ldquoThe Politicsof lsquoPlatformsrsquordquo New Media amp Society12 no 3 (2010) 347ndash64 GillespieCustodians of the Internet PlatformsContent Moderation and the HiddenDecisions That Shape Social Media (YaleUniversity Press 2018) Klonick ldquoTheNew Governors The People Rules andProcesses Governing Online SpeechrdquoHarv L Rev 131 (2017) 1598)68 Noble Algorithms of Oppression HowSearch Engines Reinforce Racism (nyuPress 2018)69 Kay Matuszek and Munson ldquoUn-equal Representation and GenderStereotypes in Image Search Resultsfor Occupationsrdquo in Proceedings of the33rd Annual ACM Conference on HumanFactors in Computing Systems (ACM2015) 3819ndash2870 Specifically instances are occupationsand the fraction of women in the searchresults is viewed as a predictor of thefraction of women in the occupation inthe real world

platforms have control over public discourse they become the ar-biters of conflicts between competing interests and viewpoints Froma legal perspective these issues fall primarily under antitrust law andtelecommunication regulation rather than antidiscrimination law67

Representational harms The book Algorithms of Oppression drewattention to the ways in which search engines reinforce harmfulracial gender and intersectional stereotypes68 There have also beenquantitative studies of some aspects of these harms In keeping withour quantitative focus letrsquos discuss a study that measured how wellthe gender skew in Google image search results for 45 occupations(author bartender construction worker ) corresponded to the real-world gender skew of the respective occupations69 This can beseen as test for calibration70 The study found weak evidence forstereotype exaggeration that is imbalances in occupational statisticsare exaggerated in image search results However the deviationswere minor

Consider a thought experiment suppose the study had foundno evidence of miscalibration Is the resulting system fair It wouldbe simplistic to answer in the affirmative for at least two reasonsFirst the study tested calibration between image search results andoccupational statistics in the United States Gender stereotypes ofoccupations as well as occupational statistics differ substantially be-tween countries and cultures Second accurately reflecting real-worldstatistics may still constitute a representational harm when thosestatistics are skewed and themselves reflect a history of prejudiceSuch a system contributes to the lack of visible role models for un-derrepresented groups To what extent information platforms shouldbear responsibility for minimizing these imbalances and what typesof interventions are justified remain matters of debate

Understanding unfairness in ad targeting

Ads have long been targeted in relatively crude ways For examplea health magazine might have ads for beauty products exploiting acoarse correlation In contrast to previous methods online targetingoffers several key advantages to advertisers granular data collectionabout individuals the ability to reach niche audiences (in theory theaudience size can be one since ad content can be programmaticallygenerated and customized with user attributes as inputs) and theability to measure conversion (conversion is when someone whoviews the ad clicks on it and then takes another action such as a pur-chase) To date ad targeting has been one of the most commerciallyimpactful applications of machine learning

The complexity of modern ad targeting results in many avenues


71 Susser Roessler and NissenbaumldquoOnline Manipulation Hidden Influ-ences in a Digital Worldrdquo Available atSSRN 3306006 2018

72 The economic analysis of advertisingincludes a third category complemen-tary thatrsquos related to persuasive ormanipulative category (Bagwell ldquoTheEconomic Analysis of AdvertisingrdquoHandbook of Industrial Organization 3

(2007) 1701ndash844)73 See eg (Coltrane and MessineoldquoThe Perpetuation of Subtle PrejudiceRace and Gender Imagery in 1990sTelevision Advertisingrdquo Sex Roles 42 no5ndash6 (2000) 363ndash89)

74 Angwin Varner and TobinldquoFacebook Enabled Advertisers toReach lsquoJew Hatersrsquordquo (ProPublica httpswww20propublica20orgarticle

facebook-enabled-advertisers-to-reach-jew-haters2017)

for disparities in the demographics of ad views which we will studyBut it is not obvious how to connect these disparities to fairnessAfter all many types of demographic targeting such as clothing adsby gender are considered innocuous

There are two frameworks for understanding potential harms fromad targeting The first framework sees ads as unlocking opportunitiesfor their recipients because they provide information that the viewermight not have This is why targeting employment or housing adsbased on protected categories may be unfair and unlawful Thedomains where targeting is legally prohibited broadly correspond tothose which impact civil rights and reflect the complex histories ofdiscrimination in those domains as discussed in Chapter 3

The second framework views ads as tools of persuasion ratherthan information dissemination In this framework harms arise fromads being manipulative mdash that is exerting covert influence insteadof making forthright appeals mdash or exploiting stereotypes71 Usersare harmed by being targeted with ads that provide them negativeutility as opposed to the first framework in which the harm comesfrom missing out on ads with positive utility The two frameworksdonrsquot necessarily contradict each other Rather individual ads or adcampaigns can be seen as either primarily informational or primarilypersuasive and accordingly one or the other framework might beappropriate for analysis72

There is a vast literature on how race and gender are portrayedin ads that we consider to fall under the persuasion framework73

However this line of inquiry has yet to turn its attention to onlinetargeted advertising which has the potential for accentuating theharms of manipulation and stereotyping by targeting specific peopleand groups Thus the empirical research that we will highlight fallsunder the informational framework

There are roughly three mechanisms by which the same targetedad may reach one group more often than another The most obviousis the use of explicit targeting criteria by advertisers either thesensitive attribute itself or a proxy for it (such as ZIP code as a proxyfor race) For example Facebook allows thousands of targetingcategories including categories that are automatically constructedby the system based on usersrsquo free-form text descriptions of theirinterests These categories were found to include ldquoJew hatersrdquo andmany other antisemitic terms74 The company has had difficultyeliminating even direct proxies for sensitive categories resulting inrepeated exposeacutes

The second disparity-producing mechanism is optimization ofclick rate (or another measure of effectiveness) which is one of thecore goals of algorithmic targeting Unlike the first category this


75 Ali et al ldquoDiscrimination ThroughOptimization How Facebookrsquos Ad De-livery Can Lead to Biased OutcomesrdquoProceedings of the ACM on Human-Computer Interaction 3 no CSCW (2019)199 Lambrecht and Tucker ldquoAlgo-rithmic Bias An Empirical Study ofApparent Gender-Based Discriminationin the Display of STEM Career AdsrdquoManagement Science 2019

76 Datta Tschantz and Datta ldquoAuto-mated Experiments on Ad PrivacySettingsrdquo Proceedings on Privacy En-hancing Technologies 2015 no 1 (2015)92ndash112

77 Andreou et al ldquoAdAnalystrdquo (httpsadanalystmpi-swsorg 2017)

78 Ali et al ldquoDiscrimination ThroughOptimizationrdquo

79 Ali et al

does not require explicit intent by the advertiser or the platformThe algorithmic system may predict a userrsquos probability of engagingwith an ad based on her past behavior her expressed interests andother factors (including potentially explicitly expressed sensitiveattributes)

The third mechanism is market effects delivering an ad to differ-ent users may cost the advertiser different amounts For examplesome researchers have observed that women cost more to advertiseto than men and hypothesized that this is because women clickedon ads more often leading to a higher measure of effectiveness75

Thus if the advertiser simply specifies a total budget and leaves thedelivery up to the platform (which is a common practice) then theaudience composition will vary depending on the budget smallerbudgets will result in the less expensive group being overrepre-sented

In terms of methods to detect these disparities researchers andjournalists have used broadly two approaches interact with thesystem either as a user or as an advertiser Tschantz et al createdsimulated users that had the ldquogenderrdquo attribute in Googlersquos Ad set-tings page set to female or male and found that Google showed thesimulated male users ads from a certain career coaching agency thatpromised large salaries more frequently than the simulated femaleusers76 While this type of study establishes that employment adsthrough Googlersquos ad system are not blind to gender (as expressedin the ad settings page) it cannot uncover the mechanism ie dis-tinguish between explicit targeting by the advertiser and platformeffects of various kinds

Interacting with ad platforms as an advertiser has proved to bea more fruitful approach so far especially to analyze Facebookrsquosadvertising system This is because Facebook exposes vastly moredetails about its advertising system to advertisers than to users Infact it allows advertisers to learn more information it has inferred orpurchased about a user than it will allow the user himself to access77

The existence of anti-semitic auto-generated targeting categoriesmentioned above was uncovered using the advertiser interface Addelivery on Facebook has been found to introduce demographicdisparities due to both market effects and effectiveness optimizationeffects78 To reiterate this means that even if the advertiser does notexplicitly target an ad by (say) gender there may be a systematicgender skew in the adrsquos audience The optimization effects are en-abled by Facebookrsquos analysis of the contents of ads Interestingly thisincludes image analysis which researchers revealed using the clevertechnique of serving ads with transparent content that is invisible tohumans but nonetheless had an effect on ad delivery79


80 Hutson et al ldquoDebiasing DesireAddressing Bias amp Discrimination onIntimate Platformsrdquo Proceedings of theACM on Human-Computer Interaction 2no CSCW (2018) 7381 This is not to say that discriminationis nonexistent See eg (Ayres Banajiand Jolls ldquoRace Effects on eBayrdquo TheRAND Journal of Economics 46 no 4

(2015) 891ndash917)

82 Lee et al ldquoWorking with MachinesThe Impact of Algorithmic and Data-Driven Management on HumanWorkersrdquo in Proceedings of the 33rdAnnual ACM Conference on HumanFactors in Computing Systems (ACM2015) 1603ndash1283 Edelman Luca and Svirsky ldquoRacialDiscrimination in the Sharing EconomyEvidence from a Field ExperimentrdquoAmerican Economic Journal AppliedEconomics 9 no 2 (2017) 1ndash22

Fairness considerations in the design of online marketplaces

Online platforms for ride hailing short-term housing and freelance(gig) work have risen to prominence in the 2010s notable examplesare Uber Lyft Airbnb and TaskRabbit They are important targetsfor the study of fairness because they directly impact peoplersquos liveli-hoods and opportunities We will set aside some types of marketsfrom our discussion Online dating apps share some similarities withthese markets but they require an entirely separate analysis becausethe norms governing romance are different from those governingcommerce and employment80 Then there are marketplaces for goodssuch as Amazon and eBay In these markets the characteristics ofthe participants are less salient than the attributes of the product sodiscrimination is less of a concern81

Unlike the domains studied so far machine learning is not a corecomponent of the algorithms in online marketplaces (Nonethelesswe consider it in scope because of our broad interest in decisionmaking and fairness rather than just machine learning) Thereforefairness concerns are less about training data or algorithms thefar more serious issue is discrimination by buyers and sellers Forexample one study found that Uber drivers turned off the app inareas where they did not want to pick up passengers82

Methods to detect discrimination in online marketplaces are fairlysimilar to traditional settings such as housing and employment acombination of audit studies and observational methods have beenused A notable example is a field experiment targeting Airbnb83

The authors created fake guest accounts whose names signaled race(African-American or White) and gender (female or male) but wereotherwise identical Twenty different names were used five in eachcombination of race and gender They then contacted the hosts of6400 listings in five cities through these accounts to inquire aboutavailability They found a 50 probability of acceptance of inquiriesfrom guests with White-sounding names compared to 42 for guestswith African-American-sounding names The effect was persistentregardless of the hostrsquos race gender and experience on the platformas well as listing type (high or low priced entire property or shared)and diversity of the neighborhood Note that the accounts did nothave profile pictures if inference of race by hosts happens in partbased on appearance a study design that varied the accountsrsquo profilepictures might find a greater effect

Compared to traditional settings some types of observational dataare readily available on online platforms which can be useful to theresearcher In the above study the public availability of reviews oflisted properties proved useful It was not essential to the design


84 Thebault-Spieker Terveen and HechtldquoToward a Geographic Understandingof the Sharing Economy SystemicBiases in UberX and TaskRabbitrdquoACM Transactions on Computer-HumanInteraction (TOCHI) 24 no 3 (2017) 21

85 Ge et al ldquoRacial and Gender Dis-crimination in Transportation NetworkCompaniesrdquo (National Bureau ofEconomic Research 2016)

86 Levy and Barocas ldquoDesigningAgainst Discrimination in OnlineMarketsrdquo Berkeley Tech LJ 32 (2017)1183

87 Tjaden Schwemmer and KhadjavildquoRide with MemdashEthnic DiscriminationSocial Markets and the Sharing Econ-omyrdquo European Sociological Review 34no 4 (2018) 418ndash32

of the study but allowed an interesting validity check When theanalysis was restricted to the 29 hosts in the sample who hadreceived at least one review from an African-American guest theracial disparity in responses declined sharply If the studyrsquos findingswere a result of a quirk of the experimental design rather than actualracial discrimination by Airbnb hosts it would be difficult to explainwhy the effect would disappear for this subset of hosts This supportsthe studyrsquos external validity

In addition to discrimination by participants another fairnessissue that many online marketplaces must contend with is geo-graphic differences in effectiveness One study of TaskRabbit andUber found that neighborhoods with high population density andhigh-income neighborhoods receive the largest benefits from thesharing economy84 Due to the pervasive correlation between povertyand raceethnicity these also translate to racial disparities

Of course geographic and structural disparities in these marketsare not caused by online platforms and no doubt exist in offlineanalogs such as word-of-mouth gig work In fact the magnitude ofracial discrimination is much larger in scenarios such as hailing taxison the street85 compared to technologically mediated interactionsHowever in comparison to markets regulated by antidiscriminationlaw such as hotels discrimination in online markets is more severeIn any case the formalized nature of online platforms makes auditseasier As well the centralized nature of these platforms is a powerfulopportunity for fairness interventions

There are many ways in which platforms can use design to mini-mize usersrsquo ability to discriminate (such as by withholding informa-tion about counterparties) and the impetus to discriminate (such asby making participant characteristics less salient compared to prod-uct characteristics in the interface)86 There is no way for platformsto take a neutral stance towards discrimination by participants evenchoices made without explicit regard for discrimination can affecthow vulnerable users are to bias

As a concrete example the authors of the Airbnb study recom-mend that the platform withhold guest information from hosts priorto booking (Note that ride hailing services do withhold customerinformation Carpooling services on the other hand allow users toview names when selecting matches unsurprisingly this enables dis-crimination against ethnic minorities)87 The authors of the study ongeographic inequalities suggest among other interventions that ridehailing services provide a ldquogeographic reputationrdquo score to drivers tocombat the fact that drivers often incorrectly perceive neighborhoodsto be more dangerous than they are


88 Muthukumar et al ldquoUnderstandingUnequal Gender Classification Accu-racy from Face Imagesrdquo arXiv PreprintarXiv181200099 2018

89 This overrepresentation is becausephotos of celebrities are easier togather publicly and celebrities arethought to have weakened privacyrights due to the competing publicinterest in their activities Howeverfor a counterpoint see (Harvey andLaPlace ldquoMegaPixels Origins Ethicsand Privacy Implications of PubliclyAvailable Face Recognition ImageDatasetsrdquo 2019 httpsmegapixelscc)

90 Robertson et al ldquoAuditing PartisanAudience Bias Within Google SearchrdquoProceedings of the ACM on Human-Computer Interaction 2 no CSCW (2018)148

91 DrsquoOnfro ldquoGoogle Tests Changes to ItsSearch Algorithm How Search Worksrdquo(httpswwwcnbccom20180917google-tests-changes-to-its-search-algorithm-how-search-works

html 2019)92 Hannak et al ldquoMeasuring Personal-ization of Web Searchrdquo in Proceedings ofthe 22nd International Conference on WorldWide Web (ACM 2013) 527ndash3893 Tripodi ldquoSearching for AlternativeFacts Analyzing Scriptural Inference inConservative News Practicesrdquo Data ampSociety 2018

Mechanisms of discrimination

Wersquove looked at a number of studies on detecting unfairness inalgorithmic systems Letrsquos take stock

In the introductory chapter we discussed at a high-level differentways in which unfairness could arise in machine learning systemsHere we see that the specific sources and mechanisms of unfairnesscan be intricate and domain-specific Researchers need an under-standing of the domain to effectively formulate and test hypothesesabout sources and mechanisms of unfairness

For example consider the study of gender classification systemsdiscussed above It is easy to guess that unrepresentative trainingdatasets contributed to the observed accuracy disparities but un-representative in what way A follow-up paper considered thisquestion88 It analyzed several state-of-the-art gender classifiers (ina lab setting as opposed to field tests of commercial APIs in theoriginal paper) and argued that underrepresentation of darker skintones in the training data is not a reason for the observed disparityInstead one mechanism suggested by the authors is based on thefact that many training datasets of human faces comprise photos ofcelebrities89 They found that photos of female celebrities have moreprominent makeup compared to photos of women in general Thisled to classifiers using makeup as a proxy for gender in a way thatdidnrsquot generalize to the rest of the population

Slightly different hypotheses can produce vastly different conclu-sions especially in the presence of complex interactions betweencontent producers consumers and platforms For example onestudy tested claims of partisan bias by search engines as well as re-lated claims that search engines return results that reinforce searchersrsquoexisting views (the ldquofilter bubblerdquo hypothesis)90 The researchersrecruited participants with different political views collected Googlesearch results on a political topic in both standard and incognito win-dows from those participantsrsquo computers and found that standard(personalized) search results were no more partisan than incognito(non-personalized) ones seemingly finding evidence against theclaim that online search reinforces usersrsquo existing beliefs

This finding is consistent with the fact that Google doesnrsquot person-alize search results except based on searcher location and immediate(10-minute) history of searches This is known based on Googlersquosown admission91 and prior research92

However a more plausible hypothesis for the filter bubble effectin search comes from a qualitative study93 Simplified somewhatfor our purposes it goes as follows when an event with politicalsignificance unfolds key influencers (politicians partisan news


94 For example in 2017 US presidentDonald Trump called for the NationalFootball League to fire players whoengaged in a much-publicized politicalprotest during games Opposingnarratives of this event were that NFLviewership had declined due to fansprotesting playersrsquo actions or that ithad increased despite the protestsSearch terms reflecting these viewsmight be ldquoNFL ratings downrdquo versusldquoNFL ratings uprdquo95 But see (Golebiewski and Boyd ldquoDataVoids Where Missing Data Can EasilyBe Exploitedrdquo Data amp Society 29 (2018))(ldquoData Void Type 4 FragmentedConceptsrdquo) for an argument thatsearch enginesrsquo decision not to collapserelated concepts contributes to thisfragmentation96 Valentino-Devries Singer-Vine andSoltani ldquoWebsites Vary Prices DealsBased on Usersrsquo Informationrdquo WallStreet Journal 10 (2012) 60ndash68

outlets interest groups political message boards) quickly craft theirown narratives of the event Those narratives selectively reach theirrespective partisan audiences through partisan information networksThose people then turn to search engines to learn more or to ldquoverifythe factsrdquo Crucially however they use different search terms torefer to the same event reflecting the different narratives to whichthey have been exposed94 The results for these different searchterms are often starkly different because the producers of news andcommentary selectively and strategically cater to partisans usingthese same narratives Thus searchersrsquo beliefs are reinforced Notethat this filter-bubble-producing mechanism operates effectively eventhough the search algorithm itself is arguably neutral95

A final example to reinforce the fact that disparity-producingmechanisms can be subtle and that domain expertise is required toformulate the right hypothesis an investigation by journalists foundthat staplescom showed discounted prices to individuals in someZIP codes these ZIP codes were on average wealthier96 Howeverthe actual pricing rule that explained most of the variation as theyreported was that if there was a competitorrsquos physical store locatedwithin 20 miles or so of the customerrsquos inferred location then thecustomer would see a discount Presumably this strategy is intendedto infer the customerrsquos reservation price or willingness to pay Inci-dentally this is a similar kind of ldquostatistical discriminationrdquo as seen inthe car sales discrimination study discussed at the beginning of thischapter

Fairness criteria in algorithmic audits

While the mechanisms of unfairness are different in algorithmicsystems the applicable fairness criteria are the same for algorithmicdecision making as other kinds of decision making That said somefairness notions are more often relevant and others less so in algo-rithmic decision making compared to human decision making Weoffer a few selected observations on this point

Fairness as blindness is seen less often in audit studies of algorith-mic systems such systems are generally designed to be blind tosensitive attributes Besides fairness concerns often arise preciselyfrom the fact that blindness is generally not an effective fairness in-tervention in machine learning Two exceptions are ad targeting andonline marketplaces (where the non-blind decisions are in fact beingmade by users and not the platform)

Unfairness as arbitrariness There are roughly two senses in whichdecision making could be considered arbitrary and hence unfair Thefirst is when decisions are made on a whim rather than a uniform


97 Raghavan et al ldquoMitigating Bias inAlgorithmic Employment Screeningrdquo

procedure Since automated decision making results in proceduraluniformity this type of concern is generally not salient

The second sense of arbitrariness applies even when there is auniform procedure if that procedure relies on a consideration offactors that are thought to be irrelevant either statistically or morallySince machine learning excels at finding correlations it commonlyidentifies factors that seem puzzling or blatantly unacceptable Forexample in aptitude tests such as the Graduate Record Examinationessays are graded automatically Although e-rater and other toolsused for this purpose are subject to validation checks and are foundto perform similarly to human raters on samples of actual essaysthey are able to be fooled into giving perfect scores to machine-generated gibberish Recall that there is no straightforward criterionthat allows us to assess if a feature is morally valid (Chapter 3) andthis question must be debated on a case-by-case basis

More serious issues arise when classifiers are not even subjectedto proper validity checks For example there are a number of com-panies that claim to predict candidatesrsquo suitability for jobs basedon personality tests or body language and other characteristics invideos97 There is no peer-reviewed evidence that job performance ispredictable using these factors and no basis for such a belief Thuseven if these systems donrsquot produce demographic disparities they areunfair in the sense of being arbitrary candidates receiving an adversedecision lack due process to understand the basis for the decisioncontest it or determine how to improve their chances of success

Observational fairness criteria including demographic parity errorrate parity and calibration have received much attention in algorith-mic fairness studies Convenience has probably played a big role inthis choice these metrics are easy to gather and straightforward to re-port without necessarily connecting them to moral notions of fairnessWe reiterate our caution about the overuse of parity-based notionsparity should rarely be made a goal by itself At a minimum it isimportant to understand the sources and mechanisms that producedisparities as well as the harms that result from them before decidingon appropriate interventions

Representational harms Traditionally allocative and representationalharms were studied in separate literatures reflecting the fact thatthey are mostly seen in separate spheres of life (for instance housingdiscrimination versus stereotypes in advertisements) Many algorith-mic systems on the other hand are capable of generating both typesof harms A failure of face recognition for darker-skinned people isdemeaning but it could also prevent someone from being able toaccess a digital device or enter a building that uses biometric security


98 Datta Tschantz and Datta ldquoAuto-mated Experiments on Ad PrivacySettingsrdquo

Information flow fairness privacy

A notion called information flow is seen frequently in algorithmic au-dits This criterion requires that sensitive information about subjectsnot flow from one information system to another or from one partof a system to another For example a health website may promisethat user activity such as searches and clicks are not shared withthird parties such as insurance companies (since that may lead topotentially discriminatory effects on insurance premiums) It can beseen as a generalization of blindness whereas blindness is about notacting on available sensitive information restraining information flowensures that the sensitive information is not available to act upon inthe first place

There is a powerful test for testing violations of information flowconstraints which we will call the adversarial test98 It does not di-rectly detect information flow but rather decisions that are madeon the basis of that information It is powerful because it does notrequire specifying a target variable which minimizes the domainknowledge required of the researcher To illustrate letrsquos revisit the ex-ample of the health website The adversarial test operates as follows

1 Create two groups of simulated users (A and B) ie bots that areidentical except for the fact that users in group A but not group Bbrowse the sensitive website in question

2 Have both groups of users browse other websites that are thoughtto serve ads from insurance companies or personalize contentbased on usersrsquo interests or somehow tailor content to users basedon health information This is the key point the researcher doesnot need to hypothesize a mechanism by which potentially unfairoutcomes result mdash eg which websites (or third parties) mightreceive sensitive data whether the personalization might take theform of ads prices or some other aspect of content

3 Record the contents of the web pages seen by all users in theprevious step

4 Train a binary classifier to distinguish between web pages en-countered by users in group A and those encountered by users ingroup B Use cross-validation to measure its accuracy

5 If the information flow constraint is satisfied (ie the healthwebsite did not share any user information with any third parties)then the websites browsed in step 2 are blind to user activities instep 1 thus the two groups of users look identical and there isno way to systematically distinguish the content seen by group Afrom that seen by group B The classifierrsquos test accuracy shouldnot significantly exceed 1

2 The permutation test can be used toquantify the probability that the classifierrsquos observed accuracy (or


99 Ojala and Garriga ldquoPermutation Testsfor Studying Classifier PerformancerdquoJournal of Machine Learning Research 11no Jun (2010) 1833ndash63100 Datta Tschantz and Datta ldquoAu-tomated Experiments on Ad PrivacySettingsrdquo

101 Venkatadri et al ldquoInvestigatingSources of PII Used in FacebookrsquosTargeted Advertisingrdquo Proceedings onPrivacy Enhancing Technologies 2019 no1 (2019) 227ndash44

102 Bashir et al ldquoTracing InformationFlows Between Ad Exchanges UsingRetargeted Adsrdquo in USENIX SecuritySymposium 16 2016 481ndash96

103 Singer-Vine Valentino-DeVriesand Soltani ldquoHow the JournalTested Prices and Deals Onlinerdquo(Wall Street Journal httpblogs20wsj20comdigits20121223

how-the-journal-tested-prices-and-deals-online2012)

better) could have arisen by chance if there is in fact no systematicdifference between the two groups99

There are additional nuances relating to proper randomization andcontrols for which we refer the reader to the study100 Note that ifthe adversarial test fails to detect an effect it does not mean that theinformation flow constraint is satisfied Also note that the adversarialtest is not capable of measuring an effect size Such a measurementwould be meaningless anyway since the goal is to detect informationflow and any effect on observable behavior of the system is merely aproxy for it

This view of information flow as a generalization of blindnessreveals an important connection between privacy and fairness Manystudies based on this principle can be seen as either privacy or fair-ness investigations For example a study found that Facebook solicitsphone numbers from users with the stated purpose of improvingaccount security but uses those numbers for ad targeting101 Thisis an example of undisclosed information flow from one part of thesystem to another Another study used ad retargeting mdash in whichactions taken on one website such as searching for a product resultin ads for that product on another website mdash to infer the exchange ofuser data between advertising companies102 Neither study used theadversarial test

Comparison of research methods

For auditing user fairness on online platforms there are two mainapproaches creating fake profiles and recruiting real users as testersEach has its pros and cons Both approaches have the advantagecompared to traditional audit studies of allowing a potentiallygreater scale due to the ease of creating fake accounts or recruitingtesters online (eg through crowd-sourcing)

Scaling is especially relevant for testing geographic differencesgiven the global reach of many online platforms It is generally pos-sible to simulate geographically dispersed users by manipulatingtesting devices to report faked locations For example the above-mentioned investigation of regional price differences on staplescomactually included a measurement from each of the 42000 ZIP codesin the United States103 They accomplished this by observing that thewebsite stored the userrsquos inferred location in a cookie and proceed-ing to programmatically change the value stored in the cookie to eachpossible value

That said practical obstacles commonly arise in the fake-profileapproach In one study the number of test units was practically lim-


104 Chen Mislove and Wilson ldquoPeekingBeneath the Hood of Uberrdquo in Proceed-ings of the 2015 Internet MeasurementConference (ACM 2015) 495ndash508

105 Salganik Bit by Bit Social Researchin the Digital Age (Princeton UniversityPress 2019)

ited by the requirement for each account to have a distinct credit cardassociated with it104 Another issue is bot detection For example theAirbnb study was limited to five cities even though the researchersoriginally planned to test more because the platformrsquos bot-detectionalgorithms kicked in during the course of the study to detect andshut down the anomalous pattern of activity Itrsquos easy to imaginean even worse outcome where accounts detected as bots are some-how treated differently by the platform (eg messages from thoseaccounts are more likely to be hidden from intended recipients)compromising the validity of the study

As this example illustrates the relationship between audit re-searchers and the platforms being audited is often adversarial Plat-formsrsquo efforts to hinder researchers can be technical but also legalMany platforms notably Facebook prohibit both fake-account cre-ation and automated interaction in their Terms of Service The ethicsof Terms-of-Service violation in audit studies is a matter of ongo-ing debate paralleling some of the ethical discussions during theformative period of traditional audit studies In addition to ethicalquestions researchers incur a legal risk when they violate Termsof Service In fact under laws such as the US Computer Fraud andAbuse Act it is possible that they may face criminal as opposed tojust civil penalties

Compared to the fake-profile approach recruiting real usersallows less control over profiles but is better able to capture thenatural variation in attributes and behavior between demographicgroups Thus neither design is always preferable and they areattuned to different fairness notions When testers are recruitedvia crowd-sourcing the result is generally a convenience sample(ie the sample is biased towards people who are easy to contact)resulting in a non-probability (non-representative) sample It isgenerally infeasible to train such a group of testers to carry outan experimental protocol instead such studies typically handlethe interaction between testers and the platform via software tools(eg browser extensions) created by the researcher and installedby the tester For more on the difficulties of research using non-probability samples see the book Bit by Bit105

Due to the serious limitations of both approaches lab studies ofalgorithmic systems are commonly seen The reason that lab studieshave value at all is that since automated systems are fully specifiedusing code the researcher can hope to simulate them relatively faith-fully Of course there are limitations the researcher typically doesnrsquothave access to training data user interaction data or configurationsettings But simulation is a valuable way for developers of algorith-mic systems to test their own systems and this is a common approach


106 Bennett and Lanning ldquoThe NetflixPrizerdquo in Proceedings of KDD Cup andWorkshop vol 2007 (New York NYUSA 2007) 35

107 Lum and Isaac ldquoTo Predict andServerdquo Significance 13 no 5 (2016) 14ndash19 Ensign et al ldquoRunaway FeedbackLoops in Predictive Policingrdquo arXivPreprint arXiv170609847 2017108 Chaney Stewart and EngelhardtldquoHow Algorithmic Confounding inRecommendation Systems IncreasesHomogeneity and Decreases Utilityrdquo inProceedings of the 12th ACM Conferenceon Recommender Systems (ACM 2018)224ndash32109 Obermeyer et al ldquoDissecting RacialBias in an Algorithm Used to Managethe Health of Populationsrdquo Science 366no 6464 (2019) 447ndash53 Chouldechovaet al ldquoA Case Study of Algorithm-Assisted Decision Making in ChildMaltreatment Hotline ScreeningDecisionsrdquo in Conference on FairnessAccountability and Transparency 2018134ndash48110 Passi and Barocas ldquoProblem Formu-lation and Fairnessrdquo in Proceedings ofthe Conference on Fairness Accountabilityand Transparency (ACM 2019) 39ndash48

in the industry Companies often go so far as to make de-identifieduser interaction data publicly available so that external researcherscan conduct lab studies to develop and test algorithms The NetflixPrize is a prominent example of such a data release106 So far theseefforts have almost always been about improving the accuracy ratherthan the fairness of algorithmic systems

Lab studies are especially useful for getting a handle on questionsthat cannot be studied by other empirical methods notably thedynamics of algorithmic systems ie their evolution over time Oneprominent result from this type of study is the quantification offeedback loops in predictive policing107 Another insight is theincreasing homogeneity of usersrsquo consumption patterns over time inrecommender systems108

Observational studies and observational fairness criteria continueto be important Such studies are typically carried out by algorithmdevelopers or decision makers often in collaboration with externalresearchers109 It is relatively rare for observational data to be madepublicly available A rare exception the COMPAS dataset involved aFreedom of Information Act request

Finally it is worth reiterating that quantitative studies are nar-row in what they can conceptualize and measure Qualitative andethnographic studies of decision makers thus provide an invaluablecomplementary perspective To illustrate wersquoll discuss one studythat reports on six months of ethnographic fieldwork in a corporatedata science team110 The team worked on a project in the domain ofcar financing that aimed to ldquoimprove the qualityrdquo of leads (leads arepotential car buyers in need of financing who might be converted toactual buyers through marketing) Given such an amorphous high-level goal formulating a concrete and tractable data science problemis a necessary and nontrivial task mdash a task that is further complicatedby the limitations of the data available The paper documents howthere is substantial latitude in problem formulation and spotlightsthe iterative process that was used resulting in the use of a seriesof proxies for lead quality The authors show that different proxieshave different fairness implications one proxy would maximizepeoplersquos lending opportunities and another would alleviate dealersrsquoexisting biases both potentially valuable fairness goals However thedata scientists were not aware of the normative implications of theirdecisions and did not explicitly deliberate them

Looking ahead

In this chapter we covered traditional tests for discrimination as wellas fairness studies of various algorithmic systems Together these


methods constitute a powerful toolbox for interrogating a singledecision system at a single point in time But there are other typesof fairness questions we can ask what is the cumulative effect ofthe discrimination faced by a person over the course of a lifetimeWhat structural aspects of society result in unfairness We cannotanswer such a question by looking at individual systems The nextchapter is all about broadening our view of discrimination and thenusing that broader perspective to study a range of possible fairnessinterventions

References

Agan Amanda and Sonja Starr ldquoBan the Box Criminal Records andRacial Discrimination A Field Experimentrdquo The Quarterly Journalof Economics 133 no 1 (2017) 191ndash235

Ali Muhammad Piotr Sapiezynski Miranda Bogen Aleksandra Ko-rolova Alan Mislove and Aaron Rieke ldquoDiscrimination ThroughOptimization How Facebookrsquos Ad Delivery Can Lead to BiasedOutcomesrdquo Proceedings of the ACM on Human-Computer Interaction3 no CSCW (2019) 199

Amorim Evelin Marcia Canccedilado and Adriano Veloso ldquoAutomatedEssay Scoring in the Presence of Biased Ratingsrdquo In Proceedings ofthe 2018 Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language Technologies Volume1 (Long Papers) 229ndash37 2018

Andreou Athanasios Oana Goga Krishna Gummadi LoiseauPatrick and Alan Mislove ldquoAdAnalystrdquo httpsadanalyst

mpi-swsorg 2017Angwin Julia Madeleine Varner and Ariana Tobin ldquoFacebook En-

abled Advertisers to Reach lsquoJew Hatersrsquordquo ProPublica httpswww20propublica20orgarticlefacebook-enabled-advertisers-to-reach-jew-haters2017

Arrow Kenneth ldquoThe Theory of Discriminationrdquo Discrimination inLabor Markets 3 no 10 (1973) 3ndash33

Ayres Ian ldquoThree Tests for Measuring Unjustified Disparate Impactsin Organ Transplantation The Problem of Included VariableBiasrdquo Perspectives in Biology and Medicine 48 no 1 (2005) 68ndashS87

Ayres Ian Mahzarin Banaji and Christine Jolls ldquoRace Effects oneBayrdquo The RAND Journal of Economics 46 no 4 (2015) 891ndash917

Ayres Ian and Peter Siegelman ldquoRace and Gender Discrimination inBargaining for a New Carrdquo The American Economic Review 1995304ndash21

Bagwell Kyle ldquoThe Economic Analysis of Advertisingrdquo Handbook ofIndustrial Organization 3 (2007) 1701ndash844


Bashir Muhammad Ahmad Sajjad Arshad William Robertsonand Christo Wilson ldquoTracing Information Flows Between AdExchanges Using Retargeted Adsrdquo In USENIX Security Symposium16 481ndash96 2016

Becker Gary S The Economics of Discrimination University of ChicagoPress 1957

Bennett James and Stan Lanning ldquoThe Netflix Prizerdquo In Proceedingsof KDD Cup and Workshop 200735 New York NY USA 2007

Bertrand Marianne and Esther Duflo ldquoField Experiments on Dis-criminationrdquo In Handbook of Economic Field Experiments 1309ndash93Elsevier 2017

Bertrand Marianne Esther Duflo and Sendhil Mullainathan ldquoHowMuch Should We Trust Differences-in-Differences Estimatesrdquo TheQuarterly Journal of Economics 119 no 1 (2004) 249ndash75

Bertrand Marianne and Sendhil Mullainathan ldquoAre Emily and GregMore Employable Than Lakisha and Jamal A Field Experimenton Labor Market Discriminationrdquo American Economic Review 94no 4 (2004) 991ndash1013

Bird Sarah Solon Barocas Kate Crawford Fernando Diaz andHanna Wallach ldquoExploring or Exploiting Social and EthicalImplications of Autonomous Experimentation in AIrdquo In Workshopon Fairness Accountability and Transparency in Machine Learning2016

Blank Rebecca M ldquoThe Effects of Double-Blind Versus Single-BlindReviewing Experimental Evidence from the American EconomicReviewrdquo The American Economic Review 1991 1041ndash67

Bogen Miranda and Aaron Rieke ldquoHelp wanted an examinationof hiring algorithms equity and biasrdquo Technical report Upturn2018

Buolamwini Joy and Timnit Gebru ldquoGender Shades IntersectionalAccuracy Disparities in Commercial Gender Classificationrdquo InConference on Fairness Accountability and Transparency 77ndash91 2018

Buranyi Stephen ldquoHow to Persuade a Robot That You Should Getthe Jobrdquo Guardian 2018

Chaney Allison JB Brandon M Stewart and Barbara E EngelhardtldquoHow Algorithmic Confounding in Recommendation SystemsIncreases Homogeneity and Decreases Utilityrdquo In Proceedings ofthe 12th ACM Conference on Recommender Systems 224ndash32 ACM2018

Chen Le Alan Mislove and Christo Wilson ldquoPeeking Beneath theHood of Uberrdquo In Proceedings of the 2015 Internet MeasurementConference 495ndash508 ACM 2015

Chouldechova Alexandra Diana Benavides-Prado Oleksandr Fialkoand Rhema Vaithianathan ldquoA Case Study of Algorithm-Assisted


Decision Making in Child Maltreatment Hotline Screening Deci-sionsrdquo In Conference on Fairness Accountability and Transparency134ndash48 2018

Coltrane Scott and Melinda Messineo ldquoThe Perpetuation of SubtlePrejudice Race and Gender Imagery in 1990s Television Advertis-ingrdquo Sex Roles 42 no 5ndash6 (2000) 363ndash89

DrsquoOnfro Jillian ldquoGoogle Tests Changes to Its Search Algorithm HowSearch Worksrdquo httpswwwcnbccom20180917google-tests-changes-to-its-search-algorithm-how-search-works

html 2019Danziger Shai Jonathan Levav and Liora Avnaim-Pesso ldquoExtra-

neous Factors in Judicial Decisionsrdquo Proceedings of the NationalAcademy of Sciences 108 no 17 (2011) 6889ndash92

Dastin Jeffrey ldquoAmazon Scraps Secret AI Recruiting Tool ThatShowed Bias Against Womenrdquo Reuters 2018

Datta Amit Michael Carl Tschantz and Anupam Datta ldquoAutomatedExperiments on Ad Privacy Settingsrdquo Proceedings on PrivacyEnhancing Technologies 2015 no 1 (2015) 92ndash112

De-Arteaga Maria Alexey Romanov Hanna Wallach JenniferChayes Christian Borgs Alexandra Chouldechova Sahin GeyikKrishnaram Kenthapadi and Adam Tauman Kalai ldquoBias in BiosA Case Study of Semantic Representation Bias in a High-StakesSettingrdquo In Proceedings of the Conference on Fairness Accountabilityand Transparency 120ndash28 ACM 2019

Edelman Benjamin Michael Luca and Dan Svirsky ldquoRacial Dis-crimination in the Sharing Economy Evidence from a FieldExperimentrdquo American Economic Journal Applied Economics 9 no 2

(2017) 1ndash22Ensign Danielle Sorelle A Friedler Scott Neville Carlos Scheidegger

and Suresh Venkatasubramanian ldquoRunaway Feedback Loops inPredictive Policingrdquo arXiv Preprint arXiv170609847 2017

Eren Ozkan and Naci Mocan ldquoEmotional Judges and UnluckyJuvenilesrdquo American Economic Journal Applied Economics 10 no 3

(2018) 171ndash205Freeman Jonathan B Andrew M Penner Aliya Saperstein Matthias

Scheutz and Nalini Ambady ldquoLooking the Part Social StatusCues Shape Race Perceptionrdquo PloS One 6 no 9 (2011) e25107

Friedman Batya and Helen Nissenbaum ldquoBias in Computer Sys-temsrdquo ACM Transactions on Information Systems (TOIS) 14 no 3

(1996) 330ndash47Frucci Adam ldquoHP Face-Tracking Webcams Donrsquot Recognize Black

Peoplerdquo httpsgizmodocomhp-face-tracking-webcams-dont-recognize-black-people-54311902009

Ge Yanbo Christopher R Knittel Don MacKenzie and StephenZoepf ldquoRacial and Gender Discrimination in Transportation


Network Companiesrdquo National Bureau of Economic Research2016

Gillespie Tarleton Custodians of the Internet Platforms Content Mod-eration and the Hidden Decisions That Shape Social Media YaleUniversity Press 2018

mdashmdashmdash ldquoThe Politics of lsquoPlatformsrsquordquo New Media amp Society 12 no 3

(2010) 347ndash64Golebiewski M and D Boyd ldquoData Voids Where Missing Data Can

Easily Be Exploitedrdquo Data amp Society 29 (2018)Green Lisa J African American English A Linguistic Introduction

Cambridge University Press 2002Hannak Aniko Piotr Sapiezynski Arash Molavi Kakhki Balachan-

der Krishnamurthy David Lazer Alan Mislove and ChristoWilson ldquoMeasuring Personalization of Web Searchrdquo In Proceed-ings of the 22nd International Conference on World Wide Web 527ndash38ACM 2013

Harvey Adam and Jules LaPlace ldquoMegaPixels Origins Ethics andPrivacy Implications of Publicly Available Face Recognition ImageDatasetsrdquo 2019 httpsmegapixelscc

Hern Alex ldquoFlickr Faces Complaints over lsquoOffensiversquoauto-Taggingfor Photosrdquo The Guardian 20 (2015)

Huq Aziz Z ldquoRacial Equity in Algorithmic Criminal Justicerdquo DukeLJ 68 (2018) 1043

Hutson Jevan A Jessie G Taft Solon Barocas and Karen Levy ldquoDe-biasing Desire Addressing Bias amp Discrimination on IntimatePlatformsrdquo Proceedings of the ACM on Human-Computer Interaction2 no CSCW (2018) 73

Kang Sonia K Katherine A DeCelles Andraacutes Tilcsik and Sora JunldquoWhitened Resumes Race and Self-Presentation in the LaborMarketrdquo Administrative Science Quarterly 61 no 3 (2016) 469ndash502

Kay Matthew Cynthia Matuszek and Sean A Munson ldquoUnequalRepresentation and Gender Stereotypes in Image Search Resultsfor Occupationsrdquo In Proceedings of the 33rd Annual ACM Conferenceon Human Factors in Computing Systems 3819ndash28 ACM 2015

Kiritchenko Svetlana and Saif M Mohammad ldquoExamining Genderand Race Bias in Two Hundred Sentiment Analysis SystemsrdquoarXiv Preprint arXiv180504508 2018

Klonick Kate ldquoThe New Governors The People Rules and Pro-cesses Governing Online Speechrdquo Harv L Rev 131 (2017) 1598

Kohler-Hausmann Issa ldquoEddie Murphy and the Dangers of Coun-terfactual Causal Thinking about Detecting Racial DiscriminationrdquoNw UL Rev 113 (2018) 1163

Lakens Daniel ldquoImpossibly Hungry Judgesrdquo httpsdaniellakens

blogspotcom201707impossibly-hungry-judgeshtml 2017


Lakkaraju Himabindu Jon Kleinberg Jure Leskovec Jens Ludwigand Sendhil Mullainathan ldquoThe Selective Labels Problem Evalu-ating Algorithmic Predictions in the Presence of UnobservablesrdquoIn Proceedings of the 23rd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining 275ndash84 ACM 2017

Lambrecht Anja and Catherine Tucker ldquoAlgorithmic Bias AnEmpirical Study of Apparent Gender-Based Discrimination in theDisplay of STEM Career Adsrdquo Management Science 2019

Lee Min Kyung Daniel Kusbit Evan Metsky and Laura DabbishldquoWorking with Machines The Impact of Algorithmic and Data-Driven Management on Human Workersrdquo In Proceedings of the33rd Annual ACM Conference on Human Factors in ComputingSystems 1603ndash12 ACM 2015

Levy Karen and Solon Barocas ldquoDesigning Against Discriminationin Online Marketsrdquo Berkeley Tech LJ 32 (2017) 1183

Lum Kristian and William Isaac ldquoTo Predict and Serverdquo Signifi-cance 13 no 5 (2016) 14ndash19

Martineau Paris ldquoCities Examine Propermdashand ImpropermdashUses ofFacial Recognition | WIREDrdquo httpswwwwiredcomstory

cities-examine-proper-improper-facial-recognition 2019McEntegart Jane ldquoKinect May Have Issues with Dark-Skinned Users

| Tomrsquos Guiderdquo httpswwwtomsguidecomusMicrosoft-Kinect-Dark-Skin-Facial-Recognition

news-8638html 2010Mehrotra Rishabh Ashton Anderson Fernando Diaz Amit Sharma

Hanna Wallach and Emine Yilmaz ldquoAuditing Search Enginesfor Differential Satisfaction Across Demographicsrdquo In Proceedingsof the 26th International Conference on World Wide Web Companion626ndash33 2017

Muthukumar Vidya Tejaswini Pedapati Nalini Ratha PrasannaSattigeri Chai-Wah Wu Brian Kingsbury Abhishek KumarSamuel Thomas Aleksandra Mojsilovic and Kush R VarshneyldquoUnderstanding Unequal Gender Classification Accuracy fromFace Imagesrdquo arXiv Preprint arXiv181200099 2018

Neckerman Kathryn M and Joleen Kirschenman ldquoHiring StrategiesRacial Bias and Inner-City Workersrdquo Social Problems 38 no 4

(1991) 433ndash47Noble Safiya Umoja Algorithms of Oppression How Search Engines

Reinforce Racism nyu Press 2018Norton Helen ldquoThe Supreme Courtrsquos Post-Racial Turn Towards a

Zero-Sum Understanding of Equalityrdquo Wm amp Mary L Rev 52

(2010) 197OrsquoTOOLE ALICE J KENNETH DEFFENBACHER Herveacute Abdi

and JAMES C BARTLETT ldquoSimulating the lsquoOther-Race Effectrsquoasa Problem in Perceptual Learningrdquo Connection Science 3 no 2


(1991) 163ndash78Obermeyer Ziad Brian Powers Christine Vogeli and Sendhil Mul-

lainathan ldquoDissecting Racial Bias in an Algorithm Used to Man-age the Health of Populationsrdquo Science 366 no 6464 (2019)447ndash53

Ojala Markus and Gemma C Garriga ldquoPermutation Tests for Study-ing Classifier Performancerdquo Journal of Machine Learning Research11 no Jun (2010) 1833ndash63

Pager Devah ldquoThe Use of Field Experiments for Studies of Employ-ment Discrimination Contributions Critiques and Directionsfor the Futurerdquo The Annals of the American Academy of Political andSocial Science 609 no 1 (2007) 104ndash33

Pager Devah and Hana Shepherd ldquoThe Sociology of DiscriminationRacial Discrimination in Employment Housing Credit andConsumer Marketsrdquo Annu Rev Sociol 34 (2008) 181ndash209

Passi Samir and Solon Barocas ldquoProblem Formulation and Fair-nessrdquo In Proceedings of the Conference on Fairness Accountabilityand Transparency 39ndash48 ACM 2019

Phelps Edmund S ldquoThe Statistical Theory of Racism and SexismrdquoThe American Economic Review 62 no 4 (1972) 659ndash61

Pischke Jorn-Steffen ldquoEmpirical Methods in Applied EconomicsLecture Notesrdquo 2005

Posselt Julie R Inside Graduate Admissions Harvard University Press2016

Quillian Lincoln Devah Pager Ole Hexel and Arnfinn H MidtboslashenldquoMeta-Analysis of Field Experiments Shows No Change in RacialDiscrimination in Hiring over Timerdquo Proceedings of the NationalAcademy of Sciences 114 no 41 (2017) 10870ndash75

Raghavan Manish Solon Barocas Jon Kleinberg and Karen LevyldquoMitigating Bias in Algorithmic Employment Screening Eval-uating Claims and Practicesrdquo arXiv Preprint arXiv1906092082019

Ramineni Chaitanya and David Williamson ldquoUnderstanding MeanScore Differences Between the e-rater Automated Scoring Engineand Humans for Demographically Based Groups in the GREGeneral Testrdquo ETS Research Report Series 2018 no 1 (2018) 1ndash31

Rivera Lauren A Pedigree How Elite Students Get Elite Jobs PrincetonUniversity Press 2016

Robertson Ronald E Shan Jiang Kenneth Joseph Lisa FriedlandDavid Lazer and Christo Wilson ldquoAuditing Partisan AudienceBias Within Google Searchrdquo Proceedings of the ACM on Human-Computer Interaction 2 no CSCW (2018) 148

Roth Alvin E ldquoThe Origins History and Design of the ResidentMatchrdquo Jama 289 no 7 (2003) 909ndash12


Salganik Matthew Bit by Bit Social Research in the Digital Age Prince-ton University Press 2019

Sandvig C K Hamilton K Karahalios and C Langbort ldquoAuditingAlgorithms Research Methods for Detecting Discrimination onInternet Platformsrdquo ICA Pre-Conference on Data and Discrimination2014

Sap Maarten Dallas Card Saadia Gabriel Yejin Choi and NoahA Smith ldquoThe Risk of Racial Bias in Hate Speech DetectionrdquoIn Proceedings of the 57th Annual Meeting of the Association forComputational Linguistics 1668ndash78 2019

Shankar Shreya Yoni Halpern Eric Breck James Atwood JimboWilson and D Sculley ldquoNo Classification Without RepresentationAssessing Geodiversity Issues in Open Data Sets for the Devel-oping Worldrdquo In NIPS 2017 Workshop Machine Learning for theDeveloping World 2017

Simoiu Camelia Sam Corbett-Davies and Sharad Goel ldquoThe Prob-lem of Infra-Marginality in Outcome Tests for DiscriminationrdquoThe Annals of Applied Statistics 11 no 3 (2017) 1193ndash1216

Simonite Tom ldquoWhen It Comes to Gorillas Google Photos RemainsBlindrdquo Wired January 13 (2018)

Singer-Vine Jeremy Jennifer Valentino-DeVries and Ashkan SoltanildquoHow the Journal Tested Prices and Deals Onlinerdquo Wall StreetJournal httpblogs20wsj20comdigits20121223how-the-journal-tested-prices-and-deals-online 2012

Solaiman Irene Miles Brundage Jack Clark Amanda Askell ArielHerbert-Voss Jeff Wu Alec Radford and Jasmine Wang ldquoReleaseStrategies and the Social Impacts of Language Modelsrdquo arXivPreprint arXiv190809203 2019

Susser Daniel Beate Roessler and Helen Nissenbaum ldquoOnlineManipulation Hidden Influences in a Digital Worldrdquo Available atSSRN 3306006 2018

Tatman Rachael ldquoGender and Dialect Bias in YouTubersquos AutomaticCaptionsrdquo In Proceedings of the First ACL Workshop on Ethics inNatural Language Processing 53ndash59 Valencia Spain Associationfor Computational Linguistics 2017 httpsdoiorg1018653v1W17-1606

Thebault-Spieker Jacob Loren Terveen and Brent Hecht ldquoTowarda Geographic Understanding of the Sharing Economy SystemicBiases in UberX and TaskRabbitrdquo ACM Transactions on Computer-Human Interaction (TOCHI) 24 no 3 (2017) 21

Tjaden Jasper Dag Carsten Schwemmer and Menusch KhadjavildquoRide with MemdashEthnic Discrimination Social Markets and theSharing Economyrdquo European Sociological Review 34 no 4 (2018)418ndash32


Tripodi Francesca ldquoSearching for Alternative Facts AnalyzingScriptural Inference in Conservative News Practicesrdquo Data ampSociety 2018

Turow Joseph Jennifer King Chris Jay Hoofnagle Amy Bleakley andMichael Hennessy ldquoAmericans Reject Tailored Advertising andThree Activities That Enable Itrdquo Available at SSRN 1478214 2009

Valentino-Devries Jennifer Jeremy Singer-Vine and Ashkan SoltanildquoWebsites Vary Prices Deals Based on Usersrsquo Informationrdquo WallStreet Journal 10 (2012) 60ndash68

Venkatadri Giridhari Elena Lucherini Piotr Sapiezynski and AlanMislove ldquoInvestigating Sources of PII Used in Facebookrsquos Tar-geted Advertisingrdquo Proceedings on Privacy Enhancing Technologies2019 no 1 (2019) 227ndash44

Vries Terrance de Ishan Misra Changhan Wang and Laurens vander Maaten ldquoDoes Object Recognition Work for Everyonerdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition Workshops 52ndash59 2019

Weinshall-Margel Keren and John Shapard ldquoOverlooked Factorsin the Analysis of Parole Decisionsrdquo Proceedings of the NationalAcademy of Sciences 108 no 42 (2011) E833ndash33

Wienk Ronald E Clifford E Reid John C Simonson and Frederick JEggers ldquoMeasuring Racial Discrimination in American HousingMarkets The Housing Market Practices Surveyrdquo 1979

Williams Wendy M and Stephen J Ceci ldquoNational Hiring Experi-ments Reveal 2 1 Faculty Preference for Women on STEM TenureTrackrdquo Proceedings of the National Academy of Sciences 112 no 17

(2015) 5360ndash65Wilson Benjamin Judy Hoffman and Jamie Morgenstern ldquoPredic-

tive Inequity in Object Detectionrdquo arXiv Preprint arXiv1902110972019

Wu Tim The Master Switch The Rise and Fall of Information EmpiresVintage 2010

Yao Sirui and Bert Huang ldquoBeyond Parity Fairness Objectivesfor Collaborative Filteringrdquo In Advances in Neural InformationProcessing Systems 2921ndash30 2017


Audit studies









Fairness considerations in applications of natural language processing

Demographic disparities and questionable applications of computer vision








Looking ahead

References


They enable tracking trends in discrimination over time When thefindings are sufficiently blatant they justify the need for interventionregardless of any differences in interpretation And when we doapply a fairness intervention they help us measure its effectivenessFinally empirical research can also help uncover the mechanismsby which discrimination takes place which enables more targetedand effective interventions This requires carefully formulating andtesting hypotheses using domain knowledge

The first half of this chapter surveys classic tests for discrimina-tion that were developed in the context of human-decision makingsystems The underlying concepts are just as applicable to the studyof fairness in automated systems Much of the first half will buildon the causality chapter and explain concrete techniques includingexperiments difference-in-differences and regression discontinuityWhile these are standard tools in the causal inference toolkit wersquolllearn about the specific ways in which they can be applied to fairnessquestions Then we will turn to the application of the observationalcriteria from Chapter 2 The summary table at the end of the first halflists for each test the fairness criterion that it probes the type of ac-cess to the system that is required and other nuances and limitationsThe second half of the chapter is about testing fairness in algorithmicdecision making focusing on issues specific to algorithmic systems


Audit studies

The audit study is a popular technique for diagnosing discriminationIt involves a study design called a field experiment ldquoFieldrdquo refersto the fact that it is an experiment on the actual decision makingsystem of interest (in the ldquofieldrdquo as opposed to a lab simulation ofdecision making) Experiments on real systems are hard to pull offFor example we usually have to keep participants unaware thatthey are in an experiment But field experiments allow us to studydecision making as it actually happens rather than worrying thatwhat wersquore discovering is an artifact of a lab setting At the sametime the experiment by carefully manipulating and controllingvariables allows us to observe a treatment effect rather than merelyobserving a correlation

How to interpret such a treatment effect is a more tricky questionIn our view most audit studies including the ones wersquoll describeare best seen as attempts to test blindness whether a decision makerdirectly uses a sensitive attribute Recall that this notion of discrim-ination is not necessarily a counterfactual in a valid causal model














(2004) 991ndash1013









(2018) 1163






























(2011) E833ndash33




























Outcome-based tests







(2017) 1193ndash1216























Legend








(1973) 3ndash33

markets33



























































translation























































79 Ali et al









(2015) 891ndash917)




















































































Looking ahead




References






























































































































Audit studies


















Looking ahead

References














(2004) 991ndash1013









(2018) 1163






























(2011) E833ndash33




























Outcome-based tests







(2017) 1193ndash1216























Legend








(1973) 3ndash33

markets33



























































translation























































79 Ali et al









(2015) 891ndash917)




















































































Looking ahead




References






























































































































Audit studies


















Looking ahead

References




(2018) 1163






























(2011) E833ndash33




























Outcome-based tests







(2017) 1193ndash1216























Legend








(1973) 3ndash33

markets33



























































translation























































79 Ali et al









(2015) 891ndash917)




















































































Looking ahead




References






























































































































Audit studies


















Looking ahead

References























(2011) E833ndash33




























Outcome-based tests







(2017) 1193ndash1216























Legend








(1973) 3ndash33

markets33



























































translation























































79 Ali et al









(2015) 891ndash917)




















































































Looking ahead




References






























































































































Audit studies


















Looking ahead

References
























Outcome-based tests







(2017) 1193ndash1216























Legend








(1973) 3ndash33

markets33



























































translation























































79 Ali et al









(2015) 891ndash917)




















































































Looking ahead




References






























































































































Audit studies


















Looking ahead

References



(2017) 1193ndash1216























Legend








(1973) 3ndash33

markets33



























































translation























































79 Ali et al









(2015) 891ndash917)




















































































Looking ahead




References






























































































































Audit studies


















Looking ahead

References


















Legend








(1973) 3ndash33

markets33



























































translation























































79 Ali et al









(2015) 891ndash917)




















































































Looking ahead




References






























































































































Audit studies


















Looking ahead

References




(1973) 3ndash33

markets33



























































translation























































79 Ali et al









(2015) 891ndash917)




















































































Looking ahead




References






























































































































Audit studies


















Looking ahead

References























































translation























































79 Ali et al









(2015) 891ndash917)




















































































Looking ahead




References






























































































































Audit studies


















Looking ahead

References

Testing Discrimination in Practice

Documents

Transcript of Testing Discrimination in Practice