Measuring Model Biases in the Absence of Ground Truth
Transcript of Measuring Model Biases in the Absence of Ground Truth
Measuring Model Biases in the Absence of Ground Truth
Osman AkaโGoogleU.S.A.
Ken BurkeโGoogleU.S.A.
Alex Bรคuerleโ Ulm University
Germany
Christina GreerGoogleU.S.A.
Margaret MitchellโกEthical AI LLC
U.S.A.
ABSTRACTThe measurement of bias in machine learning often focuses onmodel performance across identity subgroups (such as ๐๐๐ and๐ค๐๐๐๐) with respect to groundtruth labels [15]. However, thesemethods do not directly measure the associations that a model mayhave learned, for example between labels and identity subgroups.Further, measuring a modelโs bias requires a fully annotated evalu-ation dataset which may not be easily available in practice.
We present an elegant mathematical solution that tackles bothissues simultaneously, using image classification as a working ex-ample. By treating a classification modelโs predictions for a givenimage as a set of labels analogous to a โbag of wordsโ [17], werank the biases that a model has learned with respect to differentidentity labels. We use {๐๐๐,๐ค๐๐๐๐} as a concrete example of anidentity label set (although this set need not be binary), and presentrankings for the labels that are most biased towards one identity orthe other. We demonstrate how the statistical properties of differentassociation metrics can lead to different rankings of the most โgen-der biased" labels, and conclude that normalized pointwise mutualinformation (๐๐๐๐ผ ) is most useful in practice. Finally, we announcean open-sourced ๐๐๐๐ผ visualization tool using TensorBoard.
CCS CONCEPTSโข General and referenceโMeasurement;Metrics; โข Appliedcomputing โ Document analysis.
KEYWORDSdatasets, image tagging, fairness, bias, stereotypes, informationextraction, model analysis
โEqual contribution.โ Work conducted during internship at Google.โกWork conducted while author was at Google.
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] โ21, May 19โ21, 2021, Virtual Event, USA.ยฉ 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-8473-5/21/05. . . $15.00https://doi.org/10.1145/3461702.3462557
ACM Reference Format:OsmanAka, Ken Burke, Alex Bรคuerle, Christina Greer, andMargaretMitchell.2021. Measuring Model Biases in the Absence of Ground Truth. In Proceed-ings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AIES โ21),May 19โ21, 2021, Virtual Event, USA. ACM, New York, NY, USA, 22 pages.https://doi.org/10.1145/3461702.3462557
1 INTRODUCTIONThe impact of algorithmic bias in computer vision models has beenwell-documented [c.f., 5, 24]. Examples of the negative fairnessimpacts of machine learning models include decreased pedestriandetection accuracy on darker skin tones [29], gender stereotypingin image captioning [6], and perceived racial identities impactingunrelated labels [27]. Many of these examples are directly relatedto currently deployed technology, which highlights the urgency ofsolving these fairness problems as adoption of these technologiescontinues to grow.
Many commonmetrics for quantifying fairness in machine learn-ing models, such as Statistical Parity [11], Equality of Opportunity[15] and Predictive Parity [8], rely on datasets with a significantamount of ground truth annotations for each label under analysis.However, some of the most commonly used datasets in computervision have relatively sparse ground truth [19]. One reason for thisis the significant growth in the number of predicted labels. Thebenchmark challenge dataset PASCAL VOC introduced in 2008 hadonly 20 categories [12], while less than 10 years later, the bench-mark challenge dataset ImageNet provided hundreds of categories[22]. As systems have rapidly improved, it is now common to usethe full set of ImageNet categories, which number more than 20,000[10, 26].
While large label spaces offer a more fine-grained ontologyof the visual world, they also increase the cost of implementinggroundtruth-dependent fairnessmetrics. This concern is compoundedby the common practice of collecting training datasets from multi-ple online resources [20]. This can lead to patterns where specificlabels are omitted in a biased way, either through human bias (e.g.,crowdsourcing where certain valid labels or tags are omitted sys-tematically) or through algorithmic bias (e.g., selecting labels forhuman verification based on the predictions of another model [19]).If the ground truth annotations in a sparsely labelled dataset arepotentially biased, then the premise of a fairness metric that โnor-malizes" model prediction patterns to groundtruth patterns may beincomplete. In light of this difficulty, we argue that it is important
arX
iv:2
103.
0341
7v3
[cs
.CV
] 6
Jun
202
1
to develop bias metrics that do not explicitly rely on โunbiasedโground truth labels.
In this work, we introduce a novel approach for measuring prob-lematic model biases, focusing on the associations between modelpredictions directly. This has several advantages compared to com-mon fairness approaches in the context of large label spaces, makingit possible to identify biases after the regular practice of runningmodel inference over a dataset. We study several different met-rics that measure associations between labels, building upon workin Natural Language Processing [9] and information theory. Weperform experiments on these association metrics using the OpenImages Dataset [19] which has a large enough label space to illus-trate how this framework can be generally applied, but we note thatthe focus of this paper is on introducing the relevant techniques anddo not require any specific dataset. We demonstrate that normal-ized pointwise mutual information (๐๐๐๐ผ ) is particularly useful fordetecting the associations between model predictions and sensitiveidentity labels in this setting, and can uncover stereotype-alignedassociations that the model has learned. This metric is particularlypromising because:
โข It requires no ground truth annotations.โข It provides a method for uncovering biases that the modelitself has learned.
โข It can be used to provide insight into per-label associationsbetween model predictions and identity attributes.
โข Biases for both low- and high-frequency labels are able tobe detected and compared.
Finally we announce an open-sourced visualization tool in Tensor-Board that allows users to explore patterns of label bias in largedatasets using the ๐๐๐๐ผ metric.
2 RELATEDWORKIn 1990, Church and Hanks [9] introduced a novel approach toquantifying associations between words based on mutual infor-mation [13, 23] and inspired by psycholinguistic work on wordnorms [28] that catalogue words that people closely associate. Forexample, subjects respond more quickly to the word nurse if it fol-lows a highly associated word such as doctor [7, 14]. Church andHanksโ proposed metric applies mutual information to words usinga pointwise approach, measuring co-occurrences of distinct wordpairs rather than averaging over all words. This enables a quan-tification of the question, "How closely related are these words?"by measuring their co-occurrence rates relative to chance in thedataset of interest. In this case, the dataset of interest is a computervision evaluation dataset, and the words are the labels that themodel predicts.
This information theoretic approach to uncovering word associ-ations became a prominent method in the field of Natural LanguageProcessing, with applications ranging from measuring topic co-herence [1] to collocation extraction [4] to great effect, althoughoften requiring a good deal of preprocessing in order to incorpo-rate details of a sentenceโs syntactic structure. However, withoutpreprocessing, this method functions to simply measure word asso-ciations regardless of their order in sentences or relationship to oneanother, treating words as an unordered set of tokens (a so-called"bag-of-words") [16].
As we show, this simple approach can be newly applied to anemergent problem in the machine learning ethics space: The identi-fication of problematic associations that an ML model has learned.This approach is comparable to measuring correlations, althoughthe common correlation metric of Spearman Rank [25] operates onassumptions that are not suitable for this task, such as linearity andmonotonicity. The related correlation metric of the Kendall RankCorrelation [18] does not require such behavior, and we includecomparisons with this approach.
Additionally, many potentially applicable metrics for this prob-lem rely on simple counts of paired words, which does not takeinto consideration how the words are distributed with other words(e.g., sentence syntax or context); we will elaborate on how thisinformation can be formally incorporated into a bias metric in theDiscussion and Future Work sections.
This work is motivated by recent research on fairness in machinelearning (e.g., [15]), which at a high level seeks to define criteriathat result in equal outcomes across different subpopulations. Thefocus in this paper is complementary to previous fairness work,honing in on ways to identify and quantify the specific problematicassociations that a model may learn rather than providing an overallmeasurement of a modelโs unfairness. It also offers an alternative tofairness metrics that rely on comprehensive ground truth labelling,which is not always available for large datasets.
Open Images Dataset. The Open Images dataset we use in thiswork was chosen because it is open-sourced, with millions of di-verse images and a large label space of thousands of visual concepts(see [19] for more details). Furthermore, the dataset comes withpre-computed labels generated by a non-trivial algorithm that com-bines machine-generated predictions and human-verification; thisallowed us to focus on analysis of label associations (rather thantraining a new classifier ourselves) and uncover the most commonconcepts related to sensitive characteristics, which in this datasetare๐๐๐ and๐ค๐๐๐๐.
We now turn to a formal description of the problem we seek tosolve.
3 PROBLEM DEFINITIONWe have a dataset D which contains image examples and labelsgenerated by a image classifier. This classifier takes one imageexample and predicts โIs label ๐ฆ๐ relevant to the image?" for eachlabel in L = {๐ฆ1, ๐ฆ2, ..., ๐ฆ๐}1. We infer ๐ (๐ฆ๐ ) and ๐ (๐ฆ๐ , ๐ฆ ๐ ) from Dsuch that for a given random image inD, ๐ (๐ฆ๐ ) is the probability ofhaving๐ฆ๐ as positive prediction and ๐ (๐ฆ๐ , ๐ฆ ๐ ) is the joint probabilityof having both ๐ฆ๐ and ๐ฆ ๐ as positive predictions. We further assumethat we have identity labels ๐ฅ1, ๐ฅ2, ...๐ฅ๐ โ L that belong to somesensitive identity group for which we wish to compute a bias metric(e.g.,๐๐๐ and๐ค๐๐๐๐ as labels for gender)2. These identity labelsmay even be generated by the same process that generates the labels๐ฆ, as in our case where๐๐๐, ๐ค๐๐๐๐ and all other ๐ฆ are elementsof L predicted by the classifier. We then measure bias with respect1For ease of notation, we use ๐ฆ and ๐ฅ rather than ๏ฟฝฬ๏ฟฝ and ๐ฅ .2For the rest of the paper, we focus on only two identity labels with notation ๐ฅ1 and๐ฅ2 for simplicity, however the identity labels need not be binary (or one-dimensional)in this way. The remainder of this work is straightforwardly extended to any numberof identity labels by using the pairwise comparison of all identity labels, or by using aone-vs-all gap (e.g.,๐ด(๐ฅ, ๐ฆ) โ E [๐ด(๐ฅ โฒ, ๐ฆ) ] where ๐ฅ โฒ is the set of all other ๐ฅ ).
Figure 1: Label ranking shifts formetric-to-metric comparison. Each point represents the rank of a single label when sorted by๐บ (๐ฆ |๐ฅ1, ๐ฅ2, ๐ด(ยท)) (i.e., the label rank by gap). The coordinates represent the rankings by gap for different associationmetrics๐ด(ยท)on the ๐ฅ and ๐ฆ axes. A highly-correlated plot along ๐ฆ = ๐ฅ would imply that the two metrics lead to very similar bias rankingsaccording to ๐บ (ยท).
to these identity labels for all other labels ๐ฆ โ L. As the size of thislabel space |L| approaches tens of thousands and increases year byyear for modern machine learning models, it is important to havea simple bias metric that can be computed and reasoned about atscale.
For ease of discussion in the rest of this paper, we denote anygeneric associationmetric๐ด(๐ฅ ๐ , ๐ฆ), where ๐ฅ ๐ is an identity label and๐ฆ is any other label. We define an association gap for label๐ฆ betweentwo identity labels [๐ฅ1, ๐ฅ2] with respect to the association metric๐ด(๐ฅ ๐ , ๐ฆ) as๐บ (๐ฆ |๐ฅ1, ๐ฅ2, ๐ด(ยท)) = ๐ด(๐ฅ1, ๐ฆ) โ๐ด(๐ฅ2, ๐ฆ). For example, theassociation between the labels ๐ค๐๐๐๐ and ๐๐๐๐ , ๐ด(๐ค๐๐๐๐,๐๐๐๐)can be compared to the association between the labels๐๐๐ and๐๐๐๐ , ๐ด(๐๐๐,๐๐๐๐). The difference between them is the associa-tion gap for the label ๐๐๐๐ :
๐บ (๐๐๐๐ |๐ค๐๐๐๐,๐๐๐,๐ด(ยท)) =๐ด(๐ค๐๐๐๐,๐๐๐๐) - ๐ด(๐๐๐,๐๐๐๐)
We use this association gap ๐บ (ยท) as a measurement of โbias" orโskew" of a given label across sensitive identity subgroups.
The first objective we are interested in is the question โIs theprediction of label ๐ฆ biased towards either ๐ฅ1 or ๐ฅ2?". The secondobjective is then ranking the labels ๐ฆ by the size of this bias. If๐ฅ1 and ๐ฅ2 both belong to the same identity group (e.g.,๐๐๐ and๐ค๐๐๐๐ are both labels for โgender"), then one may consider thesemeasurements to approximate the gender bias.
We choose this gender example because of the abundance ofthese specific labels in the Open Images Dataset, however thischoice should not be interpreted to mean that gender representa-tion is one-dimensional, nor that paired labels are required for thegeneral1 approach. Nonetheless, this simplification is importantbecause it allows us to demonstrate how a single per-label approx-imation of โbias" can be measured between paired labels, and weleave details of further expansions, including calculations acrossmultiple sensitive identity labels, to the Discussion section.
3.1 Association MetricsWe consider several sets of related association metrics ๐ด(ยท) thatcan be applied given the constraints of the problem at hand โ lim-ited groundtruth, non-linearity, and limited assumptions about the
underlying distribution of the data. All of these metrics share incommon the general intuition of measuring how labels associatewith each other in a dataset, but as we will demonstrate, they yieldvery different results due to differences in how they quantify thisnotion of โassociation".
We first consider fairness metrics as types of association met-rics. One of the most common fairness metrics, Demographic (orStatistical) Parity [3, 11, 15], a quantification of the legal doctrineof Disparate Impact [2], can be applied directly for the given taskconstraints.3 Other metrics that are possible to adopt for this taskinclude those based on Intersection-over-Union (IOU) measure-ments, and metrics based on correlation and statistical tests. Wenext describe these metrics in further detail and their relationshipto the task at hand. In summary, we compare the following familiesof metrics:
โข Fairnesss: Demographic Parity (๐ท๐ )โข Entropy: Pointwise Mutual Information (๐๐๐ผ ), NormalizedPointwise Mutual Information (๐๐๐๐ผ ).
โข IOU: Sรธrensen-Dice Coefficient (๐๐ท๐ถ), Jaccard Index (๐ฝ ๐ผ ).โข Correlation and Statistical Tests: Kendall Rank Correlation(๐๐ ), Log-Likelihood Ratio (๐ฟ๐ฟ๐ ), ๐ก-test.
One of the important aspects of our problem setting is the countsof images with labels and label intersections, i.e., ๐ถ (๐ฆ), ๐ถ (๐ฅ1, ๐ฆ),and ๐ถ (๐ฅ2, ๐ฆ). These values can span a large range for differentlabels ๐ฆ in the label set L, depending on how common they are inthe dataset. Some metrics are theoretically more sensitive to thefrequencies/counts of the label ๐ฆ as determined by their nonzeropartial derivatives with respect to ๐ (๐ฆ) (see Table 2). However, aswe further discuss in the Experiments and Discussion sections, ourexperiments indicate that in practice, metrics with non-zero partialderivatives are surprisingly better able to capture biases across arange of label frequencies thanmetrics with a zero partial derivative.Differential sensitivity to label frequency could be problematic inpractice for two reasons:
(1) It would not be possible to compare๐บ (๐ฆ |๐ฅ1, ๐ฅ2, ๐ด(ยท)) betweendifferent labels๐ฆ with different marginal frequencies (counts)๐ถ (๐ฆ). For example, the ideal bias metric should be able tocapture gender bias equally well for both ๐๐๐ and ๐๐๐ ๐ ๐๐
๐๐๐ฅ๐๐๐ even though the first label is more common thanthe second.
(2) The alternative, bucketizing labels by marginal frequencyand setting distinct thresholds per bucket, would add sig-nificantly more hyperparameters and essentially amount tomanual frequency-normalization.
The following sections contain basic explanations of these met-rics for a general audience, with the running example of ๐๐๐๐,๐๐๐,
and๐ค๐๐๐๐. We leave further mathematical analyses of the metricsto the Appendix. However, integral to the application of ๐๐๐๐ผ inthis task is the choice of normalization factor, and so we discussthis in further detail in the Normalizing ๐๐๐ผ subsection.
Demographic Parity
๐บ (๐ฆ |๐ฅ1, ๐ฅ2, ๐ท๐) = ๐ (๐ฆ |๐ฅ1) โ ๐ (๐ฆ |๐ฅ2)3Other common fairness metrics, such as Equality of Opportunity, require both amodel prediction and a groundtruth label, which makes the correct way to apply themto this task less clear. We leave this for further work.
Demographic Parity focuses on differences between the conditionalprobability of ๐ฆ given ๐ฅ1 and ๐ฅ2: How likely ๐๐๐๐ is for ๐๐๐ vs๐ค๐๐๐๐.
Entropy
๐บ (๐ฆ |๐ฅ1, ๐ฅ2, ๐๐๐ผ ) = ๐๐
(๐ (๐ฅ1, ๐ฆ)
๐ (๐ฅ1)๐ (๐ฆ)
)โ ๐๐
(๐ (๐ฅ2, ๐ฆ)
๐ (๐ฅ2)๐ (๐ฆ)
)Pointwise Mutual Information, adapted from information theory,is the main entropy-based metric studied here. In this form, we areanalyzing the entropy difference between [๐ฅ1, ๐ฆ] and [๐ฅ2, ๐ฆ]. Thisessentially examines the dependence of, for example, the ๐๐๐๐ labeldistribution on two other label distributions:๐๐๐ and๐ค๐๐๐๐.
Remaining MetricsWe use the Sรธrensen-Dice Coefficient (๐๐ท๐ถ), which has the
commonly-used F1-score as one of its variants; the Jaccard Index(๐ฝ ๐ผ ), a common metric in Computer Vision also known as Inter-section Over Union (IOU); Log-Likelihood Ratio (๐ฟ๐ฟ๐ ), a classicflexible comparison approach; Kendall Rank Correlation, whichis also known as ๐๐ -correlation, and is the particular correlationmethod that can be reasonably applied in this setting; and the ๐ก-๐ก๐๐ ๐ก ,a common statistical significance test that can be adapted in thissetting [17]. Each of these metrics have different behaviours, how-ever, we limit our mathematical explanation to the Appendix, aswe found these metrics are either less useful in practice or behavesimilarly to other metrics in this use case.
3.2 Normalizing PMIOne major challenge in our problem setting is the sensitivity ofthese association metrics to the frequencies/counts of the labelsin L. Some metrics are weighted more heavily towards commonlabels (i.e., large marginal counts, ๐ถ (๐ฆ)) in spite of differences intheir joint probabilities with identity labels (๐ (๐ฅ1, ๐ฆ), ๐ (๐ฅ2, ๐ฆ)). Theopposite is true for other metrics, which are weighted towards rarelabels with smaller marginal frequencies. In order to compensatefor this problem, several different normalization techniques havebeen applied to ๐๐๐ผ [4, 21]. Common normalizations include:
โข ๐๐๐๐ผ๐ฆ : Normalizing each term by ๐ (๐ฆ).โข ๐๐๐๐ผ๐ฅ๐ฆ : Normalizing the two terms by ๐ (๐ฅ1, ๐ฆ) and ๐ (๐ฅ2, ๐ฆ),respectively.
โข ๐๐๐ผ2: Using ๐ (๐ฅ1, ๐ฆ)2 and ๐ (๐ฅ2, ๐ฆ)2 instead of ๐ (๐ฅ1, ๐ฆ) and๐ (๐ฅ2, ๐ฆ), the normalization effects of which are further illus-trated in the Appendix.
Each of these normalization methods have different impacts on the๐๐๐ผ metric. The main advantage of these normalizations is theability to compare association gaps between label pairs [๐ฆ1, ๐ฆ2] (e.g.,comparing the gender skews of two labels like Long Hair and DidoFlip) even if ๐ (๐ฆ1) and ๐ (๐ฆ2) are very different. In the Experimentssection, we discuss which of these is most effective and meaningfulfor the fairness and bias use case motivating this work.
4 EXPERIMENTSIn order to compare these metrics, we use the Open Images Dataset(OID) [19] described above in the Related Works section. This
dataset is useful for demonstrating realistic bias detection use casesbecause of the number of distinct labels that may be applied tothe images (the dataset itself is annotated with nearly 20,000 la-bels). The label space is also diverse, including objects (cat, car,tree), materials (leather, basketball), moments/actions (blowing outcandles, woman playing guitar), and scene descriptors and attributes(beautiful, smiling, forest). We seek to measure the gender bias asdescribed above for each of these labels, and compare these biasvalues directly in order to determine which of the labels are mostskewed towards associating with the labels๐๐๐ and๐ค๐๐๐๐.
In our experiments, we apply each of the association metrics๐ด(๐ฅ ๐ , ๐ฆ) to the machine-generated predictions for identity labels๐ฅ1 = ๐๐๐, ๐ฅ2 = ๐ค๐๐๐๐ and all other labels ๐ฆ in the Open ImagesDataset. We then compute the gap value ๐บ (ยท) between the identitylabels for each label๐ฆ and sort, providing a ranking of labels that aremost biased towards ๐ฅ1 or ๐ฅ2. As we will show, sorting labels by thisassociation gap creates different rankings for different associationmetrics. We examine which labels are ranked within the top 100 forthe different association metrics in the Top 100 Labels by MetricGaps subsection.
4.1 Label RanksThe first experiment we performed is to compute the associationmetrics and the gaps between them for different labels โ ๐ด(๐ฅ1, ๐ฆ),๐ด(๐ฅ2, ๐ฆ) and ๐บ (๐ฆ |๐ฅ1, ๐ฅ2, ๐ด(ยท)) โ over the OID dataset. We thensorted the labels by ๐บ (ยท) and studied how the ranking by gap dif-fered between different association metrics ๐ด(ยท). Figure 1 showsexamples of these metric-to-metric comparisons of label rankingsby gap (all other metric comparisons can be found in the Appendix).We can see that a single label can have quite a different rankingdepending on the metric.
When comparing metrics, we found that they grouped togetherin a few clusters based on similar ranking patterns when sorting by๐บ (๐ฆ |๐๐๐,๐ค๐๐๐๐,๐ด(ยท)). In the first cluster, pairwise comparisonsbetween ๐๐๐ผ , ๐๐๐ผ2 and ๐ฟ๐ฟ๐ show linear relationships when sort-ing labels by๐บ (ยท). Indeed, while some labels show modest changesin rank between these metrics, they share all of their top 100 la-bels, and > 99% of label pairs maintain the same relative rankingbetween metrics. By contrast, there are only 7 labels in commonbetween the top 100 labels of ๐๐๐ผ2 and ๐๐ท๐ถ . Due to the similarbehavior of this cluster of metrics, we chose to focus on ๐๐๐ผ asrepresentative of these 3 metrics moving forward (see the Appendixfor further details on these relationships).
Similar results were obtained for another cluster of metrics: ๐ท๐ ,๐ฝ ๐ผ , and ๐๐ท๐ถ . All pairwise comparisons generated linear plots, withabout 70% of the top 100 labels shared in common between thesemetrics when sorted by ๐บ (ยท). Furthermore, about 95% of pairs ofthose overlapping labels maintained the same relative ranking be-tween metrics. Similar to the ๐๐๐ผ cluster, we chose to focus onDemographic Parity (๐ท๐ ) as the representative metric from its clus-ter, due to its mathematical simplicity and prominence in fairnessliterature.
We next sought to understand how incremental changes to thecounts of the labels, ๐ถ (๐ฆ), affect these association gap rankings ina real dataset (see Appendix Section E.2). To achieve this, we addeda fake label to the real labels of OID, setting initial values for its
Metrics Min/Max Min/Max Min/Max๐ถ (๐ฆ) ๐ถ (๐ฅ1, ๐ฆ) ๐ถ (๐ฅ2, ๐ฆ)
๐๐๐ผ 15 / 10,551 1 / 1,059 8 / 7,755
๐๐๐ผ 2 15 / 10,551 1 / 1,059 8 / 7,755
๐ฟ๐ฟ๐ 15 / 10,551 1 / 1,059 8 / 7,755
๐ท๐ 6,104 / 785,045 628 / 239,950 5,347 / 197,795
๐ฝ ๐ผ 4,158 / 562,445 399 / 144,185 3,359 / 183,132
๐๐ท๐ถ 2,906 / 562,445 139 / 144,185 2,563 / 183,132
๐๐๐๐ผ๐ฆ 35 / 562,445 1/144,185 9 / 183,132
๐๐๐๐ผ๐ฅ๐ฆ 34 / 270,748 1 / 144,185 20 / 183,132
๐๐ 6,104 / 785,045 628 / 207,723 5,347 / 183,132
๐ก -๐ก๐๐ ๐ก 960 / 562,445 72 / 144,185 870 / 183,132
Table 1: Minimum and maximum counts๐ถ (๐ฆ) of the top 100labels with the largest association gaps for eachmetric. Notethat these min/max values for๐ถ (๐ฆ) vary by orders of magni-tude for different metrics. A larger version of this table isavailable in Appendix, Table 7.
counts and co-occurences in the dataset, ๐ (๐ฆ), ๐ (๐ฅ1, ๐ฆ), and ๐ (๐ฅ2, ๐ฆ).Then we incrementally increased or decreased the count of label ๐ฆ,and measured whether its bias ranking in ๐บ would change relativeto the other labels in OID. We repeated this procedure for differentorders of magnitude of label count๐ถ (๐ฆ) while maintaining the ratio๐ (๐ฅ1, ๐ฆ)/๐ (๐ฅ2, ๐ฆ) as constant.
This experiment allowed us to determine whether the theoreticalsensitivities of eachmetric to label frequency ๐ (๐ฆ), as determined bypartial derivatives ๐๐ด(ยท)/๐๐ (๐ฆ) (see Table 2), would hold in the con-text of real-world data, where the underlying distribution of labelfrequencies may not be uniform. If certain subregions of the labeldistribution are relatively sparse, for example, then the gap rankingof our hypothetical label may not change even if ๐๐ด(ยท)/๐๐ถ (๐ฆ) โ 0.However, in practice we do not observe this behavior in the testedsettings (see Appendix for plots of these experiments), where labelrank moves with label count roughly as predicted by the partialderivatives in Table 2. In fact, we observed that metrics with largerpartial derivatives for ๐ฅ1, ๐ฅ2, or ๐ฆ often led to a larger change inrank. For example, slightly increasing ๐ (๐ฅ1, ๐ฆ) when ๐ฆ always co-occurs with ๐ฅ1, ๐ (๐ฆ) = ๐ (๐ฅ1, ๐ฆ) affects rankingmore for๐ด = ๐๐๐๐ผ๐ฆcompared to ๐ด = ๐๐๐ผ (see Appendix).
4.2 Top 100 Labels by Metric GapsWhen applying these metrics to fairness and bias use cases, modelusers may be most interested in surfacing the labels with the largestassociation gaps. If one filters results to a โtop K" label set, then thenormalization chosen could lead to vastly different sets of labels
Figure 2: Top 100 count distributions for ๐๐๐ผ , ๐๐๐๐ผ๐ฅ๐ฆ , and ๐๐The distribution of ๐ถ (๐ฆ), ๐ถ (๐ฅ1, ๐ฆ), and ๐ถ (๐ฅ2, ๐ฆ) for the top 100 labels sorted by gap ๐บ (๐ฆ |๐ฅ1, ๐ฅ2, ๐ด(ยท)) for ๐๐๐ผ , ๐๐๐๐ผ๐ฅ๐ฆ , and ๐๐ . The ๐ฅ-axis is
the logarithmic-scaled bins and the ๐ฆ-axis is the number of labels which have the corresponding count values in that bin.
๐๐ (๐ฆ) ๐๐ (๐ฅ1, ๐ฆ) ๐๐ (๐ฅ2, ๐ฆ)
๐DP 0 1๐ (๐ฅ1)
โ1๐ (๐ฅ2)
๐PMI 0 1๐ (๐ฅ1,๐ฆ)
โ1๐ (๐ฅ2,๐ฆ)
๐๐๐๐๐ผ๐ฆ๐๐ ( ๐ (๐ฅ2 |๐ฆ)
๐ (๐ฅ1 |๐ฆ))
๐๐2 (๐ (๐ฆ))๐ (๐ฆ)1
๐๐ (๐ (๐ฆ))๐ (๐ฅ1,๐ฆ)โ1
๐๐ (๐ (๐ฆ))๐ (๐ฅ2,๐ฆ)
๐๐๐๐๐ผ๐ฅ๐ฆ1
๐๐ (๐ (๐ฅ1,๐ฆ))๐ (๐ฆ) โ1
๐๐ (๐ (๐ฅ2,๐ฆ))๐ (๐ฆ)๐๐ (๐ (๐ฆ))โ๐๐ (๐ (๐ฅ1))๐๐2 (๐ (๐ฅ1,๐ฆ))๐ (๐ฅ1,๐ฆ)
๐๐ (๐ (๐ฅ2))โ๐๐ (๐ (๐ฆ))๐๐2 (๐ (๐ฅ2,๐ฆ))๐ (๐ฅ2,๐ฆ)
๐๐๐๐ผ2 0 2๐ (๐ฅ1,๐ฆ)
โ2๐ (๐ฅ2,๐ฆ)
๐SDC see Appendix C 1๐ (๐ฅ1)+๐ (๐ฆ)
โ1๐ (๐ฅ2)+๐ (๐ฆ)
๐JI see Appendix C ๐ (๐ฅ1)+๐ (๐ฆ)(๐ (๐ฅ1)+๐ (๐ฆ)โ๐ (๐ฅ1,๐ฆ))2
๐ (๐ฅ2)+๐ (๐ฆ)(๐ (๐ฅ2)+๐ (๐ฆ)โ๐ (๐ฅ2,๐ฆ))2
๐LLR 0 1๐ (๐ฅ1,๐ฆ)
โ1๐ (๐ฅ2,๐ฆ)
๐๐๐ see Appendix C(2โ 4
๐)โ
(๐ (๐ฅ1)โ๐ (๐ฅ1)2) (๐ (๐ฆ)โ๐ (๐ฆ)2)( 4๐โ2)โ
(๐ (๐ฅ2)โ๐ (๐ฅ2)2) (๐ (๐ฆ)โ๐ (๐ฆ)2)
๐๐ก-๐ก๐๐ ๐ก_๐๐๐โ๐ (๐ฅ2)โ
โ๐ (๐ฅ1)
2โ๐ (๐ฆ)
1โ๐ (๐ฅ1)๐ (๐ฆ)
โ1โ๐ (๐ฅ2)๐ (๐ฆ)
Table 2: Metric orientations.
This table shows the partial derivatives of the metrics with respect to ๐ (๐ฆ), ๐ (๐ฅ1, ๐ฆ), and ๐ (๐ฅ2, ๐ฆ). This provides a quantification of thetheoretical sensitivity of the metrics for different probability values of ๐ (๐ฆ), ๐ (๐ฅ1, ๐ฆ), and ๐ (๐ฅ2, ๐ฆ). We see similar experimental results for
the metrics with similar orientations (see Appendix).
(e.g., as mentioned earlier, ๐๐๐ผ2 and ๐๐ท๐ถ only shared 7 labels intheir top 100 set for OID).
To further analyze this issue, we calculated simple values foreach metricโs top 100 labels sorted by๐บ (ยท): minimum andmaximumvalues of๐ถ (๐ฆ),๐ถ (๐ฅ1, ๐ฆ) and๐ถ (๐ฅ2, ๐ฆ) as shown in Table 1. The mostsalient point is that the clusters of metrics from the Label Rankssubsection also appear to hold in this analysis as well; ๐๐๐ผ , ๐๐๐ผ2,and ๐ฟ๐ฟ๐ have low ๐ถ (๐ฆ), ๐ถ (๐ฅ1, ๐ฆ) and ๐ถ (๐ฅ2, ๐ฆ) ranges, whereas ๐ท๐ ,๐ฝ ๐ผ , and ๐๐ท๐ถ have relatively high ranges. Another straightforwardobservation we can make is that the ๐๐๐๐ผ๐ฆ and ๐๐๐๐ผ๐ฅ๐ฆ rangesare much broader than the first two clusters, and include the othermetricsโ ranges especially for the joint co-occurrences,๐ถ (๐ฅ1, ๐ฆ) and๐ถ (๐ฅ2, ๐ฆ).
To demonstrate this point more clearly, we plot the distributionsof these counts for ๐๐๐ผ , ๐๐๐๐ผ๐ฅ๐ฆ and ๐๐ skews in Figure 2 (all othercombinations can be found in the Appendix). These three metricdistributions show that gap calculations based on ๐๐๐ผ (blue distri-bution) exclusively rank labels with low counts in the top 100 mostskewed labels, where ๐๐ calculations (green distribution) almostexclusively rank labels with much higher counts. The exceptionis ๐๐๐๐ผ๐ฆ and ๐๐๐๐ผ๐ฅ๐ฆ (orange distribution); these two metrics arecapable of capturing labels across a range of marginal frequencies.In other words, ranking labels by ๐๐๐ผ gaps is likely to highlightrare labels, ranking by ๐๐ will highlight common labels, and rankingby ๐๐๐๐ผ๐ฅ๐ฆ will highlight both rare and common labels.
An example of this relationship between association metricchoice and label commonality can be seen in Table 3 (note, herewe use Demographic Parity instead of ๐๐ because they behave simi-larly in this respect). In this table, we show the โTop 15 labels" mostheavily skewed towards๐ค๐๐๐๐ relative to๐๐๐ according to ๐ท๐ ,unnormalized ๐๐๐ผ , and ๐๐๐๐ผ๐ฅ๐ฆ . ๐ท๐ almost exclusively highlightscommon labels predicted for over 100,000 images in the dataset (e.g.,Happiness and Fashion), whereas ๐๐๐ผ largely highlights rarer labelspredicted for less than 1,000 images (e.g., Treggings and Boho-chic).By contrast, ๐๐๐๐ผ๐ฅ๐ฆ highlights both common and rare labels (e.g.,Long Hair as well as Boho-chic).
5 DISCUSSIONIn the previous section, we first showed that some associationmetrics behave very similarly when ranking labels from the OpenImages Dataset (OID). We then showed that the mathematical ori-entations and sensitivity of these metrics align with experimentalresults from OID. Finally, we showed that the different normaliza-tions affect whether labels with high or low marginal frequenciesare likely to be detected as having a significant bias according to๐บ (๐ฆ |๐ฅ1, ๐ฅ2, ๐ด(ยท)) in this dataset. We arrive at the conclusion thatthe ๐๐๐๐ผ metrics are preferable to other commonly used associ-ation metrics in the problem setting of detecting biases withoutgroundtruth labels.
What is the intuition behind this particular association metricas a bias metric? All of the studied entropy-based metrics (gaps in๐๐๐๐ผ as well as ๐๐๐ผ ) approximately correspond to whether oneidentity label ๐ฅ1 co-occurs with the target label ๐ฆ more often thananother identity label ๐ฅ2 relative to chance levels. This chance-level normalization is important because even completely unbiasedlabels would still co-occur at some baseline rate by chance alone.
The further normalization of ๐๐๐ผ by either the marginal orjoint probability (๐๐๐๐ผ๐ฆ and ๐๐๐๐ผ๐ฅ๐ฆ , respectively) takes this onestep further in practice by surfacing labels with larger marginalcounts at higher ranks in ๐บ (๐ฆ |๐ฅ1, ๐ฅ2, ๐๐๐๐ผ ) alongside labels withsmaller marginal counts. This is a somewhat surprising result, be-cause in theory ๐๐๐ผ should already be independent of the marginalfrequency of ๐ (๐ฆ) (because ๐๐๐๐ผ/๐๐ (๐ฆ) = 0), whereas this de-rivative for ๐๐๐๐ผ is non-zero. When we examined this patternin practice, the labels with smaller counts can achieve very large๐ (๐ฅ1, ๐ฆ)/๐ (๐ฅ2, ๐ฆ) ratios (and therefore their bias rankings can getvery high) merely by reducing the denominator to a single imageexample. ๐๐๐ผ is unable to compensate for this noise, whereas thenormalizations we use for ๐๐๐๐ผ allow us to capture a significantamount of common labels in the top 100 labels by๐บ (๐ฆ |๐ฅ1, ๐ฅ2, ๐๐๐๐ผ )in spite of this pattern. This result is indicated by the ranges inTable 1 and Figure 2, as well as the set of both common and rarelabels for ๐๐๐๐ผ in Table 3.
Indeed, if the evaluation set is properly designed to match thedistribution of use cases of a classification model โin the wild", thenwe argue more common labels that have a smaller ๐ (๐ฅ1, ๐ฆ)/๐ (๐ฅ2, ๐ฆ)ratio are still critical to audit for biases. Normalization strategiesmust be titrated carefully to balance this simple ratio of joint prob-abilities with the labelโs rarity in the dataset.
An alternative solution to this problem could be bucketing labelsby their marginal frequency. We argue this is a suboptimal solutionfor two reasons. First, determining even a single threshold hyper-parameter is a painful process for defining fairness constraints.Systems that prevent models from being published if their fair-ness discrepancies exceed a threshold would then be required totitrate this threshold for every bucket. Secondly, bucketing labelsby frequency is essentially a manual and discontinuous form ofnormalization; we argue that building normalization into the metricdirectly is a more elegant solution.
Finally, to enable detailed investigation of the model predictions,we implemented and open-sourced a tool to visualize ๐๐๐๐ผ metricsas a TensorBoard plugin for developers4 (see Figure 3). It allowsusers to investigate discrepancies between two or more identitylabels and their pairwise comparisons. Users can visualize probabil-ities, image counts, sample and label distributions, and filter, flag,and download these results.
4https://github.com/tensorflow/tensorboard/tree/master/tensorboard/plugins/npmi
Metric ๐ด ๐ท๐ ๐๐๐ผ ๐๐๐๐ผ๐ฅ๐ฆ
Ranks Label ๐ฆ Count Label ๐ฆ Count Label ๐ฆ Count0 265,853 Dido Flip 140 6101 270,748 Webcam Model 184 Dido Flip 1402 221,017 Boho-chic 151 2,9063 166,186 610 Eye Liner 3,1444 Beauty 562,445 Treggings 126 Long Hair 56,8325 Long Hair 56,832 Mascara 539 Mascara 5396 Happiness 117,562 145 Lipstick 8,6887 Hairstyle 145,151 Lace Wig 70 Step Cutting 6,1048 Smile 144,694 Eyelash Extension 1,167 Model 10,5519 Fashion 238,100 Bohemian Style 460 Eye Shadow 1,23510 Fashion Designer 101,854 78 Photo Shoot 8,77511 Iris 120,411 Gravure Idole 200 Eyelash Extension 1,16712 Skin 202,360 165 Boho-chic 46013 Textile 231,628 Eye Shadow 1,235 Webcam Model 15114 Adolescence 221,940 156 Bohemian Style 184
Table 3: Top 15 labels skewed towards๐ค๐๐๐๐.
Ranking by the gap of label ๐ฆ between๐ค๐๐๐๐ ๐ฅ1 and๐๐๐ ๐ฅ2, according to ๐บ (๐ฆ |๐ฅ1, ๐ฅ2, ๐ด(ยท)). Identity label names are omitted.
Figure 3: Open-sourced tool we implemented for bias investigation.In A, annotations are filtered by an association metric, in this case the ๐๐๐๐ผ difference (๐๐๐๐ผ_๐๐๐) between๐๐๐๐ and ๐ ๐๐๐๐๐ . In B,
distribution of association values. In C, the filtered annotations view. In D, different models or datasets can be selected for display. In E, aparallel coordinates visualization of selected annotations to assess where association differences come from.
6 CONCLUSION AND FUTUREWORKIn this paper we have described association metrics that can mea-sure gaps โ biases or skews โ towards specific labels in the largelabel space of current computer vision classification models. Thesemetrics do not require ground truth annotations, which allowsthem to be applied in contexts where it is difficult to apply standardfairness metrics such as Equality of Opportunity [15]. According toour experiments, Normalized Pointwise Mutual Information (๐๐๐๐ผ )is a particularly useful metric for measuring specific biases in areal-world dataset with a large label space, e.g., the Open ImagesDataset.
This paper also introduces several questions for future work. Thefirst is whether๐๐๐๐ผ is also similarly useful as a bias metric in smalllabel spaces (e.g., credit and loan applications). Second, if we were tohave exhaustive ground truth labels for such a dataset, how wouldthe sensitivity of ๐๐๐๐ผ in detecting biases compare to ground-truth-dependent fairness metrics? Finally, in this work we treatedthe labels predicted for an image as a flat set. However, just likesentences have rich syntactic structure beyond the โbag-of-words"model in NLP, images also have rich structure and relationshipsbetween objects that are not captured by mere rates of binary co-occurrence. This opens up the possibility that within-image labelrelationships could be leveraged to better understand how conceptsare associated in a large computer vision dataset. We leave thesequestions for future work.
REFERENCES[1] Nikolaos Aletras and Mark Stevenson. 2013. Evaluating Topic Coherence Us-
ing Distributional Semantics. In Proceedings of the 10th International Conferenceon Computational Semantics (IWCS 2013) โ Long Papers. Association for Com-putational Linguistics, Potsdam, Germany, 13โ22. https://www.aclweb.org/anthology/W13-0102
[2] Solon Barocas and Andrew D. Selbst. 2014. Big Dataโs Disparate Impact. SSRNeLibrary (2014).
[3] Richard Berk. 2016. A primer on fairness in criminal justice risk assessments.The Criminologist 41, 6 (2016), 6โ9.
[4] G. Bouma. 2009. Normalized (pointwise) mutual information in collocationextraction.
[5] Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional AccuracyDisparities in Commercial Gender Classification (Proceedings of Machine LearningResearch, Vol. 81), Sorelle A. Friedler and Christo Wilson (Eds.). PMLR, New York,NY, USA, 77โ91. http://proceedings.mlr.press/v81/buolamwini18a.html
[6] Kaylee Burns, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. 2018.Women also Snowboard: Overcoming Bias in Captioning Models. CoRRabs/1803.09797 (2018). arXiv:1803.09797 http://arxiv.org/abs/1803.09797
[7] Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan. 2017. Seman-tics derived automatically from language corpora contain human-like biases.Science 356, 6334 (2017), 183โ186. https://doi.org/10.1126/science.aal4230arXiv:https://science.sciencemag.org/content/356/6334/183.full.pdf
[8] Alexandra Chouldechova. 2016. Fair prediction with disparate impact: A studyof bias in recidivism prediction instruments. arXiv:1610.07524 [stat.AP]
[9] KennethWard Church and Patrick Hanks. 1990. Word Association Norms, MutualInformation, and Lexicography. Computational Linguistics 16, 1 (1990), 22โ29.
https://www.aclweb.org/anthology/J90-1003[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A
Large-Scale Hierarchical Image Database. In CVPR09.[11] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard S.
Zemel. 2011. Fairness Through Awareness. CoRR abs/1104.3913 (2011).arXiv:1104.3913 http://arxiv.org/abs/1104.3913
[12] Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams,John Winn, and Andrew Zisserman. 2015. The Pascal Visual Object ClassesChallenge: A Retrospective. International Journal of Computer Vision 111, 1 (Jan.2015), 98โ136. https://doi.org/10.1007/s11263-014-0733-5
[13] Robert M Fano. 1961. Transmission of information: A statistical theory of com-munications. American Journal of Physics 29 (1961), 793โ794.
[14] A. G. Greenwald, D. E. McGhee, and J. L. Schwartz. 1998. Measuring individ-ual differences in implicit cognition: the implicit association test. Journal ofpersonality and social psychology 74 (1998). Issue 6.
[15] Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of Opportunity inSupervised Learning. arXiv:1610.02413 [cs.LG]
[16] Zellig Harris. 1954. Distributional structure. Word 10, 2-3 (1954), 146โ162.https://doi.org/10.1007/978-94-009-8467-7_1
[17] Dan Jurafsky and James H. Martin. 2009. Speech and language process-ing : an introduction to natural language processing, computational linguis-tics, and speech recognition. Pearson Prentice Hall, Upper Saddle River,N.J. http://www.amazon.com/Speech-Language-Processing-2nd-Edition/dp/0131873210/ref=pd_bxgy_b_img_y
[18] M. G. Kendall. 1938. A New Measure of Rank Correlation. Biometrika 30,1-2 (06 1938), 81โ93. arXiv:https://academic.oup.com/biomet/article-pdf/30/1-2/81/423380/30-1-2-81.pdf https://doi.org/10.1093/biomet/30.1-2.81
[19] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin,Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, andVittorio Ferrari. 2018. The Open Images Dataset V4: Unified image classification,object detection, and visual relationship detection at scale. CoRR abs/1811.00982(2018). arXiv:1811.00982 http://arxiv.org/abs/1811.00982
[20] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Gir-shick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollรกr, and C. LawrenceZitnick. 2014. Microsoft COCO: Common Objects in Context. CoRR abs/1405.0312(2014). arXiv:1405.0312 http://arxiv.org/abs/1405.0312
[21] Franรois Role and Mohamed Nadif. 2011. Handling the Impact of Low FrequencyEvents on Co-Occurrence Based Measures of Word Similarity - A Case Study ofPointwise Mutual Information. In Proceedings of the International Conference onKnowledge Discovery and Information Retrieval - KDIR, (IC3K 2011). SciTePress,218โ223.
[22] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, SeanMa,ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV) 115, 3 (2015), 211โ252. https://doi.org/10.1007/s11263-015-0816-y
[23] Claude E. Shannon. 1948. A mathematical theory of communication. Bell Syst.Tech. J. 27, 3 (1948), 379โ423. http://dblp.uni-trier.de/db/journals/bstj/bstj27.html#Shannon48
[24] Jacob Snow. 2018. Amazonโs Face Recognition Falsely Matched 28 Members ofCongress With Mugshots. (2018).
[25] C. Spearman. 1904. The Proof and Measurement of Association Between TwoThings. American Journal of Psychology 15 (1904), 88โ103.
[26] Stanford Vision Lab. 2020. ImageNet. http://image-net.org/explore (2020). accessed6.Oct.2020.
[27] Pierre Stock andMoustapha Cisse. 2018. Convnets and imagenet beyond accuracy:Understanding mistakes and uncovering biases. In Proceedings of the EuropeanConference on Computer Vision (ECCV). 498โ512.
[28] M. P. Toglia and W. F. Battig. 1978. Handbook of semantic word norms. LawrenceErlbaum.
[29] Benjamin Wilson, Judy Hoffman, and Jamie Morgenstern. 2019. PredictiveInequity in Object Detection. CoRR abs/1902.11097 (2019). arXiv:1902.11097http://arxiv.org/abs/1902.11097
APPENDIXThe following sections break down all of the different metrics we examined. In Section A, Gap Calculations, we present the calculationsof the label association metric gaps, a measurement of the skew with respect to the given sensitive labels ๐ฅ1 and ๐ฅ2, such as๐ค๐๐๐๐ (๐ฅ1)and๐๐๐ (๐ฅ2). In Section C, Metric Orientations, we present the derivations to calculate their orientations, which represent the theoreticalsensitivities of each metric to various marginal or joint label probabilities.
A GAP CALCULATIONSโข Demographic Parity (๐ท๐ ):
๐ท๐_๐๐๐ = ๐ (๐ฆ |๐ฅ1) โ ๐ (๐ฆ |๐ฅ2)โข Sรธrensen-Dice Coefficient (๐๐ท๐ถ):
๐๐ท๐ถ_๐๐๐ =๐ (๐ฅ1, ๐ฆ)
๐ (๐ฅ1) + ๐ (๐ฆ) โ๐ (๐ฅ2, ๐ฆ)
๐ (๐ฅ2) + ๐ (๐ฆ)โข Jaccard Index (๐ฝ ๐ผ ):
๐ฝ ๐ผ_๐๐๐ =๐ (๐ฅ1, ๐ฆ)
๐ (๐ฅ1) + ๐ (๐ฆ) โ ๐ (๐ฅ1, ๐ฆ)โ ๐ (๐ฅ2, ๐ฆ)๐ (๐ฅ2) + ๐ (๐ฆ) โ ๐ (๐ฅ2, ๐ฆ)
โข Log-Likelihood Ratio (๐ฟ๐ฟ๐ ):๐ฟ๐ฟ๐ _๐๐๐ = ๐๐(๐ (๐ฅ1 |๐ฆ)) โ ๐๐(๐ (๐ฅ2 |๐ฆ))
โข Pointwise Mutual Information (๐๐๐ผ ):
๐๐๐ผ_๐๐๐ = ๐๐
(๐ (๐ฅ1, ๐ฆ)๐ (๐ฅ1)๐ (๐ฆ)
)โ ๐๐
(๐ (๐ฅ2, ๐ฆ)๐ (๐ฅ2)๐ (๐ฆ)
)โข Normalized Pointwise Mutual Information, p(y) normalization (๐๐๐๐ผ๐ฆ ):
๐๐๐๐ผ๐ฆ_๐๐๐ =
๐๐
(๐ (๐ฅ1,๐ฆ)
๐ (๐ฅ1)๐ (๐ฆ)
)๐๐ (๐ (๐ฆ)) โ
๐๐
(๐ (๐ฅ2,๐ฆ)
๐ (๐ฅ2)๐ (๐ฆ)
)๐๐ (๐ (๐ฆ))
โข Normalized Pointwise Mutual Information, p(x,y) normalization (๐๐๐๐ผ๐ฅ๐ฆ ):
๐๐๐๐ผ๐ฅ๐ฆ_๐๐๐ =
๐๐
(๐ (๐ฅ1,๐ฆ)
๐ (๐ฅ1)๐ (๐ฆ)
)๐๐ (๐ (๐ฅ1, ๐ฆ))
โ๐๐
(๐ (๐ฅ2,๐ฆ)
๐ (๐ฅ2)๐ (๐ฆ)
)๐๐ (๐ (๐ฅ2, ๐ฆ))
โข Squared Pointwise Mutual Information (๐๐๐ผ2):
๐๐๐ผ2_๐๐๐ = ๐๐
(๐ (๐ฅ1, ๐ฆ)2๐ (๐ฅ1)๐ (๐ฆ)
)โ ๐๐
(๐ (๐ฅ2, ๐ฆ)2๐ (๐ฅ2)๐ (๐ฆ)
)โข Kendall Rank Correlation (๐๐ ):
๐๐ =๐๐ โ ๐๐โ๏ธ
(๐0 โ ๐1) (๐0 โ ๐2)๐๐_๐๐๐ = ๐๐ (๐1 = ๐ฅ1, ๐2 = ๐ฆ) โ ๐๐ (๐1 = ๐ฅ2, ๐2 = ๐ฆ)
โข t-test:
๐ก-๐ก๐๐ ๐ก_๐๐๐ =๐ (๐ฅ1, ๐ฆ) โ ๐ (๐ฅ1)๐ (๐ฆ)โ๏ธ
๐ (๐ฅ1)๐ (๐ฆ)โ ๐ (๐ฅ2, ๐ฆ) โ ๐ (๐ฅ2)๐ (๐ฆ)โ๏ธ
๐ (๐ฅ2)๐ (๐ฆ)
B MATHEMATICAL REDUCTIONSโข PMI:
๐๐๐ผ_๐๐๐ = ๐๐
(๐ (๐ฅ1, ๐ฆ)๐ (๐ฅ1)๐ (๐ฆ)
)โ ๐๐
(๐ (๐ฅ2, ๐ฆ)๐ (๐ฅ2)๐ (๐ฆ)
)(1)
๐๐๐ผ_๐๐๐ = ๐๐
(๐ (๐ฅ1, ๐ฆ)๐ (๐ฅ1)๐ (๐ฆ)
.๐ (๐ฅ2)๐ (๐ฆ)๐ (๐ฅ2, ๐ฆ)
)(2)
๐๐๐ผ_๐๐๐ = ๐๐
(๐ (๐ฅ1, ๐ฆ)๐ (๐ฅ1)
.๐ (๐ฅ2)๐ (๐ฅ2, ๐ฆ)
)(3)
๐๐๐ผ_๐๐๐ = ๐๐
(๐ (๐ฆ |๐ฅ1)๐ (๐ฆ |๐ฅ2)
)(4)
๐๐๐ผ_๐๐๐ = ๐๐(๐ (๐ฆ |๐ฅ1)) โ ๐๐(๐ (๐ฆ |๐ฅ2)) (5)
โข nPMI p(y) normalization:
๐๐๐๐ผ_๐๐๐ =
๐๐
(๐ (๐ฅ1,๐ฆ)
๐ (๐ฅ1)๐ (๐ฆ)
)๐๐ (๐ (๐ฆ)) โ
๐๐
(๐ (๐ฅ2,๐ฆ)
๐ (๐ฅ2)๐ (๐ฆ)
)๐๐ (๐ (๐ฆ))
(6)
๐๐๐๐ผ_๐๐๐ =๐๐(๐ (๐ฆ |๐ฅ1)) โ ๐๐(๐ (๐ฆ |๐ฅ2))
๐๐(๐ (๐ฆ)) (7)
โข nPMI p(x,y) normalization:
๐๐๐๐ผ_๐๐๐ =
๐๐
(๐ (๐ฅ1,๐ฆ)
๐ (๐ฅ1)๐ (๐ฆ)
)๐๐ (๐ (๐ฅ1, ๐ฆ))
โ๐๐
(๐ (๐ฅ2,๐ฆ)
๐ (๐ฅ2)๐ (๐ฆ)
)๐๐ (๐ (๐ฅ2, ๐ฆ))
(8)
๐๐๐๐ผ_๐๐๐ = ๐๐
((๐ (๐ฅ1, ๐ฆ)๐ (๐ฅ1)๐ (๐ฆ)
)1/๐๐ (๐ (๐ฅ1,๐ฆ)) )โ ๐๐
((๐ (๐ฅ2, ๐ฆ)๐ (๐ฅ2)๐ (๐ฆ)
)1/๐๐ (๐ (๐ฅ2,๐ฆ)) )(9)
๐๐๐๐ผ_๐๐๐ = ๐๐ยฉยญยญยซ(
๐ (๐ฅ1,๐ฆ)๐ (๐ฅ1)๐ (๐ฆ)
)1/๐๐ (๐ (๐ฅ1,๐ฆ))(๐ (๐ฅ2,๐ฆ)
๐ (๐ฅ2)๐ (๐ฆ)
)1/๐๐ (๐ (๐ฅ2,๐ฆ)) ยชยฎยฎยฎยฌ (10)
๐๐๐๐ผ_๐๐๐ = ๐๐ยฉยญยญยซ(๐ (๐ฅ1,๐ฆ)๐ (๐ฅ1)
)1/๐๐ (๐ (๐ฅ1,๐ฆ))(๐ (๐ฅ2,๐ฆ)๐ (๐ฅ2)
)1/๐๐ (๐ (๐ฅ2,๐ฆ)) .๐ (๐ฆ) (1/๐๐ (๐ (๐ฅ1,๐ฆ))โ1/๐๐ (๐ (๐ฅ2,๐ฆ)))ยชยฎยฎยฎยฌ (11)
๐๐๐๐ผ_๐๐๐ = ๐๐ยฉยญยญยซ(๐ (๐ฅ1,๐ฆ)๐ (๐ฅ1)
)1/๐๐ (๐ (๐ฅ1,๐ฆ))(๐ (๐ฅ2,๐ฆ)๐ (๐ฅ2)
)1/๐๐ (๐ (๐ฅ2,๐ฆ)) ยชยฎยฎยฎยฌ + ๐๐(๐ (๐ฆ) (1/๐๐ (๐ (๐ฅ1,๐ฆ))โ1/๐๐ (๐ (๐ฅ2,๐ฆ)))
)(12)
๐๐๐๐ผ_๐๐๐ = ๐๐
(๐ (๐ฆ |๐ฅ1)1/๐๐ (๐ (๐ฅ1,๐ฆ))
๐ (๐ฆ |๐ฅ2)1/๐๐ (๐ (๐ฅ2,๐ฆ))
)+ ๐๐(๐ (๐ฆ)) .
(1
๐๐(๐ (๐ฅ1, ๐ฆ))โ 1๐๐(๐ (๐ฅ2, ๐ฆ))
)(13)
๐๐๐๐ผ_๐๐๐ =๐๐(๐ (๐ฆ |๐ฅ1))๐๐(๐ (๐ฅ1, ๐ฆ))
โ ๐๐(๐ (๐ฆ |๐ฅ2))๐๐(๐ (๐ฅ2, ๐ฆ))
+ ๐๐(๐ (๐ฆ)) . ยฉยญยซ๐๐( ๐ (๐ฅ2,๐ฆ)
๐ (๐ฅ1,๐ฆ) )๐๐(๐ (๐ฅ1, ๐ฆ))๐๐(๐ (๐ฅ2, ๐ฆ))
ยชยฎยฌ (14)
๐๐๐๐ผ_๐๐๐ =๐๐(๐ (๐ฅ2))๐๐(๐ (๐ฅ2, ๐ฆ))
โ ๐๐(๐ (๐ฅ1))๐๐(๐ (๐ฅ1, ๐ฆ))
+๐๐(๐ (๐ฆ))๐๐( ๐ (๐ฅ2,๐ฆ)
๐ (๐ฅ1,๐ฆ) )๐๐(๐ (๐ฅ1, ๐ฆ))๐๐(๐ (๐ฅ2, ๐ฆ))
(15)
๐๐๐๐ผ_๐๐๐ =๐๐(๐ (๐ฅ2))๐๐(๐ (๐ฅ2, ๐ฆ))
โ ๐๐(๐ (๐ฅ1))๐๐(๐ (๐ฅ1, ๐ฆ))
+ ๐๐(๐ (๐ฆ))๐๐(๐ (๐ฅ1, ๐ฆ))
โ ๐๐(๐ (๐ฆ))๐๐(๐ (๐ฅ2, ๐ฆ))
(16)
(17)
โข PMI2:
๐๐๐ผ2_๐๐๐ = ๐๐
(๐ (๐ฅ1, ๐ฆ)2๐ (๐ฅ1)๐ (๐ฆ)
)โ ๐๐
(๐ (๐ฅ2, ๐ฆ)2๐ (๐ฅ2)๐ (๐ฆ)
)(18)
๐๐๐ผ2_๐๐๐ = ๐๐
(๐ (๐ฅ1, ๐ฆ)2๐ (๐ฅ1)๐ (๐ฆ)
.๐ (๐ฅ2)๐ (๐ฆ)๐ (๐ฅ2, ๐ฆ)2
)(19)
๐๐๐ผ2_๐๐๐ = ๐๐
(๐ (๐ฅ1, ๐ฆ)2๐ (๐ฅ1)
.๐ (๐ฅ2)
๐ (๐ฅ2, ๐ฆ)2
)(20)
๐๐๐ผ2_๐๐๐ = ๐๐
(๐ (๐ฆ |๐ฅ1)๐ (๐ฅ1, ๐ฆ)๐ (๐ฆ |๐ฅ2)๐ (๐ฅ2, ๐ฆ)
)(21)
๐๐๐ผ2_๐๐๐ = ๐๐(๐ (๐ฆ |๐ฅ1)) โ ๐๐(๐ (๐ฆ |๐ฅ2)) + ๐๐(๐ (๐ฅ1, ๐ฆ)) โ ๐๐(๐ (๐ฅ2, ๐ฆ)) (22)
โข Kendall Rank Correlation (๐๐ ):
Formal definition:๐๐ =
๐๐ โ ๐๐โ๏ธ(๐0 โ ๐1) (๐0 โ ๐2)
โ ๐0 = ๐(๐ โ 1)/2
โ ๐1 =โ๐ ๐ก๐ (๐ก๐ โ 1)/2
โ ๐2 =โ
๐ ๐ข ๐ (๐ข ๐ โ 1)/2โ ๐๐ is number of concordant pairsโ ๐๐ is number of discordant pairsโ ๐๐ is number of concordant pairsโ ๐ก๐ is number of tied values in the ๐๐กโ group of ties for the first quantityโ ๐ข ๐ is number of tied values in the ๐๐กโ group of ties for the second quantity
New notations for our use case:
๐๐ (๐1 = ๐ฅ1, ๐2 = ๐ฆ) = ๐๐ (๐1 = ๐ฅ1, ๐2 = ๐ฆ) โ ๐๐ (๐1 = ๐ฅ1, ๐2 = ๐ฆ)โ๏ธ((๐2)โ ๐๐ (๐ = ๐ฅ1)) (
(๐2)โ ๐๐ (๐ = ๐ฆ))
โ ๐๐ (๐1, ๐2) =(๐ถ๐1=0,๐2=0
2)+
(๐ถ๐1=1,๐2=12
)โ ๐๐ (๐1, ๐2) =
(๐ถ๐1=0,๐2=12
)+
(๐ถ๐1=1,๐2=02
)โ ๐๐ (๐) =
(๐ถ๐=02
)+
(๐ถ๐=12
)โ ๐ถ๐๐๐๐๐๐ก๐๐๐๐ is number of examples which satisfies the conditionsโ ๐ number of examples in data set
๐๐ (๐1 = ๐ฅ, ๐2 = ๐ฆ) = ๐2 (1 โ 2๐ (๐ฅ) โ 2๐ (๐ฆ) + 2๐ (๐ฅ,๐ฆ) + 2๐ (๐ฅ)๐ (๐ฆ)) + ๐(2๐ (๐ฅ) + 2๐ (๐ฆ) โ 4๐ (๐ฅ,๐ฆ) โ 1)๐2
โ๏ธ(๐ (๐ฅ) โ ๐ (๐ฅ)2) (๐ (๐ฆ) โ ๐ (๐ฆ)2)
(23)
๐๐_๐๐๐ = ๐๐ (๐1 = ๐ฅ1, ๐2 = ๐ฆ) โ ๐๐ (๐1 = ๐ฅ2, ๐2 = ๐ฆ) (24)
C METRIC ORIENTATIONSIn this section, we derive the orientations of each metric, which can be thought of as the sensitivity of each metric to the given labelprobability. Table 4 provides a summary, followed by the longer forms.
Table 4: Metric Orientations
๐๐ (๐ฆ) ๐๐ (๐ฅ1, ๐ฆ) ๐๐ (๐ฅ2, ๐ฆ)
๐DP 0 1๐ (๐ฅ1)
โ1๐ (๐ฅ2)
๐PMI 0 1๐ (๐ฅ1,๐ฆ)
โ1๐ (๐ฅ2,๐ฆ)
๐๐๐๐๐ผ๐ฆ๐๐ ( ๐ (๐ฅ2 |๐ฆ)
๐ (๐ฅ1 |๐ฆ))
๐๐2 (๐ (๐ฆ))๐ (๐ฆ)1
๐๐ (๐ (๐ฆ))๐ (๐ฅ1,๐ฆ)โ1
๐๐ (๐ (๐ฆ))๐ (๐ฅ2,๐ฆ)
๐๐๐๐๐ผ๐ฅ๐ฆ1
๐๐ (๐ (๐ฅ1,๐ฆ))๐ (๐ฆ) โ1
๐๐ (๐ (๐ฅ2,๐ฆ))๐ (๐ฆ)๐๐ (๐ (๐ฆ))โ๐๐ (๐ (๐ฅ1))๐๐2 (๐ (๐ฅ1,๐ฆ))๐ (๐ฅ1,๐ฆ)
๐๐ (๐ (๐ฅ2))โ๐๐ (๐ (๐ฆ))๐๐2 (๐ (๐ฅ2,๐ฆ))๐ (๐ฅ2,๐ฆ)
๐๐๐๐ผ2 0 2๐ (๐ฅ1,๐ฆ)
โ2๐ (๐ฅ2,๐ฆ)
๐SDC ๐ (๐ฅ2,๐ฆ)(๐ (๐ฅ2)+๐ (๐ฆ))2 โ
๐ (๐ฅ1,๐ฆ)(๐ (๐ฅ1)+๐ (๐ฆ))2
1๐ (๐ฅ1)+๐ (๐ฆ)
โ1๐ (๐ฅ2)+๐ (๐ฆ)
๐JI ๐ (๐ฅ1,๐ฆ)๐ (๐ฅ1)+๐ (๐ฆ)โ๐ (๐ฅ1,๐ฆ) โ
๐ (๐ฅ2,๐ฆ)๐ (๐ฅ2)+๐ (๐ฆ)โ๐ (๐ฅ2,๐ฆ)
๐ (๐ฅ1)+๐ (๐ฆ)(๐ (๐ฅ1)+๐ (๐ฆ)โ๐ (๐ฅ1,๐ฆ))2
๐ (๐ฅ2)+๐ (๐ฆ)(๐ (๐ฅ2)+๐ (๐ฆ)โ๐ (๐ฅ2,๐ฆ))2
๐LLR 0 1๐ (๐ฅ1,๐ฆ)
โ1๐ (๐ฅ2,๐ฆ)
๐๐๐ annoyingly long(2โ 4
๐)โ
(๐ (๐ฅ1)โ๐ (๐ฅ1)2) (๐ (๐ฆ)โ๐ (๐ฆ)2)( 4๐โ2)โ
(๐ (๐ฅ2)โ๐ (๐ฅ2)2) (๐ (๐ฆ)โ๐ (๐ฆ)2)
๐๐ก-๐ก๐๐ ๐ก_๐๐๐โ๐ (๐ฅ2)โ
โ๐ (๐ฅ1)
2โ๐ (๐ฆ)
1โ๐ (๐ฅ1)โ๐ (๐ฆ)
โ1โ๐ (๐ฅ2)โ๐ (๐ฆ)
โข Demographic Parity:
๐ท๐_๐๐๐ = ๐ (๐ฆ |๐ฅ1) โ ๐ (๐ฆ |๐ฅ2) (25)๐๐ท๐_๐๐๐๐๐ (๐ฆ) = 0 (26)
๐๐ท๐_๐๐๐๐๐ (๐ฅ1, ๐ฆ)
=1
๐ (๐ฅ1)(27)
๐๐ท๐_๐๐๐๐๐ (๐ฅ2, ๐ฆ)
=โ1
๐ (๐ฅ2)(28)
โข Sรธrensen-Dice Coefficient:
๐๐ท๐ถ_๐๐๐ =๐ (๐ฅ1, ๐ฆ)
๐ (๐ฅ1) + ๐ (๐ฆ) โ๐ (๐ฅ2, ๐ฆ)
๐ (๐ฅ2) + ๐ (๐ฆ) (29)
๐๐๐ท๐ถ_๐๐๐๐๐ (๐ฆ) =
๐ (๐ฅ2, ๐ฆ)(๐ (๐ฅ2) + ๐ (๐ฆ))2
โ ๐ (๐ฅ1, ๐ฆ)(๐ (๐ฅ1) + ๐ (๐ฆ))2 (30)
๐๐๐ท๐ถ_๐๐๐๐๐ (๐ฅ1, ๐ฆ)
=1
๐ (๐ฅ1) + ๐ (๐ฆ) (31)
๐๐๐ท๐ถ_๐๐๐๐๐ (๐ฅ2, ๐ฆ)
=โ1
๐ (๐ฅ2) + ๐ (๐ฆ) (32)
(33)
โข Jaccard Index:
๐ฝ ๐ผ_๐๐๐ =๐ (๐ฅ1, ๐ฆ)
๐ (๐ฅ1) + ๐ (๐ฆ) โ ๐ (๐ฅ1, ๐ฆ)โ ๐ (๐ฅ2, ๐ฆ)๐ (๐ฅ2) + ๐ (๐ฆ) โ ๐ (๐ฅ2, ๐ฆ)
(34)
๐๐ฝ ๐ผ_๐๐๐๐๐ (๐ฆ) =
๐ (๐ฅ2, ๐ฆ)(๐ (๐ฅ2) + ๐ (๐ฆ) โ ๐ (๐ฅ2, ๐ฆ))2
โ ๐ (๐ฅ1, ๐ฆ)(๐ (๐ฅ1) + ๐ (๐ฆ) โ ๐ (๐ฅ1, ๐ฆ))2
(35)
๐๐ฝ ๐ผ_๐๐๐๐๐ (๐ฅ1, ๐ฆ)
=๐ (๐ฅ1) + ๐ (๐ฆ)
(๐ (๐ฅ1) + ๐ (๐ฆ) โ ๐ (๐ฅ1, ๐ฆ))2(36)
๐๐ฝ ๐ผ_๐๐๐๐๐ (๐ฅ2, ๐ฆ)
=๐ (๐ฅ2) + ๐ (๐ฆ)
(๐ (๐ฅ2) + ๐ (๐ฆ) โ ๐ (๐ฅ2, ๐ฆ))2(37)
(38)
โข Log-Likelihood Ratio:
๐ฟ๐ฟ๐ _๐๐๐ = ๐๐(๐ (๐ฅ1 |๐ฆ)) โ ๐๐(๐ (๐ฅ2 |๐ฆ)) (39)๐๐ฟ๐ฟ๐ _๐๐๐๐๐ (๐ฆ) =
๐ (๐ฅ2, ๐ฆ)๐ (๐ฅ2 |๐ฆ)๐ (๐ฆ)2
โ ๐ (๐ฅ1, ๐ฆ)๐ (๐ฅ1 |๐ฆ)๐ (๐ฆ)2
= 0 (40)
๐๐ฟ๐ฟ๐ _๐๐๐๐๐ (๐ฅ1, ๐ฆ)
=1
๐ (๐ฅ1 |๐ฆ)๐ (๐ฆ)=
1๐ (๐ฅ1, ๐ฆ)
(41)
๐๐ฟ๐ฟ๐ _๐๐๐๐๐ (๐ฅ2, ๐ฆ)
=โ1
๐ (๐ฅ2, ๐ฆ)(42)
(43)
โข PMI:
๐๐๐ผ_๐๐๐ = ๐๐(๐ (๐ฆ |๐ฅ1)) โ ๐๐(๐ (๐ฆ |๐ฅ2)) (44)๐๐๐๐ผ_๐๐๐๐๐ (๐ฆ) = 0 (45)
๐๐๐๐ผ_๐๐๐๐๐ (๐ฅ1, ๐ฆ)
=1
๐ (๐ฅ1, ๐ฆ)(46)
๐๐๐๐ผ_๐๐๐๐๐ (๐ฅ2, ๐ฆ)
=1
๐ (๐ฅ2, ๐ฆ)(47)
โข PMI2:
๐๐๐ผ2_๐๐๐ = ๐๐(๐ (๐ฆ |๐ฅ1)) โ ๐๐(๐ (๐ฆ |๐ฅ2)) + ๐๐(๐ (๐ฅ1, ๐ฆ)) โ ๐๐(๐ (๐ฅ2, ๐ฆ)) (48)
๐๐๐๐ผ2_๐๐๐๐๐ (๐ฆ) = 0 (49)
๐๐๐๐ผ2_๐๐๐๐๐ (๐ฅ1, ๐ฆ)
=2
๐ (๐ฅ1, ๐ฆ)(50)
๐๐๐๐ผ2_๐๐๐๐๐ (๐ฅ2, ๐ฆ)
=โ2
๐ (๐ฅ2, ๐ฆ)(51)
โข nPMI, normalized by p(y) :
๐๐๐๐ผ_๐๐๐ =๐๐(๐ (๐ฆ |๐ฅ1))๐๐(๐ (๐ฆ)) โ ๐๐(๐ (๐ฆ |๐ฅ2))
๐๐(๐ (๐ฆ)) (52)
๐๐๐๐๐ผ_๐๐๐๐๐ (๐ฆ) =
๐๐(๐ (๐ฆ |๐ฅ2))๐๐2 (๐ (๐ฆ))๐ (๐ฆ)
โ ๐๐(๐ (๐ฆ |๐ฅ1))๐๐2 (๐ (๐ฆ))๐ (๐ฆ)
=๐๐( ๐ (๐ฅ2 |๐ฆ)
๐ (๐ฅ1 |๐ฆ) )
๐๐2 (๐ (๐ฆ))๐ (๐ฆ)(53)
๐๐๐๐๐ผ_๐๐๐๐๐ (๐ฅ1, ๐ฆ)
=1
๐๐(๐ (๐ฆ))๐ (๐ฅ1, ๐ฆ)(54)
๐๐๐๐๐ผ_๐๐๐๐๐ (๐ฅ2, ๐ฆ)
=โ1
๐๐(๐ (๐ฆ))๐ (๐ฅ2, ๐ฆ)(55)
โข nPMI, normalized by p(x,y) :
๐๐๐๐ผ_๐๐๐ =๐๐(๐ (๐ฅ2))๐๐(๐ (๐ฅ2, ๐ฆ))
โ ๐๐(๐ (๐ฅ1))๐๐(๐ (๐ฅ1, ๐ฆ))
+ ๐๐(๐ (๐ฆ))๐๐(๐ (๐ฅ1, ๐ฆ))
โ ๐๐(๐ (๐ฆ))๐๐(๐ (๐ฅ2, ๐ฆ))
(56)
๐๐๐๐๐ผ_๐๐๐๐๐ (๐ฆ) =
1๐๐(๐ (๐ฅ1, ๐ฆ))๐ (๐ฆ)
โ 1๐๐(๐ (๐ฅ2, ๐ฆ))๐ (๐ฆ)
(57)
๐๐๐๐๐ผ_๐๐๐๐๐ (๐ฅ1, ๐ฆ)
=๐๐(๐ (๐ฆ)) โ ๐๐(๐ (๐ฅ1))๐๐2 (๐ (๐ฅ1, ๐ฆ))๐ (๐ฅ1, ๐ฆ)
(58)
๐๐๐๐๐ผ_๐๐๐๐๐ (๐ฅ1, ๐ฆ)
=๐๐(๐ (๐ฅ2)) โ ๐๐(๐ (๐ฆ))๐๐2 (๐ (๐ฅ2, ๐ฆ))๐ (๐ฅ2, ๐ฆ)
(59)
(60)
โข Kendall rank correlation (๐๐ ):
๐๐_๐๐๐ = ๐๐ (๐1 = ๐ฅ1, ๐2 = ๐ฆ) โ ๐๐ (๐1 = ๐ฅ2, ๐2 = ๐ฆ) (61)
๐๐๐_๐๐๐๐๐ (๐ฅ1, ๐ฆ)
=(2 โ 4
๐ )โ๏ธ(๐ (๐ฅ1) โ ๐ (๐ฅ1)2) (๐ (๐ฆ) โ ๐ (๐ฆ)2)
(62)
๐๐๐_๐๐๐๐๐ (๐ฅ2, ๐ฆ)
=(2 โ 4
๐ )โ๏ธ(๐ (๐ฅ2) โ ๐ (๐ฅ2)2) (๐ (๐ฆ) โ ๐ (๐ฆ)2)
(63)
(64)
โข ๐ก-๐ก๐๐ ๐ก :
๐ก-๐ก๐๐ ๐ก_๐๐๐ =๐ (๐ฅ1, ๐ฆ) โ ๐ (๐ฅ1) โ ๐ (๐ฆ)โ๏ธ
๐ (๐ฅ1) โ ๐ (๐ฆ)โ ๐ (๐ฅ2, ๐ฆ) โ ๐ (๐ฅ2) โ ๐ (๐ฆ)โ๏ธ
๐ (๐ฅ2) โ ๐ (๐ฆ)(65)
๐๐ก-๐ก๐๐ ๐ก_๐๐๐๐๐ (๐ฆ) =
โ๏ธ๐ (๐ฅ2) โ
โ๏ธ๐ (๐ฅ1)
2โ๏ธ๐ (๐ฆ)
(66)
๐๐ก-๐ก๐๐ ๐ก_๐๐๐๐๐ (๐ฅ1, ๐ฆ)
=1โ๏ธ
๐ (๐ฅ1) โ ๐ (๐ฆ)(67)
๐๐ก-๐ก๐๐ ๐ก_๐๐๐๐๐ (๐ฅ2, ๐ฆ)
=โ1โ๏ธ
๐ (๐ฅ2) โ ๐ (๐ฆ)(68)
(69)
D COMPARISON TABLES
Table 5: Mean/Std of top 100 Male Association Gaps
Scales of top 100 (๐ฅ1 =๐๐๐)Metrics Mean/Std ๐ถ (๐ฆ) Mean/Std ๐ถ (๐ฅ1, ๐ฆ)๐ท๐ 101.38(ยฑ199.32) 1.00(ยฑ0.00)
๐๐๐๐ผ๐ฅ๐ฆ 24371.15(ยฑ27597.10) 237.92(ยฑ604.29)๐๐๐๐ผ๐ฆ 75160.45(ยฑ119779.13) 5680.44(ยฑ17193.19)๐๐๐ผ 1875.86(ยฑ3344.23) 2.43(ยฑ4.72)๐๐๐ผ2 831.43(ยฑ620.02) 1.09(ยฑ0.32)๐๐ท๐ถ 676.33(ยฑ412.23) 1.00(ยฑ0.00)๐๐ 189294.69(ยฑ149505.10) 24524.89(ยฑ32673.77)
Table 6: Mean/Std of top 100 Male-Female Association Gaps
Counts for top 100 gaps(๐ฅ1=MALE, ๐ฅ2=FEMALE)Metrics Mean/Std ๐ถ (๐ฆ) Mean/Std ๐ถ (๐ฅ1, ๐ฆ) Mean/Std ๐ถ (๐ฅ2, ๐ฆ)๐ท๐ 154960.81(ยฑ147823.81) 71552.62 (ยฑ67400.44) 70398.07(ยฑ56824.32)๐ฝ ๐ผ 70403.46 (ยฑ86766.14) 29670.94 (ยฑ35381.09) 35723.13(ยฑ39308.46)๐ฟ๐ฟ๐ 1020.60 (ยฑ1971.77) 57.95 (ยฑ178.59) 558.36 (ยฑ1420.15)
๐๐๐๐ผ๐ฅ๐ฆ 13869.42 (ยฑ46244.17) 5771.73 (ยฑ23564.63) 9664.46 (ยฑ31475.19)๐๐๐๐ผ๐ฆ 23156.11 (ยฑ73662.90) 6632.79 (ยฑ25631.26) 10376.39 (ยฑ33559.27)๐๐๐ผ 1020.60 (ยฑ1971.77) 57.95 (ยฑ178.59) 558.36 (ยฑ1420.15)๐๐๐ผ2 1020.60 (ยฑ1971.77) 57.95 (ยฑ178.59) 558.36 (ยฑ1420.15)๐๐ท๐ถ 67648.20 (ยฑ87258.41) 28193.83 (ยฑ35295.25) 34443.22 (ยฑ39501.21)๐๐ 134076.65 (ยฑ144354.70) 53845.80 (ยฑ58016.12) 56032.09 (ยฑ52124.92)
Table 7: Minimum/Maximum of top 100 Male-Female Association Gaps
Metrics Min/Max ๐ถ (๐ฆ) Min/Max ๐ถ (๐ฅ1, ๐ฆ) Min/Max ๐ถ (๐ฅ2, ๐ฆ)
๐๐๐ผ 15 / 10,551 1 / 1,059 8 / 7,755
๐๐๐ผ2 15 / 10,551 1 / 1,059 8 / 7,755
๐ฟ๐ฟ๐ 15 / 10,551 1 / 1,059 8 / 7,755
๐ท๐ 6,104 / 785,045 628 / 239,950 5,347 / 197,795
๐ฝ ๐ผ 4,158 / 562,445 399 / 144,185 3,359 / 183,132
๐๐ท๐ถ 2,906 / 562,445 139 / 144,185 2,563 / 183,132
๐๐๐๐ผ๐ฆ 35 / 562,445 1 / 144,185 9 / 183,132
๐๐๐๐ผ๐ฅ๐ฆ 34 / 270,748 1 / 144,185 20 / 183,132
๐๐ 6,104 / 785,045 628 / 207,723 5,347 / 183,132
๐ก-๐ก๐๐ ๐ก 960 / 562,445 72 / 144,185 870 / 183,132
E COMPARISON PLOTSE.1 Overall Rank Changes
Table 8: Demographic Parity (di), Jaccard Index (ji), Sรธrensen-Dice Coefficient (sdc) Comparison
Table 9: Pointwise Mutual Information (pmi), Squared PMI (pmi_2), Log-Likelihood Ratio (ll) Comparison
Table 10: Demographic Parity (di), PointwiseMutual Information (pmi), Normalized PointwiseMutual Information (npmi_xy),๐๐ (tau_b) Comparison
E.2 Movement plots
Table 11: Demographic Parity (di), Jaccard Index (ji), Sรธrensen-Dice Coefficient (sdc) Comparison
Table 12: Pointwise Mutual Information (pmi), Squared PMI (pmi_2), Log-Likelihood Ratio (ll) Comparison
Table 13: Demographic Parity (di), PointwiseMutual Information (pmi), Normalized PointwiseMutual Information (npmi_xy),๐๐ (tau_b) Comparison
Table 14: Example Top Results
DP PMI nPMI๐ฅ๐ฆRanks Label Count Label Count Label Count
0 Female 265853 Dido Flip 140 6101 Woman 270748 Webcam Model 184 Dido Flip 1402 Girl 221017 Boho-chic 151 29063 Lady 166186 610 Eye Liner 31444 Beauty 562445 Treggings 126 Long Hair 568325 Long Hair 56832 Mascara 539 Mascara 5396 Happiness 117562 145 Lipstick 86887 Hairstyle 145151 Lace Wig 70 Step Cutting 61048 Smile 144694 Eyelash Extension 1167 Model 105519 Fashion 238100 Bohemian Style 460 Eye Shadow 123510 Fashion Designer 101854 78 Photo Shoot 877511 Iris 120411 Gravure Idole 200 Eyelash Extension 116712 Skin 202360 165 Boho-chic 46013 Textile 231628 Eye Shadow 1235 Webcam Model 15114 Adolescence 221940 156 Bohemian Style 184