A Methodology for Direct and IndirectDiscrimination Prevention in
Data Mining
Presented By:Rucha Bhutada
Guided By:Prof. M. R. Wanjari
Outline:
Introduction Challenges Discrimination analysis Why discrimination Papers read Findings of the base paper Future plans
Introduction:
Data mining is an increasingly important technology for extracting useful knowledge hidden in large collections of data.
Some Negative social perceptions can also be mined, like: Potential Privacy invasion Potential discrimination
If the training datasets are biased in what regards discriminatory attributes like gender, race, religion, discriminatory decisions may follow.
Challenges:
Direct and indirect discrimination instead of only direct discrimination
To find a good tradeoff between discrimination removal and the quality of the resulting training data sets and data mining models.
Why this topic:
It’s an extension to association rule mining. And a novel application of association rule mining in social environment.
It is more than obvious that most people do not want to be discriminated on any of the sensitive issues.
Can be useful in deriving discrimination free rule base for decision making systems like insurance, loan, job etc.
Example:
U.S. federal laws prohibit discrimination on the basis of: Race , Color, Religion, Nationality, Marital status, Age
In a number of settings: • Credit/insurance scoring • Sale, rental, and financing of housing • Personnel selection and wage• Access to public accommodations, education, nursing
homes, adoptions, and health care.
Papers read:
Sr. No.
Paper Name Author Year Conclusion
1 A Methodology for Direct and IndirectDiscrimination Prevention in Data Mining
Sara Hajian and Josep Domingo-Ferrer
2013 To develop a new preprocessing discrimination prevention methodology
2 “RuleProtection for Indirect Discrimination Prevention in DataMining
S. Hajian, J. Domingo-Ferrer, and A. Martı´nez-Balleste
2011 To protect thedecision rules made for discrimination
3 Classification with no Discriminationby Preferential Sampling
F. Kamiran and T. Calders
2010 To refine the model of discrimination
Discussion On Findings Of Base Paper
Discrimination is unfair or unequal treatment of people based on membership to a category or a minority, without regard to individual merit
Discrimination can be either direct or indirect:
Direct discrimination occurs when decisions are made based on sensitive attributes.
Indirect discrimination occurs when decisions are made based on non-sensitive attributes which are strongly correlated with biased sensitive ones.
Approach: Anti-discrimination techniques have been introduced in data
mining:
- Discrimination discovery:Consists of supporting the discovery of discriminatory
decisions hidden, either directly or indirectly, in a dataset of historical decision records.
- Discrimination Prevention:Consists of inducing patterns that do not lead to
discriminatory decisions even if the original data sets are biased.
Approach: (cont’d) Preprocessing approach Data sets: collection of data objects Item , An item set, The support of an item set, supp(X), is the fraction of records that contain
the item set X. We say that a rule X C is completely supported by a record if both X and C appear in the record.
The confidence of a rule, conf(X C), measures how often the class item C appears in records that contain X. Hence, if supp(X) > 0 then
Support and confidence range over [0,1].
Approach: (cont’d):
• A frequent classification rule is a classification rule with support and confidence greater than respective specified lower bounds.
• The negated item set, i.e., not of X is an item set with the same attributes as X, but the attributes in not of X take any value except those taken by attributes in X.
Approach: (cont’d):
o Potentially Discriminatory and Nondiscriminatory Classification Rules Let DIs be the set of predetermined discriminatory items in DB (eg.
DI={foreign worker= yes, Race= black, Gender= female}). Frequent classification rules in FR fall into one of the following two classes:
(FR stands for frequent classification rule) A classification rule X→C is potentially discriminatory (PD) when X = A,B
with A subset of DI, a nonempty discriminatory item set and B a nondiscriminatory item set. For example, {foreign worker= yes, city = NYC}→Hire = no.
A classification rule X→C is potentially nondiscriminatory (PND) when X = D,B is a nondiscriminatory item set. For example,{zip = 10451,City = NYC}→Hire = no or {Experience = low, City = NYC}→ Hire = no.
The word “potentially” means that a PD rule could probably lead to discriminatory decisions. Also, a PND rule could lead to discriminatory decisions in combination with some background knowledge;
Approach: (cont’d)o Direct Discrimination Measure Definition 1. Let A,B→C be a classification rule such that
conf(B→C>0). The extended lift of a rule is
The idea here is to evaluate the discrimination of a rule as the gain of confidence due to the presence of thediscriminatory items
Definition 2. Let α ε R be a fixed threshold and let A be a discriminatory item set. A PD classification rule c = A,B →C is a α protective w r t elift if elift (c) < α. Otherwise, c is α discriminatory.
Approach: (cont’d)
o Indirect Discrimination Measure: Definition 3. A PND classification rule r: D, B →C is a
redlining rule if it could yield an α discriminatory rule r’ : A,B→C in combination with currently available background knowledge rules of the form rb1 : A,B→D and rb2 : D,B→A, where A is a discriminatory item set.
For example: {zip= 10451, city= NYC} →Hire= no.
Approach: (cont’d)o Data Transformation for Direct Discrimination:
Direct Rule Protection:- converts α discriminatory rule into an α protective
rule
o Data transformation for indirect Discrimination:Indirect Rule Protection:
- Turns into redlining rule into non redlining
Data sets:
• Adult data set:This data set consists of 48,842 records, split into a
“train” part with 32,561 records and a “test” part with 16,281 records. The data set has 14 attributes (without class attribute).
• German credit data set: We also used the German Credit data set. This data set
consists of 1,000 records and 20 attributes (without class attribute) of bank account holders. This is a well-known real-life data set, containing both numerical and categorical attributes.
Result: (table 1)
• Misses cost (MC). This measure quantifies the percentage of rules among those extractable from the original data set that cannot be extracted from the transformed data set (side effect of the transformation process).
Ghost cost (GC). This measure quantifies the percentage of the rules among those extractable from the transformed data set that were no extractable from the original data set (side effect of the transformation process).
Result: (table 2)
Result: (table 3 and 4) .
Tables 3 and 4 shows that lower information loss in terms of the GC measure in the Adult data set than in the German Credit data set.
Future plans:
This can be implemented in Indian Scenario
To check the corruption
Gender discrimination
References:1. S. Hajian, J. Domingo-Ferrer, and A. Martı´nez-Balleste´,
“Rule Protection for Indirect Discrimination Prevention in Data Mining,” Proc. Eighth Int’l Conf. Modeling Decisions for Artificial Intelligence (MDAI ’11), pp. 211-222, 2011.
2. D. Pedreschi, S. Ruggieri, and F. Turini, “Discrimination-Aware Data Mining,” Proc. 14th ACM Int’l Conf. Knowledge Discovery and Data Mining (KDD ’08), pp. 560-568, 2008.
3. S. Ruggieri, D. Pedreschi, and F. Turini, “Data Mining for Discrimination Discovery,” ACM Trans. Knowledge Discovery from Data, vol. 4, no. 2, article 9, 2010.
4. S. Ruggieri, D. Pedreschi, and F. Turini, “DCUBE: Discrimination
5. Discovery in Databases,” Proc. ACM Int’l Conf. Management of Data (SIGMOD ’10), pp. 1127-1130, 2010.
THANK YOU…!!!
Top Related