Subgroup Discovery
description
Transcript of Subgroup Discovery
![Page 1: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/1.jpg)
Subgroup Discovery
Finding Local Patterns in Data
![Page 2: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/2.jpg)
Exploratory Data Analysis
Scan the data without much prior focus
Find unusual parts of the data
Analyse attribute dependencies interpret this as ‘rule’:
if X=x and Y=y then Z is unusual
Complex data: nominal, numeric, relational?
the Subgroup
![Page 3: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/3.jpg)
Exploratory Data Analysis Classification: model the dependency of the
target on the remaining attributes. problem: sometimes classifier is a black-box, or uses
only some of the available dependencies. for example: in decision trees, some attributes may not
appear because of overshadowing.
Exploratory Data Analysis: understanding the effects of all attributes on the target.
![Page 4: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/4.jpg)
Interactions between Attributes
Single-attribute effects are not enough
XOR problem is extreme example: 2 attributes with no info gain form a good subgroup
Apart from A=a, B=b, C=c, …
consider alsoA=aB=b, A=aC=c, …, B=bC=c, …
A=aB=bC=c, …
…
![Page 5: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/5.jpg)
Subgroup Discovery Task
“Find all subgroups within the inductive constraints that show a significant deviation in the distribution of the target attributes”
Inductive constraints: Minimum support (Maximum support) Minimum quality (Information gain, X2, WRAcc) Maximum complexity …
![Page 6: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/6.jpg)
Subgroup Discovery: the Binary Target Case
![Page 7: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/7.jpg)
Confusion Matrix A confusion matrix (or contingency table) describes the
frequency of the four combinations of subgroup and target: within subgroup, positive within subgroup, negative outside subgroup, positive outside subgroup, negative
T F
T .42 .13 .55
F .12 .33
.54 1.0
subgroup
target
![Page 8: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/8.jpg)
Confusion Matrix High numbers along the TT-FF diagonal means a positive
correlation between subgroup and target
High numbers along the TF-FT diagonal means a negative correlation between subgroup and target
Target distribution on DB is fixed
Only two degrees of freedom
T F
T .42 .13 .55
F .12 .33 .45
.54 .46 1.0
subgroup
target
![Page 9: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/9.jpg)
Quality MeasuresA quality measure for subgroups summarizes the interestingness of
its confusion matrix into a single number
WRAcc, weighted relative accuracy Balance between coverage and unexpectedness
WRAcc(S,T) = p(ST) – p(S)p(T)
between −.25 and .25, 0 means uninteresting
T F
T .42 .13 .55
F .12 .33
.54 1.0
WRAcc(S,T) = p(ST)−p(S)p(T)= .42 − .297 = .123
subgroup
target
![Page 10: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/10.jpg)
Quality Measures
WRAcc: Weighted Relative Accuracy
Information gain
X2
Correlation Coefficient
Laplace
Jaccard
Specificity
…
![Page 11: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/11.jpg)
Subgroup Discovery as Searchtrue
A=a1 A=a2 B=b1 B=b2 C=c1 …
T F
T .42 .13 .55
F .12 .33
.54 1.0
A=a2B=b1A=a1B=b1 A=a1B=b2… …
…A=a1B=b1C=c1
minimum support level reached
![Page 12: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/12.jpg)
Refinements are (anti-)monotonic
target concept
subgroup S1
S2 refinement of S1
S3 refinement of S2
Refinements are (anti-) monotonic in their support…
…but not in interestingness. This may go up or down.
entire database
![Page 13: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/13.jpg)
Subgroup Discovery and ROC space
![Page 14: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/14.jpg)
ROC Space
Each subgroup forms a point in ROC space, in terms of its False Positive Rate, and True Positive Rate.
TPR = TP/Pos = TP/TP+FN (fraction of positive cases in the subgroup)FPR = FP/Neg = FP/FP+TN (fraction of negative cases in the subgroup)
ROC = Receiver Operating Characteristics
![Page 15: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/15.jpg)
ROC Space Properties
‘ROC heaven’perfect subgroup
‘ROC hell’random subgroup
entire database
emptysubgroup minimum support
threshold
perfectnegative subgroup
![Page 16: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/16.jpg)
Measures in ROC Space
0
WRAcc Information Gain
positive
negative
sou
rce:
Flach
& F
ürn
kran
z
isometric
![Page 17: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/17.jpg)
Other MeasuresPrecision Gini index
Correlation coefficient Foil gain
![Page 18: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/18.jpg)
Refinements in ROC SpaceRefinements of S will reduce the FPR and TPR, so will appear to the left and below S.
Blue polygon represents possible refinements of S. With a convex measure, f is bounded by measure of corners.
..
.
.
.
If corners are not above minimum quality or current best (top k?), prune search space below S.
![Page 19: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/19.jpg)
Multi-class problems Generalising to problems with more than 2 classes
is fairly staightforward:
X2 Information gain
C1 C2 C3
T .27 .06 .22 .55
F .03 .19 .23 .45
.3 .25 .45 1.0
combine values to qualitymeasure
subgroup
target
sou
rce:
Nijs
sen
& K
ok
![Page 20: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/20.jpg)
Subgroup Discovery for Numeric targets
![Page 21: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/21.jpg)
Numeric Subgroup Discovery Target is numeric: find subgroups with significantly
higher or lower average value
Trade-off between size of subgroup and average target value
h = 2200h = 2200
h = 3100h = 3100
h = 3600h = 3600
![Page 22: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/22.jpg)
Types of SD for Numeric Targets
Regression subgroup discovery numeric target has order and scale
Ordinal subgroup discovery numeric target has order
Ranked subgroup discovery numeric target has order or scale
![Page 23: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/23.jpg)
Vancouver 2010 Winter Olympics
Partial rankingobjects share a rank
Partial rankingobjects share a rank
ordinal targetordinal target
regression targetregression target
![Page 24: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/24.jpg)
Offical IOC ranking of countries (med > 0)Rank Country Medals Athletes Continent Popul. Language Family Repub.
Polar 1 USA 37 214 N. America 309 Germanic y y2 Germany 30 152 Europe 82 Germanic y n3 Canada 26 205 N. America 34 Germanic n y4 Norway 23 100 Europe 4.8 Germanic n y5 Austria 16 79 Europe 8.3 Germanic y n6 Russ. Fed. 15 179 Asia 142 Slavic y y7 Korea 14 46 Asia 73 Altaic y n9 China 11 90 Asia 1338 Sino-Tibetan y n9 Sweden 11 107 Europe 9.3 Germanic n y9 France 11 107 Europe 65 Italic y n11 Switzerland 9 144 Europe 7.8 Germanic y n12 Netherlands 8 34 Europe 16.5 Germanic n n13.5 Czech Rep. 6 92 Europe 10.5 Slavic y n13.5 Poland 6 50 Europe 38 Slavic y n16 Italy 5 110 Europe 60 Italic y n16 Japan 5 94 Asia 127 Japonic n n16 Finland 5 95 Europe 5.3 Finno-Ugric y y20 Australia 3 40 Australia 22 Germanic y n20 Belarus 3 49 Europe 9.6 Slavic y n20 Slovakia 3 73 Europe 5.4 Slavic y n20 Croatia 3 18 Europe 4.5 Slavic y n20 Slovenia 3 49 Europe 2 Slavic y n23 Latvia 2 58 Europe 2.2 Slavic y n25 Great Britain 1 52 Europe 61 Germanic n n25 Estonia 1 30 Europe 1.3 Finno-Ugric y n25 Kazakhstan 1 38 Asia 16 Turkic y n
Fractional ranksshared ranks are averaged
Fractional ranksshared ranks are averaged
![Page 25: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/25.jpg)
Interesting Subgroups‘polar = yes’
1. United States
3. Canada
4. Norway
6. Russian Federation
9. Sweden
16 Finland
‘language_family = Germanic & athletes 60’
1. United States
2. Germany
3. Canada
4. Norway
5. Austria
9. Sweden
11. Switzerland
![Page 26: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/26.jpg)
Intuitions Size: larger subgroups are more
reliable
Rank: majority of objects appear at the top
Position: ‘middle’ of subgroup should differ from middle of ranking
Deviation: objects should have similar rank
*****
**
*
language_family = Slaviclanguage_family = Slavic
![Page 27: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/27.jpg)
Intuitions Size: larger subgroups are more
reliable
Rank: majority of objects appear at the top
Position: ‘middle’ of subgroup should differ from middle of ranking
Deviation: objects should have similar rank
polar = yespolar = yes
*****
*
![Page 28: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/28.jpg)
Intuitions Size: larger subgroups are more
reliable
Rank: majority of objects appear at the top
Position: ‘middle’ of subgroup should differ from middle of ranking
Deviation: objects should have similar rank
population 10Mpopulation 10M
**
*
*******
![Page 29: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/29.jpg)
Intuitions Size: larger subgroups are more
reliable
Rank: majority of objects appear at the top
Position: ‘middle’ of subgroup should differ from middle of ranking
Deviation: objects should have similar rank
language_family = Slavic & population 10Mlanguage_family = Slavic & population 10M
*****
![Page 30: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/30.jpg)
Quality Measures Average
Mean test
z-Score
t-Statistic
Median X2 statistic
AUC of ROC
Wilcoxon-Mann-Whitney Ranks statistic
Median MAD Metric
![Page 31: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/31.jpg)
Meet Cortanathe open source Subgroup Discovery tool
![Page 32: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/32.jpg)
Cortana Features
Generic Subgroup Discovery algorithm quality measure search strategy inductive constraints
Flat file, .txt, .arff, (DB connection to come)
Support for complex targets
41 quality measures
ROC plots
Statistical validation
![Page 33: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/33.jpg)
Target Concepts
‘Classical’ Subgroup Discovery nominal targets (classification) numeric targets (regression)
Exceptional Model Mining (to be discussed in a few slides)
multiple targets regression, correlation multi-label classification
![Page 34: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/34.jpg)
Mixed Data
Data types binary nominal numeric
Numeric data is treated dynamically (no discretisation as preprocessing) all: consider all available thresholds bins: discretise the current candidate subgroup best: find best threshold, and search from there
![Page 35: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/35.jpg)
Statistical Validation Determine distribution of random results
random subsets random conditions swap-randomization
Determine minimum quality
Significance of individual results
Validate quality measures how exceptional?
![Page 36: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/36.jpg)
Open Source
You can Use Cortana binary
datamining.liacs.nl/cortana.html
Use and modify Cortana sources (Java)
![Page 37: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/37.jpg)
Exceptional Model MiningSubgroup Discovery with multiple target attributes
![Page 38: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/38.jpg)
Mixture of Distributions
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
![Page 39: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/39.jpg)
Mixture of Distributions
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
![Page 40: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/40.jpg)
Mixture of Distributions
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
For each datapoint it is unclear whether it belongs to G or G
Intensional description of exceptional subgroup G?
Model class unknown
Model parameters unknown
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
![Page 41: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/41.jpg)
Solution: extend Subgroup Discovery
Use other information than X and Y: object desciptions D
Use Subgroup Discovery to scan sub-populations in terms of D
Subgroup Discovery: find subgroups of the database where the target attribute shows an unusual distribution.
![Page 42: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/42.jpg)
Solution: extend Subgroup Discovery
Use other information than X and Y: object desciptions D
Use Subgroup Discovery to scan sub-populations in terms of D
Model over subgroup becomes target of SD
Subgroup Discovery: find subgroups of the database where the target attributes show an unusual distribution, by means of modeling over the target attributes.
Exceptional Model Mining
![Page 43: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/43.jpg)
Exceptional Model Mining
Define a target concept (X and y)
X y
object descriptiontarget concept
![Page 44: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/44.jpg)
Exceptional Model Mining
Define a target concept (X and y)
Choose a model class C
Define a quality measure φ over C
X y
modeling
object descriptiontarget concept
![Page 45: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/45.jpg)
Exceptional Model Mining
Define a target concept (X and y)
Choose a model class C
Define a quality measure φ over C
Use Subgroup Discovery to find exceptional subgroups G and associated model M
X y
modeling
Subgroup Discovery
object descriptiontarget concept
![Page 46: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/46.jpg)
Quality Measure Specify what defines an exceptional subgroup G based on
properties of model M
Absolute measure (absolute quality of M) Correlation coefficient Predictive accuracy
Difference measure (difference between M and M) Difference in slope qualitative properties of classifier
Reliable results Minimum support level Statistical significance of G
![Page 47: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/47.jpg)
Correlation Model Correlation coefficient
φρ = ρ(G)
Absolute difference in correlation
φabs = |ρ(G) - ρ(G)|
Entropy weighted absolute difference
φent = H(p)·|ρ(G) - ρ(G)|
Statistical significance of correlation difference φscd
compute z-score from ρ through Fisher transform
compute p-value from z-score
31
31
''*
nn
zzz
![Page 48: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/48.jpg)
Regression Model Compare slope b of
yi = a + b·xi + e, and
yi = a + b·xi + e
Compute significance of slope difference φssd
y = 41 568 + 3.31·x y = 30 723 + 8.45·x
drive = 1 basement = 0 #baths ≤ 1
![Page 49: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/49.jpg)
Gene Expression Data
11_band = ‘no deletion’ survival time ≤ 1919 XP_498569.1 ≤ 57
y = 3313 - 1.77·x y = 360 + 0.40·x
![Page 50: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/50.jpg)
Classification Model Decision Table Majority classifier
BDeu measure (predictiveness)
Hellinger (unusual distribution)
whole database
RIF1 160.45
prognosis = ‘unknown’
![Page 51: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/51.jpg)
General Framework
X y
modeling
Subgroup Discovery
object descriptiontarget concept
![Page 52: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/52.jpg)
General Framework
X y
Regression ●Classification ●ClusteringAssociationGraphical modeling ●…
Subgroup Discovery
object descriptiontarget concept
![Page 53: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/53.jpg)
General Framework
X y
RegressionClassificationClusteringAssociationGraphical modeling…
Subgroup Discovery ●Decision TreesSVM…
object descriptiontarget concept
![Page 54: Subgroup Discovery](https://reader035.fdocuments.us/reader035/viewer/2022081504/568150e4550346895dbf01dc/html5/thumbnails/54.jpg)
General Framework
X y
RegressionClassificationClusteringAssociationGraphical modeling…
Subgroup DiscoveryDecision TreesSVM…
propositional ●multi-relational ●…
target concept