Actionability and Formal Concepts: a Data Mining Perspective
Transcript of Actionability and Formal Concepts: a Data Mining Perspective
UMR 5205
Actionability and Formal Concepts: a Data Mining Perspective
Jean-François BoulicautINSA Lyon, LIRIS CNRS UMR 5205, France
Montréal (Canada), ICFCA 2008
2
Préambule
Actionability and Formal Concepts:
a Data Mining PerspectiveKnowledge Discovery based on Formal Concepts Data Mining as the “Art of Counting”Complete solvers for Inductive Queries are usefulA “personal” and obviously biased perspective
Joint work with:J. Besson (co-author of the invited paper)R. G. Pensa, C. Rigotti, C. Robardetinspired by many other colleagues
3
A conceptual view on KDD processes
Inductive Database
Management System
DataData
PatternsPatterns
ModelsModels
e.g., Inductive queries on 0/1 data for targeted applications in functional genomics
4
Motivation (1)
Pattern discovery from large 0/1 data sets
10010O710011O610011O511100O411101O310111O201101O1
P5P4P3P2P1
Objects x Properties0 x P
Size may be a problem: 102 ... 106 x 102 .. 104
5
Motivation (2)
Pattern discovery from large 0/1 data sets
Y∈2P ∧ |g(Y)| > 2
10010O710011O610011O511100O411101O310111O201101O1
P5P4P3P2P1
Objects x Properties
A 3-frequent itemset
{{o2,o5,o6},{p2,p5}}
g
f
6
Motivation (3)
Pattern discovery from large 0/1 data sets
X∈2O ∧ Y∈2P ∧ X=g(Y) ∧ Y=f(X)
10010O710011O610011O511100O411101O310111O201101O1
P5P4P3P2P1
Objects x Properties
A formal concept
{{o2,o5,o6},{p1,p2,p5}}
g
f
7
Motivation (4)
Pattern discovery from large 0/1 data sets
Fault-tolerant extensions to formal concepts?Dense itemsets?
10010O710011O610011O511100O411101O310111O201101O1
P5P4P3P2P1
Objects x Properties
Examples of subspaceclusters (bi-clusters)
8
Motivation (5)
Pattern discovery from large 0/1 data sets
From local patterns to global patterns
10010O710011O610011O511100O411101O310111O201101O1
P5P4P3P2P1
Objects x Properties
A co-clustering
9
Targeted applications
Understanding gene regulation?
GenesProteinsTranscription factorsPromoter sequences
10
… by means of formal concepts
01000Tf3
10111Tf2
10101Tf1
G5G4G3G2G1
01100E4
11101E3
10111E2
01101E1
G5G4G3G2G1
10E4
01E3
01E210E1
C2C1
G1, G3, and G5 are co-expressedgenes for all individuals of class C1. Tf1 and Tf2 might explain thisco-expression
11
A successful example (Ph.D. J. Besson)
Actionability and Knowledge Discovery
An outcome of a concrete KDD process:Human genes from {SPOP, ABCA7, FEM1B, HK2, MAPRE1, MORF4F4L2, ARF4, SF1, VSP29, CRYBA4, HIG1, SDC1, PGRMC2} appear to be co-regulated by insulin for all control individuals and the transcription factors from {SREBP, SP1, NF-Y, GATA-1, GATA-2, AML-1a} might support this co-regulation
12
Cont.
Discovering (new) regulation mechanismsIdentification of genes regulated by insulin from microarray data
Looking for formal concepts in a 156 x 344 matrix that encodes the association of transcription factors with genes that are regulated by insulin
14
Cont.
Formal Concepts (X,Y) s.t. SREBP1∈X
> 3.6 Millions
SREBP1 (Sterol-responsive-element binding protein 1) known to be implied in transcriptional answer to insulin
1110TF4
1101TF3
1011TF2
1011TF1
G4G3G2G1
15
Cont.
Formal Concepts (X,Y) s.t. SREBP1∈X ∧ SP1∈X ∧ NF-Y∈X
1.477
SP1 and NF-Y known for « cooperating » with SREBP1
1110TF4
1101TF3
1011TF2
1011TF1
G4G3G2G1
16
Biological validation
The formal concept({SREBP, SP1, NF-Y, GATA-1, GATA-2, AML-1a}, {SPOP, ABCA7, FEM1B, HK2, MAPRE1, MORF4F4L2, ARF4, SF1, VSP29, CRYBA4, HIG1, SDC1, PGRMC2})
has been “studied” (wet biology, S. Rome et al.)90% of these genes indeed have an active binding site for SREBP1 when SP1 and NF-Y are present
New targeted genes for SREBP1, SP1, NF-Y
17
Our thesis (1)
Discovery based on formal concepts makessense … but
Mining formal concepts is a special case of bi-setmining under constraints and may be studied as such
(X,Y) X∈2O ∧ Y∈2P ∧X=g(Y) ∧ Y=f(X) ∧ ...X=g(Y) ∧ Y=f(g(Y)) ∧ ...
Z∈2P ∧ Z is free ∧ X=g(Z) ∧ Y=f(g(Z)) ∧ ...
« Pushing » constraints is a key technology for solvinginductive queries
18
Our thesis (1’)
Discovery based on formal concepts makessense … but
Mining formal concepts is a special case of bi-setmining under constraints and may be studied as such
(X,Y) X ∈2O ∧ Y∈2P ∧X=g(Y) ∧ Y=f(X) ∧ |X| > γX=g(Y) ∧ Y=f(g(Y)) ∧ |X| > γ
Z∈2P ∧ Z is free ∧ X=g(Z) ∧ Y=f(g(Z)) ∧ |X| > γ« Pushing » constraints is a key technology for solving
inductive queries
19
Our thesis (2)
Discovery based on formal concepts makessense … but
Mining formal concepts that satisfy user-definedconstraints may be difficult
Specialization order, enumeration and pruningstrategies have to be designed
Monotonicity properties are important
… ∧ |X| > α ∧ |Y| < β not that hard… ∧ |X| > α ∧ |Y| > β much harder… ∧ |X| x |Y| > α « «
20
Using such constraints in practice?
Minimal size (Objects) Minimal size (Properties)
Number of formalconcepts
... a given biological data set (TF x G)
21
Our thesis (3)
Discovery based on formal concepts makessense … but
... a somehow disapointing actionability
Too many patterns
Many patterns denote false positive associations or uninteresting ones
− Using randomization techniques may helpFault-tolerance?
22
Problems w.r.t. closed set mining (1)
0000000o71100000o61100000o50011111o40011111o30011111o20011111o1
p7p6p5p4p3p2p1
({o1,o2,o3,o4},{p1,p2,p3,p4,p5})
({o5,o6},{p6,p7})
23
Problems w.r.t. closed set mining (2)
Introducing « 10% errors »
0000001o70110000o61100000o50011111o40011111o30001111o20011101o1
p7p6p5p4p3p2p1
({o1,o2,o3,o4,o7},{p1})
({o1,o2,o3,o4},{p1,p3,p4})
({o2,o3,o4},{p1,p2,p3,p4})
({o3,o4},{p1,p2,p3,p4,p5})
({o1,o3,o4},{p1,p3,p4,p5})
({o1,o3,o4,o6},{p5})
({o5,o6},{p6})
({o5},{p6,p7})
({o6},{p5,p6})
24
Elements of solution
Supporting actionable pattern discovery1 0
Looking for « large enough » patterns may help
0 1
Looking for fault-tolerant patterns
For a more or less declarative specification of fault-tolerancePost-processing collections of local patterns in
general and formal concepts in particular
25
Condensed representations are useful
Answering frequency queries{(Y,e) ∈ 2PxΝ s.t. e=|g(Y)| ∧ e > γ}
Rules, clusters,
etc
0/1 data Frequentitemsets
26
Condensed representations are useful’
Answering frequency queries{(Y,e) ∈ 2PxΝ s.t. e=|g(Y)| ∧ e > γ}
Rules, clusters,
etc
0/1 data
CondensedRepresentationsof frequent sets
Frequentitemsets
Closed sets, δ-free sets,k-free sets, …
27
Condensed representations are useful’’
Answering frequency queries{(Y,e) ∈ 2PxΝ s.t. e=|g(Y)| ∧ e > γ}
Rules, clusters,
etc
0/1 data
CondensedRepresentationsof frequent sets
Closed sets, δ-free sets,k-free sets, …
28
p5
p3p5 p2p5
p2p3p5
About condensed representations (1)
{o1,o2,o5}
01111o610111o501010o401011o311110o210110o1
p5p4p3p2p1
An equivalence class perspective{p5} is a 0-free set and {p2,p3,p5} is its 0-closure
29
About condensed representations (2)
p3
p2p3 p3p5
p2p3p5
{o1,o2, ...}
01111o610111o501010o401011o311110o210110o1
p5p4p3p2p1
A near-equivalence class perspective{p3} is a 1-free set and {p2,p3,p5} is its 1-closure
30
Using ac-like (1)
Computing δ-free sets and their δ-closures
δ=0 p3 ∧ p4 => p2 : 2
{p3, p4} {p3, p4, p2} : {o2, o6} : 2
Given γ = 1, 11 sets with δ = 0NB. 0-free sets also called key patterns
Association rules, closed sets and formal concepts
01111o610111o501010o401011o311110o210110o1
p5p4p3p2p1
31
Using ac-like (2)
Computing δ-free sets and their δ-closures
δ=0 p3 ∧ p4 => p2 : 2
{p3, p4} {p3, p4, p2} : {o2, o6} : 2
δ=1 p3 => p2 ∧ p5(-1) : 4p3 ∧ p4 => p1(-1) ∧ p2 ∧ p5(-1) : 2
Given γ = 1, 11 sets with δ = 0, 7 sets with δ=1
Association rules, (almost) closed sets (FBS patterns)
01111o610111o501010o401011o311110o210110o1
p5p4p3p2p1
32
Using ac-like (3)
Computing δ-free sets and their δ-closures
δ=0 p3 ∧ p4 => p2 : 2
{p3, p4} {p3, p4, p2} : {o2, o6} : 2
δ=1 p3 => p2 ∧ p5(-1) : 4p3 ∧ p4 => p1(-1) ∧ p2 ∧ p5(-1) : 2
Given γ = 1, 11 sets with δ = 0, 7 sets with δ=1
Association rules, (almost) closed sets (FBS patterns)
01111o610111o501010o401011o311110o210110o1
p5p4p3p2p1
33
Introducing DMiner
{p1,p2,p3} – {o1,o2,o3,o4}
p1 – o1
{p1,p2,p3} – {o2,o3,o4} {p2,p3} – {o1,o2,o3,o4}
{p1,p2,p3} – {o2,o3} {p1} – {o2,o3,o4} {p2,p3} – {o1,o2,o3} ∅ - {o1,o2,o3,o4}
p2p3 –o4 p2p3 –o4
001o4111o3111o2110o1
p3p2p1
36
Properties (see Ph.D. J. Besson, 2005)
DMiner is a correct and complete algorithm
Time complexity when n=|O|, m=|P| and n < m
Delay complexity (worse case)
O(n2 * m)
Delay complexity in average
(n - log2(|C|) + 1) O(n*m)
37
Back to fault-tolerance
Introducing « 10% errors »
0000001o70110000o61100000o50011111o40011111o30001111o20011101o1
p7p6p5p4p3p2p1
({o1,o2,o3,o4,o7},{p1})
({o1,o2,o3,o4},{p1,p3,p4})
({o2,o3,o4},{p1,p2,p3,p4})
({o3,o4},{p1,p2,p3,p4,p5})
({o1,o3,o4},{p1,p3,p4,p5})
({o1,o3,o4,o6},{p5})
({o5,o6},{p6})
({o5},{p6,p7})
({o6},{p5,p6})
38
Specifying fault-tolerance
Fault-tolerance extensions of formal concepts are proposals for “almost-closed” set patterns
Various attempts
FBS patterns
DRBS patterns
See also dense itemsets, large tiles, subspace clusters, support enveloppes in the Data Mining community
NB. Probably much more has been done within the “Concept Lattice” community
42
Pros and Cons for FBS patterns
Efficient algorithms for mining δ-free itemsetsand thus FBS patterns
Non symmetrical
When δ=0, we get formal concepts
43
DR-bi-sets
No « 0 »At least ε« 0 » more than insideper column
At least ε« 0 » more than insideper row
At most α« 0 » per row and per column
At least one « 0 » per column
At least one « 0 »per row
Cdense
Crelevant
+ Maximality Constraint
44
An example of a DR-bi-set
α=α’=ε=ε’=1 At most 1 “0” value per row and per column
1 more “0” value outside (at least 2)
48
Pros and Cons w.r.t. DRBS
When α=α’=0 and ε=ε’=1, we get formal concepts
Symmetrical approach
Hard to compute … a correct and completealgorithm exists
Some nice properties but « preserving more » of the Galois connection properties would be nicer …
... It remains open … at least for a data miner
49
Post-processing local patterns
A pragmatic way to tackle fault-tolerancePost-processing collections of formal concepts
− e.g., grouping patterns w.r.t. their similarities to decrease the number of hypothesis that have to beinterpreted
clustering formal concepts
− To increase hypothesis relevancy thanks to fault-tolerance
This can be applied on many other pattern types
50
An application
Mining Synexpression Groups from SAGE data
90 x 5237 Boolean matrix encoding over-expression of human genes in various organs and tissues
Looking for sets of co-over-expressed genes and their associated samples
− From 64836 to 1669 formal concepts that have been grouped into QSGs by means of a hierarchicalclustering
One of these QSG has been analyzed in depth(Blachon et al. ISB 2007)
51
Mining quasi-synexpression groups
1. Measuringsimilaritiesbetweenformalconcepts
2. Hierarchicalclustering
3. Visualization
52
Relationship to co-clustering
Clustering and Co-Clustering
Useful feedback on global structures
Heuristic local optimization
Clustering objects and/or properties vs. clustering local patterns
10000o710010o610011o501100o401101o310010o201101o1p5p4p3p2p1
53
A proposal (Ph.D. R. G. Pensa, 2006)
10000O7
10010O6
10011O5
01100O4
01101O3
10010O2
01101O1
P5P4P3P2P1
10000O7
10010O6
10011O5
01100O4
01101O3
10010O2
01101O1
P5P4P3P2P1
Local patterns Global structure
Compute K (e.g., 2) co-clusters from available bi-sets
54
Back to the Inductive Database vision
Inductive Database
Management System
Extensional/intensional data Extensional/intensional data
PatternsPatterns
ModelsModels
55
SQUAT
SAGE data (H. Sapiens, G. Gallus, M. musculus)SAGE data (H. Sapiens, G. Gallus, M. musculus)
Domain knowledge, e.g., GODomain knowledge, e.g., GO
Collections of formal conceptsCollections of formal concepts
A concrete example
http://bsmc.insa-lyon.fr/squat/login.phphttp://bsmc.insa-lyon.fr/squat/login.php
56
ClosedSet
Mining
Faulttolerance
PatternDomains
n-ary relations, multi-relational data, sequences, trees, graphs ...
δ-bi-sets, DR-bi-sets, δ-tolerantclosed sets, …
Under Co
nstraint
s
Perspectives
FCA&
extensions
ICFCA may help
p1 p2 p3o1 1 0 1o2 0 1 0o3 1 1 0o4 1 1 1
< (o1,o4),(p1,p3)>
In a binary relation subset of O x P, 2-closed sets are formal concepts
Many solvers are available
An example of an extension
p1 p2 p3
q1 o1 1 0 1o2 0 1 0
q2 o1 1 1 1o2 1 0 1
< (o1),(p1,p3),(q1,q2)>
In a n-ary relation subset of A1 x ... x An, a closed n-sets generalizes a formal concept: it binds subsets of every Ai s.t. each of them is closed w.r.t. all the others.
Computing the patterns is much harder
Trias – CubeMiner - Data Peeler
SDM’08ICDM’06
VLDB’06
p1 p2 p3
q1 o1 1 0 1o2 0 1 0
q2 o1 1 1 1o2 1 0 1
< (o1),(p1,p3),(q1,q2)>
100011
p1 p3 p2
o1o2
011111
Trias – CubeMiner - Data Peeler
q1
q2
61
ClosedSets
LocalPatterns
Globalpatterns
Collections of local patterns
Clustering, co-clustering, classifiers
Condensed representations,newfeatures, model characterization…
Knowledge nuggets
Under Co
nstraint
s
« Local to Global »
62
Summary
Formal concepts are an interesting special case of constrained bi-sets and may be studied as such
Formal concepts are not actionable patternsin many real-case applications
Complete but also heuristic solvers thatexploit user-defined constraints can support the search for actionable patterns based on formal concepts, including fault-tolerant ones
63
To know more
Inductive Databases & Constraint-based MiningOutcome of the Black Forest meeting organized in March 2004 in Hinterzarten (D)
J-F. Boulicaut, L. De Raedt, H. Mannila (Eds.)Constraint-based Mining and Inductive DatabasesSpringer-Verlag LNAI 3848, 2006, 399 pages.
http://liris.cnrs.fr/~jboulica/
http://iq.ijs.si/IQ/
64
Thanks to EU funded FET projects
cInQ (2001-2004)
consortium on knowledge discovery by Inductive Queries
− Theory for local pattern mining
IQ (2005-2008)
Inductive Queries for mining patterns and models− Theory for global pattern/model mining