Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France)...
-
Upload
laureen-adams -
Category
Documents
-
view
222 -
download
3
Transcript of Djamel A. Zighed and Nicolas Nicoloyannis ERIC Laboratory University of Lyon 2 (France)...
Djamel A. Zighed and Nicolas Nicoloyannis
ERIC LaboratoryUniversity of Lyon 2 (France)
Prague Sept. 04
About Computer science dep.
• In Lyon, there are 3 universities, 100000 students
• Lumière university Lyon 2, has 22000 students, • Lyon 2, is mainly a liberal art university• The faculty of economic has tree departments,
among them the computer science one• We belong to this department• We have Bachelor, Master and PhD programs
for 300 students
ERIC Lab at the University
Economic Sociology Linguistic Law
Faculties of university of Lyon 2
ERICResearch centers of the university
Knowledge Engineering Research Center
- The budget of ERIC doesn’t depend from the university, it’s given parThe national ministry of education- We have a large autonomy in decision making
ERIC Lab
• Born in 1995,
• 11 professors (N. Nicoloyannis, director)
• 15 PhD Students
• Grants+contracts+WK+…=200K€/year
• Research topics– Data mining (theory, tools and applications)– Data warehouse management (T,T,A)
Data Mining (T,T,A)
• Theory– Induction graphs– Learning and classification
• Tools– SIPINA : Plate form for data mining
• Applications– Medical fields– Chemical applications– Human science– …
Data mining TTA for complex data
Data mining on complex data
• An example : Breast cancer diagnosis
Motivations
c
r
yyY
xxX
Ω
YX
,,
,,
data ofset a be
attributes twobe and
Let
1
1
Contingency table
Y
X
1y cy
1x
rx
rcn
11n
1rn
cn1
.1n
.rn
n1.n cn.
XYT
XYT Association measure :It measures the strength of the relationshipbetween X and Y
Motivations
c
r
yyY
xxX
Ω
YX
,,
,,
data ofset a be
attributes twobe and
Let
1
1
Y
X
1y cy
1x
rx
rcn
11n
1rn
cn1
.1n
.rn
n1.n cn.
Contingency table
XYT
XYT
Association measure :It measures the strength of the relationshipbetween X and Y
Motivations
c
r
yyY
xxX
Ω
YX
,,
,,
data ofset a be
attributes twobe and
Let
1
1
Y
X
1y cy
1x
rx
rcn
11n
1rn
cn1
.1n
.rn
n1.n cn.
Contingency table
XYT
XYT
Association measure :It measures the strength of the relationshipbetween X and Y
Motivations
c
r
yyY
xxX
Ω
YX
,,
,,
data ofset a be
attributes twobe and
Let
1
1
Y
X
1y cy
1x
rx
rcn
11n
1rn
cn1
.1n
.rn
n1.n cn.
Contingency table
XYT
XYT
Association measure :It measures the strength of the relationshipbetween X and Y
According to a specific association measure, may we improve the strength of the relationship by merging some rows and/or some columns ?
Motivations
Y
X
1y cy
1x
rx
rcn
11n
1rn
cn1
.1n
.rn
n1.n cn.
Contingency table
XYT
XYT Association measure :It measures the strength of the relationshipbetween X and Y
XYXY
XY
TT
rr
cc
T
'
and '
and'
: that such
'
According to a specific association measure, may we improve the strength of the relation ship by merging some rows and/or some columns ?
An example
140ˆ .tTXY
Goal:Find the groupings that maximize the association between attributes
Yes, we can improve the association by reducing the size of the contingency
table
tt ˆ'ˆ
320'ˆ' .tT XY
For the preceding examplethe maximization of the Tschuprow’s t gives
Extension
c
r
yyY
xxX
Ω
YX
,,
,,
data ofset a be
attributes twobe and
Let
1
1
Y
X
1y cy
1x
rx
rcn
11n
1rn
cn1
.1n
.rn
n1.n cn.
Contingency table
XYT
XYT
According to a specific association measure, may we find the optimal reduced contingency table ?
iXY
iXY
XY
TT
ll
cc
T
max *
*
*
*
Optimal solution (exhaustive search)
Goal : Find the best cross partition on T
case ordinal
case nominal
XT#P
YX
YY
XX
Y
X
T#T#
TT#
TT#
YT
XT
PP
PP
PP
P
P
ischeck tohave wecases ofnumber The
set theof size the:
set theof size the:
over about brought partitions all ofset The :
over about brought partitions all ofset The :
Optimal solution (exhaustive search)
case ordinal
case nominal
XT#P
Optimal solution (exhaustive search)
According to a specific association measure, may we find the optimal reduced contingency table ?
Yes, but the solution is intractable in real word because of the high time complexity
Heuristic
1
,0
whenStop
2,1 ly successive determines algorithm The
categoriesfinest the withStarting
kk
k
cr
TT
kT
TT
Proceed successively to the grouping of 2 (row or column) values that
maximizes the increase in the association criteria.
Complexity
Simulation
Goal: How far is the quasi-optimal solution from the true optimum?
Comparison tractable for tables not greater than 6 × 6.
Simulation Design
Randomly generate 200 tables
Analysis of the distribution of the deviations between optima andquasi-optima.
Generating the Tables
10000 cases distributed in the cxr cells of the table with an uniform distribution (worst case).
Quasi-optimal solution
Quasi-optimal solution
Conclusion
• Implementation for new approach induction decision tree.– Zighed, D.A., Ritschard, G., W. Erray and V.-M. Scuturici (2003),
Abogodaï,a New approach for Decision Trees, in Lavrac, N., D.Gamberger, L. Todorovski and H. Blockeel (eds), Knowledge Discovery in databases: PKDD 2003 , LNAI 2838, Berlin: Springer, 495--506.
– Zighed D. A., Ritschard G., Erray W., Scuturici V.-M. (2003), Decision tree with optimal join partitioning, To appear in Journal of Information Intelligent Systems, Kluwer (2004).
• Divisive top-down approach• Extension to multidimensionnal case