A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal...
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal...
A dynamic-programming algorithm for hierarchical discretization of continuous attributes
Amit Goyal (15st April 2008)
Department of Computer Science
The University of British Columbia
Reference
Ching-Cheng Shen and Yen-Liang Chen. A dynamic-programming algorithm for hierarchical discretization of continuous attributes. In European Journal of Operational Research 184 (2008) 636-651 (ElseVier).
Amit Goyal (UBC Computer Science)
Overview
Motivation Background Why need Discretization? Related Work DP Solution Analysis Conclusion
Amit Goyal (UBC Computer Science)
Motivation
Situation: Attrition rate for mobile phone customer is around 25-30% per year
Task: Given customer information for
past N months, predict who is likely to attrite next month
Also estimate customer value & what is the cost effective order to be made to this customer
Customer Attributes:AgeGenderLocationPhone billsIncomeOccupation
Amit Goyal (UBC Computer Science)
Pattern Discoveryt1: Beef, Chicken, Milk
t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese, Milk
t6: Chicken, Clothes, Milk
t7: Chicken, Milk, Clothes
Transaction data Assume:
min_support = 30% min_confidence = 80%
An example frequent itemset: {Chicken, Clothes, Milk} [sup = 3/7]
Association rules from the itemset: Clothes Milk, Chicken [sup = 3/7, conf = 3/3] … … Clothes, Chicken Milk, [sup = 3/7, conf = 3/3]
Amit Goyal (UBC Computer Science)
Issues with Numeric Attributes Size of the discretized intervals affect support &
confidence{Occupation = SE, (Income = $70,000)} {Attrition = Yes}{Occupation = SE, (60K Income 80K)} {Attrition = Yes}{Occupation = SE, (0K Income 1B)} {Attrition = Yes}
If intervals too small may not have enough support
If intervals too large may not have enough confidence
Loss of Information (How to minimize?) Potential solution: use all possible intervals Too many rules!!!
Amit Goyal (UBC Computer Science)
Background
Discretization reduce the number of values for a given
continuous attribute by dividing the range of the attribute into intervals.
Concept hierarchies reduce the data by collecting and replacing low
level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
Amit Goyal (UBC Computer Science)
Why need discretization?
Data Warehousing and Mining Data reduction Association Rule Mining Sequential Patterns Mining
In some machine learning algorithms like Bayesian approaches and Decision Trees.
Granular Computing
Amit Goyal (UBC Computer Science)
Related Work
Manual Equal-Width Partition Equal-Depth Partition Chi-Square Partition Entropy Based Partition Clustering
Amit Goyal (UBC Computer Science)
Simple Discretization Methods: Binning Equal-width (distance) partitioning:
It divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.
The most straightforward Equal-depth (frequency) partitioning:
It divides the range into N intervals, each containing approximately same number of samples
Amit Goyal (UBC Computer Science)
Chi-Square Based Partitioning
The larger the Χ2 value, the more likely the variables are related
Merge: Find the best neighboring intervals and merge
them to form larger intervals recursively
Χ2 (chi-square) test
Expected
ExpectedObserved 22 )(
Amit Goyal (UBC Computer Science)
Entropy Based Partition
Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is
E S TS
EntS
EntS S S S( , )| |
| |( )
| |
| |( ) 1
12
2
The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.
The process is recursively applied to partitions obtained until some stopping criterion is met
Amit Goyal (UBC Computer Science)
Clustering
Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is
“smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
Amit Goyal (UBC Computer Science)
Weaknesses
Seeks a local optimal solution instead of a global optimal
Subject to constraint that each interval can only be partitioned into a fixed number of sub-intervals
Constructed tree may be unbalanced
Amit Goyal (UBC Computer Science)
Notations
val(i): value of ith data num(i): number of occurrences of value val(i) R: depth of the output tree ub: upper boundary on the number of
subintervals spawned from an interval lb: lower boundary
Amit Goyal (UBC Computer Science)
Example
R = 2, lb = 2, ub = 3
Amit Goyal (UBC Computer Science)
Problem Definition
Given parameters R, ub, and lb and input data val(1), val(2), …, val(n) and num(1), num(2), … num(n), our goal is to build a minimum volume tree subject to the constraints that all leaf nodes must be in level R and that the branch degree must be between ub and lb
Amit Goyal (UBC Computer Science)
Distances and Volume
j
ix
xnumjimeanxvalji )(*|),()(|),(intradist
),1(interdist),(interdist
j))1,totalnum(uu),totalnum(i())()1((
),(totalnum))()1((),,(interdist
L juui
uvaluval
jiuvaluvaluji
R
Intra-distance of a node containing data from data i to data j
Inter-distance b/w two adjacent siblings; first node containing data from i to u, second node containing data from u+1 to j
Volume of a tree is the total intra-distance minus total inter-distance in the tree
Amit Goyal (UBC Computer Science)
Theorem
The volume of a tree = the intra-distance of the root node + the volumes of all its sub-trees - the inter-distances among its children
Amit Goyal (UBC Computer Science)
Notations
T*(i,j,r): the minimum volume tree that contains data from data i to data j and has depth r
T(i,j,r,k): the minimum volume tree that contains data from data i to data j, has depth r, and whose root has k branches
D*(i,j,r): the volume of T*(i,j,r) D(i,j,r,k): the volume of T(i,j,r,k)
Amit Goyal (UBC Computer Science)
Notations Cont.
}1nceinterdista
min{
1
1
1
node))(v node and b/w v(the
) - nodee of v(the volum): QD(i,j,r,k
thk-
v
th
k
v
th
Amit Goyal (UBC Computer Science)
Notations Cont.
)},(nceinterdista
1nceinterdista
min{
1
1
1
ui
node))(v node and b/w v(the
)nodee of v(the volum (i,j,r,k):QD
R
thk-
v
th
k
v
thM
Amit Goyal (UBC Computer Science)
Algorithm
)},(interdist),(interdist
)1,,,1(),,(*{min),,,(
uiui
krjuQDruiDkrjiQD
LR
M
jui
M
Amit Goyal (UBC Computer Science)
Algorithm Cont.
)},(interdist
)1,,,1(),,(*{min),,,(
ui
krjuQDruiDkrjiQD
L
M
jui
Amit Goyal (UBC Computer Science)
Algorithm Cont.
),1,,(),(intradist),,,( krjiQDjikrjiD
Amit Goyal (UBC Computer Science)
The complete DP algorithm
)},(interdist),(interdist
)1,,,1(),,(*{min),,,(
uiui
krjuQDruiDkrjiQD
LR
M
jui
M
)},(interdist
)1,,,1(),,(*{min),,,(
ui
krjuQDruiDkrjiQD
L
M
jui
),1,,(),(intradist),,,( krjiQDjikrjiD
)},,,({min),,(* krjiDrjiDrbklb
Amit Goyal (UBC Computer Science)
Volume of trees constructed Run times of different algorithms
Gain Ratios of different algorithms(Monthly Household Income)
Gain Ratios of different algorithms(Money Spent Monthly)
Amit Goyal (UBC Computer Science)
Conclusion
Global optima instead of local optima Each interval is partitioned into the most
appropriate number of subintervals Trees are balanced Time complexity is cubic, thus slightly slower
Amit Goyal (UBC Computer Science)
http://www.cs.ubc.ca/~goyal([email protected])
Thank you !!!Thank you !!!
Amit Goyal (UBC Computer Science)
Gain Ratio
n
iii ppTentropy
12log)(
)(||
||),...,(
121 i
r
i
ir Sentropy
S
SSSSpurity
The information gain due to particular split of S into Si, i = 1, 2, …., r Gain (S, {S1, S2, …., Sr) = purity(S ) – purity (S1, S2, … Sr)
.||
||2
log||||
),(SiS
SiS
ASnfoIntrinsicI
.),(
),(),(ASnfoIntrinsicI
ASGainASGainRatio
Amit Goyal (UBC Computer Science)
Chi-Square Test Example
Heads Tails Total
Observed 53 47 100
Expected 50 50 100
(O-E)2 9 9
X2= (O-E)2/E 0.18 0.18 0.36
In order to see whether this result is statistically significant, the P-value (the probability of this result not being due to chance) must be calculated or looked up in a chart. The P-value is found to be Prob(X2
1 ≥ 0.36) = 0.5485. There is thus a probability of about 55% of seeing data that deviates at least this much from the expected results if indeed the coin is fair. Hence, fair coin.
Amit Goyal (UBC Computer Science)