Download - KDD 2010 - Direct mining of discriminative patterns for classifying uncertain data

Transcript
Page 1: KDD 2010 - Direct mining of discriminative patterns for classifying uncertain data

Direct Mining of Discriminative Patterns for Classifying Uncertain Data

Chuancong Gao, Jianyong Wang

Department of Computer Science and Technology, Tsinghua University, Beijing, China

Uncertain Dataset

Evaluation Price Looking Tech. Spec. Quality

Unacceptable + - / {-: 0.8, /: 0.1, +: 0.1}

Acceptable / - / {-: 0.1, /: 0.8, +: 0.1}

Good - + / {-: 0.1, /: 0.8, +: 0.1}

Very Good / + + {-: 0.1, /: 0.1, +: 0.8}

0.8: Probability of Original Value on Certain Dataset

+: Good, /: Medium, -: Bad

Our Solution

• For each training instance find a set of most discriminative patterns, assuring the probability of being covered larger than threshold.

• For each selected pattern, generate a feature for each instance by whether the instance contains the pattern.

• Train a SVM classifier with generated features. • Classify all the testing instances.

Measure a Pattern’s Discriminative Power

Via the confidence value on each class label: • For patterns involving only Certain Attributes:

Calculate each confidence value directly. • For patterns involving at least one Uncertain Attributes:

Calculate the expected value of each confidence.

Definition of Expected Confidence

Given a set of transactions 𝑇 and the set of possible worlds 𝑊 w.r.t. 𝑇, the expected confidence of an itemset 𝑥 on class 𝑐 is:

where 𝑃 𝑤𝑖 is the probability of world 𝑤𝑖. 𝑐𝑜𝑛𝑓𝑥,𝑤𝑖

𝑐 is the confidence of 𝑥

on class 𝑐 in world 𝑤𝑖, while 𝑠𝑢𝑝𝑥,𝑤𝑖𝑐 (𝑠𝑢𝑝𝑥,𝑤𝑖

) is the support of 𝑥 (on class

𝑐) in world 𝑤𝑖.

Efficient Computation of Expected Confidence

#Transaction / n

Support / i

0 1 |T|

0

1

|T|

...

...

,| |( )c

i T xconfE

1,| |( )c

i T xconfE

1,| | 1( )c

i T xconfE

2

2

Stop Condition:

SkippedComputation in One Step Start of Next Step Explaination

_( )c

c cur db

i x maxbound conf conf

Accuracy Evaluation

Uncertain Degree # Uncertain Attr. Avg. Accuracy on 30 UCI Datasets

10% 1 79.0138% 74.8738% 75.2111%

2 78.6970% 73.1629% 73.4107%

4 77.9657% 72.2670% 69.4649%

20% 1 78.9537% 74.6577% 74.6287%

2 78.6073% 72.5642% 72.5460%

4 77.8352% 69.9157% 68.2066%

Ours DTU [1] uRule [2]

References

[1] B. Qin, Y. Xia, and F. Li. DTU: A decision tree for uncertain data. PAKDD’09. [2] B. Qin, Y. Xia, S. Prabhakar, and Y.-C. Tu. A rule-based classification algorithm for uncertain data. ICDE’09 MOUND Workshop.