KDD 2010 - Direct mining of discriminative patterns for classifying uncertain data

1
Direct Mining of Discriminative Patterns for Classifying Uncertain Data Chuancong Gao, Jianyong Wang Department of Computer Science and Technology, Tsinghua University, Beijing, China Uncertain Dataset Evaluation Price Looking Tech. Spec. Quality Unacceptable + - / {-: 0.8, /: 0.1, +: 0.1} Acceptable / - / {-: 0.1, /: 0.8, +: 0.1} Good - + / {-: 0.1, /: 0.8, +: 0.1} Very Good / + + {-: 0.1, /: 0.1, +: 0.8} 0.8: Probability of Original Value on Certain Dataset +: Good, /: Medium, -: Bad Our Solution For each training instance find a set of most discriminative patterns, assuring the probability of being covered larger than threshold. For each selected pattern, generate a feature for each instance by whether the instance contains the pattern. Train a SVM classifier with generated features. Classify all the testing instances. Measure a Pattern’s Discriminative Power Via the confidence value on each class label: For patterns involving only Certain Attributes: Calculate each confidence value directly. For patterns involving at least one Uncertain Attributes: Calculate the expected value of each confidence. Definition of Expected Confidence Given a set of transactions and the set of possible worlds w.r.t. , the expected confidence of an itemset on class is: where is the probability of world . , is the confidence of on class in world , while , ( , ) is the support of (on class ) in world . Efficient Computation of Expected Confidence #Transaction / n Support / i 0 1 |T| 0 1 |T| ... ... ,| | ( ) c iT x conf E 1,| | ( ) c i T x conf E 1,| | 1 ( ) c i T x conf E 2 2 Stop Condition: Skipped Computation in One Step Start of Next Step Explaination _ ( ) c c cur db i x max bound conf conf Accuracy Evaluation Uncertain Degree # Uncertain Attr . Avg. Accuracy on 30 UCI Datasets 10% 1 79.0138% 74.8738% 75.2111% 2 78.6970% 73.1629% 73.4107% 4 77.9657% 72.2670% 69.4649% 20% 1 78.9537% 74.6577% 74.6287% 2 78.6073% 72.5642% 72.5460% 4 77.8352% 69.9157% 68.2066% Ours DTU [1] uRule [2] References [1] B. Qin, Y. Xia, and F. Li. DTU: A decision tree for uncertain data. PAKDD’09. [2] B. Qin, Y. Xia, S. Prabhakar, and Y.-C. Tu. A rule-based classification algorithm for uncertain data. ICDE’09 MOUND Workshop.

Transcript of KDD 2010 - Direct mining of discriminative patterns for classifying uncertain data

Page 1: KDD 2010 - Direct mining of discriminative patterns for classifying uncertain data

Direct Mining of Discriminative Patterns for Classifying Uncertain Data

Chuancong Gao, Jianyong Wang

Department of Computer Science and Technology, Tsinghua University, Beijing, China

Uncertain Dataset

Evaluation Price Looking Tech. Spec. Quality

Unacceptable + - / {-: 0.8, /: 0.1, +: 0.1}

Acceptable / - / {-: 0.1, /: 0.8, +: 0.1}

Good - + / {-: 0.1, /: 0.8, +: 0.1}

Very Good / + + {-: 0.1, /: 0.1, +: 0.8}

0.8: Probability of Original Value on Certain Dataset

+: Good, /: Medium, -: Bad

Our Solution

• For each training instance find a set of most discriminative patterns, assuring the probability of being covered larger than threshold.

• For each selected pattern, generate a feature for each instance by whether the instance contains the pattern.

• Train a SVM classifier with generated features. • Classify all the testing instances.

Measure a Pattern’s Discriminative Power

Via the confidence value on each class label: • For patterns involving only Certain Attributes:

Calculate each confidence value directly. • For patterns involving at least one Uncertain Attributes:

Calculate the expected value of each confidence.

Definition of Expected Confidence

Given a set of transactions 𝑇 and the set of possible worlds 𝑊 w.r.t. 𝑇, the expected confidence of an itemset 𝑥 on class 𝑐 is:

where 𝑃 𝑤𝑖 is the probability of world 𝑤𝑖. 𝑐𝑜𝑛𝑓𝑥,𝑤𝑖

𝑐 is the confidence of 𝑥

on class 𝑐 in world 𝑤𝑖, while 𝑠𝑢𝑝𝑥,𝑤𝑖𝑐 (𝑠𝑢𝑝𝑥,𝑤𝑖

) is the support of 𝑥 (on class

𝑐) in world 𝑤𝑖.

Efficient Computation of Expected Confidence

#Transaction / n

Support / i

0 1 |T|

0

1

|T|

...

...

,| |( )c

i T xconfE

1,| |( )c

i T xconfE

1,| | 1( )c

i T xconfE

2

2

Stop Condition:

SkippedComputation in One Step Start of Next Step Explaination

_( )c

c cur db

i x maxbound conf conf

Accuracy Evaluation

Uncertain Degree # Uncertain Attr. Avg. Accuracy on 30 UCI Datasets

10% 1 79.0138% 74.8738% 75.2111%

2 78.6970% 73.1629% 73.4107%

4 77.9657% 72.2670% 69.4649%

20% 1 78.9537% 74.6577% 74.6287%

2 78.6073% 72.5642% 72.5460%

4 77.8352% 69.9157% 68.2066%

Ours DTU [1] uRule [2]

References

[1] B. Qin, Y. Xia, and F. Li. DTU: A decision tree for uncertain data. PAKDD’09. [2] B. Qin, Y. Xia, S. Prabhakar, and Y.-C. Tu. A rule-based classification algorithm for uncertain data. ICDE’09 MOUND Workshop.