Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in...
-
Upload
rodney-anthony -
Category
Documents
-
view
223 -
download
0
Transcript of Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in...
Classification and Novel Class Detection in Data Streams
Mehedy Masud1, Latifur Khan1, Jing Gao2,
Jiawei Han2, and Bhavani Thuraisingham1
1Department of Computer Science, University of Texas at Dallas
2Department of Computer Science, University of Illinois at Urbana Champaign
This work was funded in part by
Presentation Overview
Stream Mining Background
Novel Class Detection– Concept Evolution
Data StreamsData streams are:
◦ Continuous flows of
data
Network traffic
Sensor data Call center
records
◦ Examples:
Uses past labeled data to build classification model
Predicts the labels of future instances using the model
Helps decision making
Data Stream Classification
Network traffic
Classification model
Attack traffic
Firewall
Block and quarantine
Benign traffic
Server
Model update
Expert analysis and labeling
Infinite length
Concept-drift
Concept-evolution (emergence of
novel class)
Recurrence (seasonal) class
ChallengesIntroduction
5ICDM 2012, Brussels, Belgium 12/11/2012
Impractical to store and use all historical data
◦ Requires infinite storage
◦ And running time
Infinite Length
0 11
0
11
11
0
0 0
0
Concept-Drift
Negative instancePositive instance
A data chunk
Current hyperplane
Previous hyperplane
Instances victim of concept-drift
Concept-Evolution
X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X
X X X X X X
Novel classy
x1
y1
y2
x
++++ ++
++ + + ++ + +++ ++ + ++ + + + ++ +
+++++ ++++ +++ + ++ + + ++ ++
+
- - - - - - - - - - - - - - -
+ + + + + + + + + + + + + + + +
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - -- -
- - -
Classification rules:
R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +
R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -Existing classification models misclassify novel class instances
AC
D
B
y
x1
y1
y2
x
++++ ++
++ + + ++ + +++ ++ + ++ + + + ++ +
+++++ ++++ +++ + ++ + + ++ ++
+
- - - - - - - - - - - - - - -
+ + + + + + + + + + + + + + + +
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - -- -
- - -
A
CD
B
Background: Ensemble of Classifiers
C1
C2
C3
x,?
+
+
-input
ClassifierIndividual outputs
voting
+
Ensemble output
Background: Ensemble Classification of Data
StreamsDivide the data stream into equal sized chunks◦ Train a classifier from each data chunk◦ Keep the best L such classifier-ensemble◦ Example: for L= 3
Data chunks
Classifiers
D1
C1
D2
C2
D3
C3
Ensemble
C1 C2 C3
D4
Prediction
D4
C4C4
C4
D5D5
C5C5
C5
D6
Labeled chunkUnlabeled chunk
Addresses infinite lengthand concept-drift
Note: Di may contain data points from different classes
Examples of Recurrence and Novel Classes
Twitter Stream – a stream of messagesEach message may be given a category or
“class” ◦ based on the topic
Examples ◦ “Election 2012”, “London Olympic”,
“Halloween”, “Christmas”, “Hurricane Sandy”, etc.
Among these ◦ “Election 2012” or “Hurricane Sandy” are
novel classes because they are new events.Also
◦ “Halloween” is recurrence class because it “recurs” every year.
11ICDM 2012, Brussels, Belgium 12/11/2012
Introduction
Concept-Evolution and Feature Space
Introduction
X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X
X X X X X X
Novel class
y
x1
y1
y2
x
++++ ++
++ + + ++ + +++ ++ + ++ + + + ++ +
+++++ ++++ +++ + ++ + + ++ ++
+
- - - - - - - - - - - - - - -
+ + + + + + + + + + + + + + + +
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - -- -
- - -
Classification rules:
R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = +
R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = -Existing classification models misclassify novel class instances
AC
D
B
y
x1
y1
y2
x
++++ ++
++ + + ++ + +++ ++ + ++ + + + ++ +
+++++ ++++ +++ + ++ + + ++ ++
+
- - - - - - - - - - - - - - -
+ + + + + + + + + + + + + + + +
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - -
- - - - - - - - - - -- -
- - -
A
CD
B
12ICDM 2012, Brussels, Belgium 12/11/2012
Novel Class Detection – Prior Work
Prior work
13ICDM 2012, Brussels, Belgium 12/11/2012
Three steps:
◦ Training and building decision boundary
◦ Outlier detection and filtering
◦ Computing cohesion and separation
Training: Creating Decision Boundary
++++ ++ + + +
+ +++ ++ +
+ + + + ++ +
+++ ++ ++ +++
+++++ ++++ +++ + ++ + +
++ ++ + ++
- - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - -
- - - - - - - - - - - -
- -
y
x1
y1
y2B
CA
D
x
- - - - -
- - -
- - - - - - - -
+++ ++ + + + + + + + + + + +
Raw training dataClusters are created
y
x1
y1
y2
x
A
D
C
B
Pseudopoints
Addresses Infinite length problem14ICDM 2012, Brussels, Belgium 12/11/2012
Prior work
• Training is done chunk-by-chunk (One classifier per chunk)
• An ensemble of classifiers are used for classification
Outlier Detection and Filtering
x1 x
y
y1
y2B
CA
D x
x
AND
Routlier
Routlier
Routlier
Ensemble of L modelsM1 M2 ML
xTest instance
. . .
True
X is a filtered outlier (Foutlier)(potential novel class instance)
False
X is an existing class instance
Test instance inside decision boundary (not outlier)
Test instance outside decision
boundary Raw outlier or
Routlier
Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible.
15ICDM 2012, Brussels, Belgium 12/11/2012
Prior work
Computing Cohesion & Separation
a(x) = mean distance from an Foutlier x to the instances in o,q(x)
bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure)
q-Neighborhood Silhouette Coefficient (q-NSC):a(x)),(x)bmax(
a(x)) (x)(b NSC(x)-q
min
min
If q-NSC(x) is positive, it means x is closer to Foutliers than any other class.
x
o,5(x)
+,5(x)
- - - -
+ + + +
- - -
- -
+ + + + +
-,5(x)
a(x)
b+
(x)b-(x)
16ICDM 2012, Brussels, Belgium 12/11/2012
Prior work
Limitation: Recurrence Class
chunk0 chunk1 chunk49 chunk50
Stream
chunk51 chunk52 chunk99 chunk100
Novel
chunk101 chunk102 chunk149 chunk150
Recurrence
17ICDM 2012, Brussels, Belgium 12/11/2012
Prior work
Why Recurrence Classes are Forgotten?
Divide the data stream into equal sized chunks◦ Train a classifier from whole data chunk◦ Keep the best L such classifier-ensemble◦ Example: for L= 3◦ Therefore, old models are discarded◦ Old classes are “forgotten” after a while
Data chunks
Classifiers
D1
C1
D2
C2
D3
C3
Ensemble
C1 C2 C3
D4
Prediction
D4
C4C4
C4
D5D5
C5C5
C5
D6
Labeled chunkUnlabeled chunk
Addresses infinite length and concept-drift
18ICDM 2012, Brussels, Belgium 12/11/2012
Prior work
CLAM: The Proposed Approach
19ICDM 2012, Brussels, Belgium 12/11/2012
LatestLabeled chunk
Stream
New model
Training
Ensemble (M)(keeps all classes)
Upd
ate
Latest unlabeled instance Outlier
detection
Not outlierClassify using M
(Existing class)Outlier
Buffering and novel class detection
Proposed method
CLAss Based Micro-Classifier Ensemble
Training and Updating
20ICDM 2012, Brussels, Belgium 12/11/2012
Proposed method
• Each chunk is first separated into different classes• A micro-classifier is trained from each class’s data• Each micro-classifier replaces one existing micro-
classifier• A total of L micro-classifiers make a Micro-Classifier
Ensemble (MCE)• C such MCE’s constitute the whole ensemble, E
CLAM: The Proposed Approach
21ICDM 2012, Brussels, Belgium 12/11/2012
LatestLabeled chunk
Stream
New model
Training
Ensemble (M)(keeps all classes)
Upd
ate
Latest unlabeled instance Outlier
detection
Not outlierClassify using M
(Existing class)Outlier
Buffering and novel class detection
Proposed method
CLAss Based Micro-Classifier Ensemble
Outlier Detection and Classification
22ICDM 2012, Brussels, Belgium 12/11/2012
Proposed method
• A test instance x is first classified with each micro-classifier ensemble
• Each micro-classifier ensemble gives a partial output (Yr) and a outlier flag (boolean)
• If all ensembles flags x as outlier, then it is buffered and sent to novel class detector
• Otherwise, the partial outputs are combined and a class label is predicted
Evaluation Competitors:
◦ CLAM (CL) – proposed work◦ SCANR (SC) [1] – prior work◦ ECSMiner (EM) [2] – prior work◦ Olindda [3]-WCE [4] (OW) – another baseline
Datasets: Synthetic, KDD Cup 1999 & Forest covertype
1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C. Aggarwal, J. Gao, J. Han, and B. M. Thuraisingham, Detecting recurring and novel classes in concept-drifting data streams,” in Proc. ICDM ’11, Dec. 2011, pp. 1176–181.
2. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham.Classification and novel class detection in concept-drifting data streams under time constraints. In Preprints, IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(6): 859-874 (2011).
3. E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In Proc. 2008 ACM symposium on Applied computing, pages 976–980, 2008.
4. H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proc. ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235, Washington, DC, USA, Aug, 2003. ACM.
23ICDM 2012, Brussels, Belgium 12/11/2012
Evaluation
Overall Error
24ICDM 2012, Brussels, Belgium 12/11/2012
Evaluation
Error rates on (a) SynC20, (b)SynC40, (c)Forest and (d) KDD
Number of Recurring Classes vs Error
25ICDM 2012, Brussels, Belgium 12/11/2012
Evaluation
Error vs Drift and Chunk Size
26ICDM 2012, Brussels, Belgium 12/11/2012
Evaluation
Summary Table
27ICDM 2012, Brussels, Belgium 12/11/2012
Evaluation
ConclusionDetect RecurrenceImproved AccuracyRunning TimeReduced Human InteractionFuture work: use other base
learners
28ICDM 2012, Brussels, Belgium 12/11/2012
QUESTIONS?
29ICDM 2012, Brussels, Belgium 12/11/2012
THANKS
30ICDM 2012, Brussels, Belgium 12/11/2012