An extended K-means++ with mixed attributes for outlier detection
description
Transcript of An extended K-means++ with mixed attributes for outlier detection
![Page 1: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/1.jpg)
An extended K-means++ with mixed attributes for outlier detection
Presented by Miss Sarunya Kanjanawattana
![Page 2: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/2.jpg)
Examination Committee
Dr. Sumanta Guha (Chairperson)Prof. Dr. Phan Minh Dung (Committee)Dr. Matthew N. Dailey (Committee)
![Page 3: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/3.jpg)
:: Agenda ::
• Background• Literature review• Methodologies
![Page 4: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/4.jpg)
Background• Problem statement• Objective of the study• Scope and Limitation • Contribution
![Page 5: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/5.jpg)
« Background »
• Data mining :– huge volume of data and information are collected in
databases. – These tremendous data has far exceeded the human
ability to analyze extract valuable information for the purpose of decision-making support.
“data mining helps to transform the collected data into valuable information”
![Page 6: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/6.jpg)
« Background »
• Outlier detection :– Outlier cluster is a popular methodology
that uses to detect fraud in data sets.– identify data points as “normal” or “outlier”
Outlier data point => fraudulent sample
![Page 7: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/7.jpg)
« Background »
• Fraud detection – Health insurance fraud detection is a
beneficial and challenging task.– The detection helps to observe the fraud
and abuse pattern.
Example : Institutional or health professional led health insurance fraud include the falsification of information on forms.
![Page 8: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/8.jpg)
« Background »
• The National Health Security office– is an autonomous state agency, officially
founded in 2002 , stated by the National Health Security Act
– The vital duties of NHSO • are to manage the health security fund
and allocate the subsidiary budget to 236 clinics and 963 hospitals to promote and develop a good health care system for all Thai people.
![Page 9: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/9.jpg)
« Problem statement »
• Fraud and abuse • led to significant additional expense in the health care
system.
• A case study : NHSO database• Occurred with the large number of data .• Many transactions emerge constantly daily hour. • These become huge and hard to use human inspections for
detecting fraud.
• Outlier clustering approach : • Need fast and more accuracy algorithm to monitor outliers
![Page 10: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/10.jpg)
« Objective of the study »
• To provide a process of extracting the fraud instances and uncover unusual activities in NHSO.
• To develop the K-means++, that is another variation of standard k-means algorithm, with mixed attributes of dataset for detecting outliers.
• To answer what is the optimal “”.
![Page 11: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/11.jpg)
« Scope and Limitation »
• The data source only involved in 4 provinces in Thailand– Nakhonratchasima, Chaiyaphom, Burirum and Surin.
• The transaction comes from a group of High-costs diseases – There is high chance to occur fraudulent behaviors
larger than other groups of diseases.
![Page 12: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/12.jpg)
« Contribution »
• The proposed study provides the methodology to detect fraud and abuse in NSHO, Thailand. It will present some results of outlier cluster.
• This study proposes a novel algorithm based on extended K-means++ to work with mixed attributes and detect outliers.
![Page 13: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/13.jpg)
Literature review• Fraud detection• The process of data mining
![Page 14: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/14.jpg)
« Literature review »
Yi et al. 2006 : • understand and detect suspicious health care
frauds from large databases using clustering technique
• Use two clusters to compare : SAS EM and CLUTO
• As the experimental results indicate that CLUTO is faster than SAS EM while SAS EM provides more useful clusters than CLUTO.
Fraud detection
![Page 15: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/15.jpg)
« Literature review »
Liou, Tang, and Chen 2008 : • Applies data mining techniques to detect
fraudulent or abusive reporting by healthcare providers using their invoices for diabetic outpatient services.
• Logistic regression, neural network, classification trees
• The classification tree model performs the best with an overall correct identification rate of 99%.
Fraud detection
![Page 16: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/16.jpg)
« Literature review »
• Data preprocessing– The data that obtain from the real
databases are often incomplete, noisy and inconsistent.
– The target of data preprocessing is to clean a rough data set for improve accuracy.
– The process of data preprocessing :• data cleaning, data transformation and
integration and data reduction.
The process of data mining
![Page 17: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/17.jpg)
« Literature review »
• Data preprocessingWang and Chiang 2009 : – presents an efficient data preprocessing
procedure for the support of vector clustering (SVC) to reduce the size of a training dataset.
The process of data mining
![Page 18: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/18.jpg)
« Literature review »
• K-means algorithm
The process of data mining
![Page 19: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/19.jpg)
« Literature review »
• K-means algorithm– The benefits of K-means • fast and simplicity. Its algorithm is really
easy to understand and implementation.
– The shortcoming of K-means • number of clusters dependency • degeneracy
The process of data mining
![Page 20: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/20.jpg)
« Literature review »
• K-means++ algorithm
The process of data mining
![Page 21: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/21.jpg)
« Literature review »
• K-means++ algorithm• Arthur and Vassilvitskii 2007– Fast and more efficient• K-means : O(i * n * k)• K-means++ : O(log k)
– not pretty good to work with a dataset which combines categorical and numerical attribute
The process of data mining
![Page 22: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/22.jpg)
« Literature review »
• K-means++ algorithm• Example
The process of data mining
(k=3)
D(x) =
the shortest distance from
a data point x to the
closest center we have
already chosen.
![Page 23: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/23.jpg)
« Literature review »
• K-means++ algorithm• Example
The process of data mining
(k=3)
![Page 24: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/24.jpg)
« Literature review »
• K-means++ algorithm• Example
The process of data mining
D2=82+42
D2=72+32
D2=12+72
D2=22+12
(k=3)
![Page 25: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/25.jpg)
« Literature review »
• K-means++ algorithm• Example
The process of data mining
D2=82+42
D2=72+32
D2=12+72
D2=22+12
(k=3)
![Page 26: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/26.jpg)
« Literature review »
• K-means++ algorithm• Example
The process of data mining
D2=12+12
D2=12+72
D2=22+12
(k=3)
![Page 27: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/27.jpg)
« Literature review »
• K-means++ algorithm• Example
The process of data mining
D2=12+12
D2=12+72
D2=22+12
(k=3)
![Page 28: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/28.jpg)
« Literature review »
• K-means++ algorithm• Example
The process of data mining
(k=3)
![Page 29: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/29.jpg)
« Literature review »
• Y-means algorithm
The process of data mining
![Page 30: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/30.jpg)
« Literature review »
• Y-means algorithm• Guan, Ghorbani, and Belacel 2003– based on the K-means algorithm– It overcomes two shortcomings
of K-means: • number of clusters dependency and
degeneracy
The process of data mining
![Page 31: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/31.jpg)
« Literature review »
• Koufakou, Ortiz, Georgiopoulos, Anagnostopoulos, and Reynolds 2007– Introduced a strategy named
“Attribute Value Frequency (AVF)”. – That is a fast and scalable outlier
detection strategy for categorical data.
The process of data mining
![Page 32: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/32.jpg)
Methodologies• Methodology• Data collection• Data evaluation • Tasks and timeline
![Page 33: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/33.jpg)
« Methodologies »
• It can divide into 3 phases.• Phases 1: Data preprocessing– Convert categorical data to numeric data
• Phases 2: Clustering– Followed by K-means++ algorithm
• Phases 3: Outlier detection – Local and global outlier– Determine what cluster is outlier
![Page 34: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/34.jpg)
« Methodologies »
• Overview of the extended K-means++ algorithm
![Page 35: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/35.jpg)
« Methodologies »
• Phases 1: Data preprocessing
![Page 36: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/36.jpg)
« Methodologies »
• Phases 1: Data preprocessing1) Normalizes the numeric attributes’ value into
the range of 0 and 1
Attribute W Attribute X Attribute Y Attribute Z
A C 100 100
A C 300 900
A D 800 800
B D 900 200
B C 200 800
B E 600 900
A D 700 100
![Page 37: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/37.jpg)
« Methodologies »
• Phases 1: Data preprocessing1) Normalizes the numeric attributes’ value into
the range of 0 and 1
Attribute W Attribute X Attribute Y Attribute Z
A C 0.1 0.1
A C 0.3 0.9
A D 0.8 0.8
B D 0.9 0.2
B C 0.2 0.8
B E 0.6 0.9
A D 0.7 0.1
![Page 38: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/38.jpg)
« Methodologies »
• Phases 1: Data preprocessing2) A categorical attribute A with most number of
items is selected to be the base attribute.
Attribute W Attribute X Attribute Y Attribute Z
A C 0.1 0.1
A C 0.3 0.9
A D 0.8 0.8
B D 0.9 0.2
B C 0.2 0.8
B E 0.6 0.9
A D 0.7 0.1
2 items: A,B 3 items: C,D,E
![Page 39: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/39.jpg)
« Methodologies »
• Phases 1: Data preprocessing3) Counting the frequency of co-occurrence,
represent by Matrix M
Attribute W Attribute X Attribute Y Attribute Z
A C 0.1 0.1
A C 0.3 0.9
A D 0.8 0.8
B D 0.9 0.2
B C 0.2 0.8
B E 0.6 0.9
A D 0.7 0.1
Matrix M =
4 0 2 2 00 3 1 1 10 0 3 0 00 0 0 3 00 0 0 0 1
A B C D E
A B C D E
![Page 40: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/40.jpg)
« Methodologies »
• Phases 1: Data preprocessing4) Calculate similarity between items represent by
equation D
Matrix M =
4 0 2 2 00 3 1 1 10 0 3 0 00 0 0 3 00 0 0 0 1
A B C D E
A B C D E
Similarity Calculated value
DAC 2/4+3-2 = 0.4
DAD 2/4+3-2 = 0.4
DAE 0/4+2-0 = 0
DBC 1/3+3-1 = 0.2
DBD 1/3+3-1 = 0.2
DBE 1/3+1-1 = 0.33
![Page 41: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/41.jpg)
« Methodologies »
• Phases 1: Data preprocessing5) Find group variance of numerical value by
following equation:
Y attribute
Base Items Mean SSw
C 0.1+0.3+0.2/3 = 0.2 0.01+0.01+0 = 0.02
D 0.8+0.9+0.7/3 = 0.8 0+0.01+0.01 = 0.02
E 0.6/1 = 0.6 0
Z attribute
Base Items Mean SSw
C 0.1+0.9+0.8/3 = 0.6 0.25+0.09+0.01 = 0.35
D 0.8+0.2+0.1/3 = 0.37 0.185+0.029+0.73 = 0.94
E 0.9/1 = 0.9 0
å SSw(Y) = 0.04å SSw(Z) = 1.294
<< Select Y
![Page 42: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/42.jpg)
« Methodologies »
• Phases 1: Data preprocessing6) Every base item can be quantified by assigning
mean of the mapping value in the selected numeric attribute.
Y attribute
Base Items Mean
C 0.1+0.3+0.2/3 = 0.2
D 0.8+0.9+0.7/3 = 0.8
E 0.6/1 = 0.6
Attribute W Attribute X Attribute Y Attribute Z
A 0.2 (C) 0.1 0.1
A 0.2 (C) 0.3 0.9
A 0.8 (D) 0.8 0.8
B 0.8 (D) 0.9 0.2
B 0.2 (C) 0.2 0.8
B 0.6 (E) 0.6 0.9
A 0.8 (D) 0.7 0.1
![Page 43: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/43.jpg)
« Methodologies »
• Phases 1: Data preprocessing7) All other categorical items can be quantified by
applying the function:
Attribute W Attribute X Attribute Y Attribute Z
0.4 (A) 0.2 (C) 0.1 0.1
0.4 (A) 0.2 (C) 0.3 0.9
0.4 (A) 0.8 (D) 0.8 0.8
0.398 (B) 0.8 (D) 0.9 0.2
0.398 (B) 0.2 (C) 0.2 0.8
0.398 (B) 0.6 (E) 0.6 0.9
0.4 (A) 0.8 (D) 0.7 0.1
F(A) = 0.4 * 0.2 + 0.4 * 0.8 + 0 * 0.6 = 0.4
F(B) = 0.2 * 0.2 + 0.2 * 0.8 + 0.33 * 0.6 = 0.398
*All data in data set are numeric now.
![Page 44: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/44.jpg)
« Methodologies »
• Phases 2: Clustering
Probability :
D(x) : denote the shortest distance from a data point x to the closest center we have already chosen.
![Page 45: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/45.jpg)
« Methodologies »
• Phases 2: Clustering– Define initial values: • = Cluster width
– for detect local outlier– Followed by previous study = 2.32.
• = Cluster population ratio– for detect global outlier– My assumption : = 0.9
Detection rate and false negative rate should be get the highest values with optimal “”.
![Page 46: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/46.jpg)
« Methodologies »
• Phases 3: Outlier detection
![Page 47: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/47.jpg)
« Methodologies »
• Phases 3: Outlier detection– There are 2 stages• Local outlier detection : • = cluster width
![Page 48: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/48.jpg)
« Methodologies »
• Phases 3: Outlier detection– There are 2 stages• Global outlier detection• = population ratio
![Page 49: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/49.jpg)
« Data collection »
• A real dataset provided by National Health Security office of Thailand was applied to demonstrate the effectiveness of the proposed method.
• Primary data will gather information from database especially statement information that contains all financial transactions, Thailand.
![Page 50: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/50.jpg)
« Data collection »
• Overview of data set
![Page 51: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/51.jpg)
« Data evaluation »
• Outlier Detection Accuracy rate, which is the number of outliers correctly identified by this approach as outliers
• False Positive rate, reflecting the number of normal points erroneously identified as outliers.
![Page 52: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/52.jpg)
« Tasks and timeline »
![Page 53: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/53.jpg)
Thank youDr. Sumanta Guha (Chairperson)Prof. Dr. Phan Minh Dung (Committee)Dr. Matthew N. Dailey (Committee)
![Page 54: An extended K-means++ with mixed attributes for outlier detection](https://reader034.fdocuments.us/reader034/viewer/2022042608/56814503550346895db1cdcf/html5/thumbnails/54.jpg)
Question?